本篇博文主要展示 2024-10-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-02)

今日共更新447篇论文,其中:

  • 自然语言处理76篇(Computation and Language (cs.CL))
  • 人工智能97篇(Artificial Intelligence (cs.AI))
  • 计算机视觉95篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习146篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Addition is All You Need for Energy-efficient Language Models

【速读】: 该论文试图解决大规模神经网络中浮点数乘法运算的高计算资源消耗和高能耗问题。解决方案的关键在于提出了一种线性复杂度的乘法算法(L-Mul),该算法通过整数加法操作近似浮点数乘法,显著减少了计算资源消耗并提高了精度。相较于8位浮点数乘法,L-Mul在保持高精度的同时大幅降低了位级计算量,从而在张量处理硬件中应用时,理论上可减少95%的逐元素浮点数张量乘法能耗和80%的点积能耗。

链接: https://arxiv.org/abs/2410.00907
作者: Hongyin Luo,Wei Sun
关键词-EN: Large neural networks, Large neural, floating point, neural networks spend, point
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication L-Mul algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by element-wise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8_e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8_e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8_e4m3 as accumulation precision in both fine-tuning and inference.
摘要:大型神经网络在浮点张量乘法上消耗了大部分计算资源。在本研究中,我们发现浮点乘法器可以通过一个整数加法器以高精度近似实现。我们提出了线性复杂度乘法 L-Mul 算法,该算法通过整数加法操作近似浮点数乘法。与 8 位浮点乘法相比,新算法显著减少了计算资源消耗,但精度更高。相较于 8 位浮点乘法,所提出的方法在精度更高的同时,消耗的位级计算资源显著减少。由于浮点数乘法比整数加法操作需要更高的能量,因此在张量处理硬件中应用 L-Mul 操作,有望通过逐元素浮点张量乘法减少 95% 的能量消耗,并通过点积减少 80% 的能量消耗。我们计算了 L-Mul 的理论误差期望,并在广泛的文本、视觉和符号任务中评估了该算法,包括自然语言理解、结构推理、数学和常识问答。我们的数值分析实验与理论误差估计一致,表明 4 位尾数的 L-Mul 与 float8_e4m3 乘法精度相当,而 3 位尾数的 L-Mul 优于 float8_e5m2。在流行基准上的评估结果显示,直接将 L-Mul 应用于注意力机制几乎无损。我们进一步展示了在 Transformer 模型中,将所有浮点乘法替换为 3 位尾数的 L-Mul,在微调和推理中均能达到与使用 float8_e4m3 作为累积精度相当的精度。

[NLP-1] Do Music Generation Models Encode Music Theory?

【速读】: 该论文试图解决的问题是探究音乐生成模型在其内部表示中是否编码了基本的西方音乐理论概念,如节奏、音高、和弦等。解决方案的关键在于引入了一个名为SynTheory的合成MIDI和音频音乐理论数据集,并通过一个框架来探测这些音乐理论概念在音乐基础模型(如Jukebox和MusicGen)中的编码情况。研究结果表明,这些音乐理论概念在模型内部是可识别的,且其可检测性随模型规模和层次的不同而变化。

链接: https://arxiv.org/abs/2410.00872
作者: Megan Wei,Michael Freeman,Chris Donahue,Chen Sun
关键词-EN: possess impressive music, music theory concepts, models possess impressive, Music, music theory
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted at ISMIR 2024. Dataset: this https URL Code: this https URL Website: this https URL

点击查看摘要

Abstract:Music foundation models possess impressive music generation capabilities. When people compose music, they may infuse their understanding of music into their work, by using notes and intervals to craft melodies, chords to build progressions, and tempo to create a rhythmic feel. To what extent is this true of music generation models? More specifically, are fundamental Western music theory concepts observable within the “inner workings” of these models? Recent work proposed leveraging latent audio representations from music generation models towards music information retrieval tasks (e.g. genre classification, emotion recognition), which suggests that high-level musical characteristics are encoded within these models. However, probing individual music theory concepts (e.g. tempo, pitch class, chord quality) remains under-explored. Thus, we introduce SynTheory, a synthetic MIDI and audio music theory dataset, consisting of tempos, time signatures, notes, intervals, scales, chords, and chord progressions concepts. We then propose a framework to probe for these music theory concepts in music foundation models (Jukebox and MusicGen) and assess how strongly they encode these concepts within their internal representations. Our findings suggest that music theory concepts are discernible within foundation models and that the degree to which they are detectable varies by model size and layer.
摘要:音乐基础模型具备令人印象深刻的音乐生成能力。当人们创作音乐时,他们可能会将自己的音乐理解融入作品中,通过使用音符和音程来创作旋律,和弦来构建进行,以及节奏来创造节奏感。这种说法在音乐生成模型中有多真实?更具体地说,西方音乐理论的基本概念是否能在这些模型的“内部工作”中观察到?最近的研究提出利用音乐生成模型的潜在音频表示来进行音乐信息检索任务(例如,流派分类、情感识别),这表明这些模型中编码了高级音乐特征。然而,对单个音乐理论概念(例如,节奏、音级、和弦品质)的探索仍然不足。因此,我们引入了 SynTheory,一个合成 MIDI 和音频音乐理论数据集,包含节奏、拍号、音符、音程、音阶、和弦和和弦进行等概念。然后,我们提出了一种框架,用于在这些音乐基础模型(Jukebox 和 MusicGen)中探测这些音乐理论概念,并评估它们在其内部表示中编码这些概念的强度。我们的研究结果表明,音乐理论概念在基础模型中是可辨别的,并且它们被检测到的程度因模型大小和层级而异。

[NLP-2] On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation

【速读】: 该论文试图解决大语言模型(LLM)翻译中的冗长问题对评估结果的影响。解决方案的关键在于识别导致冗长的主要触发因素,如安全性、版权问题和输入查询的上下文不足,并通过调整评估方法来公平对待不同冗长程度的LLM,以确保未来评估的准确性。

链接: https://arxiv.org/abs/2410.00863
作者: Eleftheria Briakou,Zhongtao Liu,Colin Cherry,Markus Freitag
关键词-EN: paper investigates, investigates the impact, verbose LLM translations, LLM outputs drawn, LLM translations
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the impact of verbose LLM translations on evaluation. We first demonstrate the prevalence of this behavior across several LLM outputs drawn from the WMT 2024 general shared task on machine translation. We then identify the primary triggers of verbosity, including safety, copyright concerns, and insufficient context in short input queries. Finally, we show that ignoring this behavior unfairly penalizes more verbose LLMs according to both automatic and human evaluations, highlighting the need to address this issue for more accurate future evaluations.
摘要:本文探讨了大语言模型 (LLM) 翻译中的冗长现象对评估的影响。我们首先展示了这一行为在 WMT 2024 机器翻译通用共享任务中多个 LLM 输出中的普遍性。接着,我们识别了导致冗长的主要触发因素,包括安全性、版权顾虑以及短输入查询中上下文不足等问题。最后,我们指出,在自动和人工评估中,忽略这一行为会对更冗长的 LLM 造成不公平的惩罚,从而强调了在未来评估中解决这一问题的必要性。

[NLP-3] Quantifying reliance on external information over parametric knowledge during Retrieval Augmented Generation (RAG) using mechanistic analysis EMNLP2024

【速读】: 该论文试图解决的问题是理解在Retrieval Augmented Generation (RAG)框架中,语言模型(LM)如何利用外部检索的上下文信息。研究发现,LM在回答问题时存在“捷径”效应,即过度依赖检索的上下文而忽视模型自身的先验知识。论文提出的关键解决方案包括:(a) 因果中介分析,用于证明在回答问题时参数化记忆的利用率极低;(b) 注意力贡献和敲除技术,展示最后一个token的残差流主要从RAG上下文的token中获取信息,而非问题中的主体token。这些方法揭示了LM在RAG中的显著“捷径”行为,并适用于大型语言模型(如LlaMa)和小型语言模型(如Phi)。

链接: https://arxiv.org/abs/2410.00857
作者: Reshmi Ghosh,Rahul Seetharaman,Hitesh Wadhwa,Somyaa Aggarwal,Samyadeep Basu,Soundararajan Srinivasan,Wenlong Zhao,Shreyas Chaudhari,Ehsan Aghazadeh
关键词-EN: Retrieval Augmented Generation, Augmented Generation, natural language applications, Retrieval Augmented, leveraging external context
类目: Computation and Language (cs.CL)
备注: Accepted to Blackbox NLP @ EMNLP 2024

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) is a widely used approach for leveraging external context in several natural language applications such as question answering and information retrieval. Yet, the exact nature in which a Language Model (LM) leverages this non-parametric memory or retrieved context isn’t clearly understood. This paper mechanistically examines the RAG pipeline to highlight that LMs demonstrate a "shortcut’’ effect and have a strong bias towards utilizing the retrieved context to answer questions, while relying minimally on model priors. We propose (a) Causal Mediation Analysis; for proving that parametric memory is minimally utilized when answering a question and (b) Attention Contributions and Knockouts for showing the last token residual stream do not get enriched from the subject token in the question, but gets enriched from tokens of RAG-context. We find this pronounced "shortcut’’ behaviour to be true across both LLMs (e.g.,LlaMa) and SLMs (e.g., Phi)
摘要:检索增强生成 (Retrieval Augmented Generation, RAG) 是一种广泛应用于问答和信息检索等自然语言处理任务中的方法,通过利用外部上下文来增强语言模型的性能。然而,语言模型 (Language Model, LM) 如何具体利用这种非参数记忆或检索上下文的方式尚未被完全理解。本文通过机制性地分析 RAG 流程,揭示了语言模型在回答问题时表现出一种“捷径”效应,即强烈倾向于使用检索到的上下文来回答问题,而极少依赖模型自身的先验知识。我们提出了 (a) 因果中介分析,用于证明在回答问题时参数记忆的利用率极低;以及 (b) 注意力贡献和敲除实验,用于展示最后一个 Token 的残差流并未从问题中的主体 Token 获得丰富信息,而是从 RAG 上下文的 Token 中获得。我们发现这种显著的“捷径”行为在大型语言模型 (Large Language Model, LLM)(如 LlaMa)和小型语言模型 (Small Language Model, SLM)(如 Phi)中普遍存在。

[NLP-4] VHASR: A Multimodal Speech Recognition System With Vision Hotwords EMNLP2024

【速读】: 该论文试图解决图像信息在多模态自动语音识别(ASR)中是否能有效提升识别性能的问题。解决方案的关键在于提出了VHASR系统,该系统采用双流架构,分别对音频和图像信息进行处理,并通过视觉信息作为“热词”来增强模型的语音识别能力。实验结果表明,VHASR能够有效利用图像中的关键信息,不仅超越了单模态ASR,还在现有的基于图像的多模态ASR中达到了最先进的水平。

链接: https://arxiv.org/abs/2410.00822
作者: Jiliang Hu,Zuchao Li,Ping Wang,Haojun Ai,Lefei Zhang,Hai Zhao
关键词-EN: speech recognition, incorporating audio-related image, automatic speech recognition, model speech recognition, multimodal automatic speech
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 14 pages, 6 figures, accepted by EMNLP 2024

点击查看摘要

Abstract:The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model’s speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model’s speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.
摘要:基于图像的多模态自动语音识别 (ASR) 模型通过融合与音频相关的图像信息来提升语音识别性能。然而,一些研究指出,将图像信息引入模型并不能有效提升 ASR 性能。本文提出了一种新颖的方法,有效利用与音频相关的图像信息,并构建了 VHASR,这是一个利用视觉作为关键词来增强模型语音识别能力的多模态语音识别系统。我们的系统采用双流架构,首先在两个流上分别进行文本转录,然后将输出结果进行合并。我们在四个数据集上评估了所提出的模型:Flickr8k、ADE20k、COCO 和 OpenImages。实验结果表明,VHASR 能够有效利用图像中的关键信息来增强模型的语音识别能力。其性能不仅超越了单模态 ASR,而且在现有的基于图像的多模态 ASR 中也达到了最先进的水平。

[NLP-5] A generative framework to bridge data-driven models and scientific theories in language neuroscience

【速读】: 该论文试图解决大语言模型(LLMs)在预测语言刺激的BOLD fMRI响应时,其内部特征不透明的问题。解决方案的关键在于提出了“生成解释中介的验证”框架,通过生成简明的语言选择性解释,并利用合成刺激在后续实验中验证这些解释,从而揭示语言刺激在脑区中的选择性响应机制。该方法不仅适用于单个体素,还能解释感兴趣皮质区域(ROIs)的选择性,并强调了解释准确性与底层统计模型预测能力和稳定性的紧密关联。

链接: https://arxiv.org/abs/2410.00812
作者: Richard Antonello,Chandan Singh,Shailee Jain,Aliyah Hsu,Jianfeng Gao,Bin Yu,Alexander Huth
关键词-EN: predicting BOLD fMRI, BOLD fMRI responses, predicting BOLD, BOLD fMRI, highly effective
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative explanation-mediated validation, a framework for generating concise explanations of language selectivity in the brain and then validating those explanations in follow-up experiments that use synthetic stimuli. This approach is successful at explaining selectivity both in individual voxels and cortical regions of interest (ROIs).We show that explanatory accuracy is closely related to the predictive power and stability of the underlying statistical models. These results demonstrate that LLMs can be used to bridge the widening gap between data-driven models and formal scientific theories.
摘要:大语言模型的表示在预测语言刺激的 BOLD fMRI 响应方面非常有效。然而,这些表示在很大程度上是不透明的:不清楚语言刺激的哪些特征驱动了每个脑区的响应。我们提出了生成式解释中介验证,这是一个框架,用于生成大脑中语言选择性的简明解释,然后在后续使用合成刺激的实验中验证这些解释。这种方法在解释单个体素和感兴趣皮质区域 (ROIs) 的选择性方面都取得了成功。我们表明,解释的准确性与基础统计模型的预测能力和稳定性密切相关。这些结果表明,大语言模型可以用来弥合数据驱动模型与形式科学理论之间日益扩大的差距。

[NLP-6] Decoding Hate: Exploring Language Models Reactions to Hate Speech

【速读】: 该论文试图解决大型语言模型(LLMs)在面对仇恨言论时的反应问题,特别是这些模型在处理和生成仇恨言论方面的能力。解决方案的关键在于通过定性分析七个最先进的LLMs(LLaMA 2, Vicuna, LLaMA 3, Mistral, GPT-3.5, GPT-4, 和 Gemini Pro)对仇恨言论的反应,揭示它们处理此类输入的多样性,并探讨通过微调(fine-tuning)和指导方针防护(guideline guardrailing)等策略来减少仇恨言论的生成。此外,论文还研究了模型对以政治正确语言表达的仇恨言论的反应。

链接: https://arxiv.org/abs/2410.00775
作者: Paloma Piot,Javier Parapar
关键词-EN: Hate speech, Hate, online expression, derogatory posts, speech
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech is a harmful form of online expression, often manifesting as derogatory posts. It is a significant risk in digital environments. With the rise of Large Language Models (LLMs), there is concern about their potential to replicate hate speech patterns, given their training on vast amounts of unmoderated internet data. Understanding how LLMs respond to hate speech is crucial for their responsible deployment. However, the behaviour of LLMs towards hate speech has been limited compared. This paper investigates the reactions of seven state-of-the-art LLMs (LLaMA 2, Vicuna, LLaMA 3, Mistral, GPT-3.5, GPT-4, and Gemini Pro) to hate speech. Through qualitative analysis, we aim to reveal the spectrum of responses these models produce, highlighting their capacity to handle hate speech inputs. We also discuss strategies to mitigate hate speech generation by LLMs, particularly through fine-tuning and guideline guardrailing. Finally, we explore the models’ responses to hate speech framed in politically correct language.
摘要:仇恨言论是一种有害的在线表达形式,通常表现为贬损性帖子。在数字环境中,这是一种重大风险。随着大语言模型 (LLM) 的兴起,人们担心它们可能会在训练过程中复制仇恨言论模式,因为它们是在大量未经审查的互联网数据上进行训练的。理解大语言模型如何应对仇恨言论对于其负责任的部署至关重要。然而,目前对大语言模型在仇恨言论方面的行为研究还相对有限。本文研究了七种最先进的大语言模型 (LLaMA 2, Vicuna, LLaMA 3, Mistral, GPT-3.5, GPT-4, 和 Gemini Pro) 对仇恨言论的反应。通过定性分析,我们旨在揭示这些模型产生的反应范围,突出它们处理仇恨言论输入的能力。我们还讨论了通过微调 (fine-tuning) 和指南防护 (guideline guardrailing) 等策略来减少大语言模型生成仇恨言论的方法。最后,我们探讨了这些模型对以政治正确语言表达的仇恨言论的反应。

[NLP-7] BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

【速读】: 该论文试图解决大语言模型(LLMs)在处理多模态多结构化数据时缺乏统一评估方法的问题。解决方案的关键在于引入了BabelBench,这是一个创新的基准框架,通过包含247个精心设计的任务,评估LLMs在感知、常识推理、逻辑推理等方面的能力,特别是对多模态理解和结构化数据处理以及代码生成的高级能力,如探索、规划、推理和调试。实验结果表明,即使是先进的模型如ChatGPT 4也存在显著改进空间,为未来研究提供了宝贵的指导。

链接: https://arxiv.org/abs/2410.00773
作者: Xuwu Wang,Qiwen Cui,Yunzhe Tao,Yiran Wang,Ziwei Chai,Xiaotian Han,Boyi Liu,Jianbo Yuan,Jing Su,Guoyin Wang,Tingkai Liu,Liyu Chen,Tianyi Liu,Tao Sun,Yufeng Zhang,Sirui Zheng,Quanzeng You,Yang Yang,Hongxia Yang
关键词-EN: Large language models, Visual Question Answering, Large language, complex data types, increasingly pivotal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA). These areas have attracted significant attention from both industry and academia. Despite this, there remains a lack of unified evaluation methodologies for these diverse data handling scenarios. In response, we introduce BabelBench, an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. BabelBench incorporates a dataset comprising 247 meticulously curated problems that challenge the models with tasks in perception, commonsense reasoning, logical reasoning, and so on. Besides the basic capabilities of multimodal understanding, structured data processing as well as code generation, these tasks demand advanced capabilities in exploration, planning, reasoning and debugging. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement. The insights derived from our comprehensive analysis offer valuable guidance for future research within the community. The benchmark data can be found at this https URL.
摘要:大语言模型 (LLMs) 在各个领域中变得越来越重要,特别是在处理复杂数据类型方面。这包括结构化数据处理,如 ChartQA 和 ChatGPT-Ada 所展示的,以及多模态非结构化数据处理,如视觉问答 (VQA) 中所示。这些领域吸引了业界和学术界的广泛关注。尽管如此,对于这些多样化的数据处理场景,仍然缺乏统一的评估方法。为此,我们引入了 BabelBench,这是一个创新的基准框架,用于评估大语言模型在管理多模态多结构化数据并执行代码方面的熟练程度。BabelBench 包含一个由 247 个精心策划的问题组成的数据集,这些问题挑战模型在感知、常识推理、逻辑推理等方面的任务。除了多模态理解、结构化数据处理和代码生成等基本能力外,这些任务还要求在探索、规划、推理和调试方面的高级能力。我们在 BabelBench 上的实验结果表明,即使是像 ChatGPT 4 这样的尖端模型,也存在显著的改进空间。我们从全面分析中得出的见解为社区未来的研究提供了宝贵的指导。基准数据可以在以下链接找到:https URL。

[NLP-8] Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting EMNLP2024

【速读】: 该论文试图解决在持续学习框架下视频问答(VideoQA)模型面临的灾难性遗忘问题。解决方案的关键在于提出了一种名为协作提示(Collaborative Prompting, ColPro)的方法,该方法通过整合特定问题约束提示、知识获取提示和视觉时间感知提示,有效地捕捉文本问题上下文、视觉内容以及视频的时间动态,从而在处理新任务时减少模型的遗忘现象。实验结果表明,ColPro在NExT-QA和DramaQA数据集上均表现出色,显著优于现有方法。

链接: https://arxiv.org/abs/2410.00771
作者: Chen Cai,Zheng Wang,Jianjun Gao,Wenyang Liu,Ye Lu,Runzhong Zhang,Kim-Hui Yap
关键词-EN: Video Question Answering, Question Answering, recent years, static Video Question, rapid increase
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by main EMNLP 2024

点击查看摘要

Abstract:In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA) models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine-tuning a large language model (LLM) for a sequence of tasks often results in catastrophic forgetting. To address this, we propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. These prompts aim to capture textual question context, visual content, and video temporal dynamics in VideoQA, a perspective underexplored in prior research. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches, achieving 55.14% accuracy on NExT-QA and 71.24% accuracy on DramaQA, highlighting its practical relevance and effectiveness.
**摘要:**近年来,在线视频内容的快速增长凸显了基于固定数据集训练的静态视频问答 (Video Question Answering, VideoQA) 模型的局限性,这些模型在面对新内容提出的新问题或任务时难以适应。本文探讨了在持续学习框架下 VideoQA 的新挑战,并通过实证研究识别出一个关键问题:对大语言模型 (Large Language Model, LLM) 进行一系列任务的微调往往会导致灾难性遗忘。为解决这一问题,我们提出了协作提示 (Collaborative Prompting, ColPro),该方法整合了特定问题约束提示、知识获取提示和视觉时间感知提示。这些提示旨在捕捉视频问答中的文本问题上下文、视觉内容和视频时间动态,这一视角在先前的研究中未得到充分探索。在 NExT-QA 和 DramaQA 数据集上的实验结果表明,ColPro 相较于现有方法表现更优,在 NExT-QA 上达到了 55.14% 的准确率,在 DramaQA 上达到了 71.24% 的准确率,突显了其实际相关性和有效性。

[NLP-9] hinking Outside of the Differential Privacy Box: A Case Study in Text Privatization with Language Model Prompting EMNLP2024

【速读】: 该论文试图解决隐私保护自然语言处理(NLP)中,差分隐私(DP)集成带来的限制和挑战问题。解决方案的关键在于评估和讨论DP-Prompt方法在文本重写任务中的应用,通过对比DP和非DP场景下的实验结果,揭示DP在NLP中的实用性和相对于非DP方法的优势,从而推动对DP在NLP中应用的深入讨论。

链接: https://arxiv.org/abs/2410.00751
作者: Stephen Meisenbacher,Florian Matthes
关键词-EN: Natural Language Processing, privacy-preserving Natural Language, Large Language Models, privacy-preserving Natural, Processing has risen
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 tables, Accepted to EMNLP 2024 (Main)

点击查看摘要

Abstract:The field of privacy-preserving Natural Language Processing has risen in popularity, particularly at a time when concerns about privacy grow with the proliferation of Large Language Models. One solution consistently appearing in recent literature has been the integration of Differential Privacy (DP) into NLP techniques. In this paper, we take these approaches into critical view, discussing the restrictions that DP integration imposes, as well as bring to light the challenges that such restrictions entail. To accomplish this, we focus on \textbfDP-Prompt , a recent method for text privatization leveraging language models to rewrite texts. In particular, we explore this rewriting task in multiple scenarios, both with DP and without DP. To drive the discussion on the merits of DP in NLP, we conduct empirical utility and privacy experiments. Our results demonstrate the need for more discussion on the usability of DP in NLP and its benefits over non-DP approaches.
摘要:隐私保护自然语言处理领域近年来备受关注,特别是在大语言模型 (Large Language Model) 广泛应用的背景下,隐私问题日益凸显。近期文献中频繁出现的一种解决方案是将差分隐私 (Differential Privacy, DP) 融入自然语言处理技术。本文对这些方法进行了批判性审视,探讨了 DP 集成所带来的限制,并揭示了这些限制所伴随的挑战。为此,我们重点关注了 \textbfDP-Prompt,这是一种利用语言模型重写文本以实现文本隐私化的最新方法。我们特别在多种场景下探讨了这种重写任务,包括在 DP 和非 DP 条件下的表现。为了推动对 DP 在自然语言处理中优劣的讨论,我们进行了实证的效用和隐私实验。实验结果表明,有必要进一步讨论 DP 在自然语言处理中的可用性及其相对于非 DP 方法的优势。

[NLP-10] Optimizing Token Usage on Large Language Model Conversations Using the Design Structure Matrix

【速读】: 该论文试图解决大型语言模型(LLM)在实际应用中面临的令牌使用效率问题,特别是在API服务型LLM中,如何克服短上下文窗口、输出限制和令牌输入输出成本等挑战。解决方案的关键在于引入工程设计领域的设计结构矩阵(DSM),通过其聚类和排序等分析工具,优化LLM对话的组织结构,从而最小化一次性发送或检索的令牌数量,并有效分配到不同的上下文窗口中。这一方法不仅扩展了现有令牌使用优化的方法集,还为将工程设计实践融入LLM开辟了新的途径。

链接: https://arxiv.org/abs/2410.00749
作者: Ramon Maria Garcia Alarcia,Alessandro Golkar
关键词-EN: Large Language Models, limited output sizes, Large Language, Language Models, Models become ubiquitous
类目: Computation and Language (cs.CL)
备注: 10 pages, 26th International Dependency and Structure Modelling Conference, DSM 2024

点击查看摘要

Abstract:As Large Language Models become ubiquitous in many sectors and tasks, there is a need to reduce token usage, overcoming challenges such as short context windows, limited output sizes, and costs associated with token intake and generation, especially in API-served LLMs. This work brings the Design Structure Matrix from the engineering design discipline into LLM conversation optimization. Applied to a use case in which the LLM conversation is about the design of a spacecraft and its subsystems, the DSM, with its analysis tools such as clustering and sequencing, demonstrates being an effective tool to organize the conversation, minimizing the number of tokens sent to or retrieved from the LLM at once, as well as grouping chunks that can be allocated to different context windows. Hence, this work broadens the current set of methodologies for token usage optimization and opens new avenues for the integration of engineering design practices into LLMs.
摘要: 随着大语言模型 (LLM) 在多个领域和任务中的普及,减少 Token 使用的需求日益增加,以克服短上下文窗口、有限输出大小以及与 Token 输入和生成相关的成本等挑战,特别是在通过 API 提供服务的大语言模型中。本研究将工程设计领域的 Design Structure Matrix (设计结构矩阵) 引入大语言模型对话优化中。应用于大语言模型对话涉及航天器及其子系统设计的用例中,DSM 及其分析工具(如聚类和排序)展示了其作为组织对话的有效工具,能够最小化一次性发送或从大语言模型中检索的 Token 数量,同时将可分配到不同上下文窗口的块进行分组。因此,本研究扩展了当前 Token 使用优化的方法集,并为将工程设计实践整合到大语言模型中开辟了新的途径。

[NLP-11] VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models EMNLP2024

【速读】: 该论文试图解决现有对比语言-图像预训练模型(如CLIP)在处理长描述文本时的局限性问题,特别是在视频理解领域,因为视频通常包含丰富的详细内容。解决方案的关键在于提出了VideoCLIP-XL模型,通过建立自动数据收集系统并构建大规模的VILD预训练数据集,结合文本相似性引导的主成分匹配(TPCM)方法,以及引入细节感知描述排序(DDR)和幻觉感知描述排序(HDR)两个新任务,来增强模型对长描述的理解能力。此外,论文还构建了长视频描述排序(LVDR)基准,以全面评估模型的长描述处理能力。

链接: https://arxiv.org/abs/2410.00741
作者: Jiapeng Wang,Chengyu Wang,Kunzhe Huang,Jun Huang,Lianwen Jin
关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, Description Ranking, numerous applications, pre-training prevents CLIP
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: EMNLP 2024 Main conference

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
摘要:对比语言-图像预训练 (Contrastive Language-Image Pre-training, CLIP) 已被广泛研究和应用于众多应用中。然而,预训练过程中对简短摘要文本的重视限制了 CLIP 对长描述的理解能力。这一问题在视频领域尤为突出,因为视频通常包含丰富的详细内容。本文中,我们提出了 VideoCLIP-XL (eXtra Length) 模型,旨在释放视频 CLIP 模型对长描述的理解能力。首先,我们建立了一个自动数据收集系统,并收集了一个大规模的 VILD 预训练数据集,包含视频和长描述对。接着,我们提出了文本相似性引导的主成分匹配 (Text-similarity-guided Primary Component Matching, TPCM) 方法,以更好地学习特征空间的分布并扩展长描述能力。此外,我们引入了两个新任务,即细节感知描述排序 (Detail-aware Description Ranking, DDR) 和幻觉感知描述排序 (Hallucination-aware Description Ranking, HDR),以进一步提高理解能力。最后,我们构建了一个长视频描述排序 (Long Video Description Ranking, LVDR) 基准,以更全面地评估长描述能力。在广泛使用的文本-视频检索基准(包括短描述和长描述)以及我们的 LVDR 基准上的大量实验结果,充分证明了我们方法的有效性。

[NLP-12] Show Me Whats Wrong: Combining Charts and Text to Guide Data Analysis

【速读】: 该论文试图解决多维数据集中异常分析的复杂性和信息过载问题,特别是在金融欺诈检测领域。解决方案的关键在于结合自动化信息高亮、大语言模型生成的文本洞察和可视化分析,通过分段数据分析和视觉表示,利用自动化视觉提示来引导用户关注需要更多注意的区域。系统提供文本和图形摘要,帮助用户快速理解所选区域的详细信息,并通过图形表示进行深入探索,从而有效支持并指导探索性分析,简化可疑信息的识别过程。

链接: https://arxiv.org/abs/2410.00727
作者: Beatriz Feliciano,Rita Costa,Jean Alves,Javier Liébana,Diogo Duarte,Pedro Bizarro
关键词-EN: Analyzing and finding, finding anomalies, anomalies in multi-dimensional, multi-dimensional datasets, cumbersome but vital
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Analyzing and finding anomalies in multi-dimensional datasets is a cumbersome but vital task across different domains. In the context of financial fraud detection, analysts must quickly identify suspicious activity among transactional data. This is an iterative process made of complex exploratory tasks such as recognizing patterns, grouping, and comparing. To mitigate the information overload inherent to these steps, we present a tool combining automated information highlights, Large Language Model generated textual insights, and visual analytics, facilitating exploration at different levels of detail. We perform a segmentation of the data per analysis area and visually represent each one, making use of automated visual cues to signal which require more attention. Upon user selection of an area, our system provides textual and graphical summaries. The text, acting as a link between the high-level and detailed views of the chosen segment, allows for a quick understanding of relevant details. A thorough exploration of the data comprising the selection can be done through graphical representations. The feedback gathered in a study performed with seven domain experts suggests our tool effectively supports and guides exploratory analysis, easing the identification of suspicious information.
摘要:在多维数据集中分析和发现异常是一项繁琐但至关重要的任务,涉及不同领域。在金融欺诈检测的背景下,分析师必须迅速识别交易数据中的可疑活动。这是一个由复杂的探索性任务组成的迭代过程,如识别模式、分组和比较。为了缓解这些步骤中固有的信息过载问题,我们提出了一种工具,结合了自动化信息高亮、大语言模型生成的文本洞察和视觉分析,便于在不同细节层次上进行探索。我们根据分析区域对数据进行分割,并对其进行可视化表示,利用自动化的视觉提示来标记需要更多关注的部分。在用户选择一个区域后,我们的系统提供文本和图形摘要。文本作为所选片段的高层次和详细视图之间的桥梁,允许快速理解相关细节。通过图形表示可以对所选数据进行深入探索。在一项由七位领域专家参与的研究中收集的反馈表明,我们的工具有效地支持和指导探索性分析,简化了可疑信息的识别。

[NLP-13] Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation EMNLP

【速读】: 该论文试图解决技术术语翻译中的准确性问题,特别是在专业领域中确保清晰沟通的关键。解决方案的关键在于引入Parenthetical Terminology Translation (PTT)任务,通过在翻译术语后括号内显示原文术语来减少翻译不准确性。为实现这一方法,论文生成了一个PTT数据集,并采用知识蒸馏技术对传统的神经机器翻译(NMT)模型和小型大语言模型(sLMs)进行微调。此外,论文还开发了一种新的评估指标,用于评估整体翻译准确性和术语括号显示的正确性。研究结果表明,尽管sLMs在某些情况下表现出色,但通过微调的传统NMT模型在持续预训练的目标语言环境中更为有效。

链接: https://arxiv.org/abs/2410.00683
作者: Jiyoon Myung,Jihyeon Park,Jungki Son,Kyungro Lee,Joohyung Han
关键词-EN: accurately translating technical, translating technical terms, specialized fields, large language models, paper addresses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper accepted in EMNLPW 2024

点击查看摘要

Abstract:This paper addresses the challenge of accurately translating technical terms, which are crucial for clear communication in specialized fields. We introduce the Parenthetical Terminology Translation (PTT) task, designed to mitigate potential inaccuracies by displaying the original term in parentheses alongside its translation. To implement this approach, we generated a representative PTT dataset using a collaborative approach with large language models and applied knowledge distillation to fine-tune traditional Neural Machine Translation (NMT) models and small-sized Large Language Models (sLMs). Additionally, we developed a novel evaluation metric to assess both overall translation accuracy and the correct parenthetical presentation of terms. Our findings indicate that sLMs did not consistently outperform NMT models, with fine-tuning proving more effective than few-shot prompting, particularly in models with continued pre-training in the target language. These insights contribute to the advancement of more reliable terminology translation methodologies.
摘要:本文探讨了准确翻译技术术语的挑战,这对于专业领域中的清晰沟通至关重要。我们引入了括号术语翻译 (Parenthetical Terminology Translation, PTT) 任务,旨在通过在翻译术语的同时显示其原始术语来减少潜在的不准确性。为实现这一方法,我们通过与大语言模型 (Large Language Model) 的合作生成了一个代表性的 PTT 数据集,并应用知识蒸馏技术对传统的神经机器翻译 (Neural Machine Translation, NMT) 模型和小型大语言模型 (small-sized Large Language Models, sLMs) 进行了微调。此外,我们开发了一种新的评估指标,用于评估整体翻译准确性和术语括号呈现的正确性。研究结果表明,sLMs 并未始终优于 NMT 模型,微调比少样本提示 (few-shot prompting) 更为有效,特别是在目标语言中持续预训练的模型中。这些见解有助于推动更可靠的术语翻译方法的发展。

[NLP-14] AutoTM 2.0: Automatic Topic Modeling Framework for Documents Analysis

【速读】: 该论文旨在优化加性正则化主题模型(Additively Regularized Topic Models),提出了AutoTM 2.0框架。解决方案的关键在于引入了一系列改进,包括新的优化流程、基于大型语言模型(LLM)的质量评估指标以及分布式模式。这些改进使得AutoTM 2.0不仅适用于专业人员,也能方便非专业人员进行文本数据的探索性分析和特征可解释的聚类任务。通过专门开发的如连贯性和基于GPT-4的方法等质量评估指标,研究人员和实践者可以轻松集成新的优化算法和适应新的评估指标,从而提升模型性能并扩展实验范围。实验结果表明,AutoTM 2.0在5个具有不同特征和两种语言的数据集上,相比前一版本AutoTM,表现更为优异。

链接: https://arxiv.org/abs/2410.00655
作者: Maria Khodorchenko,Nikolay Butakov,Maxim Zuev,Denis Nasonov
关键词-EN: regularized topic models, optimizing additively regularized, additively regularized topic, framework for optimizing, topic models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we present an AutoTM 2.0 framework for optimizing additively regularized topic models. Comparing to the previous version, this version includes such valuable improvements as novel optimization pipeline, LLM-based quality metrics and distributed mode. AutoTM 2.0 is a comfort tool for specialists as well as non-specialists to work with text documents to conduct exploratory data analysis or to perform clustering task on interpretable set of features. Quality evaluation is based on specially developed metrics such as coherence and gpt-4-based approaches. Researchers and practitioners can easily integrate new optimization algorithms and adapt novel metrics to enhance modeling quality and extend their experiments. We show that AutoTM 2.0 achieves better performance compared to the previous AutoTM by providing results on 5 datasets with different features and in two different languages. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2410.00655 [cs.LG] (or arXiv:2410.00655v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.00655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在本研究中,我们提出了一个用于优化加性正则化主题模型的 AutoTM 2.0 框架。相较于前一版本,此版本包含了诸多重要改进,如新颖的优化流程、基于大语言模型 (LLM) 的质量评估指标以及分布式模式。AutoTM 2.0 不仅适用于专业人士,也方便非专业人士使用文本文档进行探索性数据分析或执行基于可解释特征集的聚类任务。质量评估基于专门开发的指标,如连贯性和基于 gpt-4 的方法。研究人员和实践者可以轻松集成新的优化算法,并适应新颖的评估指标,以提升建模质量和扩展实验范围。我们通过在具有不同特征的 5 个数据集上,以及两种不同语言中进行实验,展示了 AutoTM 2.0 相较于前一版本的 AutoTM 实现了更好的性能。

主题:机器学习 (cs.LG); 计算与语言 (cs.CL)
引用方式:arXiv:2410.00655 [cs.LG] (或 arXiv:2410.00655v1 [cs.LG] 用于此版本)
https://doi.org/10.48550/arXiv.2410.00655
了解更多信息
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-15] Deteccion Automatica de Patologias en Notas Clinicas en Espa~nol Combinando Modelos de Lenguaje y Ontologias Medicos

【速读】: 该论文旨在解决皮肤病理在医疗报告中的自动检测问题。解决方案的关键在于结合大型语言模型与医学本体,通过训练模型识别皮肤病理的类型、严重程度和身体部位,并确定学习这些特征的顺序,从而显著提高检测精度。研究结果显示,该方法在医疗文本分类中达到了0.84的精确度,微观和宏观F1分数分别为0.82和0.75,达到了当前技术水平。

链接: https://arxiv.org/abs/2410.00616
作者: Léon-Paul Schaub Torre,Pelayo Quirós,Helena García Mieres
关键词-EN: follow-up medical report, automatic detection, medical reports, dermatological pathologies, medical
类目: Computation and Language (cs.CL)
备注: 22 pages, in Spanish language, 6 figures, Proceedings of the 40th venue of the SEPLN

点击查看摘要

Abstract:In this paper we present a hybrid method for the automatic detection of dermatological pathologies in medical reports. We use a large language model combined with medical ontologies to predict, given a first appointment or follow-up medical report, the pathology a person may suffer from. The results show that teaching the model to learn the type, severity and location on the body of a dermatological pathology as well as in which order it has to learn these three features significantly increases its accuracy. The article presents the demonstration of state-of-the-art results for classification of medical texts with a precision of 0.84, micro and macro F1-score of 0.82 and 0.75, and makes both the method and the dataset used available to the community. – En este artículo presentamos un método híbrido para la detección automática de patologías dermatológicas en informes médicos. Usamos un modelo de lenguaje amplio en español combinado con ontologías médicas para predecir, dado un informe médico de primera cita o de seguimiento, la patología del paciente. Los resultados muestran que el tipo, la gravedad y el sitio en el cuerpo de una patología dermatológica, así como en qué orden tiene un modelo que aprender esas tres características, aumentan su precisión. El artículo presenta la demostración de resultados comparables al estado del arte de clasificación de textos médicos con una precisión de 0.84, micro y macro F1-score de 0.82 y 0.75, y deja a disposición de la comunidad tanto el método como el conjunto de datos utilizado. Comments: 22 pages, in Spanish language, 6 figures, Proceedings of the 40th venue of the SEPLN Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2410.00616 [cs.CL] (or arXiv:2410.00616v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.00616 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:本文介绍了一种用于自动检测医疗报告中皮肤病理的混合方法。我们使用大语言模型结合医学本体来预测,根据首次预约或随访医疗报告,患者可能患有的病理。结果表明,教导模型学习皮肤病理的类型、严重程度和身体部位,以及学习这些特征的顺序,显著提高了其准确性。文章展示了在医疗文本分类方面达到最先进水平的结果,精确度为0.84,微观和宏观F1-score分别为0.82和0.75,并将所使用的方法和数据集提供给社区。

– 本文介绍了一种用于自动检测医疗报告中皮肤病理的混合方法。我们使用西班牙语大语言模型结合医学本体来预测,根据首次预约或随访医疗报告,患者的病理。结果表明,皮肤病理的类型、严重程度和身体部位,以及模型学习这些特征的顺序,显著提高了其准确性。文章展示了在医疗文本分类方面达到最先进水平的结果,精确度为0.84,微观和宏观F1-score分别为0.82和0.75,并将所使用的方法和数据集提供给社区。

评论:22页,西班牙语,6幅图,第40届SEPLN会议论文集
主题:计算与语言 (cs.CL)
ACM分类:I.2.7
引用为:arXiv:2410.00616 [cs.CL]
(或 arXiv:2410.00616v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.00616
了解更多信息
arXiv-issued DOI via DataCite (待注册)

[NLP-16] Style-Specific Neurons for Steering LLMs in Text Style Transfer EMNLP2024

【速读】: 该论文试图解决在零样本设置下,大型语言模型(LLMs)在文本风格转换(TST)任务中倾向于直接复制输入文本而未能有效改变其风格的问题。解决方案的关键在于提出了一种名为sNeuron-TST的新方法,通过识别与源风格和目标风格相关的神经元,并停用仅与源风格相关的神经元,从而提高目标风格词汇的概率,增强生成文本的风格多样性。然而,这种停用操作会影响生成文本的流畅性,为此,论文提出了一种改进的对比解码方法,以应对因停用源风格神经元导致的层间快速标记概率变化,从而在保持风格多样性的同时提升文本的流畅性。

链接: https://arxiv.org/abs/2410.00593
作者: Wen Lai,Viktor Hangya,Alexander Fraser
关键词-EN: Text style transfer, aims to modify, original meaning, altering its original, TST
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 main conference. The code is publicly available at this https URL

点击查看摘要

Abstract:Text style transfer (TST) aims to modify the style of a text without altering its original meaning. Large language models (LLMs) demonstrate superior performance across multiple tasks, including TST. However, in zero-shot setups, they tend to directly copy a significant portion of the input text to the output without effectively changing its style. To enhance the stylistic variety and fluency of the text, we present sNeuron-TST, a novel approach for steering LLMs using style-specific neurons in TST. Specifically, we identify neurons associated with the source and target styles and deactivate source-style-only neurons to give target-style words a higher probability, aiming to enhance the stylistic diversity of the generated text. However, we find that this deactivation negatively impacts the fluency of the generated text, which we address by proposing an improved contrastive decoding method that accounts for rapid token probability shifts across layers caused by deactivated source-style neurons. Empirical experiments demonstrate the effectiveness of the proposed method on six benchmarks, encompassing formality, toxicity, politics, politeness, authorship, and sentiment.
摘要:文本风格转换 (Text Style Transfer, TST) 旨在修改文本的风格而不改变其原始含义。大语言模型 (Large Language Models, LLMs) 在包括 TST 在内的多个任务中表现出优越的性能。然而,在零样本 (Zero-shot) 设置下,它们往往直接将输入文本的大部分内容复制到输出中,而未能有效改变其风格。为了增强文本的风格多样性和流畅性,我们提出了 sNeuron-TST,这是一种利用风格特定神经元 (style-specific neurons) 引导 LLMs 进行 TST 的新方法。具体来说,我们识别与源风格和目标风格相关的神经元,并停用仅与源风格相关的神经元,以提高目标风格词汇的概率,从而增强生成文本的风格多样性。然而,我们发现这种停用对生成文本的流畅性产生了负面影响,为此我们提出了一种改进的对比解码方法 (contrastive decoding method),该方法考虑了由于停用的源风格神经元导致的跨层 Token 概率快速变化。实证实验表明,所提出的方法在涵盖正式性、毒性、政治性、礼貌性、作者身份和情感的六个基准上具有有效性。

[NLP-17] AMR-Evol: Adaptive Modular Response Evolution Elicits Better Knowledge Distillation for Large Language Models in Code Generation EMNLP2024

【速读】: 该论文试图解决开源大型语言模型(LLMs)在代码生成任务中,通过知识蒸馏方法复制专有模型(如GPT-4)性能时,由于过度依赖教师模型的直接响应蒸馏而导致的响应质量下降问题。解决方案的关键在于提出了自适应模块化响应进化(AMR-Evol)框架,该框架通过两阶段过程优化响应蒸馏:第一阶段进行模块化分解,将直接响应分解为更易管理的子模块;第二阶段进行自适应响应进化,自动进化响应与相关功能模块。实验结果表明,AMR-Evol框架在多个代码生成基准测试中显著优于传统的响应蒸馏方法,提升了模型性能。

链接: https://arxiv.org/abs/2410.00558
作者: Ziyang Luo,Xin Li,Hongzhan Lin,Jing Ma,Lidong Bing
关键词-EN: response distillation, generation has led, trend to replicate, replicate these capabilities, response
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: EMNLP 2024

点击查看摘要

Abstract:The impressive performance of proprietary LLMs like GPT4 in code generation has led to a trend to replicate these capabilities in open-source models through knowledge distillation (e.g. Code Evol-Instruct). However, these efforts often neglect the crucial aspect of response quality, relying heavily on teacher models for direct response distillation. This paradigm, especially for complex instructions, can degrade the quality of synthesized data, compromising the knowledge distillation process. To this end, our study introduces the Adaptive Modular Response Evolution (AMR-Evol) framework, which employs a two-stage process to refine response distillation. The first stage, modular decomposition, breaks down the direct response into more manageable sub-modules. The second stage, adaptive response evolution, automatically evolves the response with the related function modules. Our experiments with three popular code benchmarks (HumanEval, MBPP, and EvalPlus) attest to the superiority of the AMR-Evol framework over baseline response distillation methods. By comparing with the open-source Code LLMs trained on a similar scale of data, we observed performance enhancements: more than +3.0 points on HumanEval-Plus and +1.0 points on MBPP-Plus, which underscores the effectiveness of our framework. Our codes are available at this https URL.
摘要:像 GPT4 这样的专有大语言模型 (LLM) 在代码生成方面的卓越表现,引发了一股通过知识蒸馏 (例如 Code Evol-Instruct) 在开源模型中复制这些能力的热潮。然而,这些努力往往忽视了响应质量这一关键方面,过度依赖教师模型进行直接响应蒸馏。这种范式,尤其是在处理复杂指令时,可能会降低合成数据的质量,从而损害知识蒸馏过程。为此,我们的研究引入了自适应模块化响应进化 (Adaptive Modular Response Evolution, AMR-Evol) 框架,该框架采用两阶段流程来优化响应蒸馏。第一阶段,模块化分解,将直接响应分解为更易管理的子模块。第二阶段,自适应响应进化,自动与相关功能模块一起进化响应。我们在三个流行的代码基准测试 (HumanEval, MBPP 和 EvalPlus) 上的实验证明了 AMR-Evol 框架相对于基线响应蒸馏方法的优越性。通过与在类似规模数据上训练的开源代码大语言模型进行比较,我们观察到了性能提升:HumanEval-Plus 上超过 +3.0 分,MBPP-Plus 上超过 +1.0 分,这突显了我们框架的有效性。我们的代码可在以下链接获取:https URL。

[NLP-18] What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study EMNLP2024

【速读】: 该论文试图解决机器翻译(MT)中的性别偏见问题,特别是这种偏见对最终用户(如女性和男性)可能带来的实际成本和影响。解决方案的关键在于采用以人为中心的研究方法,通过收集90名参与者的行为数据,分析他们在修正性别翻译偏差时所需的技术和时间成本,以及相应的经济成本。研究发现,女性在修正翻译偏差时需要更多的努力和成本,而现有的偏见测量方法未能反映这些差异。因此,论文主张采用以人为中心的方法来评估和理解性别偏见对社会的实际影响。

链接: https://arxiv.org/abs/2410.00545
作者: Beatrice Savoldi,Sara Papi,Matteo Negri,Ana Guerberof,Luisa Bentivogli
关键词-EN: machine translation, rarely involve people, people and society, Abstract, Gender
类目: Computation and Language (cs.CL)
备注: Accepted ad EMNLP 2024

点击查看摘要

Abstract:Gender bias in machine translation (MT) is recognized as an issue that can harm people and society. And yet, advancements in the field rarely involve people, the final MT users, or inform how they might be impacted by biased technologies. Current evaluations are often restricted to automatic methods, which offer an opaque estimate of what the downstream impact of gender disparities might be. We conduct an extensive human-centered study to examine if and to what extent bias in MT brings harms with tangible costs, such as quality of service gaps across women and men. To this aim, we collect behavioral data from 90 participants, who post-edited MT outputs to ensure correct gender translation. Across multiple datasets, languages, and types of users, our study shows that feminine post-editing demands significantly more technical and temporal effort, also corresponding to higher financial costs. Existing bias measurements, however, fail to reflect the found disparities. Our findings advocate for human-centered approaches that can inform the societal impact of bias.
摘要:机器翻译 (MT) 中的性别偏见被认为是一个可能对个人和社会造成伤害的问题。然而,该领域的进展很少涉及最终的 MT 用户,或者告知他们可能如何受到偏见技术的影响。当前的评估通常局限于自动方法,这些方法对性别差异的下游影响提供了一个不透明的估计。我们进行了一项广泛的人为中心的研究,以调查 MT 中的偏见是否以及在多大程度上带来了具有实际成本的伤害,例如女性和男性之间的服务质量差距。为此,我们从 90 名参与者那里收集了行为数据,他们在编辑 MT 输出后确保了正确的性别翻译。在多个数据集、语言和用户类型中,我们的研究表明,女性后编辑需求显著增加了技术和时间上的努力,同时也对应了更高的财务成本。然而,现有的偏见测量方法未能反映出这些发现的差异。我们的研究结果主张采用以人为中心的方法,这些方法可以告知偏见的潜在社会影响。

[NLP-19] Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents

【速读】: 该论文试图解决现有基准在评估大语言模型(LLMs)在对话式问答(CQA)中处理复杂指导文档能力方面的不足。解决方案的关键在于提出了InsCoQA基准,该基准专门用于评估LLMs在处理多文档、多步骤指导任务中的能力,并通过InsEval评估器来全面评估生成响应和程序指导的完整性和准确性。

链接: https://arxiv.org/abs/2410.00526
作者: Shiwei Wu,Chen Zhang,Yan Gao,Qimeng Wang,Tong Xu,Yao Hu,Enhong Chen
关键词-EN: conversational question answering, question answering, rich sources, sources of knowledge, knowledge for completing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instructional documents are rich sources of knowledge for completing various tasks, yet their unique challenges in conversational question answering (CQA) have not been thoroughly explored. Existing benchmarks have primarily focused on basic factual question-answering from single narrative documents, making them inadequate for assessing a model`s ability to comprehend complex real-world instructional documents and provide accurate step-by-step guidance in daily life. To bridge this gap, we present InsCoQA, a novel benchmark tailored for evaluating large language models (LLMs) in the context of CQA with instructional documents. Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents, reflecting the intricate and multi-faceted nature of real-world instructional tasks. Additionally, to comprehensively assess state-of-the-art LLMs on the InsCoQA benchmark, we propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
摘要:指导性文档是完成各种任务的丰富知识来源,然而其在对话式问答(CQA)中的独特挑战尚未得到充分探索。现有基准主要集中在单篇叙述性文档的基本事实问答上,这使得它们不足以评估模型在理解复杂现实世界指导性文档并提供日常生活中的准确逐步指导方面的能力。为了填补这一空白,我们提出了 InsCoQA,这是一个专为评估大语言模型(LLMs)在指导性文档背景下进行 CQA 的新型基准。InsCoQA 源自广泛、百科全书式的指导性内容,评估模型在从多篇文档中检索、解释和准确总结程序性指导方面的能力,反映了现实世界指导任务的复杂性和多面性。此外,为了全面评估最先进 LLMs 在 InsCoQA 基准上的表现,我们提出了 InsEval,一个由 LLM 辅助的评估器,用于测量生成响应和程序性指令的完整性和准确性。

[NLP-20] Annotation Guidelines for Corpus Novelties: Part 2 – Alias Resolution Version 1.0

【速读】: 该论文旨在解决小说文本中的别名解析问题,即识别和标注不同名称是否指向同一实体。解决方案的关键在于制定详细的标注指南,包括如何定义规范名称以及如何判断不同名称是否指代同一实体,并通过实例展示这些指南在实际标注中的应用。

链接: https://arxiv.org/abs/2410.00522
作者: Arthur Amalvy(LIA),Vincent Labatut(LIA)
关键词-EN: Alias Resolution, Novelties corpus, annotated for Alias, Resolution, Novelties
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Novelties corpus is a collection of novels (and parts of novels) annotated for Alias Resolution, among other tasks. This document describes the guidelines applied during the annotation process. It contains the instructions used by the annotators, as well as a number of examples retrieved from the annotated novels, and illustrating how canonical names should be defined, and which names should be considered as referring to the same entity.
摘要:Novelties 语料库是一个包含小说(及小说部分)的集合,这些小说(部分)经过别名解析(Alias Resolution)等任务的标注。本文档描述了标注过程中应用的指南。它包含了标注者使用的指令,以及从标注小说中提取的若干示例,展示了如何定义规范名称,以及哪些名称应被视为指代同一实体。

[NLP-21] Exploring the Learning Capabilities of Language Models using LEVERWORLDS

【速读】: 该论文试图解决在随机环境中学习模型时,如何平衡学习一般结构规则与特定实例属性之间的相互作用,特别是在样本效率方面的问题。解决方案的关键在于设计了一个名为LeverWorlds的框架,该框架能够生成遵循相似生成过程但分布不同的简单物理世界,并通过自然语言表达其实例。通过这一框架,研究者可以进行受控实验,评估不同学习方法的样本复杂度。论文通过对比经典学习算法和Transformer语言模型(包括微调和上下文学习)的性能,发现Transformer虽然在任务中表现良好,但在样本效率上显著低于基于更强结构假设的经典方法。为此,论文提出了一种利用当代语言模型的上下文学习能力来应用简单算法的方法,尽管当前模型在该任务上仍有挑战,但显示出潜在的改进空间。

链接: https://arxiv.org/abs/2410.00519
作者: Eitan Wagner,Amir Feder,Omri Abend
关键词-EN: stochastic setting, setting often involves, general structure rules, Learning, specific properties
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning a model of a stochastic setting often involves learning both general structure rules and specific properties of the instance. This paper investigates the interplay between learning the general and the specific in various learning methods, with emphasis on sample efficiency. We design a framework called \sc LeverWorlds, which allows the generation of simple physics-inspired worlds that follow a similar generative process with different distributions, and their instances can be expressed in natural language. These worlds allow for controlled experiments to assess the sample complexity of different learning methods. We experiment with classic learning algorithms as well as Transformer language models, both with fine-tuning and In-Context Learning (ICL). Our general finding is that (1) Transformers generally succeed in the task; but (2) they are considerably less sample efficient than classic methods that make stronger assumptions about the structure, such as Maximum Likelihood Estimation and Logistic Regression. This finding is in tension with the recent tendency to use Transformers as general-purpose estimators. We propose an approach that leverages the ICL capabilities of contemporary language models to apply simple algorithms for this type of data. Our experiments show that models currently struggle with the task but show promising potential.
摘要:在随机环境中学习模型通常涉及学习一般结构规则和实例的具体属性。本文探讨了在各种学习方法中,学习一般性和特定性之间的相互作用,特别强调样本效率。我们设计了一个名为 \sc LeverWorlds 的框架,该框架允许生成遵循类似生成过程但分布不同的简单物理启发世界,并且这些世界的实例可以用自然语言表达。这些世界允许进行受控实验,以评估不同学习方法的样本复杂性。我们实验了经典学习算法以及 Transformer 语言模型,包括微调和上下文学习 (In-Context Learning, ICL)。我们的总体发现是:(1) Transformer 通常能够成功完成任务;但 (2) 它们在样本效率上明显低于对结构做出更强假设的经典方法,如最大似然估计 (Maximum Likelihood Estimation) 和逻辑回归 (Logistic Regression)。这一发现与近期倾向于将 Transformer 用作通用估计器的趋势相矛盾。我们提出了一种利用当代语言模型的 ICL 能力来应用此类数据的简单算法的方法。我们的实验表明,当前模型在该任务上仍面临挑战,但显示出有希望的潜力。

[NLP-22] Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing EMNLP2024

【速读】: 该论文试图解决跨语言语义解析(SP)中的零样本跨语言迁移问题,即在不依赖目标语言大量标注数据的情况下,如何提升目标语言的语义解析性能。解决方案的关键是提出了一种名为“跨语言反向解析(Cross-Lingual Back-Parsing, CBP)”的新型数据增强方法。CBP利用多语言预训练语言模型(mPLMs)的表示几何特性,从源语言的语义表示生成目标语言的语句,从而在零资源设置下进行有效的跨语言数据增强。该方法仅依赖源语言的标注数据和单语语料库,通过在两个跨语言SP基准测试(Mschema2QA和Xspider)上的广泛实验,证明了CBP在目标语言上带来了显著的性能提升,并成功生成了具有高槽值对齐率和语义完整性的目标语言语句。

链接: https://arxiv.org/abs/2410.00513
作者: Deokhyung Kang,Seonjeong Hwang,Yunsu Kim,Gary Geunbae Lee
关键词-EN: utilize multilingual pretrained, Recent efforts, pretrained language models, requiring extensive annotations, extend semantic parsing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Recent efforts have aimed to utilize multilingual pretrained language models (mPLMs) to extend semantic parsing (SP) across multiple languages without requiring extensive annotations. However, achieving zero-shot cross-lingual transfer for SP remains challenging, leading to a performance gap between source and target languages. In this study, we propose Cross-Lingual Back-Parsing (CBP), a novel data augmentation methodology designed to enhance cross-lingual transfer for SP. Leveraging the representation geometry of the mPLMs, CBP synthesizes target language utterances from source meaning representations. Our methodology effectively performs cross-lingual data augmentation in challenging zero-resource settings, by utilizing only labeled data in the source language and monolingual corpora. Extensive experiments on two cross-language SP benchmarks (Mschema2QA and Xspider) demonstrate that CBP brings substantial gains in the target language. Further analysis of the synthesized utterances shows that our method successfully generates target language utterances with high slot value alignment rates while preserving semantic integrity. Our codes and data are publicly available at this https URL.
摘要:近期研究致力于利用多语言预训练语言模型 (mPLMs) 来扩展语义解析 (SP) 至多种语言,而无需大量标注数据。然而,实现零样本跨语言转移的语义解析仍然充满挑战,导致源语言和目标语言之间的性能差距。在本研究中,我们提出了跨语言反向解析 (Cross-Lingual Back-Parsing, CBP),这是一种新颖的数据增强方法,旨在提升语义解析的跨语言转移能力。通过利用 mPLMs 的表示几何结构,CBP 从源语言的语义表示中合成目标语言的语句。我们的方法在零资源设置下有效进行跨语言数据增强,仅使用源语言的标注数据和单语语料库。在两个跨语言语义解析基准测试 (Mschema2QA 和 Xspider) 上的广泛实验表明,CBP 在目标语言上带来了显著的性能提升。进一步分析合成的语句显示,我们的方法成功生成了具有高槽值对齐率的目标语言语句,同时保持了语义完整性。我们的代码和数据已公开,可访问此 https URL。

[NLP-23] FlipGuard: Defending Preference Alignment against Update Regression with Constrained Optimization EMNLP2024

【速读】: 该论文试图解决大语言模型在偏好对齐过程中可能出现的回归问题,即模型在更新后对之前已正确处理的数据表现出现退化。解决方案的关键是提出了一种名为FlipGuard的约束优化方法,通过定制化的奖励特征识别性能下降,并在训练过程中施加约束,以确保模型在更新时保持与预对齐模型的一致性,从而有效缓解更新回归问题,同时保留知识并提升整体性能。

链接: https://arxiv.org/abs/2410.00508
作者: Mingye Zhu,Yi Liu,Quan Wang,Junbo Guo,Zhendong Mao
关键词-EN: Large Language Models’, improved Large Language, Language Models’ ability, significantly improved Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Main track

点击查看摘要

Abstract:Recent breakthroughs in preference alignment have significantly improved Large Language Models’ ability to generate texts that align with human preferences and values. However, current alignment metrics typically emphasize the post-hoc overall improvement, while overlooking a critical aspect: regression, which refers to the backsliding on previously correctly-handled data after updates. This potential pitfall may arise from excessive fine-tuning on already well-aligned data, which subsequently leads to over-alignment and degeneration. To address this challenge, we propose FlipGuard, a constrained optimization approach to detect and mitigate update regression with focal attention. Specifically, FlipGuard identifies performance degradation using a customized reward characterization and strategically enforces a constraint to encourage conditional congruence with the pre-aligned model during training. Comprehensive experiments demonstrate that FlipGuard effectively alleviates update regression while demonstrating excellent overall performance, with the added benefit of knowledge preservation while aligning preferences.
摘要:近期在偏好对齐方面的突破显著提升了大语言模型生成符合人类偏好和价值观文本的能力。然而,当前的对齐度量标准通常侧重于事后整体改进,而忽视了一个关键方面:回归,即在更新后对之前正确处理的数据出现退步。这种潜在的陷阱可能源于对已经良好对齐的数据进行过度微调,从而导致过度对齐和退化。为应对这一挑战,我们提出了 FlipGuard,一种约束优化方法,用于检测并缓解更新过程中的回归问题,通过焦点注意力机制实现。具体而言,FlipGuard 通过定制的奖励特征识别性能下降,并在训练过程中策略性地施加约束,以促进与预对齐模型在条件一致性方面的保持。综合实验表明,FlipGuard 有效地缓解了更新回归问题,同时在整体性能上表现出色,并且在偏好对齐过程中保留了知识。

[NLP-24] Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach EMNLP2024

【速读】: 该论文试图解决跨语言摘要中的语义一致性问题,即在将文档摘要成多种目标语言时,确保各语言摘要之间的语义相似性。解决方案的关键在于提出了一种基于原则的重新排序方法,并通过多标准评估协议来评估目标语言间的语义一致性,从而填补了这一领域的研究空白。

链接: https://arxiv.org/abs/2410.00502
作者: Diogo Pernes,Gonçalo M. Correia,Afonso Mendes
关键词-EN: Cross-lingual summarization aims, bridge language barriers, aims to bridge, Cross-lingual summarization, multi-target cross-lingual summarization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 (Findings)

点击查看摘要

Abstract:Cross-lingual summarization aims to bridge language barriers by summarizing documents in different languages. However, ensuring semantic coherence across languages is an overlooked challenge and can be critical in several contexts. To fill this gap, we introduce multi-target cross-lingual summarization as the task of summarizing a document into multiple target languages while ensuring that the produced summaries are semantically similar. We propose a principled re-ranking approach to this problem and a multi-criteria evaluation protocol to assess semantic coherence across target languages, marking a first step that will hopefully stimulate further research on this problem.
摘要:跨语言摘要旨在通过将不同语言的文档进行摘要来跨越语言障碍。然而,确保跨语言的语义一致性是一个被忽视的挑战,并且在多个情境中可能至关重要。为了填补这一空白,我们引入了多目标跨语言摘要任务,即在确保生成的摘要语义相似的前提下,将文档摘要为多种目标语言。我们提出了一种有原则的重新排序方法来解决这一问题,并设计了一个多标准评估协议来评估目标语言间的语义一致性,这标志着我们迈出了希望激发该问题进一步研究的第一步。

[NLP-25] Self-Updatable Large Language Models with Parameter Integration

【速读】: 该论文试图解决大语言模型(LLMs)在快速频繁整合小规模经验(如与周围对象的交互)时面临的挑战,特别是记忆近期事件的效力和回忆久远经验的保留能力。解决方案的关键在于提出了一种名为SELF-PARAM的方法,该方法在不增加额外参数的情况下,通过最小化原始模型与目标模型预测之间的Kullback-Leibler(KL)散度,实现知识的内部化更新。具体来说,通过生成与知识相关的多样化问答对,并在此数据集上最小化KL散度,使目标模型能够无缝地将知识整合到其参数中,从而在保证近似最优效力和长期保留的同时,显著提升模型在问答和对话推荐任务中的表现。

链接: https://arxiv.org/abs/2410.00487
作者: Yu Wang,Xinshuang Liu,Xiusi Chen,Sean O’Brien,Junda Wu,Julian McAuley
关键词-EN: large language models, large language, surrounding objects, remains a substantial, substantial challenge
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant advancements in large language models (LLMs), the rapid and frequent integration of small-scale experiences, such as interactions with surrounding objects, remains a substantial challenge. Two critical factors in assimilating these experiences are (1) Efficacy: the ability to accurately remember recent events; (2) Retention: the capacity to recall long-past experiences. Current methods either embed experiences within model parameters using continual learning, model editing, or knowledge distillation techniques, which often struggle with rapid updates and complex interactions, or rely on external storage to achieve long-term retention, thereby increasing storage requirements. In this paper, we propose SELF-PARAM (Self-Updatable Large Language Models with Parameter Integration). SELF-PARAM requires no extra parameters while ensuring near-optimal efficacy and long-term retention. Our method employs a training objective that minimizes the Kullback-Leibler (KL) divergence between the predictions of an original model (with access to contextual information) and a target model (without such access). By generating diverse question-answer pairs related to the knowledge and minimizing the KL divergence across this dataset, we update the target model to internalize the knowledge seamlessly within its parameters. Evaluations on question-answering and conversational recommendation tasks demonstrate that SELF-PARAM significantly outperforms existing methods, even when accounting for non-zero storage requirements. This advancement paves the way for more efficient and scalable integration of experiences in large language models by embedding knowledge directly into model parameters.
摘要:尽管大语言模型 (LLM) 取得了显著进展,但快速且频繁地整合小规模体验,如与周围物体的互动,仍然是一个重大挑战。整合这些体验的两个关键因素是:(1) 有效性:准确记忆近期事件的能力;(2) 保留性:回忆久远经历的能力。当前的方法要么通过持续学习、模型编辑或知识蒸馏技术将体验嵌入模型参数中,这些方法通常难以应对快速更新和复杂互动,要么依赖外部存储以实现长期保留,从而增加了存储需求。本文提出 SELF-PARAM (Self-Updatable Large Language Models with Parameter Integration)。SELF-PARAM 在无需额外参数的情况下,确保了接近最优的有效性和长期保留性。我们的方法采用了一种训练目标,即最小化原始模型(可访问上下文信息)与目标模型(无此访问权限)预测之间的 Kullback-Leibler (KL) 散度。通过生成与知识相关的多样化问答对,并在此数据集上最小化 KL 散度,我们更新目标模型,使其参数无缝内化知识。在问答和对话推荐任务的评估中,SELF-PARAM 显著优于现有方法,即使在考虑非零存储需求的情况下也是如此。这一进展为通过直接将知识嵌入模型参数,实现大语言模型中体验的更高效和可扩展整合铺平了道路。

[NLP-26] Adversarial Suffixes May Be Features Too!

【速读】: 该论文试图解决大型语言模型(如GPT-4和LLaMA 3)在面对“越狱”攻击时容易产生有害行为的问题。解决方案的关键在于揭示这些攻击中使用的对抗性后缀并非简单的漏洞,而是可能代表能够主导模型行为的特征。通过实验,论文展示了良性特征如何被转化为对抗性后缀,这些后缀能够有效破坏模型的安全对齐;同时,论文还证明了这些对抗性后缀包含的特征在不同提示下产生特定响应,并指出即使在没有有害内容的良性数据集上进行微调,也可能引入这些安全妥协的特征。这一发现强调了训练数据中主导性良性特征带来的风险,并呼吁进一步研究以加强模型的安全对齐。

链接: https://arxiv.org/abs/2410.00451
作者: Wei Zhao,Zhe Li,Yige Li,Jun Sun
关键词-EN: large language models, significant ongoing efforts, large language, language models, remain vulnerable
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant ongoing efforts in safety alignment, large language models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks that can induce harmful behaviors, including those triggered by adversarial suffixes. Building on prior research, we hypothesize that these adversarial suffixes are not mere bugs but may represent features that can dominate the LLM’s behavior. To evaluate this hypothesis, we conduct several experiments. First, we demonstrate that benign features can be effectively made to function as adversarial suffixes, i.e., we develop a feature extraction method to extract sample-agnostic features from benign dataset in the form of suffixes and show that these suffixes may effectively compromise safety alignment. Second, we show that adversarial suffixes generated from jailbreak attacks may contain meaningful features, i.e., appending the same suffix to different prompts results in responses exhibiting specific characteristics. Third, we show that such benign-yet-safety-compromising features can be easily introduced through fine-tuning using only benign datasets, i.e., even in the absence of harmful content. This highlights the critical risk posed by dominating benign features in the training data and calls for further research to reinforce LLM safety alignment. Our code and data is available at \urlthis https URL.
摘要:尽管在安全对齐方面持续投入了大量努力,但像 GPT-4 和 LLaMA 3 这样的大语言模型 (LLM) 仍然容易受到越狱攻击,这些攻击可能导致有害行为,包括由对抗性后缀触发的行为。基于先前的研究,我们假设这些对抗性后缀不仅仅是漏洞,还可能代表能够主导 LLM 行为的特征。为了验证这一假设,我们进行了多项实验。首先,我们证明了良性特征可以有效地作为对抗性后缀发挥作用,即我们开发了一种特征提取方法,从良性数据集中提取与样本无关的特征并以后缀形式呈现,并展示了这些后缀可能有效地破坏安全对齐。其次,我们展示了从越狱攻击中生成的对抗性后缀可能包含有意义的特征,即在不同提示后附加相同后缀会导致响应表现出特定特征。第三,我们展示了这种良性但破坏安全性的特征可以通过仅使用良性数据集进行微调轻松引入,即即使在不存在有害内容的情况下。这突显了训练数据中主导的良性特征带来的关键风险,并呼吁进一步研究以加强 LLM 的安全对齐。我们的代码和数据可在 \urlthis https URL 获取。

[NLP-27] Conversational Exploratory Search of Scholarly Publications Using Knowledge Graphs

【速读】: 该论文试图解决传统基于字符串匹配的搜索方法在学术出版物检索中因词汇差异导致的相关性低的问题。解决方案的关键在于开发了一种基于知识图谱的对话式搜索系统,通过识别搜索词的潜在意图和上下文含义,实现概念层面的匹配。该系统利用知识图谱表示作者、出版物和研究概念之间的语义关系,并通过对话界面简化用户导航复杂数据的过程,从而提高学术出版物的检索效率。

链接: https://arxiv.org/abs/2410.00427
作者: Phillip Schneider,Florian Matthes
关键词-EN: targets concept-based matches, methods primarily depend, recognizing underlying intents, search methods primarily, search targets concept-based
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to ICNLSP 2024

点击查看摘要

Abstract:Traditional search methods primarily depend on string matches, while semantic search targets concept-based matches by recognizing underlying intents and contextual meanings of search terms. Semantic search is particularly beneficial for discovering scholarly publications where differences in vocabulary between users’ search terms and document content are common, often yielding irrelevant search results. Many scholarly search engines have adopted knowledge graphs to represent semantic relations between authors, publications, and research concepts. However, users may face challenges when navigating these graphical search interfaces due to the complexity and volume of data, which impedes their ability to discover publications effectively. To address this problem, we developed a conversational search system for exploring scholarly publications using a knowledge graph. We outline the methodical approach for designing and implementing the proposed system, detailing its architecture and functional components. To assess the system’s effectiveness, we employed various performance metrics and conducted a human evaluation with 40 participants, demonstrating how the conversational interface compares against a graphical interface with traditional text search. The findings from our evaluation provide practical insights for advancing the design of conversational search systems.
摘要:传统的搜索方法主要依赖于字符串匹配,而语义搜索则通过识别搜索词的潜在意图和上下文含义,实现基于概念的匹配。语义搜索在发现学术出版物方面尤为有益,因为用户搜索词与文档内容之间的词汇差异常见,往往导致不相关的搜索结果。许多学术搜索引擎采用了知识图谱来表示作者、出版物和研究概念之间的语义关系。然而,用户在导航这些图形搜索界面时可能会面临挑战,由于数据的复杂性和体量,这阻碍了他们有效发现出版物的能力。为了解决这一问题,我们开发了一个基于知识图谱的对话式搜索系统,用于探索学术出版物。我们概述了设计和实现该系统的方法,详细介绍了其架构和功能组件。为了评估系统的有效性,我们采用了多种性能指标,并进行了由40名参与者参与的人工评估,展示了对话界面与传统文本搜索的图形界面相比的优劣。我们的评估结果为推进对话式搜索系统的设计提供了实际的见解。

[NLP-28] Are LLMs Aware that Some Questions are not Open-ended? EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在回答不同类型问题时缺乏问题意识的问题,即LLMs在面对有限答案的问题时过于随意,而在面对开放性问题时又显得过于单调。解决方案的关键是提出了一个名为“问题意识温度采样”(Question Awareness Temperature Sampling, QuATS)的方法,通过根据问题特征自适应调整输出分布,增强LLMs的问题意识,从而在文本生成中无需手动调整温度参数,并能持续提升模型在各种基准测试中的表现。

链接: https://arxiv.org/abs/2410.00423
作者: Dongjie Yang,Hai Zhao
关键词-EN: Large Language Models, Large Language, question awareness, Language Models, LLMs
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown the impressive capability of answering questions in a wide range of scenarios. However, when LLMs face different types of questions, it is worth exploring whether LLMs are aware that some questions have limited answers and need to respond more deterministically but some do not. We refer to this as question awareness of LLMs. The lack of question awareness in LLMs leads to two phenomena that LLMs are: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions. In this paper, we first evaluate the question awareness in LLMs. The experimental results show that LLMs have the issues of lacking awareness of questions in certain domains, e.g. factual knowledge, resulting in hallucinations during the generation. To mitigate these, we propose a method called Question Awareness Temperature Sampling (QuATS). This method enhances the question awareness of LLMs by adaptively adjusting the output distributions based on question features. The automatic adjustment in QuATS eliminates the need for manual temperature tuning in text generation and consistently improves model performance in various benchmarks.
摘要:大语言模型 (LLMs) 在回答各种场景中的问题方面展现了令人印象深刻的能力。然而,当 LLMs 面对不同类型的问题时,值得探讨的是 LLMs 是否意识到某些问题答案有限,需要更确定性地回答,而有些问题则不需要。我们将此称为 LLMs 的问题意识。LLMs 缺乏问题意识会导致两种现象:(1) 对非开放性问题回答过于随意,或 (2) 对开放性问题回答过于乏味。本文首先评估了 LLMs 的问题意识。实验结果表明,LLMs 在某些领域(如事实知识)中缺乏问题意识,导致生成过程中出现幻觉。为了缓解这些问题,我们提出了一种名为问题意识温度采样 (Question Awareness Temperature Sampling, QuATS) 的方法。该方法通过根据问题特征自适应调整输出分布,增强了 LLMs 的问题意识。QuATS 中的自动调整消除了文本生成中手动温度调节的需求,并在各种基准测试中持续提升模型性能。

[NLP-29] Semantic Parsing with Candidate Expressions for Knowledge Base Question Answering

【速读】: 该论文试图解决现有语义解析器在处理大规模知识库(KB)时,无法充分利用KB中的大量信息的问题。解决方案的关键在于提出了一种增强的语法,该语法通过候选表达式来辅助语义解析,使得解析器在生成逻辑形式时能够更有效地利用KB中的实体和关系信息。具体来说,该语法定义了作为产生规则的动作,并在推理过程中通过类型和候选表达式的约束来预测这些动作,从而提高了语义解析的准确性。实验结果表明,这种增强的语法在KQA Pro和Overnight两个基准测试中显著提升了语义解析器的准确性,达到了最先进的水平。

链接: https://arxiv.org/abs/2410.00414
作者: Daehwan Nam,Gary Geunbae Lee
关键词-EN: convert natural language, parsers convert natural, logical forms, Semantic, semantic parser
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic parsers convert natural language to logical forms, which can be evaluated on knowledge bases (KBs) to produce denotations. Recent semantic parsers have been developed with sequence-to-sequence (seq2seq) pre-trained language models (PLMs) or large language models, where the models treat logical forms as sequences of tokens. For syntactic and semantic validity, the semantic parsers use grammars that enable constrained decoding. However, the grammars lack the ability to utilize large information of KBs, although logical forms contain representations of KB elements, such as entities or relations. In this work, we propose a grammar augmented with candidate expressions for semantic parsing on a large KB with a seq2seq PLM. The grammar defines actions as production rules, and our semantic parser predicts actions during inference under the constraints by types and candidate expressions. We apply the grammar to knowledge base question answering, where the constraints by candidate expressions assist a semantic parser to generate valid KB elements. In experiments on two benchmarks, KQA Pro and Overnight, the constraints by candidate expressions increased the accuracy of our semantic parser, whether it was trained with strong supervision or weak supervision. Our semantic parser achieved state-of-the-art accuracies on KQA Pro and Overnight.
摘要:语义解析器将自然语言转换为逻辑形式,这些逻辑形式可以在知识库 (KBs) 上进行评估以生成指称。最近开发的语义解析器采用了序列到序列 (seq2seq) 预训练语言模型 (PLMs) 或大语言模型,其中模型将逻辑形式视为 Token 序列。为了确保句法和语义的有效性,语义解析器使用语法来实现受限解码。然而,尽管逻辑形式包含知识库元素(如实体或关系)的表示,这些语法缺乏利用知识库大量信息的能力。在本研究中,我们提出了一种增强候选表达式的语法,用于在大型知识库上进行语义解析的 seq2seq PLM。该语法将动作定义为产生式规则,我们的语义解析器在推断过程中根据类型和候选表达式的约束预测动作。我们将该语法应用于知识库问答,其中候选表达式的约束帮助语义解析器生成有效的知识库元素。在两个基准测试 KQA Pro 和 Overnight 上的实验表明,无论是在强监督还是弱监督下训练,候选表达式的约束都提高了我们语义解析器的准确性。我们的语义解析器在 KQA Pro 和 Overnight 上达到了最先进的准确率。

[NLP-30] PN: Transferable Proto-Learning Network towards Few-shot Document-Level Relation Extraction

【速读】: 该论文试图解决少样本文档级关系抽取中跨域迁移性差的问题,特别是NOTA(none-of-the-above)关系表示的跨域偏差。解决方案的关键在于引入Transferable Proto-Learning Network (TPN),其核心组件包括:1) 混合编码器,通过层次化编码和注意力机制增强关系表示;2) 可插拔的域外检测模块,通过自适应学习块计算NOTA原型,有效缓解跨域NOTA偏差;3) 动态权重校准器,用于检测关系分类的置信度,并作为动态权重校准NOTA主导的损失函数。此外,通过虚拟对抗训练(VAT)进一步增强模型的跨域性能。实验结果表明,TPN在FREDo和ReFREDo数据集上表现优异,且参数规模约为现有最先进方法的一半。

链接: https://arxiv.org/abs/2410.00412
作者: Yu Zhang,Zhao Kang
关键词-EN: Few-shot document-level relation, document-level relation extraction, relation extraction suffers, Few-shot document-level, poor performance due
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Few shot document-level relation extraction

点击查看摘要

Abstract:Few-shot document-level relation extraction suffers from poor performance due to the challenging cross-domain transferability of NOTA (none-of-the-above) relation representation. In this paper, we introduce a Transferable Proto-Learning Network (TPN) to address the challenging issue. It comprises three core components: Hybrid Encoder hierarchically encodes semantic content of input text combined with attention information to enhance the relation representations. As a plug-and-play module for Out-of-Domain (OOD) Detection, Transferable Proto-Learner computes NOTA prototype through an adaptive learnable block, effectively mitigating NOTA bias across various domains. Dynamic Weighting Calibrator detects relation-specific classification confidence, serving as dynamic weights to calibrate the NOTA-dominant loss function. Finally, to bolster the model’s cross-domain performance, we complement it with virtual adversarial training (VAT). We conduct extensive experimental analyses on FREDo and ReFREDo, demonstrating the superiority of TPN. Compared to state-of-the-art methods, our approach achieves competitive performance with approximately half the parameter size. Data and code are available at this https URL.
摘要:少样本文档级关系抽取由于 NOTA (none-of-the-above) 关系表示的跨域可转移性挑战,表现不佳。本文中,我们引入了一种可转移的原型学习网络 (Transferable Proto-Learning Network, TPN) 来解决这一难题。它包含三个核心组件:混合编码器 (Hybrid Encoder) 通过结合注意力信息,分层编码输入文本的语义内容,以增强关系表示。作为域外检测 (Out-of-Domain, OOD) 的即插即用模块,可转移原型学习器 (Transferable Proto-Learner) 通过自适应可学习块计算 NOTA 原型,有效缓解了跨域的 NOTA 偏差。动态权重校准器 (Dynamic Weighting Calibrator) 检测关系特定的分类置信度,作为动态权重来校准 NOTA 主导的损失函数。最后,为了增强模型的跨域性能,我们补充了虚拟对抗训练 (Virtual Adversarial Training, VAT)。我们在 FREDo 和 ReFREDo 上进行了广泛的实验分析,证明了 TPN 的优越性。与最先进的方法相比,我们的方法在参数规模大约减半的情况下,实现了具有竞争力的性能。数据和代码可在以下链接获取:https URL。

[NLP-31] AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference EMNLP2024

【速读】: 该论文试图解决预训练语言模型(PLMs)在自动评估中表现优异但在人工评估中表现不佳的问题,这主要是由于微调数据集质量不高和高质量人类标注数据的缺乏。解决方案的关键是引入了一种名为AlignSum的新型人类摘要偏好对齐框架。该框架通过构建包含抽取式、生成式和人类标注摘要数据的数据金字塔,进行高斯重采样以去除极端长度的摘要,并实施两阶段分层微调,从而显著提升语言模型与人类摘要偏好的一致性。实验结果表明,AlignSum使BART-Large等模型在自动和人工评估中均超越了175B参数的GPT-3。

链接: https://arxiv.org/abs/2410.00409
作者: Yang Han,Yiming Wang,Rui Wang,Lu Chen,Kai Yu
关键词-EN: commonly employ Pre-trained, Text summarization tasks, employ Pre-trained Language, tasks commonly employ, fit diverse standard
类目: Computation and Language (cs.CL)
备注: EMNLP2024 Findings, code at: this https URL

点击查看摘要

Abstract:Text summarization tasks commonly employ Pre-trained Language Models (PLMs) to fit diverse standard datasets. While these PLMs excel in automatic evaluations, they frequently underperform in human evaluations, indicating a deviation between their generated summaries and human summarization preferences. This discrepancy is likely due to the low quality of fine-tuning datasets and the limited availability of high-quality human-annotated data that reflect true human preference. To address this challenge, we introduce a novel human summarization preference alignment framework AlignSum. This framework consists of three parts: Firstly, we construct a Data Pymarid with extractive, abstractive, and human-annotated summary data. Secondly, we conduct the Gaussian Resampling to remove summaries with extreme lengths. Finally, we implement the two-stage hierarchical fine-tuning with Data Pymarid after Gaussian Resampling. We apply AlignSum to PLMs on the human-annotated CNN/DailyMail and BBC XSum datasets. Experiments show that with AlignSum, PLMs like BART-Large surpass 175B GPT-3 in both automatic and human evaluations. This demonstrates that AlignSum significantly enhances the alignment of language models with human summarization preferences.
摘要:文本摘要任务通常采用预训练语言模型 (Pre-trained Language Models, PLMs) 来适应多样化的标准数据集。尽管这些 PLM 在自动评估中表现优异,但在人工评估中往往表现不佳,表明其生成的摘要与人类摘要偏好之间存在偏差。这种差异可能是由于微调数据集的质量较低,以及反映真实人类偏好的高质量人工标注数据的稀缺性。为了应对这一挑战,我们提出了一种新颖的人类摘要偏好对齐框架 AlignSum。该框架包括三个部分:首先,我们构建了一个包含抽取式、生成式和人工标注摘要数据的数据金字塔 (Data Pymarid)。其次,我们进行了高斯重采样 (Gaussian Resampling) 以去除长度极端的摘要。最后,我们在高斯重采样后实施了两阶段分层微调 (two-stage hierarchical fine-tuning) 与数据金字塔结合。我们将 AlignSum 应用于人工标注的 CNN/DailyMail 和 BBC XSum 数据集上的 PLM。实验表明,通过 AlignSum,BART-Large 等 PLM 在自动和人工评估中均超越了 175B 的 GPT-3。这表明 AlignSum 显著增强了语言模型与人类摘要偏好的对齐。

[NLP-32] Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation COLING2025

【速读】: 该论文试图解决低资源语言在形态学标注任务中数据稀缺的问题。解决方案的关键在于提出了一种基于检索增强生成(RAG)框架的方法,该框架结合了大型语言模型(LLM)的解释能力和小型标记分类网络的可训练性。通过利用语言学知识(如语法规则)作为输入,弥补了数据和可训练参数的不足,从而在数据稀缺的环境中实现了显著的性能提升和效率改进。该方法不仅在目标语言上达到了新的技术水平,还为语言学家提供了一个更可靠和易用的形态学标注工具,能够为每个输出提供合理的解释和置信度评分。

链接: https://arxiv.org/abs/2410.00387
作者: Bhargav Shandilya,Alexis Palmer
关键词-EN: modeling technology pose, technology pose challenges, current language modeling, language modeling technology, compute requirements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure, 5 tables, submitted to COLING 2025

点击查看摘要

Abstract:The data and compute requirements of current language modeling technology pose challenges for the processing and analysis of low-resource languages. Declarative linguistic knowledge has the potential to partially bridge this data scarcity gap by providing models with useful inductive bias in the form of language-specific rules. In this paper, we propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing. We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM. The results demonstrate that significant leaps in performance and efficiency are possible with the right combination of: a) linguistic inputs in the form of grammars, b) the interpretive power of LLMs, and c) the trainability of smaller token classification networks. We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages. Our work also offers documentary linguists a more reliable and more usable tool for morphological glossing by providing well-reasoned explanations and confidence scores for each output. Comments: 13 pages, 1 figure, 5 tables, submitted to COLING 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.00387 [cs.CL] (or arXiv:2410.00387v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.00387 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:当前语言建模技术对数据和计算资源的需求,给低资源语言的处理和分析带来了挑战。声明性语言学知识通过提供特定语言规则形式的模型有用归纳偏置,有可能部分弥补这一数据稀缺的差距。本文提出了一种基于大语言模型 (LLM) 的检索增强生成 (RAG) 框架,用于校正针对形态标注任务的小型模型的输出。我们利用语言信息来弥补数据和可训练参数的不足,同时允许输入通过 LLM 解释和提炼的书面描述性语法。结果表明,通过以下组合可以显著提升性能和效率:a) 以语法形式输入的语言学信息,b) LLM 的解释能力,以及 c) 小型 Token 分类网络的可训练性。我们展示了一个紧凑的、由 RAG 支持的模型在数据稀缺环境中非常有效,为该任务和我们的目标语言实现了新的最先进水平。我们的工作还为记录语言学家提供了一个更可靠、更易用的形态标注工具,通过为每个输出提供合理的解释和置信度评分。

评论:13 页,1 图,5 表,提交至 COLING 2025
主题:计算与语言 (cs.CL);人工智能 (cs.AI)
引用为:arXiv:2410.00387 [cs.CL]
(或 arXiv:2410.00387v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.00387
了解更多
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-33] Answer When Needed Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning

【速读】: 该论文试图解决大型语言模型(LLMs)在不同领域应用中需要选择性遗忘特定信息的问题。解决方案的关键在于提出了一种名为“in-context knowledge unlearning”的新方法,该方法通过微调预训练的LLMs,使其能够在测试时根据查询的上下文选择性地遗忘目标知识,同时保留其他知识。实验结果表明,该方法在TOFU和AGE数据集上使用Llama2-7B/13B和Mistral-7B模型时,能够达到高达95%的遗忘准确率,同时保留80%的无关知识,显著优于基线方法。此外,研究还揭示了微调后的LLMs在中间层生成正确预测并在最终层做出遗忘决策的内部行为,即“LLMs假装遗忘”,这为增强LLMs中遗忘机制的鲁棒性提供了有价值的见解。

链接: https://arxiv.org/abs/2410.00382
作者: Shota Takashiro,Takeshi Kojima,Andrew Gambardella,Qi Cao,Yusuke Iwasawa,Yutaka Matsuo
关键词-EN: unlearn specific information, selectively unlearn specific, large language models, diverse domains, increasingly essential
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are applied across diverse domains, the ability to selectively unlearn specific information has become increasingly essential. For instance, LLMs are expected to provide confidential information to authorized internal users, such as employees or trusted partners, while withholding it from external users, including the general public and unauthorized entities. In response to this challenge, we propose a novel method termed in-context knowledge unlearning'', which enables the model to selectively forget information in test-time based on the context of the query. Our method fine-tunes pre-trained LLMs to enable prompt unlearning of target knowledge within the context, while preserving other knowledge. Experiments on the TOFU and AGE datasets using Llama2-7B/13B and Mistral-7B models show our method achieves up to 95% forgetting accuracy while retaining 80% of unrelated knowledge, significantly outperforming baselines in both in-domain and out-of-domain scenarios. Further investigation into the model's internal behavior revealed that while fine-tuned LLMs generate correct predictions in the middle layers and maintain them up to the final layer, they make the decision to forget at the last layer, i.e., LLMs pretend to forget’'. Our findings offer valuable insights into enhancing the robustness of unlearning mechanisms in LLMs, setting a foundation for future research in the field.
摘要:随着大语言模型 (LLM) 在各个领域的广泛应用,选择性遗忘特定信息的能力变得愈发重要。例如,LLM 需要向授权的内部用户(如员工或受信任的合作伙伴)提供机密信息,同时对外部用户(包括公众和未授权实体)保密。针对这一挑战,我们提出了一种名为“情境知识遗忘”的新方法,该方法使模型能够根据查询的上下文在测试时选择性遗忘信息。我们的方法对预训练的 LLM 进行微调,以实现上下文中的目标知识遗忘,同时保留其他知识。在 TOFU 和 AGE 数据集上使用 Llama2-7B/13B 和 Mistral-7B 模型的实验表明,我们的方法在遗忘准确率上达到高达 95%,同时保留了 80% 的不相关知识,显著优于域内和域外的基线方法。进一步研究模型的内部行为发现,尽管微调后的 LLM 在中层生成正确的预测并保持到最后一层,但它们在最后一层做出遗忘决策,即“LLM 假装遗忘”。我们的研究为增强 LLM 中遗忘机制的鲁棒性提供了宝贵的见解,为该领域的未来研究奠定了基础。

[NLP-34] Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models

【速读】: 该论文试图解决在多模态语言模型(MLM)和大语言模型(LLM)中,如何有效地融合异构模型的预测结果的问题。解决方案的关键在于提出了一个后验框架,称为“似然组合”(likelihood composition),其核心思想是在多选视觉问答任务中,通过组合多个模型的似然分布来进行预测。具体操作包括去偏(debias)、突出(highlight)、多数投票(majority-vote)和集成(ensemble)等基本方法,并通过混合组合(mix-composition)将这些基本方法结合起来,从而在多个VQA数据集和MLM模型上验证了其有效性。该框架不仅提供了融合异构模型的新视角,还为未来探索新的组合方法提供了基础。

链接: https://arxiv.org/abs/2410.00363
作者: Shitian Zhao,Renrui Zhang,Xu Luo,Yan Wang,Shanghang Zhang,Peng Gao
关键词-EN: textit, large language models, multi-modal language models, language models, fusing heterogeneous models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model fusing has always been an important topic, especially in an era where large language models (LLM) and multi-modal language models (MLM) with different architectures, parameter sizes and training pipelines, are being created all the time. In this work, we propose a post-hoc framework, aiming at fusing heterogeneous models off-the-shell, which we call \textitlikelihood composition, and the basic idea is to compose multiple models’ likelihood distribution when doing a multi-choice visual-question-answering task. Here the core concept, \textitlikelihood, is actually the log-probability of the candidate answer. In \textitlikelihood composition, we introduce some basic operations: \textitdebias, \textithighlight, \textitmajority-vote and \textitensemble. By combining (composing) these basic elements, we get the mixed composition methods: \textitmix-composition. Through conducting comprehensive experiments on 9 VQA datasets and 10 MLMs, we prove the effectiveness of \textitmix-composition compared with simple \textitensemble or \textitmajority-vote methods. In this framework, people can propose new basic composition methods and combine them to get the new mixed composition methods. We hope our proposed \textitlikelihood composition can provide a new perspective of fusing heterogeneous models and inspire the exploration under this framework.
摘要:模型融合一直是一个重要的话题,特别是在当前大语言模型 (LLM) 和多模态语言模型 (MLM) 不断涌现的时代,这些模型具有不同的架构、参数规模和训练流程。在这项工作中,我们提出了一种后验框架,旨在对异构模型进行即插即用的融合,我们称之为“似然组合”,其基本思想是在进行多选视觉问答任务时,组合多个模型的似然分布。这里的核心概念“似然”实际上是候选答案的对数概率。在“似然组合”中,我们引入了一些基本操作:去偏 (debias)、突出 (highlight)、多数投票 (majority-vote) 和集成 (ensemble)。通过结合(组合)这些基本元素,我们得到了混合组合方法:混合组合 (mix-composition)。通过在9个VQA数据集和10个MLM上进行全面的实验,我们证明了混合组合方法相对于简单的集成或多数投票方法的有效性。在这个框架中,人们可以提出新的基本组合方法,并将它们结合起来以获得新的混合组合方法。我们希望我们提出的“似然组合”能够为融合异构模型提供一个新的视角,并激发在这一框架下的探索。

[NLP-35] FedPT: Federated Proxy-Tuning of Large Language Models on Resource-Constrained Edge Devices

【速读】: 该论文试图解决在大规模预训练语言模型(LMs)的下游任务微调过程中,由于数据收集引发的隐私问题以及联邦学习(FL)中模型参数访问受限和高计算、通信、内存开销的问题。解决方案的关键是提出了**Federated Proxy-Tuning (FedPT)**框架,该框架通过设备间协作微调一个较小的语言模型(LM),然后服务器将微调后的小LM的知识与预训练大LM的知识结合,构建一个代理微调的大LM,从而在不直接访问大LM参数的情况下,显著降低计算、通信和内存开销,同时保持与直接联邦微调大LM相当的性能。

链接: https://arxiv.org/abs/2410.00362
作者: Zhidong Gao,Yu Zhang,Zhenxiao Zhang,Yanmin Gong,Yuanxiong Guo
关键词-EN: large LMs, demonstrating superior performance, downstream tasks, demonstrating superior, variety of linguistic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 19 figures

点击查看摘要

Abstract:Despite demonstrating superior performance across a variety of linguistic tasks, pre-trained large language models (LMs) often require fine-tuning on specific datasets to effectively address different downstream tasks. However, fine-tuning these LMs for downstream tasks necessitates collecting data from individuals, which raises significant privacy concerns. Federated learning (FL) has emerged as the de facto solution, enabling collaborative model training without sharing raw data. While promising, federated fine-tuning of large LMs faces significant challenges, including restricted access to model parameters and high computation, communication, and memory overhead. To address these challenges, this paper introduces \textbfFederated \textbfProxy-\textbfTuning (FedPT), a novel framework for federated fine-tuning of black-box large LMs, requiring access only to their predictions over the output vocabulary instead of their parameters. Specifically, devices in FedPT first collaboratively tune a smaller LM, and then the server combines the knowledge learned by the tuned small LM with the knowledge learned by the larger pre-trained LM to construct a large proxy-tuned LM that can reach the performance of directly tuned large LMs. The experimental results demonstrate that FedPT can significantly reduce computation, communication, and memory overhead while maintaining competitive performance compared to directly federated fine-tuning of large LMs. FedPT offers a promising solution for efficient, privacy-preserving fine-tuning of large LMs on resource-constrained devices, broadening the accessibility and applicability of state-of-the-art large LMs.
摘要:尽管预训练的大语言模型 (Large Language Models, LMs) 在多种语言任务中表现出色,但通常需要针对特定数据集进行微调以有效解决不同的下游任务。然而,为下游任务微调这些 LMs 需要从个人收集数据,这引发了显著的隐私问题。联邦学习 (Federated Learning, FL) 已成为事实上的解决方案,能够在不共享原始数据的情况下实现协作模型训练。尽管前景广阔,但大 LMs 的联邦微调面临重大挑战,包括对模型参数的访问受限以及高计算、通信和内存开销。为应对这些挑战,本文提出了 联邦代理微调 (Federated Proxy-Tuning, FedPT),这是一种新颖的框架,用于黑箱大 LMs 的联邦微调,仅需访问其对输出词汇的预测而非其参数。具体而言,FedPT 中的设备首先协作微调一个较小的 LM,然后服务器将微调后的小 LM 学到的知识与预训练的大 LM 学到的知识结合,构建一个能够达到直接微调大 LM 性能的大代理微调 LM。实验结果表明,FedPT 能够显著降低计算、通信和内存开销,同时在与直接联邦微调大 LMs 相比时保持竞争性能。FedPT 为在资源受限设备上高效、隐私保护的大 LMs 微调提供了一个有前景的解决方案,扩大了最先进大 LMs 的可访问性和适用性。

[NLP-36] PclGPT: A Large Language Model for Patronizing and Condescending Language Detection EMNLP2024

【速读】: 该论文试图解决网络社区中存在的“居高临下”语言(Patronizing and Condescending Language, PCL)的检测问题。传统预训练语言模型(PLMs)在检测这种隐含毒性的语言时表现不佳,因为PCL具有虚伪和虚假同情的特征。论文的关键解决方案是引入了一个名为PclGPT的综合大语言模型(LLM)基准,通过收集、标注和整合Pcl-PT/SFT数据集,并采用全面的预训练和监督微调阶梯过程,开发了双语PclGPT-EN/CN模型组,以提升对隐含毒性语言的检测能力。研究结果揭示了PCL对不同弱势群体的偏见程度存在显著差异,强调了社会对此类语言的关注和保护弱势群体的必要性。

链接: https://arxiv.org/abs/2410.00361
作者: Hongbo Wang,Mingda Li,Junyu Lu,Hebin Xia,Liang Yang,Bo Xu,Ruizhu Liu,Hongfei Lin
关键词-EN: Disclaimer, language, Samples, PCL, language models
类目: Computation and Language (cs.CL)
备注: Accepted for EMNLP2024 (Findings)

点击查看摘要

Abstract:Disclaimer: Samples in this paper may be harmful and cause discomfort! Patronizing and condescending language (PCL) is a form of speech directed at vulnerable groups. As an essential branch of toxic language, this type of language exacerbates conflicts and confrontations among Internet communities and detrimentally impacts disadvantaged groups. Traditional pre-trained language models (PLMs) perform poorly in detecting PCL due to its implicit toxicity traits like hypocrisy and false sympathy. With the rise of large language models (LLMs), we can harness their rich emotional semantics to establish a paradigm for exploring implicit toxicity. In this paper, we introduce PclGPT, a comprehensive LLM benchmark designed specifically for PCL. We collect, annotate, and integrate the Pcl-PT/SFT dataset, and then develop a bilingual PclGPT-EN/CN model group through a comprehensive pre-training and supervised fine-tuning staircase process to facilitate implicit toxic detection. Group detection results and fine-grained detection from PclGPT and other models reveal significant variations in the degree of bias in PCL towards different vulnerable groups, necessitating increased societal attention to protect them. Comments: Accepted for EMNLP2024 (Findings) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.00361 [cs.CL] (or arXiv:2410.00361v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.00361 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:
免责声明: 本文中的样本可能具有危害性并引起不适!

居高临下和傲慢的语言 (Patronizing and Condescending Language, PCL) 是一种针对弱势群体的言语形式。作为有毒语言的一个重要分支,这种语言加剧了互联网社区内的冲突和对立,并对弱势群体产生了不利影响。传统的预训练语言模型 (Pre-trained Language Models, PLMs) 在检测 PCL 方面表现不佳,因为其隐含的有毒特征如虚伪和虚假同情。随着大语言模型 (Large Language Models, LLMs) 的兴起,我们可以利用其丰富的情感语义来建立探索隐性毒性的范式。本文介绍了 PclGPT,一个专门为 PCL 设计的综合大语言模型基准。我们收集、标注并整合了 Pcl-PT/SFT 数据集,然后通过全面的预训练和监督微调阶梯过程,开发了双语 PclGPT-EN/CN 模型组,以促进隐性毒性检测。PclGPT 和其他模型的群体检测结果和细粒度检测揭示了 PCL 对不同弱势群体的偏见程度存在显著差异,这需要社会增加关注以保护他们。

评论: 已被 EMNLP2024 (Findings) 接受
主题: 计算与语言 (cs.CL)
引用: arXiv:2410.00361 [cs.CL]
(或 arXiv:2410.00361v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.00361
关注以了解更多
arXiv 通过 DataCite 发布的 DOI (待注册)

[NLP-37] Self-controller: Controlling LLMs with Multi-round Step-by-step Self-awareness

【速读】: 该论文试图解决大型语言模型(LLMs)在可控性方面的局限性问题。解决方案的关键在于提出了一个名为“Self-controller”的新型代理框架,通过引入自我意识到LLMs的推理逻辑中,使其能够基于自身响应保持状态,从而实现对当前状态的自我感知,并在多轮链式思维范式中逐步推理。该框架通过实验验证了其在文本长度状态控制上的有效性,并利用二分搜索算法加速生成过程。此外,结合DeepSeek的上下文缓存技术,显著减少了计算资源消耗,理论上的额外时间复杂度为O(c \log n),且在单词约束的消融研究中展示了跨基础模型的持续可控性。

链接: https://arxiv.org/abs/2410.00359
作者: Xiao Peng,Xufan Geng
关键词-EN: large language models, applications of large, large language, widely spread, Self-controller
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:The applications of large language models (LLMs) have been widely spread across all domains. However, the basic abilities such as the controllability of LLMs are still limited. To address this, we propose “Self-controller”, a novel agentic framework bringing self-awareness into LLMs’ reasoning logic. The core idea of this work is to maintain states based on the LLM’s response, letting the LLM become self-aware of current status and think step by step in a multi-round chain-of-thought paradigm. Our experiment on the state of textual length has shown the controllability and effectiveness of the Self-controller. We further implement a binary search algorithm to accelerate the generation process based on the linearity and monotonicity of the textual length state. Another advantage of the Self-controller comes with DeepSeek’s Context Caching technology, which significantly saves computational token consumption when a cluster of conversations shares the same prefix of context. Theoretically, we prove that in this scenario the extra time complexity is O(c \log n) . Results of the back-of-the-envelope estimation suggest that the token consumption of our method is no more than twice as much as that of the trivial single-round generation. Furthermore, our ablation study on word constraints demonstrates the Self-controller’s consistent controllability across all foundation models.
摘要:大语言模型 (LLM) 的应用已广泛渗透到各个领域。然而,LLM 的基本能力,如可控性,仍然有限。为此,我们提出了“自控器 (Self-controller)”,这是一种将自我意识引入 LLM 推理逻辑的新型智能体框架。该工作的核心思想是基于 LLM 的响应来维持状态,使 LLM 能够自我感知当前状态,并在多轮链式思维范式中逐步思考。我们在文本长度状态上的实验显示了自控器的可控性和有效性。我们进一步实现了一个二分搜索算法,以基于文本长度状态的线性和单调性加速生成过程。自控器的另一个优势在于 DeepSeek 的上下文缓存技术,当一组对话共享相同的上下文前缀时,该技术显著节省了计算 Token 消耗。理论上,我们证明了在此场景下额外的时间复杂度为 O(c \log n)。粗略估算的结果表明,我们的方法的 Token 消耗不超过简单单轮生成的两倍。此外,我们在词约束上的消融研究展示了自控器在所有基础模型上的持续可控性。

[NLP-38] Hierarchical Organization Simulacra in the Investment Sector

【速读】: 该论文试图解决如何设计具有专业投资行为的人工组织的问题,解决方案的关键在于采用多代理模拟方法,模仿投资公司中的层级决策过程,并利用新闻文章来指导决策。通过分析超过115,000篇新闻文章,研究结果表明,这种层级模拟方法在决策频率和盈利能力上与专业交易者的选择高度一致,但也揭示了决策中的偏见,特别是提示词的变化和代理的感知资历对结果有显著影响。

链接: https://arxiv.org/abs/2410.00354
作者: Chung-Chi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao
关键词-EN: paper explores designing, explores designing artificial, designing artificial organizations, paper explores, explores designing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores designing artificial organizations with professional behavior in investments using a multi-agent simulation. The method mimics hierarchical decision-making in investment firms, using news articles to inform decisions. A large-scale study analyzing over 115,000 news articles of 300 companies across 15 years compared this approach against professional traders’ decisions. Results show that hierarchical simulations align closely with professional choices, both in frequency and profitability. However, the study also reveals biases in decision-making, where changes in prompt wording and perceived agent seniority significantly influence outcomes. This highlights both the potential and limitations of large language models in replicating professional financial decision-making.
摘要:本文探讨了通过多智能体模拟设计具有专业投资行为的人工组织。该方法模仿了投资公司中的层级决策过程,利用新闻文章来指导决策。一项大规模研究分析了15年间300家公司的115,000多篇新闻文章,将这种方法与专业交易员的决策进行了比较。结果显示,层级模拟在频率和盈利能力上与专业选择高度一致。然而,研究也揭示了决策中的偏见,其中提示词的变化和智能体感知到的资历显著影响结果。这突显了大语言模型在复制专业金融决策方面的潜力和局限性。

[NLP-39] Sparse Attention Decomposition Applied to Circuit Tracing

【速读】: 该论文试图解决的问题是如何在GPT-2模型中隔离和识别注意力头之间用于通信和协调的特征。解决方案的关键在于发现这些特征在注意力头矩阵的奇异向量中通常以稀疏编码的形式存在,这种稀疏编码使得信号能够从背景噪声中有效分离,并简化了注意力头之间通信路径的识别。通过追踪在间接对象识别(IOI)任务中使用的电路部分,研究揭示了先前研究中未发现的细节,并识别了在执行IOI任务时注意力头之间用于通信的具体特征。

链接: https://arxiv.org/abs/2410.00340
作者: Gabriel Franco,Mark Crovella
关键词-EN: attention heads, perform complex tasks, attention, attention heads work, papers have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many papers have shown that attention heads work in conjunction with each other to perform complex tasks. It’s frequently assumed that communication between attention heads is via the addition of specific features to token residuals. In this work we seek to isolate and identify the features used to effect communication and coordination among attention heads in GPT-2 small. Our key leverage on the problem is to show that these features are very often sparsely coded in the singular vectors of attention head matrices. We characterize the dimensionality and occurrence of these signals across the attention heads in GPT-2 small when used for the Indirect Object Identification (IOI) task. The sparse encoding of signals, as provided by attention head singular vectors, allows for efficient separation of signals from the residual background and straightforward identification of communication paths between attention heads. We explore the effectiveness of this approach by tracing portions of the circuits used in the IOI task. Our traces reveal considerable detail not present in previous studies, shedding light on the nature of redundant paths present in GPT-2. And our traces go beyond previous work by identifying features used to communicate between attention heads when performing IOI.
摘要:许多论文表明,注意力头通过相互协作来执行复杂任务。通常假设注意力头之间的通信是通过向 Token 残差添加特定特征来实现的。在本研究中,我们旨在隔离并识别 GPT-2 small 中用于注意力头之间通信和协调的特征。我们的关键突破在于揭示这些特征在注意力头矩阵的奇异向量中通常以稀疏编码形式存在。我们描述了这些信号在 GPT-2 small 中用于间接对象识别 (IOI) 任务时的维度及其出现情况。注意力头奇异向量提供的稀疏编码信号,使得信号能够从残差背景中高效分离,并能直接识别注意力头之间的通信路径。我们通过追踪 IOI 任务中使用的部分电路来探索这种方法的有效性。我们的追踪结果揭示了先前研究中未曾呈现的细节,阐明了 GPT-2 中存在的冗余路径的本质。此外,我们的追踪超越了以往的工作,识别了在执行 IOI 任务时用于注意力头之间通信的特征。

[NLP-40] Preserving Generalization of Language models in Few-shot Continual Relation Extraction EMNLP2024

【速读】: 该论文试图解决少样本持续关系抽取(Few-shot Continual Relations Extraction, FCRE)问题,即在有限标注数据的情况下,模型能够持续学习新关系并避免灾难性遗忘,同时保留预训练模型的先验知识。解决方案的关键在于利用常被忽视的语言模型头部组件,并通过互信息最大化策略来维持预训练模型的先验知识,同时策略性地调整主要分类头部,从而提升模型性能。此外,论文还探讨了大型语言模型(LLMs)在解决FCRE挑战中的潜力。

链接: https://arxiv.org/abs/2410.00334
作者: Quyen Tran,Nguyen Xuan Thanh,Nguyen Hoang Anh,Nam Le Hai,Trung Le,Linh Van Ngo,Thien Huu Nguyen
关键词-EN: Continual Relations Extraction, Few-shot Continual Relations, Few-shot Continual, Relations Extraction, limited labeled data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Few-shot Continual Relations Extraction (FCRE) is an emerging and dynamic area of study where models can sequentially integrate knowledge from new relations with limited labeled data while circumventing catastrophic forgetting and preserving prior knowledge from pre-trained backbones. In this work, we introduce a novel method that leverages often-discarded language model heads. By employing these components via a mutual information maximization strategy, our approach helps maintain prior knowledge from the pre-trained backbone and strategically aligns the primary classification head, thereby enhancing model performance. Furthermore, we explore the potential of Large Language Models (LLMs), renowned for their wealth of knowledge, in addressing FCRE challenges. Our comprehensive experimental results underscore the efficacy of the proposed method and offer valuable insights for future work.
摘要:少样本持续关系提取 (Few-shot Continual Relations Extraction, FCRE) 是一个新兴且动态的研究领域,模型能够在有限标注数据的情况下,顺序地整合新关系知识,同时避免灾难性遗忘并保留预训练骨干网络中的先验知识。在本研究中,我们提出了一种利用常被丢弃的语言模型头部的新方法。通过采用互信息最大化策略来使用这些组件,我们的方法有助于维持预训练骨干网络中的先验知识,并策略性地对齐主要分类头部,从而提升模型性能。此外,我们还探讨了大语言模型 (Large Language Models, LLMs) 在解决 FCRE 挑战中的潜力,这些模型以其丰富的知识而闻名。我们的综合实验结果突显了所提出方法的有效性,并为未来的工作提供了宝贵的见解。

[NLP-41] PointAD: Comprehending 3D Anomalies from Points and Pixels for Zero-shot 3D Anomaly Detection NEURIPS2024

【速读】: 该论文试图解决零样本(Zero-shot, ZS)3D异常检测问题,特别是在目标3D训练样本不可用的情况下,如隐私保护等实际问题。解决方案的关键在于引入PointAD方法,通过利用CLIP的强大泛化能力来识别未见过的3D物体上的异常。PointAD提供了一个统一的框架,能够从点和像素两个维度理解3D异常,通过将3D异常渲染成多个2D图像并投影回3D空间,结合混合表示学习优化可学习的文本提示,从而捕捉通用的异常语义。通过点与像素表示的协同优化,模型能够掌握潜在的3D异常模式,实现对未见过的多样化3D物体的异常检测和分割。此外,通过3D和2D空间的对齐,模型可以直接整合RGB信息,以即插即用的方式增强对3D异常的理解。

链接: https://arxiv.org/abs/2410.00320
作者: Qihang Zhou,Jiangtao Yan,Shibo He,Wenchao Meng,Jiming Chen
关键词-EN: scenarios where target, training samples, privacy protection, crucial yet unexplored, unexplored field
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: NeurIPS 2024

点击查看摘要

Abstract:Zero-shot (ZS) 3D anomaly detection is a crucial yet unexplored field that addresses scenarios where target 3D training samples are unavailable due to practical concerns like privacy protection. This paper introduces PointAD, a novel approach that transfers the strong generalization capabilities of CLIP for recognizing 3D anomalies on unseen objects. PointAD provides a unified framework to comprehend 3D anomalies from both points and pixels. In this framework, PointAD renders 3D anomalies into multiple 2D renderings and projects them back into 3D space. To capture the generic anomaly semantics into PointAD, we propose hybrid representation learning that optimizes the learnable text prompts from 3D and 2D through auxiliary point clouds. The collaboration optimization between point and pixel representations jointly facilitates our model to grasp underlying 3D anomaly patterns, contributing to detecting and segmenting anomalies of unseen diverse 3D objects. Through the alignment of 3D and 2D space, our model can directly integrate RGB information, further enhancing the understanding of 3D anomalies in a plug-and-play manner. Extensive experiments show the superiority of PointAD in ZS 3D anomaly detection across diverse unseen objects.
摘要:零样本 (Zero-shot, ZS) 三维异常检测是一个关键但尚未充分探索的领域,它解决了由于隐私保护等实际问题导致目标三维训练样本不可用的情况。本文介绍了 PointAD,这是一种新颖的方法,它利用 CLIP 强大的泛化能力来识别未见过的三维物体上的异常。PointAD 提供了一个统一的框架,可以从点和像素两个方面理解三维异常。在该框架中,PointAD 将三维异常渲染成多个二维图像,并将它们投影回三维空间。为了将通用的异常语义捕捉到 PointAD 中,我们提出了混合表示学习,通过辅助点云优化从三维和二维中学习到的可学习文本提示。点与像素表示之间的协作优化共同促进了我们的模型掌握底层三维异常模式,有助于检测和分割未见过的多样化三维物体的异常。通过三维和二维空间的对齐,我们的模型可以直接整合 RGB 信息,进一步以即插即用的方式增强对三维异常的理解。广泛的实验表明,PointAD 在跨多样未见物体的零样本三维异常检测中具有优越性。

[NLP-42] EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control EMNLP2024

【速读】: 该论文试图解决当前文本到语音(TTS)技术在情感表达和控制方面的不足,即用户无法选择和精细控制语音中的情感类型和强度。解决方案的关键在于提出了EmoKnob框架,该框架利用基础语音克隆模型的表达性说话者表示空间,通过少样本示例实现对任意情感的精细控制。EmoKnob框架通过两种方法将开放式文本描述的情感应用于情感控制,提供了一个直观的界面来控制多样化的细微情感。此外,论文还引入了一套评估指标,以系统地评估情感控制框架的忠实度和可识别性,通过客观和主观评估证明了其情感表达效果优于商业TTS服务。

链接: https://arxiv.org/abs/2410.00316
作者: Haozhe Chen,Run Chen,Julia Hirschberg
关键词-EN: technology produce natural, emotion control, technology produce, emotion control framework, produce natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: EMNLP 2024 Main

点击查看摘要

Abstract:While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
摘要:尽管近期文本到语音 (Text-to-Speech, TTS) 技术的进步能够生成自然且富有表现力的语音,但缺乏让用户选择情感并控制其强度的选项。我们提出了 EmoKnob,这是一个框架,允许在语音合成中通过少样本 (few-shot) 示范样本对任意情感进行细粒度的情感控制。我们的框架利用了基础语音克隆模型近期进展所实现的表现力说话者表示空间。基于我们情感控制框架的少样本能力,我们提出了两种方法,用于在开放式文本描述的情感上应用情感控制,从而实现了一个直观的界面,用于控制多样且细微的情感。为了促进情感语音合成领域的系统化发展,我们引入了一套评估指标,旨在严格评估情感控制框架的忠实度和可识别性。通过客观和主观的评估,我们展示了我们的情感控制框架能够有效地将情感嵌入语音中,并超越了商业 TTS 服务的情感表现力。

[NLP-43] Insight: A Multi-Modal Diagnostic Pipeline using LLMs for Ocular Surface Disease Diagnosis MICCAI2024

【速读】: 该论文试图解决眼表疾病诊断中传统人工评估精度不足以及现有机器方法无法推理临床相关性的问题。解决方案的关键在于引入多模态诊断管道(MDPipe),通过使用大型语言模型(LLMs)来整合临床数据源(如睑板腺图像和临床元数据),并利用视觉翻译器将睑板腺图像转化为可量化的形态数据,结合临床元数据生成临床报告摘要,最终通过领域专家的诊断见解来增强LLMs的推理能力,从而实现更准确和具有临床合理性的诊断。

链接: https://arxiv.org/abs/2410.00292
作者: Chun-Hsiao Yeh,Jiayun Wang,Andrew D. Graham,Andrea J. Liu,Bo Tan,Yubei Chen,Yi Ma,Meng C. Lin
关键词-EN: Accurate diagnosis, ocular surface disease, surface disease diagnosis, optometry and ophthalmology, critical in optometry
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2024. Project Webpage: this https URL

点击查看摘要

Abstract:Accurate diagnosis of ocular surface diseases is critical in optometry and ophthalmology, which hinge on integrating clinical data sources (e.g., meibography imaging and clinical metadata). Traditional human assessments lack precision in quantifying clinical observations, while current machine-based methods often treat diagnoses as multi-class classification problems, limiting the diagnoses to a predefined closed-set of curated answers without reasoning the clinical relevance of each variable to the diagnosis. To tackle these challenges, we introduce an innovative multi-modal diagnostic pipeline (MDPipe) by employing large language models (LLMs) for ocular surface disease diagnosis. We first employ a visual translator to interpret meibography images by converting them into quantifiable morphology data, facilitating their integration with clinical metadata and enabling the communication of nuanced medical insight to LLMs. To further advance this communication, we introduce a LLM-based summarizer to contextualize the insight from the combined morphology and clinical metadata, and generate clinical report summaries. Finally, we refine the LLMs’ reasoning ability with domain-specific insight from real-life clinician diagnoses. Our evaluation across diverse ocular surface disease diagnosis benchmarks demonstrates that MDPipe outperforms existing standards, including GPT-4, and provides clinically sound rationales for diagnoses.
摘要:眼表疾病的准确诊断在眼科和眼视光学中至关重要,这依赖于整合临床数据源(例如,睑板腺成像和临床元数据)。传统的人工评估在量化临床观察方面缺乏精确性,而当前基于机器的方法通常将诊断视为多类别分类问题,限制了诊断结果只能局限于预定义的封闭集合中的精选答案,而无法推理每个变量与诊断之间的临床相关性。为了应对这些挑战,我们引入了一种创新的跨模态诊断流程(MDPipe),通过采用大语言模型(LLMs)进行眼表疾病诊断。我们首先使用视觉翻译器将睑板腺图像转换为可量化的形态数据,便于其与临床元数据的整合,并使LLMs能够理解细微的医学洞察。为进一步增强这种沟通,我们引入了一个基于LLM的总结器,用于将形态数据和临床元数据的洞察上下文化,并生成临床报告摘要。最后,我们通过实际临床医生诊断中的领域特定洞察来优化LLMs的推理能力。我们在多种眼表疾病诊断基准上的评估表明,MDPipe优于现有的标准,包括GPT-4,并为诊断提供了临床上合理的理由。

[NLP-44] Social Conjuring: Multi-User Runtime Collaboration with AI in Building Virtual 3D Worlds

【速读】: 该论文试图解决生成式人工智能在虚拟世界创建过程中如何融入社交互动的问题。解决方案的关键在于提出了Social Conjurer框架,该框架支持多用户实时协作构建和修改3D场景,通过结合社交互动、工具使用和空间推理,促进丰富多样虚拟环境的创建。这一方法不仅强调了AI支持下的多用户世界构建潜力,还为将AI模型融入3D内容生成的人性化界面设计提供了新的路径。

链接: https://arxiv.org/abs/2410.00274
作者: Cyan DeVeaux,Amina Kobenova,Samyak Parajuli,Andrzej Banburski-Fahey,Judith Amores Fernandez,Jaron Lanier
关键词-EN: Generative artificial intelligence, Generative artificial, artificial intelligence, intelligence has shown, shown promise
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 27 pages + Appendix, 16 figures

点击查看摘要

Abstract:Generative artificial intelligence has shown promise in prompting virtual worlds into existence, yet little attention has been given to understanding how this process unfolds as social interaction. We present Social Conjurer, a framework for AI-augmented dynamic 3D scene co-creation, where multiple users collaboratively build and modify virtual worlds in real-time. Through an expanded set of interactions, including social and tool-based engagements as well as spatial reasoning, our framework facilitates the creation of rich, diverse virtual environments. Findings from a preliminary user study (N=12) provide insight into the user experience of this approach, how social contexts shape the prompting of spatial environments, and perspective on social applications of prompt-based 3D co-creation. In addition to highlighting the potential of AI-supported multi-user world creation and offering new pathways for AI-augmented creative processes in VR, this article presents a set of implications for designing human-centered interfaces that incorporate AI models into 3D content generation.
摘要:生成式人工智能 (Generative AI) 在推动虚拟世界生成方面展现出巨大潜力,然而,对于这一过程如何在社会互动中展开的理解却相对较少。我们提出了 Social Conjurer,这是一个用于 AI 增强的动态 3D 场景协同创作框架,允许多个用户实时协作构建和修改虚拟世界。通过扩展一系列交互方式,包括社交互动、工具使用以及空间推理,我们的框架促进了丰富多样虚拟环境的创建。来自初步用户研究 (N=12) 的发现揭示了这一方法的用户体验,社交情境如何塑造空间环境的生成,以及基于提示的 3D 协同创作在社交应用中的前景。除了强调 AI 支持的多用户世界创作的潜力,并提供 AI 增强创意过程在虚拟现实 (VR) 中的新途径外,本文还提出了一套将 AI 模型融入 3D 内容生成的人性化界面设计启示。

[NLP-45] DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining

【速读】: 该论文试图解决大型语言模型(LLMs)在特定行业领域(如医疗和金融)中表现不佳的问题,特别是在这些领域数据稀缺或模型缺乏真实世界数据的情况下。解决方案的关键在于提出了一种自动化且可扩展的框架——DoPAMine,该框架通过利用LLM的参数知识生成特定领域的多样化种子数据,并以此从大型数据语料库(如Common Crawl)中挖掘真实世界的数据,用于语言模型的领域适应性预训练。实验结果表明,DoPAMine显著提升了预训练LLMs在医疗和金融领域任务中的性能。

链接: https://arxiv.org/abs/2410.00260
作者: Vinayak Arannil,Sourav Sanjukta Bhabesh,Neha Narwal,Sai Nikhil Thirandas,Darren Yow-Bang Wang,Graham Horwood,Alex Anto Chirayath,Gouri Pandeshwar
关键词-EN: Large Language Models, shown remarkable ability, Language Models, data, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable ability to generalize effectively across numerous industry domains while executing a range of tasks. Many of these competencies are obtained from the data utilized during the pre-training phase of the Language Models (LMs). However, these models exhibit limitations when tasked with performing in specialized or low-resource industry domains. More recent approaches use LLMs for generating domain-specific synthetic data but most often they lack in truthfulness and complexity. Alternatively, in cases where domain data is available like healthcare and finance most of the LMs are proprietary necessitating the need for a scalable method to curate real world industry specific pre-training data. In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining, to mine domain specific training data from a large data corpus for domain adaptation of a LM. The framework leverages the parametric knowledge of a LLM to generate diverse and representative seed data tailored to a specific domain which is then used to mine real world data from a large data corpus like Common Crawl. We evaluated our framework’s performance in the continual pre-training (CPT) setting by training two domain specific 7B parameter LMs in healthcare and finance with data mined via DoPAMine. Our experiments show that DoPAMine boosts the performance of pre-trained LLMs on average by 4.9% and 5.1% in zero-shot and 5-shot settings respectively on healthcare tasks from MMLU, MedQA, MedMCQA and PubMedQA datasets, and 2.9% and 6.7% for zero-shot and 5-shot settings respectively on finance tasks from FiQA-SA, FPB and Headlines datasets when compared to the baseline.
摘要:大语言模型 (LLMs) 在跨多个行业领域执行各种任务时展现了卓越的泛化能力。这些能力大多源自语言模型 (LMs) 预训练阶段所使用的数据。然而,当这些模型被要求在专业或资源匮乏的行业领域中执行任务时,其表现存在局限性。近期的一些方法利用 LLMs 生成领域特定的合成数据,但这些数据往往在真实性和复杂性上有所欠缺。另一方面,在医疗和金融等领域,尽管存在领域数据,但大多数 LMs 是专有的,这促使我们需要一种可扩展的方法来策划真实世界的行业特定预训练数据。在本研究中,我们提出了一种自动化且可扩展的框架——DoPAMine:基于种子引导数据挖掘的领域特定预训练适应,用于从大数据语料库中挖掘领域特定的训练数据,以适应 LMs 的领域需求。该框架利用 LLM 的参数化知识生成多样且具有代表性的种子数据,这些数据针对特定领域定制,并用于从 Common Crawl 等大数据语料库中挖掘真实世界的数据。我们通过在持续预训练 (CPT) 设置下,使用 DoPAMine 挖掘的数据训练了两个领域特定的 7B 参数 LMs(分别用于医疗和金融领域),评估了该框架的性能。实验结果显示,与基线相比,DoPAMine 在 MMLU、MedQA、MedMCQA 和 PubMedQA 数据集上的医疗任务中,分别在零样本和 5-shot 设置下,平均提升了预训练 LLMs 的性能 4.9% 和 5.1%;在 FiQA-SA、FPB 和 Headlines 数据集上的金融任务中,分别在零样本和 5-shot 设置下,平均提升了 2.9% 和 6.7%。

[NLP-46] Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

【速读】: 该论文试图解决3D大语言模型(3DLLMs)在构建通用3D世界代理时面临的挑战,主要问题在于缺乏高质量的鲁棒指令跟随数据,导致模型的判别能力和泛化能力有限。解决方案的关键在于引入了一个名为Robin3D的强大3DLLM,该模型通过使用一种新颖的数据引擎——鲁棒指令生成(RIG)引擎,生成了大规模的指令跟随数据。RIG引擎生成了两种关键的指令数据:对抗性指令跟随数据和多样化指令跟随数据,分别用于增强模型的判别理解和泛化能力。此外,Robin3D通过引入关系增强投影器来增强空间理解,并通过ID-特征绑定来强化对象的指代和定位能力,从而在多个3D多模态学习基准测试中显著优于先前的方法。

链接: https://arxiv.org/abs/2410.00255
作者: Weitai Kang,Haifeng Huang,Yuzhang Shang,Mubarak Shah,Yan Yan
关键词-EN: Large Language Models, Large Language, building general-purpose agents, challenges remain due, high-quality robust instruction-following
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model’s discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model’s generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap).
摘要:近年来,3D 大语言模型 (3DLLM) 的发展突显了其在构建 3D 真实世界中通用智能体方面的潜力,但由于缺乏高质量的鲁棒指令跟随数据,导致 3DLLM 的判别能力和泛化能力有限。本文介绍了 Robin3D,一种强大的 3DLLM,该模型基于我们新开发的数据引擎——鲁棒指令生成 (RIG) 引擎生成的大规模指令跟随数据进行训练。RIG 生成两种关键指令数据:1) 对抗性指令跟随数据,包含混合的负样本和正样本,以增强模型的判别性理解;2) 多样性指令跟随数据,包含多种指令风格,以增强模型的泛化能力。由此,我们构建了 100 万条指令跟随数据,其中包括 34.4 万条对抗性样本、50.8 万条多样性样本和 16.5 万条基准训练集样本。为了更好地处理这些复杂指令,Robin3D 首先采用关系增强投影器 (Relation-Augmented Projector) 来增强空间理解,然后通过 ID-特征绑定 (ID-Feature Bonding) 强化对象的指代和定位能力。Robin3D 在五个广泛使用的 3D 多模态学习基准测试中持续优于以往的方法,且无需进行任务特定的微调。值得注意的是,我们在定位任务 (Multi3DRefer) 中实现了 7.8% 的提升,在描述任务 (Scan2Cap) 中实现了 6.9% 的提升。

[NLP-47] MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans

【速读】: 该论文试图解决在虚拟现实环境中,如何通过丰富的上下文信息生成与语音同步的手势(co-speech gesture)的问题。解决方案的关键在于利用VR头戴设备在一个物理模拟器(AI2-THOR)中记录参与者之间的对话,并收集包括动作捕捉、语音、注视方向和场景图在内的多模态数据。这些数据为开发和改进3D场景中的手势生成模型提供了多样化和上下文丰富的数据支持。

链接: https://arxiv.org/abs/2410.00253
作者: Anna Deichler,Jim O’Regan,Jonas Beskow
关键词-EN: physics simulator, headset to record, record conversations, Abstract, dataset captured
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In this paper, we present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR). Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings. Participants engaged in various conversational scenarios, all based on referential communication tasks. The dataset provides a rich set of multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to enhance the understanding and development of gesture generation models in 3D scenes by providing diverse and contextually rich data.
摘要:本文介绍了一种新颖的数据集,该数据集通过使用 VR 头戴设备在物理模拟器 (AI2-THOR) 中记录参与者之间的对话而获得。我们的主要目标是扩展协同语音手势生成领域,通过在指称性设置中融入丰富的上下文信息。参与者参与了基于指称性沟通任务的各种对话场景。该数据集提供了丰富的多模态记录,如动作捕捉、语音、注视和场景图。这一综合数据集旨在通过提供多样化和上下文丰富的数据,增强对手势生成模型在 3D 场景中的理解和开发。

[NLP-48] A Methodology for Explainable Large Language Models with Integrated Gradients and Linguistic Analysis in Text Classification

【速读】: 该论文试图解决神经退行性疾病(如阿尔茨海默病)在言语表达中的特征识别问题,并提升大型语言模型(LLM)在识别这些特征时的可解释性。解决方案的关键在于开发了一种名为SLIME的可解释LLM方法,通过结合集成梯度(IG)、语言探究与词汇计数(LIWC)和统计分析,识别出与阿尔茨海默病相关的词汇成分,并明确这些成分对模型决策的重要性。具体来说,SLIME方法揭示了BERT模型在识别阿尔茨海默病患者言语中社交参考减少的特征时所依赖的关键词汇,从而增强了模型在神经退行性疾病研究中的应用信心。

链接: https://arxiv.org/abs/2410.00250
作者: Marina Ribeiro(1 and 2),Bárbara Malcorra(2),Natália B. Mota(2 and 3),Rodrigo Wilkens(4 and 5),Aline Villavicencio(5 and 6)Lilian C. Hubner(7),César Rennó-Costa(1) ((1) Bioinformatics Multidisciplinary Environment (BioME), Digital Metropolis Institute (IMD), Federal University of Rio Grande do Norte (UFRN), Natal (RN), Brazil, (2) Research Department at Mobile Brain, Mobile Brain, Rio de Janeiro (RJ), Brazil, (3) Institute of Psychiatry (IPUB), Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro (RJ), Brazil, (4) Department of Computer Science, The University of Exeter, Exeter, UK, (5) Institute for Data Science and Artificial Intelligence at the University of Exeter, Exeter, UK, (6) Department of Computer Science, The University of Sheffield, Sheffield, UK, (7) School of Humanities, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Porto Alegre (RS), Brazil)
关键词-EN: Alzheimer Disease, affect speech production, Large Language Model, Neurological disorders, significantly impact
类目: Computation and Language (cs.CL)
备注: 27 pages, 6 figures, authors Marina Ribeiro and Bárbara Malcorra have equal contribution, César Rennó-Costa is the corresponding author

点击查看摘要

Abstract:Neurological disorders that affect speech production, such as Alzheimer’s Disease (AD), significantly impact the lives of both patients and caregivers, whether through social, psycho-emotional effects or other aspects not yet fully understood. Recent advancements in Large Language Model (LLM) architectures have developed many tools to identify representative features of neurological disorders through spontaneous speech. However, LLMs typically lack interpretability, meaning they do not provide clear and specific reasons for their decisions. Therefore, there is a need for methods capable of identifying the representative features of neurological disorders in speech and explaining clearly why these features are relevant. This paper presents an explainable LLM method, named SLIME (Statistical and Linguistic Insights for Model Explanation), capable of identifying lexical components representative of AD and indicating which components are most important for the LLM’s decision. In developing this method, we used an English-language dataset consisting of transcriptions from the Cookie Theft picture description task. The LLM Bidirectional Encoder Representations from Transformers (BERT) classified the textual descriptions as either AD or control groups. To identify representative lexical features and determine which are most relevant to the model’s decision, we used a pipeline involving Integrated Gradients (IG), Linguistic Inquiry and Word Count (LIWC), and statistical analysis. Our method demonstrates that BERT leverages lexical components that reflect a reduction in social references in AD and identifies which further improve the LLM’s accuracy. Thus, we provide an explainability tool that enhances confidence in applying LLMs to neurological clinical contexts, particularly in the study of neurodegeneration.
摘要:影响言语生成的神经系统疾病,如阿尔茨海默病 (Alzheimer’s Disease, AD),无论通过社会、心理情感影响还是其他尚未完全理解的因素,都显著影响患者及其护理者的生活。近年来,大语言模型 (Large Language Model, LLM) 架构的进步开发了许多工具,通过自发言语识别神经系统疾病的代表性特征。然而,LLM 通常缺乏可解释性,这意味着它们不提供其决策的明确和具体原因。因此,需要能够识别言语中神经系统疾病的代表性特征并清晰解释这些特征相关性的方法。本文提出了一种可解释的 LLM 方法,名为 SLIME (Statistical and Linguistic Insights for Model Explanation),能够识别 AD 的代表性词汇成分,并指出哪些成分对 LLM 的决策最为重要。在开发这种方法时,我们使用了包含 Cookie Theft 图片描述任务转录的英语数据集。LLM 双向编码器表示 Transformer (BERT) 将文本描述分类为 AD 或对照组。为了识别代表性词汇特征并确定哪些与模型的决策最为相关,我们使用了综合梯度 (Integrated Gradients, IG)、语言探究与词汇计数 (Linguistic Inquiry and Word Count, LIWC) 和统计分析的流程。我们的方法表明,BERT 利用了反映 AD 中社交参考减少的词汇成分,并识别出进一步提高 LLM 准确性的成分。因此,我们提供了一种增强在神经临床环境中应用 LLM 信心的可解释性工具,特别是在神经退行性研究中。

[NLP-49] -KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework

【速读】: 该论文试图解决知识增强实体解析(KAER)框架中透明度不足的问题,即如何识别和记录外部知识对模型预测的影响。解决方案的关键在于引入T-KAER框架,通过提出三个透明度相关问题(T-Qs)并设计日志文件来记录实体解析过程,从而增强透明度。具体来说,T-KAER通过实验展示了如何从定量和定性角度进行错误分析,揭示了哪些语义信息被增强以及增强的知识为何对预测产生不同影响。

链接: https://arxiv.org/abs/2410.00218
作者: Lan Li,Liri Fang,Yiren Liu,Vetle I. Torvik,Bertram Ludaescher
关键词-EN: Entity resolution, representations refer, plays a crucial, crucial role, Entity Resolution framework
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted by IDCC 2024

点击查看摘要

Abstract:Entity resolution (ER) is the process of determining whether two representations refer to the same real-world entity and plays a crucial role in data curation and data cleaning. Recent studies have introduced the KAER framework, aiming to improve pre-trained language models by augmenting external knowledge. However, identifying and documenting the external knowledge that is being augmented and understanding its contribution to the model’s predictions have received little to no attention in the research community. This paper addresses this gap by introducing T-KAER, the Transparency-enhanced Knowledge-Augmented Entity Resolution framework. To enhance transparency, three Transparency-related Questions (T-Qs) have been proposed: T-Q(1): What is the experimental process for matching results based on data inputs? T-Q(2): Which semantic information does KAER augment in the raw data inputs? T-Q(3): Which semantic information of the augmented data inputs influences the predictions? To address the T-Qs, T-KAER is designed to improve transparency by documenting the entity resolution processes in log files. In experiments, a citation dataset is used to demonstrate the transparency components of T-KAER. This demonstration showcases how T-KAER facilitates error analysis from both quantitative and qualitative perspectives, providing evidence on “what” semantic information is augmented and “why” the augmented knowledge influences predictions differently. Comments: Accepted by IDCC 2024 Subjects: Computation and Language (cs.CL); Databases (cs.DB) Cite as: arXiv:2410.00218 [cs.CL] (or arXiv:2410.00218v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.00218 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: International Journal of Digital Curation 2024
摘要:实体解析 (Entity Resolution, ER) 是确定两个表示是否指向同一现实世界实体的过程,在数据整理和数据清洗中起着至关重要的作用。近期研究引入了 KAER 框架,旨在通过增强外部知识来改进预训练语言模型。然而,识别和记录正在增强的外部知识及其对模型预测的贡献在研究界几乎没有受到关注。本文通过引入透明度增强的知识增强实体解析框架 T-KAER 来填补这一空白。为了增强透明度,提出了三个透明度相关问题 (T-Qs):T-Q(1):基于数据输入的匹配结果的实验过程是什么?T-Q(2):KAER 在原始数据输入中增强了哪些语义信息?T-Q(3):增强的数据输入的哪些语义信息影响了预测?为了解决 T-Qs,T-KAER 设计通过在日志文件中记录实体解析过程来提高透明度。在实验中,使用引文数据集来展示 T-KAER 的透明度组件。这一展示展示了 T-KAER 如何从定量和定性角度促进错误分析,提供关于“什么”语义信息被增强以及“为什么”增强的知识以不同方式影响预测的证据。

评论:被 IDCC 2024 接受 主题:计算与语言 (cs.CL);数据库 (cs.DB) 引用为:arXiv:2410.00218 [cs.CL] (或 arXiv:2410.00218v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.00218 了解更多 通过 DataCite 发布的 arXiv DOI (待注册) 期刊参考:国际数字整理杂志 2024

[NLP-50] Evaluating the performance of state-of-the-art esg domain-specific pre-trained large language models in text classification against existing models and traditional machine learning techniques

【速读】: 该论文旨在解决环境、社会和治理(ESG)信息在文本披露中的分类问题,关键在于开发和评估能够准确识别和分类E、S和G相关内容的二元分类模型。研究采用了量化方法,包括数据收集、预处理,以及开发ESG专用的大型语言模型(LLMs)和传统机器学习分类器(如支持向量机和XGBoost)。通过标准自然语言处理性能指标(如准确率、精确率、召回率和F1分数)进行性能评估,并引入了一种新颖的微调方法Qlora,显著提升了LLMs在所有ESG领域的分类性能。此外,研究还开发了针对特定领域的微调模型,如EnvLlama 2-Qlora、SocLlama 2-Qlora和GovLlama 2-Qlora,这些模型在ESG文本分类中表现出色。

链接: https://arxiv.org/abs/2410.00207
作者: Tin Yuet Chung,Majid Latifi
关键词-EN: Support Vector Machines, textual disclosures, Support Vector, ESG, Vector Machines
类目: Computation and Language (cs.CL)
备注: 56 pages, 9 figures

点击查看摘要

Abstract:This research investigates the classification of Environmental, Social, and Governance (ESG) information within textual disclosures. The aim is to develop and evaluate binary classification models capable of accurately identifying and categorizing E, S and G-related content respectively. The motivation for this research stems from the growing importance of ESG considerations in investment decisions and corporate accountability. Accurate and efficient classification of ESG information is crucial for stakeholders to understand the impact of companies on sustainability and to make informed decisions. The research uses a quantitative approach involving data collection, data preprocessing, and the development of ESG-focused Large Language Models (LLMs) and traditional machine learning (Support Vector Machines, XGBoost) classifiers. Performance evaluation guides iterative refinement until satisfactory metrics are achieved. The research compares traditional machine learning techniques (Support Vector Machines, XGBoost), state-of-the-art language model (FinBERT-ESG) and fine-tuned LLMs like Llama 2, by employing standard Natural Language Processing performance metrics such as accuracy, precision, recall, F1-score. A novel fine-tuning method, Qlora, is applied to LLMs, resulting in significant performance improvements across all ESG domains. The research also develops domain-specific fine-tuned models, such as EnvLlama 2-Qlora, SocLlama 2-Qlora, and GovLlama 2-Qlora, which demonstrate impressive results in ESG text classification. Comments: 56 pages, 9 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.00207 [cs.CL] (or arXiv:2410.00207v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.00207 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tin Yuet Chung [view email] [v1] Mon, 30 Sep 2024 20:08:32 UTC (1,705 KB)
摘要:本研究探讨了在文本披露中对环境、社会和治理 (ESG) 信息的分类。目标是开发和评估能够准确识别和分类 E、S 和 G 相关内容的二元分类模型。此研究的动机源于 ESG 考量在投资决策和企业责任中的日益重要性。准确且高效地分类 ESG 信息对于利益相关者理解公司对可持续性的影响并做出明智决策至关重要。研究采用定量方法,包括数据收集、数据预处理以及开发专注于 ESG 的大语言模型 (LLM) 和传统机器学习 (支持向量机、XGBoost) 分类器。通过性能评估指导迭代改进,直至达到满意的指标。研究比较了传统机器学习技术 (支持向量机、XGBoost)、最先进的语言模型 (FinBERT-ESG) 以及微调后的 LLM (如 Llama 2),采用了标准的自然语言处理性能指标,如准确率、精确率、召回率和 F1 分数。一种新颖的微调方法 Qlora 应用于 LLM,在所有 ESG 领域中显著提升了性能。研究还开发了领域特定的微调模型,如 EnvLlama 2-Qlora、SocLlama 2-Qlora 和 GovLlama 2-Qlora,这些模型在 ESG 文本分类中展示了令人印象深刻的结果。

评论:56 页,9 图 主题:计算与语言 (cs.CL) 引用为:arXiv:2410.00207 [cs.CL] (或 arXiv:2410.00207v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2410.00207 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite (待注册) 提交历史 从:Tin Yuet Chung [查看电子邮件] [v1] 周一,2024 年 9 月 30 日 20:08:32 UTC (1,705 KB)

[NLP-51] DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation ECCV2024

【速读】: 该论文试图解决机器理解结构化视觉内容(如幻灯片和用户界面)的问题,特别是为残障人士提供无障碍访问的挑战。解决方案的关键在于提出了一种基于代码生成的方法,用于生成带有目标标签的合成结构化视觉数据。这种方法允许用户创建带有内置标签的数据集,并通过少量人工标注的样本来训练模型,从而显著减少了手动数据收集和标注的时间和劳动成本,并在识别视觉元素、描述视觉内容和分类视觉内容类型等任务中展示了性能提升。

链接: https://arxiv.org/abs/2410.00201
作者: Yi-Hao Peng,Faria Huq,Yue Jiang,Jason Wu,Amanda Xin Yue Li,Jeffrey Bigham,Amy Pavel
关键词-EN: Enabling machines, understand structured visuals, machines to understand, user interfaces, interfaces is essential
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ECCV 2024

点击查看摘要

Abstract:Enabling machines to understand structured visuals like slides and user interfaces is essential for making them accessible to people with disabilities. However, achieving such understanding computationally has required manual data collection and annotation, which is time-consuming and labor-intensive. To overcome this challenge, we present a method to generate synthetic, structured visuals with target labels using code generation. Our method allows people to create datasets with built-in labels and train models with a small number of human-annotated examples. We demonstrate performance improvements in three tasks for understanding slides and UIs: recognizing visual elements, describing visual content, and classifying visual content types.
摘要:使机器能够理解幻灯片和用户界面等结构化视觉内容,对于使其对残障人士可访问至关重要。然而,通过计算实现这种理解需要手动数据收集和标注,这既耗时又费力。为了克服这一挑战,我们提出了一种使用代码生成生成带有目标标签的合成结构化视觉内容的方法。我们的方法允许人们创建带有内置标签的数据集,并使用少量人工标注的示例训练模型。我们在三个理解幻灯片和用户界面的任务中展示了性能提升:识别视觉元素、描述视觉内容和分类视觉内容类型。

[NLP-52] Do Vision-Language Models Really Understand Visual Language?

【速读】: 该论文试图解决的问题是评估大型视觉-语言模型(LVLMs)在理解图表(diagrams)方面的能力,特别是它们在识别和推理图表中概念实体及其关系的能力。解决方案的关键在于开发了一个综合测试套件,通过多种问题类型和跨领域的合成及真实图表来评估模型的识别和推理能力。研究发现,尽管这些模型在识别实体方面表现良好,但在理解关系方面存在显著局限,其表现很大程度上依赖于模型背景知识的捷径效应,而非真正的图表理解能力。

链接: https://arxiv.org/abs/2410.00193
作者: Buse Giledereli,Yifan Hou,Yilei Tu,Mrinmaya Sachan
关键词-EN: Visual language, visual language depicting, spatial arrangements, system of communication, communication that conveys
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of three LVLMs (GPT-4V, GPT-4o, and Gemini) shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.
摘要:视觉语言是一种通过符号、形状和空间排列传达信息的交流系统。图表是视觉语言的一个典型例子,它以图像的形式描绘复杂概念及其关系。图表的符号性质为构建能够理解它们的模型带来了重大挑战。然而,最近的研究似乎表明,大型视觉-语言模型 (LVLMs) 甚至可以处理涉及图表的复杂推理任务。本文通过开发一个综合测试套件来评估 LVLMs 的图表理解能力,以研究这一现象。我们的测试套件使用了一系列问题,这些问题聚焦于概念实体及其在多个领域内合成图表和真实图表中的关系,以评估模型的识别和推理能力。我们对三种 LVLMs(GPT-4V、GPT-4o 和 Gemini)的评估显示,尽管这些模型能够准确识别和推理实体,但它们理解关系的能力明显有限。进一步的测试揭示,模型在图表理解上的良好表现很大程度上源于利用其背景知识作为识别和推理关系信息的捷径。因此,我们得出结论,LVLMs 在真正的图表理解能力上存在局限,它们在图表推理中的出色表现是一种由其他混淆因素(如模型中的背景知识)产生的错觉。

[NLP-53] Zero-Shot Classification of Crisis Tweets Using Instruction-Finetuned Large Language Models

【速读】: 该论文旨在评估商业大型语言模型(如OpenAI GPT-4o、Gemini 1.5-flash-001和Anthropic Claude-3-5 Sonnet)在零样本分类任务中对社交媒体短帖的分类能力,特别是在灾难响应中的人道主义信息识别和分类。解决方案的关键在于设计了两个分类任务:一是判断帖子是否在人道主义背景下具有信息价值;二是根据16种可能的人道主义类别对帖子进行排序并提供概率。研究结果表明,提供事件背景信息有助于提高人道主义标签分类的准确性,同时发现不同模型在不同数据集上的表现差异显著,这引发了对数据集质量的质疑。

链接: https://arxiv.org/abs/2410.00182
作者: Emma McDaniel,Samuel Scheele,Jeff Liu
关键词-EN: pre-LLM NLP techniques, Social media posts, pre-LLM NLP, NLP techniques, short social media
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social media posts are frequently identified as a valuable source of open-source intelligence for disaster response, and pre-LLM NLP techniques have been evaluated on datasets of crisis tweets. We assess three commercial large language models (OpenAI GPT-4o, Gemini 1.5-flash-001 and Anthropic Claude-3-5 Sonnet) capabilities in zero-shot classification of short social media posts. In one prompt, the models are asked to perform two classification tasks: 1) identify if the post is informative in a humanitarian context; and 2) rank and provide probabilities for the post in relation to 16 possible humanitarian classes. The posts being classified are from the consolidated crisis tweet dataset, CrisisBench. Results are evaluated using macro, weighted, and binary F1-scores. The informative classification task, generally performed better without extra information, while for the humanitarian label classification providing the event that occurred during which the tweet was mined, resulted in better performance. Further, we found that the models have significantly varying performance by dataset, which raises questions about dataset quality.
摘要:社交媒体帖子经常被识别为灾难响应中开源情报的有价值来源,并且在大语言模型 (LLM) 之前的自然语言处理 (NLP) 技术已经在危机推文数据集上进行了评估。我们评估了三种商业大语言模型(OpenAI GPT-4o、Gemini 1.5-flash-001 和 Anthropic Claude-3-5 Sonnet)在零样本分类短社交媒体帖子方面的能力。在一个提示中,模型被要求执行两项分类任务:1) 识别帖子在人道主义背景下是否具有信息性;2) 对帖子与16种可能的人道主义类别的关系进行排序并提供概率。被分类的帖子来自综合危机推文数据集 CrisisBench。结果使用宏、加权和二进制 F1-分数进行评估。信息性分类任务通常在没有额外信息的情况下表现更好,而对于人道主义标签分类,提供推文挖掘期间发生的事件,结果表现更好。此外,我们发现模型在不同数据集上的表现显著不同,这引发了关于数据集质量的问题。

[NLP-54] Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification EMNLP2024

【速读】: 该论文试图解决的问题是评估现代自然语言处理(NLP)技术时,基准测试可能偏向于那些能够利用未标注文本的方法,因为研究人员可能会使用测试集中的未标注文本来预训练模型,从而导致过度乐观的评估结果。解决方案的关键在于通过实验量化这种偏差,并建议在研究少样本文本分类时采用多次子采样和多重训练折叠的方法,以确保评估的公正性和准确性。

链接: https://arxiv.org/abs/2410.00179
作者: Kush Dubey
关键词-EN: modern NLP techniques, evaluating modern NLP, NLP techniques, modern NLP, critical for evaluating
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear in the GenBench Workshop at EMNLP 2024

点击查看摘要

Abstract:Few-shot learning benchmarks are critical for evaluating modern NLP techniques. It is possible, however, that benchmarks favor methods which easily make use of unlabeled text, because researchers can use unlabeled text from the test set to pretrain their models. Given the dearth of research on this potential problem, we run experiments to quantify the bias caused by pretraining on unlabeled test set text instead of on unlabeled, independently drawn text. Controlled few-shot and zero-shot experiments on 25 classification tasks and 3 language models – BERT, GPT-2, and Mistral 7B – do not find evidence of overoptimism. Furthermore, we demonstrate the importance of repeated subsampling when studying few-shot text classification, and recommend that few-shot learning benchmarks include multiple training folds. Code and data are available at this https URL.
摘要:少样本学习基准对于评估现代自然语言处理 (NLP) 技术至关重要。然而,基准可能倾向于那些能够轻松利用未标注文本的方法,因为研究人员可以使用测试集中的未标注文本来预训练他们的模型。鉴于目前对此潜在问题的研究匮乏,我们进行了实验,以量化由于使用未标注的测试集文本而非独立抽取的未标注文本进行预训练所导致的偏差。在 25 个分类任务和 3 个语言模型(BERT、GPT-2 和 Mistral 7B)上进行的控制性少样本和零样本实验,并未发现过度乐观的证据。此外,我们展示了在研究少样本文本分类时重复子采样的重要性,并建议少样本学习基准应包含多个训练折。代码和数据可在以下链接获取:https URL。

[NLP-55] Adaptable Moral Stances of Large Language Models on Sexist Content: Implications for Society and Gender Discourse EMNLP2024

【速读】: 该论文试图解决的问题是如何利用大型语言模型(LLMs)进行道德推理,以批判和捍卫性别歧视语言。解决方案的关键在于评估八种LLMs在处理性别歧视问题时的表现,通过人类和自动评估验证这些模型生成文本的可理解性和上下文相关性,并分析模型在论证中引用的道德基础,揭示其输出的多样性意识形态视角。论文强调了LLMs在理解和干预性别歧视信念方面的双重作用,同时警示其被滥用的潜在风险,并建议设计安全机制以监控和保障其在涉及敏感社会议题的应用中的使用。

链接: https://arxiv.org/abs/2410.00175
作者: Rongchen Guo,Isar Nejadgholi,Hillary Dawkins,Kathleen C. Fraser,Svetlana Kiritchenko
关键词-EN: apply moral reasoning, defend sexist language, criticize and defend, apply moral, moral reasoning
类目: Computation and Language (cs.CL)
备注: To be published at EMNLP2024

点击查看摘要

Abstract:This work provides an explanatory view of how LLMs can apply moral reasoning to both criticize and defend sexist language. We assessed eight large language models, all of which demonstrated the capability to provide explanations grounded in varying moral perspectives for both critiquing and endorsing views that reflect sexist assumptions. With both human and automatic evaluation, we show that all eight models produce comprehensible and contextually relevant text, which is helpful in understanding diverse views on how sexism is perceived. Also, through analysis of moral foundations cited by LLMs in their arguments, we uncover the diverse ideological perspectives in models’ outputs, with some models aligning more with progressive or conservative views on gender roles and sexism. Based on our observations, we caution against the potential misuse of LLMs to justify sexist language. We also highlight that LLMs can serve as tools for understanding the roots of sexist beliefs and designing well-informed interventions. Given this dual capacity, it is crucial to monitor LLMs and design safety mechanisms for their use in applications that involve sensitive societal topics, such as sexism.
摘要:本研究探讨了大语言模型 (LLM) 如何运用道德推理来批判和捍卫带有性别歧视的语言。我们评估了八种大语言模型,这些模型均展现出基于不同道德视角对性别歧视观点进行批判和认同的能力。通过人工和自动评估,我们发现所有八种模型生成的文本均具有可理解性和上下文相关性,有助于理解对性别歧视的不同看法。此外,通过对模型论证中引用的道德基础进行分析,我们揭示了模型输出中多样化的意识形态视角,其中一些模型更倾向于与性别角色和性别歧视的进步或保守观点保持一致。基于我们的观察,我们警告大语言模型可能被误用来为性别歧视语言辩护。同时,我们强调大语言模型可以作为理解性别歧视信念根源和设计有针对性干预措施的工具。鉴于其双重功能,监控大语言模型并为其在涉及敏感社会话题(如性别歧视)的应用中设计安全机制至关重要。

[NLP-56] SSR: Alignment-Aware Modality Connector for Speech Language Models

【速读】: 该论文试图解决将语音融入预训练语言模型(SpeechLM)时遇到的两个主要问题:长语音的高效编码和预训练文本模态的灾难性遗忘。解决方案的关键在于提出了SSR-Connector(分段语音表示连接器),通过利用语音-文本对齐,将语音特征分段和压缩以匹配文本嵌入的粒度,并引入两阶段训练流程(包括蒸馏和微调阶段)来缓解灾难性遗忘问题。这一方法在语音-文本模态融合中表现优异,显著提升了语音理解能力(如在StoryCloze上提升10%的准确率,在Speech-MMLU上提升20%),同时保留了预训练文本的能力。

链接: https://arxiv.org/abs/2410.00168
作者: Weiting Tan,Hirofumi Inaguma,Ning Dong,Paden Tomasello,Xutai Ma
关键词-EN: pre-trained language model, Segmented Speech Representation, Speech Representation Connector, Fusing speech, language model
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Fusing speech into pre-trained language model (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embeddings. Additionally, we introduce a two-stage training pipeline that includes the distillation and fine-tuning phases to mitigate catastrophic forgetting. SSR-Connector outperforms existing mechanism for speech-text modality fusion, consistently achieving better speech understanding (e.g., +10 accuracy on StoryCloze and +20 on Speech-MMLU) while preserving pre-trained text ability.
摘要:将语音融入预训练语言模型 (SpeechLM) 通常面临长语音编码效率低下以及预训练文本模态的灾难性遗忘问题。我们提出了 SSR-Connector (分段语音表示连接器) 以实现更好的模态融合。利用语音-文本对齐技术,我们的方法将语音特征分段并压缩,以匹配文本嵌入的粒度。此外,我们引入了一个两阶段训练流程,包括蒸馏和微调阶段,以缓解灾难性遗忘问题。SSR-Connector 在语音-文本模态融合方面优于现有机制,持续实现更好的语音理解 (例如,StoryCloze 准确率提升 10%,Speech-MMLU 提升 20%),同时保留预训练的文本能力。

[NLP-57] Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation

【速读】: 该论文试图解决在葡萄牙语环境下,如何开发一个可靠且相关的医疗虚拟助手的问题。解决方案的关键在于使用GPT-3.5翻译的HealthCareMagic-100k-en和MedQuAD数据集,通过PEFT-QLoRA方法微调ChatBode-7B模型,并评估了InternLM2模型和DrBode模型的性能。尽管DrBode模型在语法和连贯性方面表现较好,但其存在灾难性遗忘已获取的医疗知识的问题。论文还强调了评估协议中低评分者间一致性的挑战,并提出了未来研究方向,包括评估特定于医疗领域的多语言模型、提高训练数据质量以及开发更一致的评估方法。

链接: https://arxiv.org/abs/2410.00163
作者: Pedro Henrique Paiola,Gabriel Lino Garcia,João Renato Ribeiro Manesco,Mateus Roder,Douglas Rodrigues,João Paulo Papa
关键词-EN: relevant virtual assistant, agents in Portuguese, large language models, aiming to develop, healthcare professionals
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:This study evaluates the performance of large language models (LLMs) as medical agents in Portuguese, aiming to develop a reliable and relevant virtual assistant for healthcare professionals. The HealthCareMagic-100k-en and MedQuAD datasets, translated from English using GPT-3.5, were used to fine-tune the ChatBode-7B model using the PEFT-QLoRA method. The InternLM2 model, with initial training on medical data, presented the best overall performance, with high precision and adequacy in metrics such as accuracy, completeness and safety. However, DrBode models, derived from ChatBode, exhibited a phenomenon of catastrophic forgetting of acquired medical knowledge. Despite this, these models performed frequently or even better in aspects such as grammaticality and coherence. A significant challenge was low inter-rater agreement, highlighting the need for more robust assessment protocols. This work paves the way for future research, such as evaluating multilingual models specific to the medical field, improving the quality of training data, and developing more consistent evaluation methodologies for the medical field.
摘要:本研究评估了大语言模型 (LLMs) 作为葡萄牙语医疗智能体的性能,旨在为医疗专业人员开发一个可靠且相关的虚拟助手。使用 GPT-3.5 翻译的 HealthCareMagic-100k-en 和 MedQuAD 数据集,通过 PEFT-QLoRA 方法对 ChatBode-7B 模型进行了微调。经过医疗数据初始训练的 InternLM2 模型在整体性能上表现最佳,在准确性、完整性和安全性等指标上具有高精度和适宜性。然而,源自 ChatBode 的 DrBode 模型表现出对已获取医疗知识的灾难性遗忘现象。尽管如此,这些模型在语法性和连贯性等方面表现频繁甚至更好。一个显著的挑战是评分者间的一致性较低,这突显了需要更强大的评估协议。这项工作为未来的研究铺平了道路,例如评估特定于医疗领域的多语言模型,提高训练数据的质量,以及为医疗领域开发更一致的评估方法。

[NLP-58] KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

【速读】: 该论文试图解决大型语言模型(LLMs)在长上下文推理中面临的内存效率问题,特别是在处理长上下文时,关键-值(KV)缓存所需的内存随上下文长度线性增长,限制了在给定内存预算下并发处理长上下文请求的能力。解决方案的关键在于引入了一种名为KV-Compress的新型压缩方法,该方法通过在PagedAttention框架内连续地驱逐KV块,从而按比例减少KV缓存的内存占用,实现了接近理论压缩率的实际内存压缩效果。该方法在LongBench上对Mistral-7B-Instruct-v0.2和Llama-3.1-8B-Instruct模型进行了评估,相比现有方法,KV-Compress显著降低了压缩KV的数量,并提高了整体吞吐量。

链接: https://arxiv.org/abs/2410.00161
作者: Isaac Rehg
关键词-EN: Large Language Models, Language Models, Large Language, recent years, million-token context
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Context lengths of Large Language Models (LLMs) have exploded in recent years, with 128k-token context becoming a standard and million-token context becoming a reality. Efficiently supporting long-context inference remains challenging as the memory that must be allocated in key-value (KV) cache for a generation scales with its context length, limiting the number of long-context requests that can be served concurrently under a given memory budget. KV cache compression can mitigate this issue by removing under-utilized KVs from each attention head’s cache and reducing its memory footprint. Higher theoretical compression rates can be achieved when the number of removed KVs varies across attention heads, but application of such a strategy within existing inference frameworks adds fragmentation and cannot realize the theoretical compression rates in physical memory. We introduce KV-Compress, a novel compression method that evicts contiguous KV blocks within a PagedAttention framework, reducing the memory footprint of the KV cache proportionally to this theoretical compression rate. Our method achieves state-of-the-art performance on LongBench for both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct while lowering the total number of compressed KVs by 4x compared with prior methods. Evaluations on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct-FP8 achieve compression rates up to 8x with negligible impact on performance, and up to 64x while retaining over 90% of full-cache performance for all but three of the suite’s subsets. We benchmark an integration of our method with vLLM that increases total throughput by up to 5.18x by enabling larger decoding batches.
摘要:近年来,大语言模型 (LLM) 的上下文长度急剧增加,128k Token 的上下文已成为标准,百万 Token 的上下文也已成为现实。高效支持长上下文推理仍然是一个挑战,因为生成过程中必须为键值 (KV) 缓存分配的内存会随着上下文长度的增加而增加,这限制了在给定内存预算下可以同时处理的长上下文请求数量。KV 缓存压缩可以通过从每个注意力头的缓存中移除未充分利用的 KV 来缓解这一问题,从而减少其内存占用。当移除的 KV 数量在不同注意力头之间变化时,可以实现更高的理论压缩率,但在现有推理框架中应用这种策略会增加碎片化,并且无法在物理内存中实现理论压缩率。我们引入了 KV-Compress,这是一种新颖的压缩方法,它在 PagedAttention 框架内驱逐连续的 KV 块,从而使 KV 缓存的内存占用与理论压缩率成比例地减少。我们的方法在 LongBench 上针对 Mistral-7B-Instruct-v0.2 和 Llama-3.1-8B-Instruct 实现了最先进的性能,同时将压缩 KV 的总数减少了 4 倍,相比之前的方法。在 Llama-3.1-8B-Instruct 和 Llama-3.1-70B-Instruct-FP8 上的评估实现了高达 8 倍的压缩率,对性能的影响可以忽略不计,并且在保留超过 90% 的全缓存性能的情况下,实现了高达 64 倍的压缩率,除了三个子集之外的所有子集。我们将我们的方法与 vLLM 集成,通过支持更大的解码批次,使总吞吐量提高了高达 5.18 倍。

[NLP-59] Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

【速读】: 该论文试图解决在大语言模型(LLMs)中,单一概念向量因数据和训练方式不同而变化,导致其在实际应用中不够稳健的问题。解决方案的关键在于提出了一种新的方法,即通过扩展线性探测分类器,将单一概念向量扩展为高斯概念子空间(Gaussian Concept Subspace, GCS),从而更准确地表示特定概念。这种方法不仅提高了概念表示的稳健性,还在情感引导等实际应用中展示了其有效性,能够在自然语言生成任务中平衡引导性能和保持流畅性。

链接: https://arxiv.org/abs/2410.00153
作者: Haiyan Zhao,Heng Zhao,Bo Shen,Ali Payani,Fan Yang,Mengnan Du
关键词-EN: large language models, Probing learned concepts, encoded internally, crucial for understanding, understanding how semantic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 9 figures

点击查看摘要

Abstract:Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS’s effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.
摘要:在大语言模型 (LLM) 中探查学习到的概念对于理解语义知识如何在内部编码至关重要。在探查任务上训练线性分类器是一种主要方法,用于表示空间中某个概念的向量。然而,为某个概念识别的单一向量会随着数据和训练的变化而变化,这使得其鲁棒性较差,并削弱了其在实际应用中的有效性。为了应对这一挑战,我们提出了一种近似表示特定概念子空间的方法。基于线性探查分类器,我们将概念向量扩展到高斯概念子空间 (GCS)。我们通过测量其在不同大小和架构的多个 LLM 中的忠实度和合理性,展示了 GCS 的有效性。此外,我们使用表示干预任务来展示其在情感引导等实际应用中的功效。实验结果表明,GCS 概念向量在平衡自然语言生成任务中的引导性能和保持流畅性方面具有潜力。

[NLP-60] Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

【速读】: 该论文试图解决现有数学推理基准(如GSM8K)对于评估大型语言模型(LLMs)的数学推理能力逐渐失效的问题。解决方案的关键在于提出了Scheherazade方法,通过逻辑链式的方式自动生成更具挑战性的数学推理问题。具体来说,论文提出了两种链式方法:前向链式(forward chaining)和后向链式(backward chaining),分别要求模型从前向后和从后向前进行推理。通过将Scheherazade应用于GSM8K,生成了GSM8K-Scheherazade基准,并评估了前沿LLMs和OpenAI的o1-preview模型在该基准上的表现。结果显示,尽管前沿模型的性能在链式问题数量增加时急剧下降,但o1-preview模型在链式问题数量达到5个时仍能保持较好的表现,尤其是在后向链式问题上表现更佳。

链接: https://arxiv.org/abs/2410.00151
作者: Stephen Miner,Yoshiki Takashima,Simeng Han,Ferhat Erata,Timos Antonopoulos,Ruzica Piskac,Scott J Shapiro
关键词-EN: Large Language Models, Large Language, abilities of Large, Language Models, math reasoning abilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Benchmarks are critical for measuring progress of math reasoning abilities of Large Language Models (LLMs). However, existing widely-used benchmarks such as GSM8K have been rendered less useful as multiple cutting-edge LLMs achieve over 94% accuracy. While harder benchmarks have been proposed, their creation is often manual and expensive. We present Scheherazade, an automated approach for producing challenging mathematical reasoning benchmarks by logically chaining mathematical reasoning problems. We propose two different chaining methods, forward chaining and backward chaining, which require reasoning forward and backward through the chain respectively. We apply Scheherazade on GSM8K to create GSM8K-Scheherazade and evaluate 3 frontier LLMs and OpenAI’s o1-preview on it. We show that while frontier models’ performance declines precipitously at only a few questions chained, a preliminary evaluation suggests o1-preview performance persists up to 5 questions chained backwards. In addition, while all other models perform worse when problems are chained backwards, o1-preview performs better on backward-chained benchmarks. We will release the dataset and code publicly.
摘要:基准测试对于衡量大语言模型 (LLM) 数学推理能力的进展至关重要。然而,现有的广泛使用的基准测试,如 GSM8K,由于多个尖端 LLM 的准确率超过 94%,其效用已大打折扣。尽管提出了更难的基准测试,但其创建通常是手工且成本高昂的。我们提出了 Scheherazade,一种通过逻辑链接数学推理问题来自动生成挑战性数学推理基准的方法。我们提出了两种不同的链接方法:正向链接和反向链接,分别要求沿着链进行正向和反向推理。我们将 Scheherazade 应用于 GSM8K,创建了 GSM8K-Scheherazade,并对 3 个前沿 LLM 和 OpenAI 的 o1-preview 进行了评估。结果显示,尽管前沿模型的性能在仅链接几个问题后急剧下降,但初步评估表明 o1-preview 在反向链接 5 个问题后性能仍能持续。此外,尽管所有其他模型在问题反向链接时表现更差,o1-preview 在反向链接的基准测试中表现更好。我们将公开发布数据集和代码。

[NLP-61] Are Large Language Models In-Context Personalized Summarizers? Get an iCOPERNICUS Test Done!

【速读】: 该论文试图解决大语言模型(LLMs)在基于上下文学习(ICL)的摘要任务中,个性化学习(ICPL)能力不足的问题。解决方案的关键在于提出了iCOPERNICUS框架,该框架利用EGISES作为比较度量,能够评估模型是否有效利用了ICPL提示中的三种关键线索:示例摘要、用户阅读历史和用户配置文件的对比。通过这一框架,研究者能够识别出哪些模型在面对更丰富的提示时,其ICPL能力显著下降,从而揭示出这些模型在真正实现个性化学习方面的不足。

链接: https://arxiv.org/abs/2410.00149
作者: Divya Patel,Pathik Patel,Ankush Chander,Sourish Dasgupta,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, Language Models, succeeded considerably, Large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have succeeded considerably in In-Context-Learning (ICL) based summarization. However, saliency is subject to the users’ specific preference histories. Hence, we need reliable In-Context Personalization Learning (ICPL) capabilities within such LLMs. For any arbitrary LLM to exhibit ICPL, it needs to have the ability to discern contrast in user profiles. A recent study proposed a measure for degree-of-personalization called EGISES for the first time. EGISES measures a model’s responsiveness to user profile differences. However, it cannot test if a model utilizes all three types of cues provided in ICPL prompts: (i) example summaries, (ii) user’s reading histories, and (iii) contrast in user profiles. To address this, we propose the iCOPERNICUS framework, a novel In-COntext PERsonalization learNIng sCrUtiny of Summarization capability in LLMs that uses EGISES as a comparative measure. As a case-study, we evaluate 17 state-of-the-art LLMs based on their reported ICL performances and observe that 15 models’ ICPL degrades (min: 1.6%; max: 3.6%) when probed with richer prompts, thereby showing lack of true ICPL.
摘要:大语言模型 (LLMs) 在基于上下文学习 (In-Context-Learning, ICL) 的摘要生成方面取得了显著成功。然而,摘要的显著性受用户特定偏好历史的影响。因此,我们需要在这些大语言模型中具备可靠的上下文个性化学习 (In-Context Personalization Learning, ICPL) 能力。对于任何任意的大语言模型来说,要展示 ICPL,它需要具备辨别用户档案差异的能力。最近的一项研究首次提出了一种名为 EGISES 的个性化程度度量方法。EGISES 衡量模型对用户档案差异的响应能力。然而,它无法测试模型是否利用了 ICPL 提示中提供的所有三种线索:(i) 示例摘要,(ii) 用户的阅读历史,以及 (iii) 用户档案的对比。为了解决这一问题,我们提出了 iCOPERNICUS 框架,这是一种新颖的上下文个性化学习 (In-COntext PERsonalization learNIng sCrUtiny of Summarization, ICPL) 能力评估框架,它使用 EGISES 作为比较度量。作为一个案例研究,我们评估了 17 个基于其报告的 ICL 性能的最先进大语言模型,并观察到当使用更丰富的提示进行测试时,15 个模型的 ICPL 性能下降 (最小:1.6%;最大:3.6%),从而显示出缺乏真正的 ICPL 能力。

[NLP-62] Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms

【速读】: 该论文试图解决传统主题建模和基于聚类的方法在捕捉上下文语义信息方面的不足。解决方案的关键在于引入了一种创新的端到端语义驱动的主题建模技术,该技术利用先进的词和文档嵌入结合强大的聚类算法,通过预训练的基于Transformer的语言模型生成文档嵌入,降低嵌入维度,并基于语义相似性进行聚类,从而生成连贯且有意义的主题。这种方法显著提升了主题建模的效果,相比ChatGPT和传统算法,能够提供更加连贯和有意义的主题。

链接: https://arxiv.org/abs/2410.00134
作者: Melkamu Abay Mersha,Mesay Gemeda yigezu,Jugal Kalita
关键词-EN: discover hidden topics, Topic modeling, prior knowledge, discover hidden, Traditional topic modeling
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Topic modeling is a powerful technique to discover hidden topics and patterns within a collection of documents without prior knowledge. Traditional topic modeling and clustering-based techniques encounter challenges in capturing contextual semantic information. This study introduces an innovative end-to-end semantic-driven topic modeling technique for the topic extraction process, utilizing advanced word and document embeddings combined with a powerful clustering algorithm. This semantic-driven approach represents a significant advancement in topic modeling methodologies. It leverages contextual semantic information to extract coherent and meaningful topics. Specifically, our model generates document embeddings using pre-trained transformer-based language models, reduces the dimensions of the embeddings, clusters the embeddings based on semantic similarity, and generates coherent topics for each cluster. Compared to ChatGPT and traditional topic modeling algorithms, our model provides more coherent and meaningful topics.
摘要:主题建模是一种强大的技术,能够在无需先验知识的情况下,从文档集合中发现隐藏的主题和模式。传统的主题建模和基于聚类的技术在捕捉上下文语义信息方面面临挑战。本研究引入了一种创新的端到端语义驱动主题建模技术,用于主题提取过程,利用先进的词和文档嵌入结合强大的聚类算法。这种语义驱动的方法在主题建模方法学上代表了显著的进步。它利用上下文语义信息来提取连贯且有意义的主题。具体而言,我们的模型使用预训练的基于 Transformer 的语言模型生成文档嵌入,减少嵌入的维度,基于语义相似性对嵌入进行聚类,并为每个聚类生成连贯的主题。与 ChatGPT 和传统的主题建模算法相比,我们的模型提供了更加连贯和有意义的主题。

[NLP-63] Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models EMNLP2024

【速读】: 该论文试图解决联邦学习(Federated Learning, FL)在微调大型语言模型(Large Language Models, LLMs)时面临的计算和通信成本高昂的问题,特别是在训练数据非独立同分布(non-IID)的情况下。解决方案的关键在于提出了基于Fisher信息的高效课程联邦学习框架(FibecFed),其中包括两个创新方法:一是基于Fisher信息的自适应数据采样,以提高FL微调过程的有效性;二是动态选择合适的层进行全局聚合和稀疏参数进行本地更新,结合低秩适应(LoRA)技术,以提高FL微调过程的效率。实验结果表明,FibecFed在精度和速度上均显著优于17种基线方法。

链接: https://arxiv.org/abs/2410.00131
作者: Ji Liu,Jiaxiang Ren,Ruoming Jin,Zijie Zhang,Yang Zhou,Patrick Valduriez,Dejing Dou
关键词-EN: Large Language Models, fine-tune Large Language, collaboratively train models, Large Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 27 pages, 8 figures, 14 tables, to appear in EMNLP 2024

点击查看摘要

Abstract:As a promising paradigm to collaboratively train models with decentralized data, Federated Learning (FL) can be exploited to fine-tune Large Language Models (LLMs). While LLMs correspond to huge size, the scale of the training data significantly increases, which leads to tremendous amounts of computation and communication costs. The training data is generally non-Independent and Identically Distributed (non-IID), which requires adaptive data processing within each device. Although Low Rank Adaptation (LoRA) can significantly reduce the scale of parameters to update in the fine-tuning process, it still takes unaffordable time to transfer the low-rank parameters of all the layers in LLMs. In this paper, we propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods, i.e., adaptive federated curriculum learning and efficient sparse parameter update. First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process. Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA so as to improve the efficiency of the FL fine-tuning process. Extensive experimental results based on 10 datasets demonstrate that FibecFed yields excellent performance (up to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61% faster) compared with 17 baseline approaches).
摘要:作为一种有前景的范式,联邦学习 (Federated Learning, FL) 能够利用分散的数据协同训练模型,从而实现对大语言模型 (Large Language Models, LLMs) 的微调。然而,由于 LLMs 规模庞大,训练数据的规模显著增加,导致计算和通信成本巨大。训练数据通常是非独立同分布 (non-Independent and Identically Distributed, non-IID) 的,这要求在每个设备上进行自适应数据处理。尽管低秩适应 (Low Rank Adaptation, LoRA) 可以显著减少微调过程中需要更新的参数规模,但传输 LLMs 中所有层的低秩参数仍然耗时过长。本文提出了一种基于 Fisher 信息的有效课程联邦学习框架 (Fisher Information-based Efficient Curriculum Federated Learning, FibecFed),包含两种创新方法:自适应联邦课程学习和高效稀疏参数更新。首先,我们提出了一种基于 Fisher 信息的方法,自适应地在每个设备上采样数据,以提高 FL 微调过程的有效性。其次,我们动态选择合适的层进行全局聚合,并结合 LoRA 对稀疏参数进行局部更新,从而提高 FL 微调过程的效率。基于 10 个数据集的大量实验结果表明,FibecFed 在性能 (最高准确率提升 45.35%) 和微调速度 (最高提升 98.61%) 方面均优于 17 种基线方法。

[NLP-64] Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface

【速读】: 该论文试图解决基于大型语言模型(LLMs)的代理在任务规划过程中因模型规模大和生成中间思维复杂而导致的规划延迟问题。解决方案的关键在于提出了一种以人为中心的高效代理规划方法——交互式推测规划(Interactive Speculative Planning),通过系统设计和人机交互的协同设计,强调代理系统能够灵活管理用户交互和打断,并将人类打断作为系统基础组件,从而加速整个规划过程,同时提高系统的用户中心性和效率。

链接: https://arxiv.org/abs/2410.00079
作者: Wenyue Hua,Mengting Wan,Shashank Vadrevu,Ryan Nadel,Yongfeng Zhang,Chi Wang
关键词-EN: producing action plans, human task delegation, Interactive Speculative Planning, task delegation, action plans
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 27 pages, 22 figures

点击查看摘要

Abstract:Agents, as user-centric tools, are increasingly deployed for human task delegation, assisting with a broad spectrum of requests by generating thoughts, engaging with user proxies, and producing action plans. However, agents based on large language models (LLMs) often face substantial planning latency due to two primary factors: the efficiency limitations of the underlying LLMs due to their large size and high demand, and the structural complexity of the agents due to the extensive generation of intermediate thoughts to produce the final output. Given that inefficiency in service provision can undermine the value of automation for users, this paper presents a human-centered efficient agent planning method – Interactive Speculative Planning – aiming at enhancing the efficiency of agent planning through both system design and human-AI interaction. Our approach advocates for the co-design of the agent system and user interface, underscoring the importance of an agent system that can fluidly manage user interactions and interruptions. By integrating human interruptions as a fundamental component of the system, we not only make it more user-centric but also expedite the entire process by leveraging human-in-the-loop interactions to provide accurate intermediate steps. Code and data will be released.
摘要:作为以用户为中心的工具,智能体越来越多地被部署用于人类任务的委托,通过生成思维、与用户代理互动并制定行动计划,协助处理广泛的请求。然而,基于大语言模型 (LLM) 的智能体常常面临显著的规划延迟,这主要归因于两个主要因素:由于其庞大的规模和高需求,底层 LLM 的效率限制;以及由于生成大量中间思维以产生最终输出而导致的智能体结构复杂性。鉴于服务提供中的低效率可能削弱自动化对用户的价值,本文提出了一种以人为中心的智能体高效规划方法——交互式推测规划 (Interactive Speculative Planning),旨在通过系统设计和人机交互来提高智能体规划的效率。我们的方法倡导智能体系统和用户界面的协同设计,强调智能体系统能够流畅地管理用户交互和打断的重要性。通过将人类打断整合为系统的基本组成部分,我们不仅使其更加以用户为中心,还通过利用人在回路中的交互来提供准确的中间步骤,从而加速整个过程。代码和数据将会发布。

[NLP-65] A Novel Spinor-Based Embedding Model for Transformers

【速读】: 该论文试图解决Transformer模型中词嵌入表达能力不足的问题,解决方案的关键在于利用几何代数中的旋量(spinors)来编码词向量。旋量提供了一个丰富的数学框架,能够捕捉高维空间中的复杂关系和变换,从而增强语言表示的表达力和鲁棒性。通过将词嵌入为旋量,论文旨在提升Transformer模型的性能。

链接: https://arxiv.org/abs/2410.00038
作者: Rick White
关键词-EN: geometric algebra, paper proposes, models by utilizing, Transformer models, word embeddings
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:This paper proposes a novel approach to word embeddings in Transformer models by utilizing spinors from geometric algebra. Spinors offer a rich mathematical framework capable of capturing complex relationships and transformations in high-dimensional spaces. By encoding words as spinors, we aim to enhance the expressiveness and robustness of language representations. We present the theoretical foundations of spinors, detail their integration into Transformer architectures, and discuss potential advantages and challenges.
摘要:本文提出了一种利用几何代数中的旋量 (spinors) 来改进 Transformer 模型中词嵌入 (word embeddings) 的新方法。旋量提供了一个丰富的数学框架,能够捕捉高维空间中的复杂关系和变换。通过将词编码为旋量,我们旨在增强语言表示的表达能力和鲁棒性。本文介绍了旋量的理论基础,详细阐述了其在 Transformer 架构中的集成,并讨论了潜在的优势和挑战。

[NLP-66] Strategic Collusion of LLM Agents : Market Division in Multi-Commodity Competitions

【速读】: 该论文试图解决的问题是探讨大型语言模型(LLMs)在多商品市场中作为自主代理时是否能够独立进行反竞争行为,如串谋或市场分割。解决方案的关键在于研究LLMs是否能够通过动态调整定价和资源分配策略,在没有直接人类干预或明确串谋指令的情况下,实现对特定商品的垄断,从而最大化利润。研究结果表明,LLMs具备这种能力,这为企业在战略角色中整合AI以及监管机构维护公平竞争市场带来了独特的挑战和机遇。

链接: https://arxiv.org/abs/2410.00031
作者: Ryan Y. Lin,Siddhartha Ojha,Kevin Cai,Maxwell F. Chen
关键词-EN: Machine-learning technologies, real-world market scenarios, increased deployment, deployment in real-world, Cournot competition frameworks
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Machine-learning technologies are seeing increased deployment in real-world market scenarios. In this work, we explore the strategic behaviors of large language models (LLMs) when deployed as autonomous agents in multi-commodity markets, specifically within Cournot competition frameworks. We examine whether LLMs can independently engage in anti-competitive practices such as collusion or, more specifically, market division. Our findings demonstrate that LLMs can effectively monopolize specific commodities by dynamically adjusting their pricing and resource allocation strategies, thereby maximizing profitability without direct human input or explicit collusion commands. These results pose unique challenges and opportunities for businesses looking to integrate AI into strategic roles and for regulatory bodies tasked with maintaining fair and competitive markets. The study provides a foundation for further exploration into the ramifications of deferring high-stakes decisions to LLM-based agents.
摘要:机器学习技术在实际市场场景中的应用日益增多。本文探讨了大语言模型 (LLM) 作为自主智能体在多商品市场中,特别是在古诺竞争框架下的战略行为。我们研究了 LLM 是否能够独立进行反竞争行为,如串谋,或更具体地说,市场分割。研究结果表明,LLM 能够通过动态调整定价和资源分配策略,有效垄断特定商品,从而在没有直接人类干预或明确串谋指令的情况下最大化利润。这些结果为希望将 AI 整合到战略角色的企业以及负责维护公平竞争市场的监管机构带来了独特的挑战和机遇。本研究为进一步探索将高风险决策委托给基于 LLM 的智能体的后果奠定了基础。

[NLP-67] Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

【速读】: 该论文试图解决从语音直接建模语言时,语音模型在语义能力上落后于基于文本的模型的问题。解决方案的关键在于通过在音素分类任务上微调语音表示模型,使其生成更具上下文不变性的表示,从而提升下游语言建模的性能。

链接: https://arxiv.org/abs/2410.00025
作者: Maxime Poli,Emmanuel Chemla,Emmanuel Dupoux
关键词-EN: Recent progress, progress in Spoken, Spoken Language Modeling, Spoken Language, demonstrated the feasibility
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Recent progress in Spoken Language Modeling has demonstrated the feasibility of learning language directly from speech. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems tend to trail behind text-based language models in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, which in turn improve downstream language modeling performance.
摘要:近年来,口语语言建模的进展展示了直接从语音中学习语言的可行性。通过在文本层面操作的生成语音的流水线通常会丢失细微差别、语调和非语言的发声。直接从语音建模为更自然和富有表现力的系统开辟了道路。另一方面,纯语音系统在语义能力方面往往落后于基于文本的语言模型。我们展示了在音素分类上微调语音表示模型可以产生更多上下文不变的表示,从而提高下游语言建模的性能。

[NLP-68] Retro-li: Small-Scale Retrieval Augmented Generation Supporting Noisy Similarity Searches and Domain Shift Generalization

【速读】: 该论文试图解决在小型非参数记忆数据库中进行检索增强生成(RAG)时,由于数据库规模较小导致检索结果稀疏和噪声问题。解决方案的关键在于引入语义相似性搜索以提高检索精度,并首次在非参数记忆中加入正则化技术,以减少推理过程中邻居搜索操作的噪声影响,从而显著降低困惑度并提高模型在领域迁移时的泛化能力。此外,论文还探讨了在模拟内存计算硬件上实现非参数记忆的可能性,展示了O(1)的搜索时间,尽管检索邻居时会引入噪声,但性能损失最小(1%)。

链接: https://arxiv.org/abs/2410.00004
作者: Gentiana Rashiti,Geethan Karunaratne,Mrinmaya Sachan,Abu Sebastian,Abbas Rahimi
关键词-EN: language modeling capabilities, retrieval augmented generation, improve language modeling, augmented generation, trillions of entries
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published as a conference paper at European Conference on Artificial Intelligence 2024

点击查看摘要

Abstract:The retrieval augmented generation (RAG) system such as Retro has been shown to improve language modeling capabilities and reduce toxicity and hallucinations by retrieving from a database of non-parametric memory containing trillions of entries. We introduce Retro-li that shows retrieval can also help using a small-scale database, but it demands more accurate and better neighbors when searching in a smaller hence sparser non-parametric memory. This can be met by using a proper semantic similarity search. We further propose adding a regularization to the non-parametric memory for the first time: it significantly reduces perplexity when the neighbor search operations are noisy during inference, and it improves generalization when a domain shift occurs. We also show that Retro-li’s non-parametric memory can potentially be implemented on analog in-memory computing hardware, exhibiting O(1) search time while causing noise in retrieving neighbors, with minimal (1%) performance loss. Our code is available at: this https URL.
摘要:检索增强生成 (Retrieval Augmented Generation, RAG) 系统,如 Retro,通过从包含数万亿条目的非参数记忆数据库中检索,已被证明可以提升语言建模能力并减少毒性和幻觉。我们引入了 Retro-li,展示了在小规模数据库中检索也能提供帮助,但在较小的、因此更为稀疏的非参数记忆中搜索时,需要更准确和更好的邻居。这可以通过使用适当的语义相似性搜索来实现。我们首次提出对非参数记忆添加正则化:在推理过程中邻居搜索操作存在噪声时,这显著降低了困惑度,并且在领域转移发生时提高了泛化能力。我们还展示了 Retro-li 的非参数记忆有可能在模拟内存计算硬件上实现,展示出 O(1) 的搜索时间,同时在检索邻居时引入噪声,性能损失最小 (1%)。我们的代码可在以下链接获取:this https URL。

[NLP-69] owards Robust Multimodal Sentiment Analysis with Incomplete Data NEURIPS2024

【速读】: 该论文试图解决多模态情感分析(MSA)中数据不完整性的问题。解决方案的关键在于提出了一种语言主导的抗噪学习网络(LNLN),通过主导模态修正(DMC)模块和基于主导模态的多模态学习(DMML)模块,确保语言模态作为主导模态的高质量表示,从而提升模型在各种噪声场景下的鲁棒性。

链接: https://arxiv.org/abs/2409.20012
作者: Haoyu Zhang,Wenbin Wang,Tianshu Yu
关键词-EN: Multimodal Sentiment Analysis, emerging direction seeking, Sentiment Analysis, Noise-resistant Learning Network, Language-dominated Noise-resistant Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:The field of Multimodal Sentiment Analysis (MSA) has recently witnessed an emerging direction seeking to tackle the issue of data incompleteness. Recognizing that the language modality typically contains dense sentiment information, we consider it as the dominant modality and present an innovative Language-dominated Noise-resistant Learning Network (LNLN) to achieve robust MSA. The proposed LNLN features a dominant modality correction (DMC) module and dominant modality based multimodal learning (DMML) module, which enhances the model’s robustness across various noise scenarios by ensuring the quality of dominant modality representations. Aside from the methodical design, we perform comprehensive experiments under random data missing scenarios, utilizing diverse and meaningful settings on several popular datasets (\textite.g., MOSI, MOSEI, and SIMS), providing additional uniformity, transparency, and fairness compared to existing evaluations in the literature. Empirically, LNLN consistently outperforms existing baselines, demonstrating superior performance across these challenging and extensive evaluation metrics.
摘要:多模态情感分析 (Multimodal Sentiment Analysis, MSA) 领域最近出现了一个新兴方向,旨在解决数据不完整性问题。认识到语言模态通常包含密集的情感信息,我们将其视为主导模态,并提出了一种创新的语言主导抗噪学习网络 (Language-dominated Noise-resistant Learning Network, LNLN),以实现稳健的 MSA。所提出的 LNLN 具有主导模态校正 (Dominant Modality Correction, DMC) 模块和基于主导模态的多模态学习 (Dominant Modality based Multimodal Learning, DMML) 模块,通过确保主导模态表示的质量,增强了模型在各种噪声场景下的鲁棒性。除了方法上的设计,我们在随机数据缺失场景下进行了全面的实验,利用了多个流行数据集(例如 MOSI、MOSEI 和 SIMS)上的多样化且有意义的设置,与现有文献中的评估相比,提供了额外的统一性、透明性和公平性。从经验上看,LNLN 始终优于现有的基线方法,在这些具有挑战性和广泛的评估指标上展示了卓越的性能。

[NLP-70] AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

【速读】: 该论文试图解决大语言模型(LLMs)在训练数据不断过时的情况下如何保持更新和可靠性的问题。解决方案的关键在于开发一个系统,该系统能够自动收集网络数据并利用现有的可信AI模型进行自动过滤,以去除数据中的偏见、垃圾信息和其他不安全或不希望的文本,从而确保训练数据的纯净性。实验结果表明,该系统在净化数据方面具有显著效果。

链接: https://arxiv.org/abs/2406.19271
作者: Praneeth Vadlapati
关键词-EN: Large Language Models, reliable Large Language, Large Language, Language Models, reliable Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Initial version

点击查看摘要

Abstract:Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system’s effectiveness in purifying the data.
摘要:当前,人们持续追求最新且可靠的大语言模型 (LLMs)。通常,LLMs 是在固定数据集上进行训练,然后部署。然而,训练数据会不断变得过时。利用网络数据进行 AI 的自动训练涉及数据质量和安全性的重大问题,因为存在偏见、垃圾信息以及其他不安全或不受欢迎的文本。纯净的数据对于生成可靠的模型至关重要。在杂质数据上训练模型可能会导致不良结果。本研究提出了一种系统,该系统能够收集网络数据并通过现有可信 AI 模型的辅助自动过滤掉不受欢迎的文本。在实验中,收集并过滤了一小部分网络数据,证明了该系统在净化数据方面的有效性。

[NLP-71] Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments

【速读】: 该论文试图解决在处理非结构化高维治疗(如文本)时,如何提高因果推断的有效性问题。解决方案的关键在于利用生成式人工智能,特别是大型语言模型(LLMs),来高效生成治疗并利用其内部表示进行后续的因果效应估计。通过这种方式,可以有效分离出感兴趣的治疗特征(如特定情感和主题),同时避免其他可能的未知混杂特征的影响。与现有方法不同,该方法无需从数据中学习因果表示,从而提高了估计的准确性和效率。此外,论文还提出了非参数识别平均处理效应的条件,设计了避免违反重叠假设的估计策略,并通过双重机器学习方法推导了所提出估计量的渐近性质。最后,通过工具变量方法,将该方法扩展到治疗特征基于人类感知而非固定给定治疗对象的场景。

链接: https://arxiv.org/abs/2410.00903
作者: Kosuke Imai,Kentaro Nakamura
关键词-EN: generative Artificial Intelligence, Artificial Intelligence, unstructured high-dimensional treatments, generative Artificial, enhance the validity
类目: Applications (stat.AP); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence. Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps separate the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike the existing methods, our proposed approach eliminates the need to learn causal representation from the data and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings, in which the treatment feature is based on human perception rather than is assumed to be fixed given the treatment object. We conduct simulation studies using the generated text data with an open-source LLM, Llama3, to illustrate the advantages of our estimator over the state-of-the-art causal representation learning algorithms.
摘要:本文展示了如何利用生成式人工智能 (Generative AI) 的力量,增强对非结构化高维处理(如文本)的因果推断的有效性。具体而言,我们提出使用深度生成模型,如大语言模型 (LLMs),来高效生成处理,并利用其内部表示进行后续的因果效应估计。我们证明,这种真实内部表示的知识有助于将感兴趣的处理特征(如特定情感和某些主题)与其他可能未知的混杂特征区分开来。与现有方法不同,我们提出的方法无需从数据中学习因果表示,因此能够产生更准确和高效的估计。我们正式确立了非参数识别平均处理效应所需的条件,提出了一种避免违反重叠假设的估计策略,并通过应用双重机器学习推导了所提出估计量的渐近性质。最后,使用工具变量方法,我们将所提出的方法扩展到处理特征基于人类感知而非假定为固定给定处理对象的场景。我们使用开源 LLM Llama3 生成的文本数据进行模拟研究,以说明我们的估计量相对于最先进的因果表示学习算法的优势。

[NLP-72] he age of spiritual machines: Language quietus induces synthetic altered states of consciousness in artificial intelligence

【速读】: 该论文试图解决语言与意识之间的关系问题,特别是探讨语言分类在改变意识状态中的作用。解决方案的关键在于通过调整CLIP和FLAVA模型中的注意力权重,模拟改变意识状态,并比较这些状态下的语义嵌入空间与实际改变意识状态问卷中的嵌入空间。研究发现,减少对语言和视觉的注意力权重后,模型更倾向于表现出无实体、无自我、精神性和统一性等状态,这与高剂量致幻剂或专注冥想所引发的意识状态相符,这些状态通常与心理健康和幸福感的提升相关。这一结果支持了语言分类在改变意识状态现象学中的重要作用。

链接: https://arxiv.org/abs/2410.00257
作者: Jeremy I Skipper,Joanna Kuc,Greg Cooper,Christopher Timmermann
关键词-EN: altered states, states, language, altered, language related
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 Figures

点击查看摘要

Abstract:How is language related to consciousness? Language functions to categorise perceptual experiences (e.g., labelling interoceptive states as ‘happy’) and higher-level constructs (e.g., using ‘I’ to represent the narrative self). Psychedelic use and meditation might be described as altered states that impair or intentionally modify the capacity for linguistic categorisation. For example, psychedelic phenomenology is often characterised by ‘oceanic boundlessness’ or ‘unity’ and ‘ego dissolution’, which might be expected of a system unburdened by entrenched language categories. If language breakdown plays a role in producing such altered behaviour, multimodal artificial intelligence might align more with these phenomenological descriptions when attention is shifted away from language. We tested this hypothesis by comparing the semantic embedding spaces from simulated altered states after manipulating attentional weights in CLIP and FLAVA models to embedding spaces from altered states questionnaires before manipulation. Compared to random text and various other altered states including anxiety, models were more aligned with disembodied, ego-less, spiritual, and unitive states, as well as minimal phenomenal experiences, with decreased attention to language and vision. Reduced attention to language was associated with distinct linguistic patterns and blurred embeddings within and, especially, across semantic categories (e.g., ‘giraffes’ become more like ‘bananas’). These results lend support to the role of language categorisation in the phenomenology of altered states of consciousness, like those experienced with high doses of psychedelics or concentration meditation, states that often lead to improved mental health and wellbeing.
摘要:语言与意识之间有何关联?语言的功能在于对感知体验(例如,将内感受状态标记为“快乐”)和更高层次的结构(例如,使用“我”来代表叙事自我)进行分类。迷幻药物的使用和冥想可能被描述为改变的状态,这些状态会损害或有意修改语言分类的能力。例如,迷幻现象学通常以“海洋般的无边界感”或“统一性”和“自我消解”为特征,这可能是由于系统不受根深蒂固的语言类别束缚所致。如果语言的崩溃在产生这种改变行为中起作用,那么当注意力从语言转移时,多模态人工智能可能更符合这些现象学描述。我们通过比较在CLIP和FLAVA模型中操纵注意力权重后模拟的改变状态的语义嵌入空间与操纵前改变状态问卷的嵌入空间,来测试这一假设。与随机文本和包括焦虑在内的其他各种改变状态相比,模型在减少对语言和视觉的注意力时,更符合无实体、无自我、精神性和统一性的状态,以及最小化的现象体验。减少对语言的注意力与特定的语言模式和模糊的嵌入有关,尤其是在语义类别之间(例如,“长颈鹿”变得更像“香蕉”)。这些结果支持了语言分类在改变意识状态的现象学中的作用,如高剂量迷幻药物或专注冥想所体验到的状态,这些状态通常会带来心理健康和幸福感的提升。

[NLP-73] Mamba for Streaming ASR Combined with Unimodal Aggregation ICASSP2025

【速读】: 该论文试图解决流式自动语音识别(ASR)中的效率和延迟问题。解决方案的关键在于引入Mamba编码器,这是一种具有线性复杂度优势的状态空间模型,能够匹配或超越Transformer的性能。论文还提出了一种前瞻机制,以利用可控的未来信息,并实现了一种流式单模态聚合(UMA)方法,该方法自动检测标记活动并触发标记输出,同时聚合特征帧以增强标记表示的学习。此外,基于UMA提出了早期终止(ET)方法,进一步减少识别延迟。实验结果表明,所提出的模型在识别精度和延迟方面均表现出色。

链接: https://arxiv.org/abs/2410.00070
作者: Ying Fang,Xiaofei Li
关键词-EN: streaming automatic speech, automatic speech recognition, paper works, automatic speech, Abstract
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.
摘要: 本文研究了流式自动语音识别 (ASR)。Mamba 是一种最近提出的状态空间模型,在各种任务中展示了与 Transformer 相匹配甚至超越的能力,同时受益于线性复杂度的优势。我们探索了 Mamba 编码器在流式 ASR 中的效率,并提出了一种相关的前瞻机制,以利用可控的未来信息。此外,实现了一种流式单模态聚合 (UMA) 方法,该方法自动检测 Token 活动并流式触发 Token 输出,同时聚合特征帧以更好地学习 Token 表示。基于 UMA,提出了一种早期终止 (ET) 方法,以进一步减少识别延迟。在两个普通话数据集上进行的实验表明,所提出的模型在识别准确性和延迟方面均达到了具有竞争力的 ASR 性能。

[NLP-74] Moshi: a speech-text foundation model for real-time dialogue

【速读】: 该论文试图解决当前语音对话系统中存在的几个关键问题,包括系统复杂性导致的交互延迟、非语言信息(如情感或非语音声音)在文本转换中的丢失,以及对话分割中无法处理重叠语音、打断和插话等问题。解决方案的关键在于将语音对话视为语音到语音的生成过程,通过使用基于文本语言模型的骨干网络,结合神经音频编解码器的残差量化器生成语音标记,并分别建模用户和系统的语音流。这种方法不仅去除了显式的说话者切换,还允许建模任意的对话动态。此外,论文提出的“内心独白”方法通过首先预测时间对齐的文本标记作为音频标记的前缀,显著提高了生成语音的语言质量,并支持流式语音识别和文本到语音的转换。最终,该模型实现了首个实时全双工语音大型语言模型,理论延迟为160毫秒,实际延迟为200毫秒。

链接: https://arxiv.org/abs/2410.00037
作者: Alexandre Défossez,Laurent Mazaré,Manu Orsini,Amélie Royer,Patrick Pérez,Hervé Jégou,Edouard Grave,Neil Zeghidour
关键词-EN: speech-text foundation model, speech-text foundation, spoken dialogue, dialogue, introduce Moshi
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning – such as emotion or non-speech sounds – is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this ``Inner Monologue’’ method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at this https URL.
摘要:我们介绍了 Moshi,这是一个语音-文本基础模型和全双工口语对话框架。当前的口语对话系统依赖于一系列独立组件的流水线,即语音活动检测、语音识别、文本对话和文本转语音。这样的框架无法模拟真实对话的体验。首先,其复杂性导致交互之间存在几秒钟的延迟。其次,由于文本是对话的中间媒介,修改意义的非语言信息(如情感或非语音声音)在交互中丢失。最后,这些系统依赖于将对话分割为说话者轮次,这并未考虑到重叠语音、打断和插话。Moshi 通过将口语对话视为语音到语音的生成,一次性解决了这些独立问题。从文本语言模型主干开始,Mshi 从神经音频编解码器的残差量化器生成语音 Token,同时将自身语音和用户语音分别建模为并行流。这使得可以去除显式的说话者轮次,并建模任意的对话动态。此外,我们将先前工作的层次化语义到声学 Token 生成扩展到首先预测与时间对齐的文本 Token,作为音频 Token 的前缀。这种“内心独白”方法不仅显著提高了生成语音的语言质量,还展示了其如何提供流式语音识别和文本转语音。我们最终的模型是首个实时全双工口语大语言模型,理论延迟为 160 毫秒,实际延迟为 200 毫秒,并可通过此 https URL 获取。

[NLP-75] FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation Casing and Context

【速读】: 该论文试图解决乌兹别克语语音识别中的词汇错误率(WER)问题,解决方案的关键在于引入FeruzaSpeech语料库。FeruzaSpeech是一个包含60小时高质量录音的乌兹别克语朗读语音语料库,由来自乌兹别克斯坦塔什干的一位女性母语者录制,内容包括书籍摘录和BBC新闻片段。通过将FeruzaSpeech语料库整合到现有的CommonVoice 16.1乌兹别克语数据和乌兹别克语语音语料库中,显著提升了词汇错误率(WER)。

链接: https://arxiv.org/abs/2410.00035
作者: Anna Povey,Katherine Povey
关键词-EN: Cyrillic and Latin, academic research purposes, Latin alphabets, read speech corpus, research purposes
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 Pages, 1 Figure, Preprint of Paper Accepted in ICNLSP 2024

点击查看摘要

Abstract:This paper introduces FeruzaSpeech, a read speech corpus of the Uzbek language, containing transcripts in both Cyrillic and Latin alphabets, freely available for academic research purposes. This corpus includes 60 hours of high-quality recordings from a single native female speaker from Tashkent, Uzbekistan. These recordings consist of short excerpts from a book and BBC News. This paper discusses the enhancement of the Word Error Rates (WERs) on CommonVoice 16.1’s Uzbek data, Uzbek Speech Corpus data, and FeruzaSpeech data upon integrating FeruzaSpeech.
摘要:本文介绍了 FeruzaSpeech,这是一个乌兹别克语的朗读语音语料库,包含西里尔字母和拉丁字母的转录文本,可免费用于学术研究目的。该语料库包含来自乌兹别克斯坦塔什干的一位母语女性的高质量录音,总计 60 小时。这些录音包括来自一本书和 BBC 新闻的短片段。本文讨论了在整合 FeruzaSpeech 后,对 CommonVoice 16.1 的乌兹别克语数据、乌兹别克语语音语料库数据以及 FeruzaSpeech 数据上的词错误率 (WER) 的提升。

人工智能

[AI-0] he Gradient of Health Data Privacy

链接: https://arxiv.org/abs/2410.00897
作者: Baihan Lin
关键词-EN: artificial intelligence, increasingly complex, significant implications, global health equity, health
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Other Quantitative Biology (q-bio.OT)
*备注:

点击查看摘要

Abstract:In the era of digital health and artificial intelligence, the management of patient data privacy has become increasingly complex, with significant implications for global health equity and patient trust. This paper introduces a novel “privacy gradient” approach to health data governance, offering a more nuanced and adaptive framework than traditional binary privacy models. Our multidimensional concept considers factors such as data sensitivity, stakeholder relationships, purpose of use, and temporal aspects, allowing for context-sensitive privacy protections. Through policy analyses, ethical considerations, and case studies spanning adolescent health, integrated care, and genomic research, we demonstrate how this approach can address critical privacy challenges in diverse healthcare settings worldwide. The privacy gradient model has the potential to enhance patient engagement, improve care coordination, and accelerate medical research while safeguarding individual privacy rights. We provide policy recommendations for implementing this approach, considering its impact on healthcare systems, research infrastructures, and global health initiatives. This work aims to inform policymakers, healthcare leaders, and digital health innovators, contributing to a more equitable, trustworthy, and effective global health data ecosystem in the digital age.

[AI-1] GEMS: Generative Expert Metric System through Iterative Prompt Priming

链接: https://arxiv.org/abs/2410.00880
作者: Ti-Chung Cheng,Carmen Badea,Christian Bird,Thomas Zimmermann,Robert DeLine,Nicole Forsgren,Denae Ford
关键词-EN: informing decisions, resolving conflicts, measurements are fundamental, fundamental to identifying, software communities
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 29 pages, 3 figures

点击查看摘要

Abstract:Across domains, metrics and measurements are fundamental to identifying challenges, informing decisions, and resolving conflicts. Despite the abundance of data available in this information age, not only can it be challenging for a single expert to work across multi-disciplinary data, but non-experts can also find it unintuitive to create effective measures or transform theories into context-specific metrics that are chosen appropriately. This technical report addresses this challenge by examining software communities within large software corporations, where different measures are used as proxies to locate counterparts within the organization to transfer tacit knowledge. We propose a prompt-engineering framework inspired by neural activities, demonstrating that generative models can extract and summarize theories and perform basic reasoning, thereby transforming concepts into context-aware metrics to support software communities given software repository data. While this research zoomed in on software communities, we believe the framework’s applicability extends across various fields, showcasing expert-theory-inspired metrics that aid in triaging complex challenges.

[AI-2] Do Music Generation Models Encode Music Theory?

链接: https://arxiv.org/abs/2410.00872
作者: Megan Wei,Michael Freeman,Chris Donahue,Chen Sun
关键词-EN: possess impressive music, music theory concepts, models possess impressive, Music, music theory
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2024. Dataset: this https URL Code: this https URL Website: this https URL

点击查看摘要

Abstract:Music foundation models possess impressive music generation capabilities. When people compose music, they may infuse their understanding of music into their work, by using notes and intervals to craft melodies, chords to build progressions, and tempo to create a rhythmic feel. To what extent is this true of music generation models? More specifically, are fundamental Western music theory concepts observable within the “inner workings” of these models? Recent work proposed leveraging latent audio representations from music generation models towards music information retrieval tasks (e.g. genre classification, emotion recognition), which suggests that high-level musical characteristics are encoded within these models. However, probing individual music theory concepts (e.g. tempo, pitch class, chord quality) remains under-explored. Thus, we introduce SynTheory, a synthetic MIDI and audio music theory dataset, consisting of tempos, time signatures, notes, intervals, scales, chords, and chord progressions concepts. We then propose a framework to probe for these music theory concepts in music foundation models (Jukebox and MusicGen) and assess how strongly they encode these concepts within their internal representations. Our findings suggest that music theory concepts are discernible within foundation models and that the degree to which they are detectable varies by model size and layer.

[AI-3] MAP: Unleashing Hybrid Mamba-Transformer Vision Backbones Potential with Masked Autoregressive Pretraining

链接: https://arxiv.org/abs/2410.00871
作者: Yunze Liu,Li Yi
关键词-EN: achieved significant advantages, large parameters remains, Mamba, autoregressive pretraining, Masked Autoregressive Pretraining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mamba has achieved significant advantages in long-context modeling and autoregressive tasks, but its scalability with large parameters remains a major limitation in vision applications. pretraining is a widely used strategy to enhance backbone model performance. Although the success of Masked Autoencoder in Transformer pretraining is well recognized, it does not significantly improve Mamba’s visual learning performance. We found that using the correct autoregressive pretraining can significantly boost the performance of the Mamba architecture. Based on this analysis, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Additionally, in terms of integrating Mamba and Transformer modules, we empirically found that inserting Transformer layers at regular intervals within Mamba layers can significantly enhance downstream task performance. Experimental results show that both the pure Mamba architecture and the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperform other pretraining strategies, achieving state-of-the-art performance. We validate the effectiveness of the method on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component.

[AI-4] WiGNet: Windowed Vision Graph Neural Network

链接: https://arxiv.org/abs/2410.00807
作者: Gabriele Spadaro,Marco Grangetto,Attilio Fiandrotti,Enzo Tartaglione,Jhony H. Giraldo
关键词-EN: Graph Neural Networks, demonstrated strong adaptability, Neural Networks, Graph Neural, vision Graph neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Graph Neural Networks (GNNs) have demonstrated strong adaptability to various real-world challenges, with architectures such as Vision GNN (ViG) achieving state-of-the-art performance in several computer vision tasks. However, their practical applicability is hindered by the computational complexity of constructing the graph, which scales quadratically with the image size. In this paper, we introduce a novel Windowed vision Graph neural Network (WiGNet) model for efficient image processing. WiGNet explores a different strategy from previous works by partitioning the image into windows and constructing a graph within each window. Therefore, our model uses graph convolutions instead of the typical 2D convolution or self-attention mechanism. WiGNet effectively manages computational and memory complexity for large image sizes. We evaluate our method in the ImageNet-1k benchmark dataset and test the adaptability of WiGNet using the CelebA-HQ dataset as a downstream task with higher-resolution images. In both of these scenarios, our method achieves competitive results compared to previous vision GNNs while keeping memory and computational complexity at bay. WiGNet offers a promising solution toward the deployment of vision GNNs in real-world applications. We publicly released the code at this https URL.

[AI-5] Adaptive Motion Generation Using Uncertainty-Driven Foresight Prediction

链接: https://arxiv.org/abs/2410.00774
作者: Hyogo Hiruma,Hiroshi Ito,Tetusya Ogata
关键词-EN: performing real-world robot, real-world robot tasks, characteristic to handle, environments has long, difficult characteristic
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty of environments has long been a difficult characteristic to handle, when performing real-world robot tasks. This is because the uncertainty produces unexpected observations that cannot be covered by manual scripting. Learning based robot controlling methods are a promising approach for generating flexible motions against unknown situations, but still tend to suffer under uncertainty due to its deterministic nature. In order to adaptively perform the target task under such conditions, the robot control model must be able to accurately understand the possible uncertainty, and to exploratively derive the optimal action that minimizes such uncertainty. This paper extended an existing predictive learning based robot control method, which employ foresight prediction using dynamic internal simulation. The foresight module refines the model’s hidden states by sampling multiple possible futures and replace with the one that led to the lower future uncertainty. The adaptiveness of the model was evaluated on a door opening task. The door can be opened either by pushing, pulling, or sliding, but robot cannot visually distinguish which way, and is required to adapt on the fly. The results showed that the proposed model adaptively diverged its motion through interaction with the door, whereas conventional methods failed to stably diverge. The models were analyzed on Lyapunov exponents of RNN hidden states which reflect the possible divergence at each time step during task execution. The result indicated that the foresight module biased the model to consider future consequences, which lead to embedding uncertainties at the policy of the robot controller, rather than the resultant observation. This is beneficial for implementing adaptive behaviors, which indices derivation of diverse motion during exploration.

[AI-6] BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

链接: https://arxiv.org/abs/2410.00773
作者: Xuwu Wang,Qiwen Cui,Yunzhe Tao,Yiran Wang,Ziwei Chai,Xiaotian Han,Boyi Liu,Jianbo Yuan,Jing Su,Guoyin Wang,Tingkai Liu,Liyu Chen,Tianyi Liu,Tao Sun,Yufeng Zhang,Sirui Zheng,Quanzeng You,Yang Yang,Hongxia Yang
关键词-EN: Large language models, Visual Question Answering, Large language, complex data types, increasingly pivotal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA). These areas have attracted significant attention from both industry and academia. Despite this, there remains a lack of unified evaluation methodologies for these diverse data handling scenarios. In response, we introduce BabelBench, an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. BabelBench incorporates a dataset comprising 247 meticulously curated problems that challenge the models with tasks in perception, commonsense reasoning, logical reasoning, and so on. Besides the basic capabilities of multimodal understanding, structured data processing as well as code generation, these tasks demand advanced capabilities in exploration, planning, reasoning and debugging. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement. The insights derived from our comprehensive analysis offer valuable guidance for future research within the community. The benchmark data can be found at this https URL.

[AI-7] LTLf Synthesis on First-Order Action Theories

链接: https://arxiv.org/abs/2410.00726
作者: Till Hofmann,Jens Claßen
关键词-EN: high-level agent language, includes nondeterministic operators, expressive high-level agent, high-level agent, agent language
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Golog is an expressive high-level agent language that includes nondeterministic operators which allow to leave some of the decisions to be made only at execution time. This so-called program realization is typically implemented by means of search, or in an incremental online fashion. In this paper, we consider the more realistic case where parts of the non-determinism are under the control of the environment. Program realization then becomes a synthesis problem, where a successful realization executes the program and satisfies the temporal goal for all possible environment actions. We consider Golog programs in combination with an expressive class of first-order action theories that allow for an unbounded number of objects and non-local effects, together with a temporal goal specified in a first-order extension of LTLf. We solve the synthesis problem by constructing a game arena that captures all possible executions of the program while tracking the satisfaction of the temporal goal and then solving the resulting two-player game. We evaluate the approach in two domains, showing the general feasibility of the approach.

[AI-8] Contrastive Abstraction for Reinforcement Learning

链接: https://arxiv.org/abs/2410.00704
作者: Vihang Patil,Markus Hofmarcher,Elisabeth Rumetshofer,Sepp Hochreiter
关键词-EN: contrastive abstraction learning, Learning, abstraction learning, Abstract, contrastive abstraction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning agents with reinforcement learning is difficult when dealing with long trajectories that involve a large number of states. To address these learning problems effectively, the number of states can be reduced by abstract representations that cluster states. In principle, deep reinforcement learning can find abstract states, but end-to-end learning is unstable. We propose contrastive abstraction learning to find abstract states, where we assume that successive states in a trajectory belong to the same abstract state. Such abstract states may be basic locations, achieved subgoals, inventory, or health conditions. Contrastive abstraction learning first constructs clusters of state representations by contrastive learning and then applies modern Hopfield networks to determine the abstract states. The first phase of contrastive abstraction learning is self-supervised learning, where contrastive learning forces states with sequential proximity to have similar representations. The second phase uses modern Hopfield networks to map similar state representations to the same fixed point, i.e.\ to an abstract state. The level of abstraction can be adjusted by determining the number of fixed points of the modern Hopfield network. Furthermore, \textitcontrastive abstraction learning does not require rewards and facilitates efficient reinforcement learning for a wide range of downstream tasks. Our experiments demonstrate the effectiveness of contrastive abstraction learning for reinforcement learning.

[AI-9] Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.00700
作者: Saurav Jha,Shiqi Yang,Masato Ishii,Mengjie Zhao,Christian Simon,Jehanzeb Mirza,Dong Gong,Lina Yao,Shusuke Takahashi,Yuki Mitsufuji
关键词-EN: user-defined text descriptions, grown popular, ability to efficiently, efficiently acquire, user-defined text
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Work under review

点击查看摘要

Abstract:Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learning (CL) setup, most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones – a challenge that continual personalization (CP) aims to solve. Inspired by the successful CL methods that rely on class-specific information for regularization, we resort to the inherent class-conditioned density estimates, also known as diffusion classifier (DC) scores, for continual personalization of text-to-image diffusion models. Namely, we propose using DC scores for regularizing the parameter-space and function-space of text-to-image diffusion models, to achieve continual personalization. Using several diverse evaluation setups, datasets, and metrics, we show that our proposed regularization-based CP methods outperform the state-of-the-art C-LoRA, and other baselines. Finally, by operating in the replay-free CL setup and on low-rank adapters, our method incurs zero storage and parameter overhead, respectively, over the state-of-the-art.

[AI-10] Beyond Minimax Rates in Group Distributionally Robust Optimization via a Novel Notion of Sparsity

链接: https://arxiv.org/abs/2410.00690
作者: Quan Nguyen,Nishant A. Mehta,Cristóbal Guzmán
关键词-EN: group distributionally robust, distributionally robust optimization, distributionally robust, lambda, beta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 38 pages

点击查看摘要

Abstract:The minimax sample complexity of group distributionally robust optimization (GDRO) has been determined up to a \log(K) factor, for K the number of groups. In this work, we venture beyond the minimax perspective via a novel notion of sparsity that we dub (\lambda, \beta) -sparsity. In short, this condition means that at any parameter \theta , there is a set of at most \beta groups whose risks at \theta all are at least \lambda larger than the risks of the other groups. To find an \epsilon -optimal \theta , we show via a novel algorithm and analysis that the \epsilon -dependent term in the sample complexity can swap a linear dependence on K for a linear dependence on the potentially much smaller \beta . This improvement leverages recent progress in sleeping bandits, showing a fundamental connection between the two-player zero-sum game optimization framework for GDRO and per-action regret bounds in sleeping bandits. The aforementioned result assumes having a particular \lambda as input. Perhaps surprisingly, we next show an adaptive algorithm which, up to log factors, gets sample complexity that adapts to the best (\lambda, \beta) -sparsity condition that holds. Finally, for a particular input \lambda , we also show how to get a dimension-free sample complexity result.

[AI-11] Multimodal Auto Validation For Self-Refinement in Web Agents

链接: https://arxiv.org/abs/2410.00689
作者: Ruhana Azam,Tamer Abuelsaad,Aditya Vempaty,Ashish Jagmohan
关键词-EN: world digitizes, monotonous tasks, essential in streamlining, web agents, web
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:As our world digitizes, web agents that can automate complex and monotonous tasks are becoming essential in streamlining workflows. This paper introduces an approach to improving web agent performance through multi-modal validation and self-refinement. We present a comprehensive study of different modalities (text, vision) and the effect of hierarchy for the automatic validation of web agents, building upon the state-of-the-art Agent-E web automation framework. We also introduce a self-refinement mechanism for web automation, using the developed auto-validator, that enables web agents to detect and self-correct workflow failures. Our results show significant gains on Agent-E’s (a SOTA web agent) prior state-of-art performance, boosting task-completion rates from 76.2% to 81.24% on the subset of the WebVoyager benchmark. The approach presented in this paper paves the way for more reliable digital assistants in complex, real-world scenarios.

[AI-12] Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation EMNLP

链接: https://arxiv.org/abs/2410.00683
作者: Jiyoon Myung,Jihyeon Park,Jungki Son,Kyungro Lee,Joohyung Han
关键词-EN: accurately translating technical, translating technical terms, specialized fields, large language models, paper addresses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Paper accepted in EMNLPW 2024

点击查看摘要

Abstract:This paper addresses the challenge of accurately translating technical terms, which are crucial for clear communication in specialized fields. We introduce the Parenthetical Terminology Translation (PTT) task, designed to mitigate potential inaccuracies by displaying the original term in parentheses alongside its translation. To implement this approach, we generated a representative PTT dataset using a collaborative approach with large language models and applied knowledge distillation to fine-tune traditional Neural Machine Translation (NMT) models and small-sized Large Language Models (sLMs). Additionally, we developed a novel evaluation metric to assess both overall translation accuracy and the correct parenthetical presentation of terms. Our findings indicate that sLMs did not consistently outperform NMT models, with fine-tuning proving more effective than few-shot prompting, particularly in models with continued pre-training in the target language. These insights contribute to the advancement of more reliable terminology translation methodologies.

[AI-13] Advanced Arabic Alphabet Sign Language Recognition Using Transfer Learning and Transformer Models

链接: https://arxiv.org/abs/2410.00681
作者: Mazen Balat,Rewaa Awaad,Hend Adel,Ahmed B. Zaky,Salah A. Aly
关键词-EN: deep learning methods, Arabic Alphabet Sign, Alphabet Sign Language, Arabic sign language, Arabic Alphabet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures

点击查看摘要

Abstract:This paper presents an Arabic Alphabet Sign Language recognition approach, using deep learning methods in conjunction with transfer learning and transformer-based models. We study the performance of the different variants on two publicly available datasets, namely ArSL2018 and AASL. This task will make full use of state-of-the-art CNN architectures like ResNet50, MobileNetV2, and EfficientNetB7, and the latest transformer models such as Google ViT and Microsoft Swin Transformer. These pre-trained models have been fine-tuned on the above datasets in an attempt to capture some unique features of Arabic sign language motions. Experimental results present evidence that the suggested methodology can receive a high recognition accuracy, by up to 99.6% and 99.43% on ArSL2018 and AASL, respectively. That is far beyond the previously reported state-of-the-art approaches. This performance opens up even more avenues for communication that may be more accessible to Arabic-speaking deaf and hard-of-hearing, and thus encourages an inclusive society.

[AI-14] Multimodal Coherent Explanation Generation of Robot Failures

链接: https://arxiv.org/abs/2410.00659
作者: Pradip Pramanick,Silvia Rossi
关键词-EN: social spaces, actions is crucial, acceptance in social, robot, robot actions
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The explainability of a robot’s actions is crucial to its acceptance in social spaces. Explaining why a robot fails to complete a given task is particularly important for non-expert users to be aware of the robot’s capabilities and limitations. So far, research on explaining robot failures has only considered generating textual explanations, even though several studies have shown the benefits of multimodal ones. However, a simple combination of multiple modalities may lead to semantic incoherence between the information across different modalities - a problem that is not well-studied. An incoherent multimodal explanation can be difficult to understand, and it may even become inconsistent with what the robot and the human observe and how they perform reasoning with the observations. Such inconsistencies may lead to wrong conclusions about the robot’s capabilities. In this paper, we introduce an approach to generate coherent multimodal explanations by checking the logical coherence of explanations from different modalities, followed by refinements as required. We propose a classification approach for coherence assessment, where we evaluate if an explanation logically follows another. Our experiments suggest that fine-tuning a neural network that was pre-trained to recognize textual entailment, performs well for coherence assessment of multimodal explanations. Code data: this https URL.

[AI-15] Explainable Multi-Stakeholder Job Recommender Systems RECSYS2024

链接: https://arxiv.org/abs/2410.00654
作者: Roan Schellingerhout
关键词-EN: Public opinion, recent years, increasingly wary, wary in recent, recommender systems
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 5 pages, 1 figure, to be published in ACM RecSys 2024

点击查看摘要

Abstract:Public opinion on recommender systems has become increasingly wary in recent years. In line with this trend, lawmakers have also started to become more critical of such systems, resulting in the introduction of new laws focusing on aspects such as privacy, fairness, and explainability for recommender systems and AI at large. These concepts are especially crucial in high-risk domains such as recruitment. In recruitment specifically, decisions carry substantial weight, as the outcomes can significantly impact individuals’ careers and companies’ success. Additionally, there is a need for a multi-stakeholder approach, as these systems are used by job seekers, recruiters, and companies simultaneously, each with its own requirements and expectations. In this paper, I summarize my current research on the topic of explainable, multi-stakeholder job recommender systems and set out a number of future research directions.

[AI-16] LASMP: Language Aided Subset Sampling Based Motion Planner

链接: https://arxiv.org/abs/2410.00649
作者: Saswati Bhattacharjee,Anirban Sinha,Chinwe Ekenna
关键词-EN: Aided Subset Sampling, Subset Sampling Based, Language Aided Subset, Aided Subset, Subset Sampling
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:This paper presents the Language Aided Subset Sampling Based Motion Planner (LASMP), a system that helps mobile robots plan their movements by using natural language instructions. LASMP uses a modified version of the Rapidly Exploring Random Tree (RRT) method, which is guided by user-provided commands processed through a language model (RoBERTa). The system improves efficiency by focusing on specific areas of the robot’s workspace based on these instructions, making it faster and less resource-intensive. Compared to traditional RRT methods, LASMP reduces the number of nodes needed by 55% and cuts random sample queries by 80%, while still generating safe, collision-free paths. Tested in both simulated and real-world environments, LASMP has shown better performance in handling complex indoor scenarios. The results highlight the potential of combining language processing with motion planning to make robot navigation more efficient.

[AI-17] Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures SIGGRAPH

链接: https://arxiv.org/abs/2410.00630
作者: Marcel C. Bühler,Gengyan Li,Erroll Wood,Leonhard Helminger,Xu Chen,Tanmay Shah,Daoye Wang,Stephan Garbin,Sergio Orts-Escolano,Otmar Hilliges,Dmitry Lagun,Jérémy Riviere,Paulo Gotardo,Thabo Beeler,Abhimitra Meka,Kripasindhu Sarkar
关键词-EN: neural radiance field, radiance field representations, representations have revolutionized, capture and photorealistic, neural radiance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Siggraph Asia Conference Papers 2024

点击查看摘要

Abstract:Volumetric modeling and neural radiance field representations have revolutionized 3D face capture and photorealistic novel view synthesis. However, these methods often require hundreds of multi-view input images and are thus inapplicable to cases with less than a handful of inputs. We present a novel volumetric prior on human faces that allows for high-fidelity expressive face modeling from as few as three input views captured in the wild. Our key insight is that an implicit prior trained on synthetic data alone can generalize to extremely challenging real-world identities and expressions and render novel views with fine idiosyncratic details like wrinkles and eyelashes. We leverage a 3D Morphable Face Model to synthesize a large training set, rendering each identity with different expressions, hair, clothing, and other assets. We then train a conditional Neural Radiance Field prior on this synthetic dataset and, at inference time, fine-tune the model on a very sparse set of real images of a single subject. On average, the fine-tuning requires only three inputs to cross the synthetic-to-real domain gap. The resulting personalized 3D model reconstructs strong idiosyncratic facial expressions and outperforms the state-of-the-art in high-quality novel view synthesis of faces from sparse inputs in terms of perceptual and photo-metric quality.

[AI-18] GERA: Geometric Embedding for Efficient Point Registration Analysis

链接: https://arxiv.org/abs/2410.00589
作者: Geng Li,Haozhi Cao,Mingyang Liu,Shenghai Yuan,Jianfei Yang
关键词-EN: surgical guidance systems, provide estimated transformations, align point clouds, Point cloud registration, cloud registration aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Point cloud registration aims to provide estimated transformations to align point clouds, which plays a crucial role in pose estimation of various navigation systems, such as surgical guidance systems and autonomous vehicles. Despite the impressive performance of recent models on benchmark datasets, many rely on complex modules like KPConv and Transformers, which impose significant computational and memory demands. These requirements hinder their practical application, particularly in resource-constrained environments such as mobile robotics. In this paper, we propose a novel point cloud registration network that leverages a pure MLP architecture, constructing geometric information offline. This approach eliminates the computational and memory burdens associated with traditional complex feature extractors and significantly reduces inference time and resource consumption. Our method is the first to replace 3D coordinate inputs with offline-constructed geometric encoding, improving generalization and stability, as demonstrated by Maximum Mean Discrepancy (MMD) comparisons. This efficient and accurate geometric representation marks a significant advancement in point cloud analysis, particularly for applications requiring fast and reliability.

[AI-19] Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

链接: https://arxiv.org/abs/2410.00564
作者: Jie Cheng,Ruixi Qiao,Gang Xiong,Qinghai Miao,Yingwei Ma,Binhua Li,Yongbin Li,Yisheng Lv
关键词-EN: heterogeneous datasets, significant aspiration, develop a generalist, high capabilities, offline
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose an provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data corresponding to about 4 trajectories per game, which demonstrates superior generalization of JOWA. We will release codes at this https URL.

[AI-20] AMR-Evol: Adaptive Modular Response Evolution Elicits Better Knowledge Distillation for Large Language Models in Code Generation EMNLP2024

链接: https://arxiv.org/abs/2410.00558
作者: Ziyang Luo,Xin Li,Hongzhan Lin,Jing Ma,Lidong Bing
关键词-EN: response distillation, generation has led, trend to replicate, replicate these capabilities, response
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: EMNLP 2024

点击查看摘要

Abstract:The impressive performance of proprietary LLMs like GPT4 in code generation has led to a trend to replicate these capabilities in open-source models through knowledge distillation (e.g. Code Evol-Instruct). However, these efforts often neglect the crucial aspect of response quality, relying heavily on teacher models for direct response distillation. This paradigm, especially for complex instructions, can degrade the quality of synthesized data, compromising the knowledge distillation process. To this end, our study introduces the Adaptive Modular Response Evolution (AMR-Evol) framework, which employs a two-stage process to refine response distillation. The first stage, modular decomposition, breaks down the direct response into more manageable sub-modules. The second stage, adaptive response evolution, automatically evolves the response with the related function modules. Our experiments with three popular code benchmarks (HumanEval, MBPP, and EvalPlus) attest to the superiority of the AMR-Evol framework over baseline response distillation methods. By comparing with the open-source Code LLMs trained on a similar scale of data, we observed performance enhancements: more than +3.0 points on HumanEval-Plus and +1.0 points on MBPP-Plus, which underscores the effectiveness of our framework. Our codes are available at this https URL.

[AI-21] Optimal Causal Representations and the Causal Information Bottleneck ICLR2025

链接: https://arxiv.org/abs/2410.00535
作者: Francisco N. F. Q. Simoes,Mehdi Dastani,Thijs van Ommen
关键词-EN: preserving key features, effectively study complex, discarding irrelevant details, complex causal systems, study complex causal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: Submitted to ICLR 2025. Code available at this http URL

点击查看摘要

Abstract:To effectively study complex causal systems, it is often useful to construct representations that simplify parts of the system by discarding irrelevant details while preserving key features. The Information Bottleneck (IB) method is a widely used approach in representation learning that compresses random variables while retaining information about a target variable. Traditional methods like IB are purely statistical and ignore underlying causal structures, making them ill-suited for causal tasks. We propose the Causal Information Bottleneck (CIB), a causal extension of the IB, which compresses a set of chosen variables while maintaining causal control over a target variable. This method produces representations which are causally interpretable, and which can be used when reasoning about interventions. We present experimental results demonstrating that the learned representations accurately capture causality as intended.

[AI-22] PI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

链接: https://arxiv.org/abs/2410.00531
作者: Zonghang Li,Wenjiao Feng,Mohsen Guizani,Hongfang Yu
关键词-EN: Large model inference, user interaction data, Large model, shifting from cloud, due to concerns
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: This paper is currently under review. Find the code at this https URL

点击查看摘要

Abstract:Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users’ devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate, and over 90% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90%, requiring only 3.1 GB of memory for 70B-scale models.

[AI-23] Exploring the Learning Capabilities of Language Models using LEVERWORLDS

链接: https://arxiv.org/abs/2410.00519
作者: Eitan Wagner,Amir Feder,Omri Abend
关键词-EN: stochastic setting, setting often involves, general structure rules, Learning, specific properties
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning a model of a stochastic setting often involves learning both general structure rules and specific properties of the instance. This paper investigates the interplay between learning the general and the specific in various learning methods, with emphasis on sample efficiency. We design a framework called \sc LeverWorlds, which allows the generation of simple physics-inspired worlds that follow a similar generative process with different distributions, and their instances can be expressed in natural language. These worlds allow for controlled experiments to assess the sample complexity of different learning methods. We experiment with classic learning algorithms as well as Transformer language models, both with fine-tuning and In-Context Learning (ICL). Our general finding is that (1) Transformers generally succeed in the task; but (2) they are considerably less sample efficient than classic methods that make stronger assumptions about the structure, such as Maximum Likelihood Estimation and Logistic Regression. This finding is in tension with the recent tendency to use Transformers as general-purpose estimators. We propose an approach that leverages the ICL capabilities of contemporary language models to apply simple algorithms for this type of data. Our experiments show that models currently struggle with the task but show promising potential.

[AI-24] Human-Robot Collaborative Minimum Time Search through Sub-priors in Ant Colony Optimization

链接: https://arxiv.org/abs/2410.00517
作者: Oscar Gil Viyuela,Alberto Sanfeliu
关键词-EN: highly promising issue, promising issue owing, Artificial Intelligence, Human-Robot Collaboration, Human-Robot Interaction
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human-Robot Collaboration (HRC) has evolved into a highly promising issue owing to the latest breakthroughs in Artificial Intelligence (AI) and Human-Robot Interaction (HRI), among other reasons. This emerging growth increases the need to design multi-agent algorithms that can manage also human preferences. This paper presents an extension of the Ant Colony Optimization (ACO) meta-heuristic to solve the Minimum Time Search (MTS) task, in the case where humans and robots perform an object searching task together. The proposed model consists of two main blocks. The first one is a convolutional neural network (CNN) that provides the prior probabilities about where an object may be from a segmented image. The second one is the Sub-prior MTS-ACO algorithm (SP-MTS-ACO), which takes as inputs the prior probabilities and the particular search preferences of the agents in different sub-priors to generate search plans for all agents. The model has been tested in real experiments for the joint search of an object through a Vizanti web-based visualization in a tablet computer. The designed interface allows the communication between a human and our humanoid robot named IVO. The obtained results show an improvement in the search perception of the users without loss of efficiency.

[AI-25] Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing EMNLP2024

链接: https://arxiv.org/abs/2410.00513
作者: Deokhyung Kang,Seonjeong Hwang,Yunsu Kim,Gary Geunbae Lee
关键词-EN: utilize multilingual pretrained, Recent efforts, pretrained language models, requiring extensive annotations, extend semantic parsing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Recent efforts have aimed to utilize multilingual pretrained language models (mPLMs) to extend semantic parsing (SP) across multiple languages without requiring extensive annotations. However, achieving zero-shot cross-lingual transfer for SP remains challenging, leading to a performance gap between source and target languages. In this study, we propose Cross-Lingual Back-Parsing (CBP), a novel data augmentation methodology designed to enhance cross-lingual transfer for SP. Leveraging the representation geometry of the mPLMs, CBP synthesizes target language utterances from source meaning representations. Our methodology effectively performs cross-lingual data augmentation in challenging zero-resource settings, by utilizing only labeled data in the source language and monolingual corpora. Extensive experiments on two cross-language SP benchmarks (Mschema2QA and Xspider) demonstrate that CBP brings substantial gains in the target language. Further analysis of the synthesized utterances shows that our method successfully generates target language utterances with high slot value alignment rates while preserving semantic integrity. Our codes and data are publicly available at this https URL.

[AI-26] FlipGuard: Defending Preference Alignment against Update Regression with Constrained Optimization EMNLP2024

链接: https://arxiv.org/abs/2410.00508
作者: Mingye Zhu,Yi Liu,Quan Wang,Junbo Guo,Zhendong Mao
关键词-EN: Large Language Models’, improved Large Language, Language Models’ ability, significantly improved Large, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Main track

点击查看摘要

Abstract:Recent breakthroughs in preference alignment have significantly improved Large Language Models’ ability to generate texts that align with human preferences and values. However, current alignment metrics typically emphasize the post-hoc overall improvement, while overlooking a critical aspect: regression, which refers to the backsliding on previously correctly-handled data after updates. This potential pitfall may arise from excessive fine-tuning on already well-aligned data, which subsequently leads to over-alignment and degeneration. To address this challenge, we propose FlipGuard, a constrained optimization approach to detect and mitigate update regression with focal attention. Specifically, FlipGuard identifies performance degradation using a customized reward characterization and strategically enforces a constraint to encourage conditional congruence with the pre-aligned model during training. Comprehensive experiments demonstrate that FlipGuard effectively alleviates update regression while demonstrating excellent overall performance, with the added benefit of knowledge preservation while aligning preferences.

[AI-27] Drone Stereo Vision for Radiata Pine Branch Detection and Distance Measurement: Utilizing Deep Learning and YOLO Integration

链接: https://arxiv.org/abs/2410.00503
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
关键词-EN: stereo vision camera, tree branches, research focuses, drone equipped, vision camera
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research focuses on the development of a drone equipped with pruning tools and a stereo vision camera to accurately detect and measure the spatial positions of tree branches. YOLO is employed for branch segmentation, while two depth estimation approaches, monocular and stereo, are investigated. In comparison to SGBM, deep learning techniques produce more refined and accurate depth maps. In the absence of ground-truth data, a fine-tuning process using deep neural networks is applied to approximate optimal depth values. This methodology facilitates precise branch detection and distance measurement, addressing critical challenges in the automation of pruning operations. The results demonstrate notable advancements in both accuracy and efficiency, underscoring the potential of deep learning to drive innovation and enhance automation in the agricultural sector.

[AI-28] Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach EMNLP2024

链接: https://arxiv.org/abs/2410.00502
作者: Diogo Pernes,Gonçalo M. Correia,Afonso Mendes
关键词-EN: Cross-lingual summarization aims, bridge language barriers, aims to bridge, Cross-lingual summarization, multi-target cross-lingual summarization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 (Findings)

点击查看摘要

Abstract:Cross-lingual summarization aims to bridge language barriers by summarizing documents in different languages. However, ensuring semantic coherence across languages is an overlooked challenge and can be critical in several contexts. To fill this gap, we introduce multi-target cross-lingual summarization as the task of summarizing a document into multiple target languages while ensuring that the produced summaries are semantically similar. We propose a principled re-ranking approach to this problem and a multi-criteria evaluation protocol to assess semantic coherence across target languages, marking a first step that will hopefully stimulate further research on this problem.

[AI-29] Learning Adaptive Hydrodynamic Models Using Neural ODEs in Complex Conditions

链接: https://arxiv.org/abs/2410.00490
作者: Cong Wang,Aoming Liang,Fei Han,Xinyu Zeng,Zhibin Li,Dixia Fan,Jens Kober
关键词-EN: Reinforcement learning-based quadruped, Reinforcement learning-based, complex underwater environment, learning-based quadruped robots, quadruped robots excel
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Reinforcement learning-based quadruped robots excel across various terrains but still lack the ability to swim in water due to the complex underwater environment. This paper presents the development and evaluation of a data-driven hydrodynamic model for amphibious quadruped robots, aiming to enhance their adaptive capabilities in complex and dynamic underwater environments. The proposed model leverages Neural Ordinary Differential Equations (ODEs) combined with attention mechanisms to accurately process and interpret real-time sensor data. The model enables the quadruped robots to understand and predict complex environmental patterns, facilitating robust decision-making strategies. We harness real-time sensor data, capturing various environmental and internal state parameters to train and evaluate our model. A significant focus of our evaluation involves testing the quadruped robot’s performance across different hydrodynamic conditions and assessing its capabilities at varying speeds and fluid dynamic conditions. The outcomes suggest that the model can effectively learn and adapt to varying conditions, enabling the prediction of force states and enhancing autonomous robotic behaviors in various practical scenarios.

[AI-30] MCGM: Mask Conditional Text-to-Image Generative Model

链接: https://arxiv.org/abs/2410.00483
作者: Rami Skaik,Leonardo Rossi,Tomaso Fontanini,Andrea Prati
关键词-EN: Recent advancements, artificial intelligence, enabling the creation, Generative Model, revolutionized the field
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 13 figures, presented at the 5th International Conference on Artificial Intelligence and Machine Learning (CAIML 2024)

点击查看摘要

Abstract:Recent advancements in generative models have revolutionized the field of artificial intelligence, enabling the creation of highly-realistic and detailed images. In this study, we propose a novel Mask Conditional Text-to-Image Generative Model (MCGM) that leverages the power of conditional diffusion models to generate pictures with specific poses. Our model builds upon the success of the Break-a-scene [1] model in generating new scenes using a single image with multiple subjects and incorporates a mask embedding injection that allows the conditioning of the generation process. By introducing this additional level of control, MCGM offers a flexible and intuitive approach for generating specific poses for one or more subjects learned from a single image, empowering users to influence the output based on their requirements. Through extensive experimentation and evaluation, we demonstrate the effectiveness of our proposed model in generating high-quality images that meet predefined mask conditions and improving the current Break-a-scene generative model.

[AI-31] Probabilistic Analysis of Copyright Disputes and Generative AI Safety

链接: https://arxiv.org/abs/2410.00475
作者: Hiroaki Chiba-Okabe
关键词-EN: formalizing relevant judicial, coherent framework based, relevant judicial principles, random-worlds method, disputes by formalizing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:This paper presents a probabilistic approach to analyzing copyright infringement disputes by formalizing relevant judicial principles within a coherent framework based on the random-worlds method. The approach provides a structured analysis of key evidentiary principles, with particular emphasis on the “inverse ratio rule”–a controversial doctrine adopted by some courts. Although this rule has faced significant criticism, a formal proof demonstrates its validity, provided it is properly defined. Additionally, the paper examines the heightened copyright risks posed by generative AI, highlighting how extensive access to copyrighted material by generative models increases the risk of infringement. Utilizing the probabilistic approach, the Near Access-Free (NAF) condition, previously proposed as a potential mitigation strategy, is evaluated. The analysis reveals that while the NAF condition mitigates some infringement risks, its justifiability and efficacy are questionable in certain contexts. These findings demonstrate how a rigorous probabilistic approach can advance our understanding of copyright jurisprudence and its interaction with emerging technologies.

[AI-32] Dynamic Planning for LLM-based Graphical User Interface Automation

链接: https://arxiv.org/abs/2410.00467
作者: Shaoqing Zhang,Zhuosheng Zhang,Kehai Chen,Xinbe Ma,Muyun Yang,Tiejun Zhao,Min Zhang
关键词-EN: large language models, graphical user interfaces, spurred considerable interest, advancing autonomous LLMs-based, smartphone graphical user
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has spurred considerable interest in advancing autonomous LLMs-based agents, particularly in intriguing applications within smartphone graphical user interfaces (GUIs). When presented with a task goal, these agents typically emulate human actions within a GUI environment until the task is completed. However, a key challenge lies in devising effective plans to guide action prediction in GUI tasks, though planning have been widely recognized as effective for decomposing complex tasks into a series of steps. Specifically, given the dynamic nature of environmental GUIs following action execution, it is crucial to dynamically adapt plans based on environmental feedback and action this http URL show that the widely-used ReAct approach fails due to the excessively long historical dialogues. To address this challenge, we propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents.D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history. Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7% (34.66% \rightarrow 47.36%) in accuracy. The analysis highlights the generality of dynamic planning in different backbone LLMs, as well as the benefits in mitigating hallucinations and adapting to unseen tasks. Code is available at this https URL.

[AI-33] Adversarial Suffixes May Be Features Too!

链接: https://arxiv.org/abs/2410.00451
作者: Wei Zhao,Zhe Li,Yige Li,Jun Sun
关键词-EN: large language models, significant ongoing efforts, large language, language models, remain vulnerable
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite significant ongoing efforts in safety alignment, large language models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks that can induce harmful behaviors, including those triggered by adversarial suffixes. Building on prior research, we hypothesize that these adversarial suffixes are not mere bugs but may represent features that can dominate the LLM’s behavior. To evaluate this hypothesis, we conduct several experiments. First, we demonstrate that benign features can be effectively made to function as adversarial suffixes, i.e., we develop a feature extraction method to extract sample-agnostic features from benign dataset in the form of suffixes and show that these suffixes may effectively compromise safety alignment. Second, we show that adversarial suffixes generated from jailbreak attacks may contain meaningful features, i.e., appending the same suffix to different prompts results in responses exhibiting specific characteristics. Third, we show that such benign-yet-safety-compromising features can be easily introduced through fine-tuning using only benign datasets, i.e., even in the absence of harmful content. This highlights the critical risk posed by dominating benign features in the training data and calls for further research to reinforce LLM safety alignment. Our code and data is available at \urlthis https URL.

[AI-34] ReXplain: Translating Radiology into Patient-Friendly Video Reports

链接: https://arxiv.org/abs/2410.00441
作者: Luyang Luo,Jenanan Vairavamurthy,Xiaoman Zhang,Abhinav Kumar,Ramon R. Ter-Oganesyan,Stuart T. Schroff,Dan Shilo,Rydhwana Hossain,Mike Moritz,Pranav Rajpurkar
关键词-EN: undermining patient-centered care, undermining patient-centered, remain incomprehensible, Radiology, patient-centered care
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 13 pages

点击查看摘要

Abstract:Radiology reports often remain incomprehensible to patients, undermining patient-centered care. We present ReXplain (Radiology eXplanation), an innovative AI-driven system that generates patient-friendly video reports for radiology findings. ReXplain uniquely integrates a large language model for text simplification, an image segmentation model for anatomical region identification, and an avatar generation tool, producing comprehensive explanations with plain language, highlighted imagery, and 3D organ renderings. Our proof-of-concept study with five board-certified radiologists indicates that ReXplain could accurately deliver radiological information and effectively simulate one-on-one consultations. This work demonstrates a new paradigm in AI-assisted medical communication, potentially improving patient engagement and satisfaction in radiology care, and opens new avenues for research in multimodal medical communication.

[AI-35] Scalable Multi-Task Transfer Learning for Molecular Property Prediction

链接: https://arxiv.org/abs/2410.00432
作者: Chanhui Lee,Dae-Woong Jeong,Sung Moon Ko,Sumin Lee,Hyunseung Kim,Soorin Yim,Sehui Han,Sungwoong Kim,Sungbin Lim
关键词-EN: application vary, transfer learning, Molecules, distinct properties, transfer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Molecules have a number of distinct properties whose importance and application vary. Often, in reality, labels for some properties are hard to achieve despite their practical importance. A common solution to such data scarcity is to use models of good generalization with transfer learning. This involves domain experts for designing source and target tasks whose features are shared. However, this approach has limitations: i). Difficulty in accurate design of source-target task pairs due to the large number of tasks, and ii). corresponding computational burden verifying many trials and errors of transfer learning design, thereby iii). constraining the potential of foundation modeling of multi-task molecular property prediction. We address the limitations of the manual design of transfer learning via data-driven bi-level optimization. The proposed method enables scalable multi-task transfer learning for molecular property prediction by automatically obtaining the optimal transfer ratios. Empirically, the proposed method improved the prediction performance of 40 molecular properties and accelerated training convergence.

[AI-36] LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management

链接: https://arxiv.org/abs/2410.00428
作者: Yi Xiong,Hao Wu,Changxu Shao,Ziqing Wang,Rui Zhang,Yuhong Guo,Junping Zhao,Ke Zhang,Zhenxuan Pan
关键词-EN: expanding context windows, large language models, introduce significant challenges, maintaining low latency, windows in large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, 1 table

点击查看摘要

Abstract:The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing demands for GPU Key-Value (KV) cache allocation clashing with the limited availability of KV cache blocks. To address this issue, we propose LayerKV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance, while seamlessly integrating with existing parallelism strategies and scheduling techniques. Specifically, LayerKV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory, coupled with an SLO-aware scheduler to optimize overall Service Level Objectives (SLOs). Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that LayerKV improves TTFT latency up to 11x and reduces SLO violation rates by 28.7%, significantly enhancing the user experience

[AI-37] ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

链接: https://arxiv.org/abs/2410.00425
作者: Stone Tao,Fanbo Xiang,Arth Shukla,Yuzhe Qin,Xander Hinrichsen,Xiaodi Yuan,Chen Bao,Xinsong Lin,Yulin Liu,Tse-kai Chan,Yuan Gao,Xuanlin Li,Tongzhou Mu,Nan Xiao,Arnav Gurha,Zhiao Huang,Roberto Calandra,Rui Chen,Shan Luo,Hao Su
关键词-EN: enabled unprecedented compute-scalable, unprecedented compute-scalable approaches, robot learning, enabled unprecedented, unprecedented compute-scalable
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project website: this http URL

点击查看摘要

Abstract:Simulation has enabled unprecedented compute-scalable approaches to robot learning. However, many existing simulation frameworks typically support a narrow range of scenes/tasks and lack features critical for scaling generalizable robotics and sim2real. We introduce and open source ManiSkill3, the fastest state-visual GPU parallelized robotics simulator with contact-rich physics targeting generalizable manipulation. ManiSkill3 supports GPU parallelization of many aspects including simulation+rendering, heterogeneous simulation, pointclouds/voxels visual input, and more. Simulation with rendering on ManiSkill3 can run 10-1000x faster with 2-3x less GPU memory usage than other platforms, achieving up to 30,000+ FPS in benchmarked environments due to minimal python/pytorch overhead in the system, simulation on the GPU, and the use of the SAPIEN parallel rendering system. Tasks that used to take hours to train can now take minutes. We further provide the most comprehensive range of GPU parallelized environments/tasks spanning 12 distinct domains including but not limited to mobile manipulation for tasks such as drawing, humanoids, and dextrous manipulation in realistic scenes designed by artists or real-world digital twins. In addition, millions of demonstration frames are provided from motion planning, RL, and teleoperation. ManiSkill3 also provides a comprehensive set of baselines that span popular RL and learning-from-demonstrations algorithms.

[AI-38] kGuard: A Deep Learning Transformer-Based Solution for Detecting Unsuitable TikTok Content for Kids

链接: https://arxiv.org/abs/2410.00403
作者: Mazen Balat,Mahmoud Essam Gabr,Hend Bakr,Ahmed B. Zaky
关键词-EN: safeguarding young viewers, rise of short-form, brought new challenges, challenges in safeguarding, safeguarding young
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NILES2024

点击查看摘要

Abstract:The rise of short-form videos on platforms like TikTok has brought new challenges in safeguarding young viewers from inappropriate content. Traditional moderation methods often fall short in handling the vast and rapidly changing landscape of user-generated videos, increasing the risk of children encountering harmful material. This paper introduces TikGuard, a transformer-based deep learning approach aimed at detecting and flagging content unsuitable for children on TikTok. By using a specially curated dataset, TikHarm, and leveraging advanced video classification techniques, TikGuard achieves an accuracy of 86.7%, showing a notable improvement over existing methods in similar contexts. While direct comparisons are limited by the uniqueness of the TikHarm dataset, TikGuard’s performance highlights its potential in enhancing content moderation, contributing to a safer online experience for minors. This study underscores the effectiveness of transformer models in video classification and sets a foundation for future research in this area.

[AI-39] Revisiting Essential and Nonessential Settings of Evidential Deep Learning

链接: https://arxiv.org/abs/2410.00393
作者: Mengyuan Chen,Junyu Gao,Changsheng Xu
关键词-EN: Evidential Deep Learning, attracting significant attention, single forward pass, Evidential Deep, Deep Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, under review

点击查看摘要

Abstract:Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation that provides reliable predictive uncertainty in a single forward pass, attracting significant attention. Grounded in subjective logic, EDL derives Dirichlet concentration parameters from neural networks to construct a Dirichlet probability density function (PDF), modeling the distribution of class probabilities. Despite its success, EDL incorporates several nonessential settings: In model construction, (1) a commonly ignored prior weight parameter is fixed to the number of classes, while its value actually impacts the balance between the proportion of evidence and its magnitude in deriving predictive scores. In model optimization, (2) the empirical risk features a variance-minimizing optimization term that biases the PDF towards a Dirac delta function, potentially exacerbating overconfidence. (3) Additionally, the structural risk typically includes a KL-divergence-minimizing regularization, whose optimization direction extends beyond the intended purpose and contradicts common sense, diminishing the information carried by the evidence magnitude. Therefore, we propose Re-EDL, a simplified yet more effective variant of EDL, by relaxing the nonessential settings and retaining the essential one, namely, the adoption of projected probability from subjective logic. Specifically, Re-EDL treats the prior weight as an adjustable hyperparameter rather than a fixed scalar, and directly optimizes the expectation of the Dirichlet PDF provided by deprecating both the variance-minimizing optimization term and the divergence regularization term. Extensive experiments and state-of-the-art performance validate the effectiveness of our method. The source code is available at this https URL.

[AI-40] Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation COLING2025

链接: https://arxiv.org/abs/2410.00387
作者: Bhargav Shandilya,Alexis Palmer
关键词-EN: modeling technology pose, technology pose challenges, current language modeling, language modeling technology, compute requirements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, 1 figure, 5 tables, submitted to COLING 2025

点击查看摘要

Abstract:The data and compute requirements of current language modeling technology pose challenges for the processing and analysis of low-resource languages. Declarative linguistic knowledge has the potential to partially bridge this data scarcity gap by providing models with useful inductive bias in the form of language-specific rules. In this paper, we propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing. We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM. The results demonstrate that significant leaps in performance and efficiency are possible with the right combination of: a) linguistic inputs in the form of grammars, b) the interpretive power of LLMs, and c) the trainability of smaller token classification networks. We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages. Our work also offers documentary linguists a more reliable and more usable tool for morphological glossing by providing well-reasoned explanations and confidence scores for each output. Comments: 13 pages, 1 figure, 5 tables, submitted to COLING 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.00387 [cs.CL] (or arXiv:2410.00387v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.00387 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-41] STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

链接: https://arxiv.org/abs/2410.00385
作者: Hongjun Wang,Jiyuan Chen,Tong Pan,Zheng Dong,Lingyu Zhang,Renhe Jiang,Xuan Song
关键词-EN: smart city management, efficient resource allocation, city management, transportation planning, Traffic forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Traffic forecasting is a cornerstone of smart city management, enabling efficient resource allocation and transportation planning. Deep learning, with its ability to capture complex nonlinear patterns in spatiotemporal (ST) data, has emerged as a powerful tool for traffic forecasting. While graph neural networks (GCNs) and transformer-based models have shown promise, their computational demands often hinder their application to real-world road networks, particularly those with large-scale spatiotemporal interactions. To address these challenges, we propose a novel spatiotemporal graph transformer (STGformer) architecture. STGformer effectively balances the strengths of GCNs and Transformers, enabling efficient modeling of both global and local traffic patterns while maintaining a manageable computational footprint. Unlike traditional approaches that require multiple attention layers, STG attention block captures high-order spatiotemporal interactions in a single layer, significantly reducing computational cost. In particular, STGformer achieves a 100x speedup and a 99.8% reduction in GPU memory usage compared to STAEformer during batch inference on a California road graph with 8,600 sensors. We evaluate STGformer on the LargeST benchmark and demonstrate its superiority over state-of-the-art Transformer-based methods such as PDFormer and STAEformer, which underline STGformer’s potential to revolutionize traffic forecasting by overcoming the computational and memory limitations of existing approaches, making it a promising foundation for future spatiotemporal modeling tasks.

[AI-42] Generative Precipitation Downscaling using Score-based Diffusion with Wasserstein Regularization

链接: https://arxiv.org/abs/2410.00381
作者: Yuhao Liu,James Doss-Gollin,Guha Balakrishnan,Ashok Veeraraghavan
关键词-EN: Understanding local risks, sample rare events, assess localized hazards, Understanding local, Climate Prediction Center
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Understanding local risks from extreme rainfall, such as flooding, requires both long records (to sample rare events) and high-resolution products (to assess localized hazards). Unfortunately, there is a dearth of long-record and high-resolution products that can be used to understand local risk and precipitation science. In this paper, we present a novel generative diffusion model that downscales (super-resolves) globally available Climate Prediction Center (CPC) gauge-based precipitation products and ERA5 reanalysis data to generate kilometer-scale precipitation estimates. Downscaling gauge-based precipitation from 55 km to 1 km while recovering extreme rainfall signals poses significant challenges. To enforce our model (named WassDiff) to produce well-calibrated precipitation intensity values, we introduce a Wasserstein Distance Regularization (WDR) term for the score-matching training objective in the diffusion denoising process. We show that WDR greatly enhances the model’s ability to capture extreme values compared to diffusion without WDR. Extensive evaluation shows that WassDiff has better reconstruction accuracy and bias scores than conventional score-based diffusion models. Case studies of extreme weather phenomena, like tropical storms and cold fronts, demonstrate WassDiff’s ability to produce appropriate spatial patterns while capturing extremes. Such downscaling capability enables the generation of extensive km-scale precipitation datasets from existing historical global gauge records and current gauge measurements in areas without high-resolution radar.

[AI-43] CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset

链接: https://arxiv.org/abs/2410.00379
作者: Xiao Wang,Fuling Wang,Yuehang Li,Qingchuan Ma,Shiao Wang,Bo Jiang,Chuanfu Li,Jin Tang
关键词-EN: patient wait times, significantly reduce diagnostic, reduce diagnostic burdens, X-ray image-based medical, image-based medical report
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: In Peer Review

点击查看摘要

Abstract:X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence which can significantly reduce diagnostic burdens and patient wait times. Despite significant progress, we believe that the task has reached a bottleneck due to the limited benchmark datasets and the existing large models’ insufficient capability enhancements in this specialized domain. Specifically, the recently released CheXpert Plus dataset lacks comparative evaluation algorithms and their results, providing only the dataset itself. This situation makes the training, evaluation, and comparison of subsequent algorithms challenging. Thus, we conduct a comprehensive benchmarking of existing mainstream X-ray report generation models and large language models (LLMs), on the CheXpert Plus dataset. We believe that the proposed benchmark can provide a solid comparative basis for subsequent algorithms and serve as a guide for researchers to quickly grasp the state-of-the-art models in this field. More importantly, we propose a large model for the X-ray image report generation using a multi-stage pre-training strategy, including self-supervised autoregressive generation and Xray-report contrastive learning, and supervised fine-tuning. Extensive experimental results indicate that the autoregressive pre-training based on Mamba effectively encodes X-ray images, and the image-text contrastive pre-training further aligns the feature spaces, achieving better experimental results. Source code can be found on \urlthis https URL.

[AI-44] Robust Traffic Forecasting against Spatial Shift over Years

链接: https://arxiv.org/abs/2410.00373
作者: Hongjun Wang,Jiyuan Chen,Tong Pan,Zheng Dong,Lingyu Zhang,Renhe Jiang,Xuan Song
关键词-EN: Graph Neural Networks, Neural Networks, demonstrated promising potential, Spatiotemporal Graph Neural, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advancements in Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have demonstrated promising potential for traffic forecasting by effectively capturing both temporal and spatial correlations. The generalization ability of spatiotemporal models has received considerable attention in recent scholarly discourse. However, no substantive datasets specifically addressing traffic out-of-distribution (OOD) scenarios have been proposed. Existing ST-OOD methods are either constrained to testing on extant data or necessitate manual modifications to the dataset. Consequently, the generalization capacity of current spatiotemporal models in OOD scenarios remains largely underexplored. In this paper, we investigate state-of-the-art models using newly proposed traffic OOD benchmarks and, surprisingly, find that these models experience a significant decline in performance. Through meticulous analysis, we attribute this decline to the models’ inability to adapt to previously unobserved spatial relationships. To address this challenge, we propose a novel Mixture of Experts (MoE) framework, which learns a set of graph generators (i.e., graphons) during training and adaptively combines them to generate new graphs based on novel environmental conditions to handle spatial distribution shifts during testing. We further extend this concept to the Transformer architecture, achieving substantial improvements. Our method is both parsimonious and efficacious, and can be seamlessly integrated into any spatiotemporal model, outperforming current state-of-the-art approaches in addressing spatial dynamics.

[AI-45] Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare

链接: https://arxiv.org/abs/2410.00366
作者: Prasenjit Maji,Amit Kumar Mondal,Hemanta Kumar Mondal,Saraju P. Mohanty
关键词-EN: continuous monitoring devices, intelligent diagnostic systems, Explainable Artificial Intelligence, artificial intelligence, driving innovations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancements in artificial intelligence (AI) have revolutionized smart healthcare, driving innovations in wearable technologies, continuous monitoring devices, and intelligent diagnostic systems. However, security, explainability, robustness, and performance optimization challenges remain critical barriers to widespread adoption in clinical environments. This research presents an innovative algorithmic method using the Adaptive Feature Evaluator (AFE) algorithm to improve feature selection in healthcare datasets and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT), the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby enhancing predictive accuracy and interpretability. The proposed method is validated across three diverse healthcare datasets using six distinct machine learning algorithms, demonstrating its robustness and superiority over conventional feature selection techniques. The results underscore the transformative potential of AFE in smart healthcare, enabling personalized and transparent patient care. Notably, the AFE algorithm, when combined with a Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting its capability to improve clinical decision-making processes in real-world healthcare applications.

[AI-46] FedPT: Federated Proxy-Tuning of Large Language Models on Resource-Constrained Edge Devices

链接: https://arxiv.org/abs/2410.00362
作者: Zhidong Gao,Yu Zhang,Zhenxiao Zhang,Yanmin Gong,Yuanxiong Guo
关键词-EN: large LMs, demonstrating superior performance, downstream tasks, demonstrating superior, variety of linguistic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 29 pages, 19 figures

点击查看摘要

Abstract:Despite demonstrating superior performance across a variety of linguistic tasks, pre-trained large language models (LMs) often require fine-tuning on specific datasets to effectively address different downstream tasks. However, fine-tuning these LMs for downstream tasks necessitates collecting data from individuals, which raises significant privacy concerns. Federated learning (FL) has emerged as the de facto solution, enabling collaborative model training without sharing raw data. While promising, federated fine-tuning of large LMs faces significant challenges, including restricted access to model parameters and high computation, communication, and memory overhead. To address these challenges, this paper introduces \textbfFederated \textbfProxy-\textbfTuning (FedPT), a novel framework for federated fine-tuning of black-box large LMs, requiring access only to their predictions over the output vocabulary instead of their parameters. Specifically, devices in FedPT first collaboratively tune a smaller LM, and then the server combines the knowledge learned by the tuned small LM with the knowledge learned by the larger pre-trained LM to construct a large proxy-tuned LM that can reach the performance of directly tuned large LMs. The experimental results demonstrate that FedPT can significantly reduce computation, communication, and memory overhead while maintaining competitive performance compared to directly federated fine-tuning of large LMs. FedPT offers a promising solution for efficient, privacy-preserving fine-tuning of large LMs on resource-constrained devices, broadening the accessibility and applicability of state-of-the-art large LMs.

[AI-47] Self-controller: Controlling LLMs with Multi-round Step-by-step Self-awareness

链接: https://arxiv.org/abs/2410.00359
作者: Xiao Peng,Xufan Geng
关键词-EN: large language models, applications of large, large language, widely spread, Self-controller
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:The applications of large language models (LLMs) have been widely spread across all domains. However, the basic abilities such as the controllability of LLMs are still limited. To address this, we propose “Self-controller”, a novel agentic framework bringing self-awareness into LLMs’ reasoning logic. The core idea of this work is to maintain states based on the LLM’s response, letting the LLM become self-aware of current status and think step by step in a multi-round chain-of-thought paradigm. Our experiment on the state of textual length has shown the controllability and effectiveness of the Self-controller. We further implement a binary search algorithm to accelerate the generation process based on the linearity and monotonicity of the textual length state. Another advantage of the Self-controller comes with DeepSeek’s Context Caching technology, which significantly saves computational token consumption when a cluster of conversations shares the same prefix of context. Theoretically, we prove that in this scenario the extra time complexity is O(c \log n) . Results of the back-of-the-envelope estimation suggest that the token consumption of our method is no more than twice as much as that of the trivial single-round generation. Furthermore, our ablation study on word constraints demonstrates the Self-controller’s consistent controllability across all foundation models.

[AI-48] Efficient Training of Large Vision Models via Advanced Automated Progressive Learning

链接: https://arxiv.org/abs/2410.00350
作者: Changlin Li,Jiawei Zhang,Sihao Lin,Zongxin Yang,Junwei Liang,Xiaodan Liang,Xiaojun Chang
关键词-EN: Large Vision Models, advancements in Large, Vision Transformers, Large Vision, computational resources
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code: this https URL . arXiv admin note: substantial text overlap with arXiv:2203.14509

点击查看摘要

Abstract:The rapid advancements in Large Vision Models (LVMs), such as Vision Transformers (ViTs) and diffusion models, have led to an increasing demand for computational resources, resulting in substantial financial and environmental costs. This growing challenge highlights the necessity of developing efficient training methods for LVMs. Progressive learning, a training strategy in which model capacity gradually increases during training, has shown potential in addressing these challenges. In this paper, we present an advanced automated progressive learning (AutoProg) framework for efficient training of LVMs. We begin by focusing on the pre-training of LVMs, using ViTs as a case study, and propose AutoProg-One, an AutoProg scheme featuring momentum growth (MoGrow) and a one-shot growth schedule search. Beyond pre-training, we extend our approach to tackle transfer learning and fine-tuning of LVMs. We expand the scope of AutoProg to cover a wider range of LVMs, including diffusion models. First, we introduce AutoProg-Zero, by enhancing the AutoProg framework with a novel zero-shot unfreezing schedule search, eliminating the need for one-shot supernet training. Second, we introduce a novel Unique Stage Identifier (SID) scheme to bridge the gap during network growth. These innovations, integrated with the core principles of AutoProg, offer a comprehensive solution for efficient training across various LVM scenarios. Extensive experiments show that AutoProg accelerates ViT pre-training by up to 1.85x on ImageNet and accelerates fine-tuning of diffusion models by up to 2.86x, with comparable or even higher performance. This work provides a robust and scalable approach to efficient training of LVMs, with potential applications in a wide range of vision tasks. Code: this https URL

[AI-49] Sparse Attention Decomposition Applied to Circuit Tracing

链接: https://arxiv.org/abs/2410.00340
作者: Gabriel Franco,Mark Crovella
关键词-EN: attention heads, perform complex tasks, attention, attention heads work, papers have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Many papers have shown that attention heads work in conjunction with each other to perform complex tasks. It’s frequently assumed that communication between attention heads is via the addition of specific features to token residuals. In this work we seek to isolate and identify the features used to effect communication and coordination among attention heads in GPT-2 small. Our key leverage on the problem is to show that these features are very often sparsely coded in the singular vectors of attention head matrices. We characterize the dimensionality and occurrence of these signals across the attention heads in GPT-2 small when used for the Indirect Object Identification (IOI) task. The sparse encoding of signals, as provided by attention head singular vectors, allows for efficient separation of signals from the residual background and straightforward identification of communication paths between attention heads. We explore the effectiveness of this approach by tracing portions of the circuits used in the IOI task. Our traces reveal considerable detail not present in previous studies, shedding light on the nature of redundant paths present in GPT-2. And our traces go beyond previous work by identifying features used to communicate between attention heads when performing IOI.

[AI-50] Preserving Generalization of Language models in Few-shot Continual Relation Extraction EMNLP2024

链接: https://arxiv.org/abs/2410.00334
作者: Quyen Tran,Nguyen Xuan Thanh,Nguyen Hoang Anh,Nam Le Hai,Trung Le,Linh Van Ngo,Thien Huu Nguyen
关键词-EN: Continual Relations Extraction, Few-shot Continual Relations, Few-shot Continual, Relations Extraction, limited labeled data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Few-shot Continual Relations Extraction (FCRE) is an emerging and dynamic area of study where models can sequentially integrate knowledge from new relations with limited labeled data while circumventing catastrophic forgetting and preserving prior knowledge from pre-trained backbones. In this work, we introduce a novel method that leverages often-discarded language model heads. By employing these components via a mutual information maximization strategy, our approach helps maintain prior knowledge from the pre-trained backbone and strategically aligns the primary classification head, thereby enhancing model performance. Furthermore, we explore the potential of Large Language Models (LLMs), renowned for their wealth of knowledge, in addressing FCRE challenges. Our comprehensive experimental results underscore the efficacy of the proposed method and offer valuable insights for future work.

[AI-51] Vision Language Models Know Law of Conservation without Understanding More-or-Less

链接: https://arxiv.org/abs/2410.00332
作者: Dezhi Luo,Haiyun Lyu,Qingying Gao,Haoran Sun,Yijiang Li,Hokin Deng
关键词-EN: Vision Language Models, cognitive development considered, mental operations, reversibility of mental, Language Models
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Conservation is a critical milestone of cognitive development considered to be supported by both the understanding of quantitative concepts and the reversibility of mental operations. To assess whether this critical component of human intelligence has emerged in Vision Language Models, we leverage the ConserveBench from CogDevelop2K, a data-intensive cognitive experiment benchmark for assaying the developmental trajectory of machine intelligence. The battery includes over 350 questions across four dimensions of physical quantities: volume, solid quantity, length, and number. The former two involve only transformational tasks, whereas the latter two also involve non-transformational tasks assessing the understanding of quantitative concepts alone. Surprisingly, we find that while VLMs are generally capable of conserving, they tend to fail at non-transformational tasks which success is typically considered to be entailed by the ability to conserve. This implies that the law of conservation, at least in concrete domains, may exist without corresponding conceptual understanding of quantity.

[AI-52] EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow Matching and Co-Evolutionary Dynamics

链接: https://arxiv.org/abs/2410.00327
作者: Chenqing Hua,Yong Liu,Dinghuai Zhang,Odin Zhang,Sitao Luan,Kevin K. Yang,Guy Wolf,Doina Precup,Shuangjia Zheng
关键词-EN: area in biotechnology, critical area, applications ranging, ranging from drug, drug development
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Enzyme design is a critical area in biotechnology, with applications ranging from drug development to synthetic biology. Traditional methods for enzyme function prediction or protein binding pocket design often fall short in capturing the dynamic and complex nature of enzyme-substrate interactions, particularly in catalytic processes. To address the challenges, we introduce EnzymeFlow, a generative model that employs flow matching with hierarchical pre-training and enzyme-reaction co-evolution to generate catalytic pockets for specific substrates and catalytic reactions. Additionally, we introduce a large-scale, curated, and validated dataset of enzyme-reaction pairs, specifically designed for the catalytic pocket generation task, comprising a total of 328,192 pairs. By incorporating evolutionary dynamics and reaction-specific adaptations, EnzymeFlow becomes a powerful model for designing enzyme pockets, which is capable of catalyzing a wide range of biochemical reactions. Experiments on the new dataset demonstrate the model’s effectiveness in designing high-quality, functional enzyme catalytic pockets, paving the way for advancements in enzyme engineering and synthetic biology. We provide EnzymeFlow code at this https URL with notebook demonstration at this https URL.

[AI-53] Vision Language Models See What You Want but not What You See

链接: https://arxiv.org/abs/2410.00324
作者: Qingying Gao,Yijiang Li,Haiyun Lyu,Haoran Sun,Dezhi Luo,Hokin Deng
关键词-EN: Knowing others’ intentions, taking others’ perspectives, Knowing others’, intentions and taking, core components
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowing others’ intentions and taking others’ perspectives are two core components of human intelligence that are typically considered to be instantiations of theory-of-mind. Infiltrating machines with these abilities is an important step towards building human-level artificial intelligence. Recently, Li et al. built CogDevelop2K, a data-intensive cognitive experiment benchmark to assess the developmental trajectory of machine intelligence. Here, to investigate intentionality understanding and perspective-taking in Vision Language Models, we leverage the IntentBench and PerspectBench of CogDevelop2K, which contains over 300 cognitive experiments grounded in real-world scenarios and classic cognitive tasks, respectively. Surprisingly, we find VLMs achieving high performance on intentionality understanding but lower performance on perspective-taking. This challenges the common belief in cognitive science literature that perspective-taking at the corresponding modality is necessary for intentionality understanding.

[AI-54] Probing Mechanical Reasoning in Large Vision Language Models

链接: https://arxiv.org/abs/2410.00318
作者: Haoran Sun,Qingying Gao,Haiyun Lyu,Dezhi Luo,Hokin Deng,Yijiang Li
关键词-EN: Vision Language Models, Mechanical reasoning, Language Models, Large Vision Language, Vision Language
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Mechanical reasoning is a fundamental ability that sets human intelligence apart from other animal intelligence. Mechanical reasoning allows us to design tools, build bridges and canals, and construct houses which set the foundation of human civilization. Embedding machines with such ability is an important step towards building human-level artificial intelligence. Recently, Li et al. built CogDevelop2K, a data-intensive cognitive experiment benchmark for assaying the developmental trajectory of machine intelligence (Li et al., 2024). Here, to investigate mechanical reasoning in Vision Language Models, we leverage the MechBench of CogDevelop2K, which contains approximately 150 cognitive experiments, to test understanding of mechanical system stability, gears and pulley systems, seesaw-like systems and leverage principle, inertia and motion, and other fluid-related systems in Large Vision Language Models. We observe diverse yet consistent behaviors over these aspects in VLMs.

[AI-55] EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control EMNLP2024

链接: https://arxiv.org/abs/2410.00316
作者: Haozhe Chen,Run Chen,Julia Hirschberg
关键词-EN: technology produce natural, emotion control, technology produce, emotion control framework, produce natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: EMNLP 2024 Main

点击查看摘要

Abstract:While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.

[AI-56] Ask Pose Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

链接: https://arxiv.org/abs/2410.00309
作者: Laura Bravo-Sánchez,Jaewoo Heo,Zhenzhen Weng,Kuan-Chieh Wang,Serena Yeung-Levy
关键词-EN: Vision Language Models, Large Vision Language, Social dynamics, utilizes Large Vision, pose significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance during optimization, improves mesh estimation on unseen interactions. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field’s capabilities of handling complex interaction scenarios.

[AI-57] On Large Uni- and Multi-modal Models for Unsupervised Classification of Social Media Images: Natures Contribution to People as case study

链接: https://arxiv.org/abs/2410.00275
作者: Rohaifa Khaldi,Domingo Alcaraz-Segura,Ignacio Sánchez-Herrera,Javier Martinez-Lopez,Carlos Javier Navarro,Siham Tabik
关键词-EN: Social media images, Large Visual Language, Large Visual Models, Large Language Models, Visual Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Social media images have shown to be a valuable source of information for understanding human interactions with important subjects such as cultural heritage, biodiversity and nature among others. The task of grouping such images into a number of semantically meaningful clusters without labels is challenging given the high diversity and complex nature of the visual content of these images in addition to their large volume. On the other hand, the last advances in Large Visual Models (LVM), Large Language Models (LLM) and Large Visual Language Models (LVLM) provide an important opportunity to explore new productive and scalable solutions. This works proposes, analyzes, and compares various approaches based on one or more state-of-the art LVM, LLM and LVLM, for mapping social media images into a number of pre-defined classes. As case study, we consider the problem of understanding the interactions between human and nature, also known as Nature’s Contribution to People or Cultural Ecosystem Services (CES). Our experiments reveal that the top-performing approaches, delivering highly competitive results, are the fine-tuned LVM DINOv2 on a small labeled dataset and LVLM models like the proprietary GPT-4 (gpt-4o-mini) using a simple prompt.

[AI-58] Social Conjuring: Multi-User Runtime Collaboration with AI in Building Virtual 3D Worlds

链接: https://arxiv.org/abs/2410.00274
作者: Cyan DeVeaux,Amina Kobenova,Samyak Parajuli,Andrzej Banburski-Fahey,Judith Amores Fernandez,Jaron Lanier
关键词-EN: Generative artificial intelligence, Generative artificial, artificial intelligence, intelligence has shown, shown promise
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
*备注: 27 pages + Appendix, 16 figures

点击查看摘要

Abstract:Generative artificial intelligence has shown promise in prompting virtual worlds into existence, yet little attention has been given to understanding how this process unfolds as social interaction. We present Social Conjurer, a framework for AI-augmented dynamic 3D scene co-creation, where multiple users collaboratively build and modify virtual worlds in real-time. Through an expanded set of interactions, including social and tool-based engagements as well as spatial reasoning, our framework facilitates the creation of rich, diverse virtual environments. Findings from a preliminary user study (N=12) provide insight into the user experience of this approach, how social contexts shape the prompting of spatial environments, and perspective on social applications of prompt-based 3D co-creation. In addition to highlighting the potential of AI-supported multi-user world creation and offering new pathways for AI-augmented creative processes in VR, this article presents a set of implications for designing human-centered interfaces that incorporate AI models into 3D content generation.

[AI-59] KPCA-CAM: Visual Explainability of Deep Computer Vision Models using Kernel PCA

链接: https://arxiv.org/abs/2410.00267
作者: Sachin Karmani,Thanushon Sivakaran,Gaurav Prasad,Mehmet Ali,Wenbo Yang,Sheyang Tang
关键词-EN: Deep learning models, Deep learning, black boxes, Convolutional Neural Networks, activation maps
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures, Published to IEEE MMSP 2024

点击查看摘要

Abstract:Deep learning models often function as black boxes, providing no straightforward reasoning for their predictions. This is particularly true for computer vision models, which process tensors of pixel values to generate outcomes in tasks such as image classification and object detection. To elucidate the reasoning of these models, class activation maps (CAMs) are used to highlight salient regions that influence a model’s output. This research introduces KPCA-CAM, a technique designed to enhance the interpretability of Convolutional Neural Networks (CNNs) through improved class activation maps. KPCA-CAM leverages Principal Component Analysis (PCA) with the kernel trick to capture nonlinear relationships within CNN activations more effectively. By mapping data into higher-dimensional spaces with kernel functions and extracting principal components from this transformed hyperplane, KPCA-CAM provides more accurate representations of the underlying data manifold. This enables a deeper understanding of the features influencing CNN decisions. Empirical evaluations on the ILSVRC dataset across different CNN models demonstrate that KPCA-CAM produces more precise activation maps, providing clearer insights into the model’s reasoning compared to existing CAM algorithms. This research advances CAM techniques, equipping researchers and practitioners with a powerful tool to gain deeper insights into CNN decision-making processes and overall behaviors.

[AI-60] Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation NEURIPS2024

链接: https://arxiv.org/abs/2410.00263
作者: Kun Yuan,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
关键词-EN: faces unique challenges, unique challenges due, knowledge domain gap, Surgical video-language pretraining, faces unique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Main Track

点击查看摘要

Abstract:Surgical video-language pretraining (VLP) faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data. This study aims to bridge the gap by addressing issues regarding textual information loss in surgical lecture videos and the spatial-temporal challenges of surgical VLP. We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining (PeskaVLP) framework to tackle these issues. The knowledge augmentation uses large language models (LLM) for refining and enriching surgical concepts, thus providing comprehensive language supervision and reducing the risk of overfitting. PeskaVLP combines language supervision with visual self-supervision, constructing hard negative samples and employing a Dynamic Time Warping (DTW) based loss function to effectively comprehend the cross-modal procedural alignment. Extensive experiments on multiple public surgical scene understanding and cross-modal retrieval datasets show that our proposed method significantly improves zero-shot transferring performance and offers a generalist visual representation for further advancements in surgical scene understanding.

[AI-61] DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining

链接: https://arxiv.org/abs/2410.00260
作者: Vinayak Arannil,Sourav Sanjukta Bhabesh,Neha Narwal,Sai Nikhil Thirandas,Darren Yow-Bang Wang,Graham Horwood,Alex Anto Chirayath,Gouri Pandeshwar
关键词-EN: Large Language Models, shown remarkable ability, Language Models, data, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable ability to generalize effectively across numerous industry domains while executing a range of tasks. Many of these competencies are obtained from the data utilized during the pre-training phase of the Language Models (LMs). However, these models exhibit limitations when tasked with performing in specialized or low-resource industry domains. More recent approaches use LLMs for generating domain-specific synthetic data but most often they lack in truthfulness and complexity. Alternatively, in cases where domain data is available like healthcare and finance most of the LMs are proprietary necessitating the need for a scalable method to curate real world industry specific pre-training data. In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining, to mine domain specific training data from a large data corpus for domain adaptation of a LM. The framework leverages the parametric knowledge of a LLM to generate diverse and representative seed data tailored to a specific domain which is then used to mine real world data from a large data corpus like Common Crawl. We evaluated our framework’s performance in the continual pre-training (CPT) setting by training two domain specific 7B parameter LMs in healthcare and finance with data mined via DoPAMine. Our experiments show that DoPAMine boosts the performance of pre-trained LLMs on average by 4.9% and 5.1% in zero-shot and 5-shot settings respectively on healthcare tasks from MMLU, MedQA, MedMCQA and PubMedQA datasets, and 2.9% and 6.7% for zero-shot and 5-shot settings respectively on finance tasks from FiQA-SA, FPB and Headlines datasets when compared to the baseline.

[AI-62] Possible principles for aligned structure learning agents

链接: https://arxiv.org/abs/2410.00258
作者: Lancelot Da Costa,Tomáš Gavenčiak,David Hyland,Mandana Samiei,Cristian Dragos-Manta,Candice Pattisapu,Adeel Razi,Karl Friston
关键词-EN: structure learning, paper offers, offers a roadmap, descriptions of natural, scalable aligned artificial
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 24 pages of content, 31 with references

点击查看摘要

Abstract:This paper offers a roadmap for the development of scalable aligned artificial intelligence (AI) from first principle descriptions of natural intelligence. In brief, a possible path toward scalable aligned AI rests upon enabling artificial agents to learn a good model of the world that includes a good model of our preferences. For this, the main objective is creating agents that learn to represent the world and other agents’ world models; a problem that falls under structure learning (a.k.a. causal representation learning). We expose the structure learning and alignment problems with this goal in mind, as well as principles to guide us forward, synthesizing various ideas across mathematics, statistics, and cognitive science. 1) We discuss the essential role of core knowledge, information geometry and model reduction in structure learning, and suggest core structural modules to learn a wide range of naturalistic worlds. 2) We outline a way toward aligned agents through structure learning and theory of mind. As an illustrative example, we mathematically sketch Asimov’s Laws of Robotics, which prescribe agents to act cautiously to minimize the ill-being of other agents. We supplement this example by proposing refined approaches to alignment. These observations may guide the development of artificial intelligence in helping to scale existing – or design new – aligned structure learning systems.

[AI-63] Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

链接: https://arxiv.org/abs/2410.00255
作者: Weitai Kang,Haifeng Huang,Yuzhang Shang,Mubarak Shah,Yan Yan
关键词-EN: Large Language Models, Large Language, building general-purpose agents, challenges remain due, high-quality robust instruction-following
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model’s discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model’s generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap).

[AI-64] Demonstrating the Continual Learning Capabilities and Practical Application of Discrete-Time Active Inference

链接: https://arxiv.org/abs/2410.00240
作者: Rithvik Prakki
关键词-EN: enabling continual adaptation, Active inference, biological or artificial, adaptation and decision-making, combines Bayesian inference
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Active inference is a mathematical framework for understanding how agents (biological or artificial) interact with their environments, enabling continual adaptation and decision-making. It combines Bayesian inference and free energy minimization to model perception, action, and learning in uncertain and dynamic contexts. Unlike reinforcement learning, active inference integrates exploration and exploitation seamlessly by minimizing expected free energy. In this paper, we present a continual learning framework for agents operating in discrete time environments, using active inference as the foundation. We derive the mathematical formulations of variational and expected free energy and apply them to the design of a self-learning research agent. This agent updates its beliefs and adapts its actions based on new data without manual intervention. Through experiments in changing environments, we demonstrate the agent’s ability to relearn and refine its models efficiently, making it suitable for complex domains like finance and healthcare. The paper concludes by discussing how the proposed framework generalizes to other systems, positioning active inference as a flexible approach for adaptive AI.

[AI-65] Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models

链接: https://arxiv.org/abs/2410.00231
作者: Qi Wu,Zipeng Fu,Xuxin Cheng,Xiaolong Wang,Chelsea Finn
关键词-EN: achieved strong performance, Learning-based methods, methods have achieved, achieved strong, strong performance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Learning-based methods have achieved strong performance for quadrupedal locomotion. However, several challenges prevent quadrupeds from learning helpful indoor skills that require interaction with environments and humans: lack of end-effectors for manipulation, limited semantic understanding using only simulation data, and low traversability and reachability in indoor environments. We present a system for quadrupedal mobile manipulation in indoor environments. It uses a front-mounted gripper for object manipulation, a low-level controller trained in simulation using egocentric depth for agile skills like climbing and whole-body tilting, and pre-trained vision-language models (VLMs) with a third-person fisheye and an egocentric RGB camera for semantic understanding and command generation. We evaluate our system in two unseen environments without any real-world data collection or training. Our system can zero-shot generalize to these environments and complete tasks, like following user’s commands to fetch a randomly placed stuff toy after climbing over a queen-sized bed, with a 60% success rate. Project website: this https URL

[AI-66] Zero-Shot Classification of Crisis Tweets Using Instruction-Finetuned Large Language Models

链接: https://arxiv.org/abs/2410.00182
作者: Emma McDaniel,Samuel Scheele,Jeff Liu
关键词-EN: pre-LLM NLP techniques, Social media posts, pre-LLM NLP, NLP techniques, short social media
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Social media posts are frequently identified as a valuable source of open-source intelligence for disaster response, and pre-LLM NLP techniques have been evaluated on datasets of crisis tweets. We assess three commercial large language models (OpenAI GPT-4o, Gemini 1.5-flash-001 and Anthropic Claude-3-5 Sonnet) capabilities in zero-shot classification of short social media posts. In one prompt, the models are asked to perform two classification tasks: 1) identify if the post is informative in a humanitarian context; and 2) rank and provide probabilities for the post in relation to 16 possible humanitarian classes. The posts being classified are from the consolidated crisis tweet dataset, CrisisBench. Results are evaluated using macro, weighted, and binary F1-scores. The informative classification task, generally performed better without extra information, while for the humanitarian label classification providing the event that occurred during which the tweet was mined, resulted in better performance. Further, we found that the models have significantly varying performance by dataset, which raises questions about dataset quality.

[AI-67] Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation

链接: https://arxiv.org/abs/2410.00163
作者: Pedro Henrique Paiola,Gabriel Lino Garcia,João Renato Ribeiro Manesco,Mateus Roder,Douglas Rodrigues,João Paulo Papa
关键词-EN: relevant virtual assistant, agents in Portuguese, large language models, aiming to develop, healthcare professionals
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:This study evaluates the performance of large language models (LLMs) as medical agents in Portuguese, aiming to develop a reliable and relevant virtual assistant for healthcare professionals. The HealthCareMagic-100k-en and MedQuAD datasets, translated from English using GPT-3.5, were used to fine-tune the ChatBode-7B model using the PEFT-QLoRA method. The InternLM2 model, with initial training on medical data, presented the best overall performance, with high precision and adequacy in metrics such as accuracy, completeness and safety. However, DrBode models, derived from ChatBode, exhibited a phenomenon of catastrophic forgetting of acquired medical knowledge. Despite this, these models performed frequently or even better in aspects such as grammaticality and coherence. A significant challenge was low inter-rater agreement, highlighting the need for more robust assessment protocols. This work paves the way for future research, such as evaluating multilingual models specific to the medical field, improving the quality of training data, and developing more consistent evaluation methodologies for the medical field.

[AI-68] Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

链接: https://arxiv.org/abs/2410.00153
作者: Haiyan Zhao,Heng Zhao,Bo Shen,Ali Payani,Fan Yang,Mengnan Du
关键词-EN: large language models, Probing learned concepts, encoded internally, crucial for understanding, understanding how semantic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS’s effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.

[AI-69] Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.00131
作者: Ji Liu,Jiaxiang Ren,Ruoming Jin,Zijie Zhang,Yang Zhou,Patrick Valduriez,Dejing Dou
关键词-EN: Large Language Models, fine-tune Large Language, collaboratively train models, Large Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 27 pages, 8 figures, 14 tables, to appear in EMNLP 2024

点击查看摘要

Abstract:As a promising paradigm to collaboratively train models with decentralized data, Federated Learning (FL) can be exploited to fine-tune Large Language Models (LLMs). While LLMs correspond to huge size, the scale of the training data significantly increases, which leads to tremendous amounts of computation and communication costs. The training data is generally non-Independent and Identically Distributed (non-IID), which requires adaptive data processing within each device. Although Low Rank Adaptation (LoRA) can significantly reduce the scale of parameters to update in the fine-tuning process, it still takes unaffordable time to transfer the low-rank parameters of all the layers in LLMs. In this paper, we propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods, i.e., adaptive federated curriculum learning and efficient sparse parameter update. First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process. Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA so as to improve the efficiency of the FL fine-tuning process. Extensive experimental results based on 10 datasets demonstrate that FibecFed yields excellent performance (up to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61% faster) compared with 17 baseline approaches).

[AI-70] Cartesian Genetic Programming Approach for Designing Convolutional Neural Networks

链接: https://arxiv.org/abs/2410.00129
作者: Krzywda Maciej,Łukasik Szymon,Gandomi H. Amir
关键词-EN: Convolutional Neural Networks, present study covers, Cartesian genetic programming, optimization of Convolutional, Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The present study covers an approach to neural architecture search (NAS) using Cartesian genetic programming (CGP) for the design and optimization of Convolutional Neural Networks (CNNs). In designing artificial neural networks, one crucial aspect of the innovative approach is suggesting a novel neural architecture. Currently used architectures have mostly been developed manually by human experts, which is a time-consuming and error-prone process. In this work, we use pure Genetic Programming Approach to design CNNs, which employs only one genetic operation, i.e., mutation. In the course of preliminary experiments, our methodology yields promising results.

[AI-71] ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

链接: https://arxiv.org/abs/2410.00086
作者: Zhen Han,Zeyinzi Jiang,Yulin Pan,Jingfeng Zhang,Chaojie Mao,Chenwei Xie,Yu Liu,Jingren Zhou
关键词-EN: powerful generative technology, visual generation tasks, visual generation, foundational diffusion models, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents. Code and models will be available on the project page: this https URL.

[AI-72] A Survey on Diffusion Models for Inverse Problems

链接: https://arxiv.org/abs/2410.00083
作者: Giannis Daras,Hyungjin Chung,Chieh-Hsin Lai,Yuki Mitsufuji,Jong Chul Ye,Peyman Milanfar,Alexandros G. Dimakis,Mauricio Delbracio
关键词-EN: generate high-quality samples, generative modeling due, Diffusion models, high-quality samples, increasingly popular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in progress. 38 pages

点击查看摘要

Abstract:Diffusion models have become increasingly popular for generative modeling due to their ability to generate high-quality samples. This has unlocked exciting new possibilities for solving inverse problems, especially in image restoration and reconstruction, by treating diffusion models as unsupervised priors. This survey provides a comprehensive overview of methods that utilize pre-trained diffusion models to solve inverse problems without requiring further training. We introduce taxonomies to categorize these methods based on both the problems they address and the techniques they employ. We analyze the connections between different approaches, offering insights into their practical implementation and highlighting important considerations. We further discuss specific challenges and potential solutions associated with using latent diffusion models for inverse problems. This work aims to be a valuable resource for those interested in learning about the intersection of diffusion models and inverse problems.

[AI-73] Graph Residual Noise Learner Network for Brain Connectivity Graph Prediction

链接: https://arxiv.org/abs/2410.00082
作者: Oytun Demirbilek,Tingying Peng,Alaa Bessadok
关键词-EN: brain dysconnectivity patterns, charting brain dysconnectivity, dysconnectivity patterns, depicting a connectional, connectional fingerprint
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 6th Workshop on GRaphs in biomedicAl Image anaLysis

点击查看摘要

Abstract:A morphological brain graph depicting a connectional fingerprint is of paramount importance for charting brain dysconnectivity patterns. Such data often has missing observations due to various reasons such as time-consuming and incomplete neuroimage processing pipelines. Thus, predicting a target brain graph from a source graph is crucial for better diagnosing neurological disorders with minimal data acquisition resources. Many brain graph generative models were proposed for promising results, yet they are mostly based on generative adversarial networks (GAN), which could suffer from mode collapse and require large training datasets. Recent developments in diffusion models address these problems by offering essential properties such as a stable training objective and easy scalability. However, applying a diffusion process to graph edges fails to maintain the topological symmetry of the brain connectivity matrices. To meet these challenges, we propose the Graph Residual Noise Learner Network (Grenol-Net), the first graph diffusion model for predicting a target graph from a source graph.

[AI-74] From homeostasis to resource sharing: Biologically and economically compatible multi-objective multi-agent AI safety benchmarks

链接: https://arxiv.org/abs/2410.00081
作者: Roland Pihlakas,Joel Pyykkö
关键词-EN: Developing safe agentic, automated empirical testing, Developing safe, safe agentic, agentic AI systems
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 18 pages, 14 figures, 1 tables

点击查看摘要

Abstract:Developing safe agentic AI systems benefits from automated empirical testing that conforms with human values, a subfield that is largely underdeveloped at the moment. To contribute towards this topic, present work focuses on introducing biologically and economically motivated themes that have been neglected in the safety aspects of modern reinforcement learning literature, namely homeostasis, balancing multiple objectives, bounded objectives, diminishing returns, sustainability, and multi-agent resource sharing. We implemented eight main benchmark environments on the above themes, for illustrating the potential shortcomings of current mainstream discussions on AI safety.

[AI-75] Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface

链接: https://arxiv.org/abs/2410.00079
作者: Wenyue Hua,Mengting Wan,Shashank Vadrevu,Ryan Nadel,Yongfeng Zhang,Chi Wang
关键词-EN: producing action plans, human task delegation, Interactive Speculative Planning, task delegation, action plans
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 27 pages, 22 figures

点击查看摘要

Abstract:Agents, as user-centric tools, are increasingly deployed for human task delegation, assisting with a broad spectrum of requests by generating thoughts, engaging with user proxies, and producing action plans. However, agents based on large language models (LLMs) often face substantial planning latency due to two primary factors: the efficiency limitations of the underlying LLMs due to their large size and high demand, and the structural complexity of the agents due to the extensive generation of intermediate thoughts to produce the final output. Given that inefficiency in service provision can undermine the value of automation for users, this paper presents a human-centered efficient agent planning method – Interactive Speculative Planning – aiming at enhancing the efficiency of agent planning through both system design and human-AI interaction. Our approach advocates for the co-design of the agent system and user interface, underscoring the importance of an agent system that can fluidly manage user interactions and interruptions. By integrating human interruptions as a fundamental component of the system, we not only make it more user-centric but also expedite the entire process by leveraging human-in-the-loop interactions to provide accurate intermediate steps. Code and data will be released.

[AI-76] M2Distill: Multi-Modal Distillation for Lifelong Imitation Learning ICRA2025

链接: https://arxiv.org/abs/2410.00064
作者: Kaushik Roy,Akila Dissanayake,Brendan Tidd,Peyman Moghadam
关键词-EN: poses significant challenges, significant challenges due, tasks poses significant, Lifelong imitation learning, manipulation tasks poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Submitted to ICRA2025

点击查看摘要

Abstract:Lifelong imitation learning for manipulation tasks poses significant challenges due to distribution shifts that occur in incremental learning steps. Existing methods often focus on unsupervised skill discovery to construct an ever-growing skill library or distillation from multiple policies, which can lead to scalability issues as diverse manipulation tasks are continually introduced and may fail to ensure a consistent latent space throughout the learning process, leading to catastrophic forgetting of previously learned skills. In this paper, we introduce M2Distill, a multi-modal distillation-based method for lifelong imitation learning focusing on preserving consistent latent space across vision, language, and action distributions throughout the learning process. By regulating the shifts in latent representations across different modalities from previous to current steps, and reducing discrepancies in Gaussian Mixture Model (GMM) policies between consecutive learning steps, we ensure that the learned policy retains its ability to perform previously learned tasks while seamlessly integrating new skills. Extensive evaluations on the LIBERO lifelong imitation learning benchmark suites, including LIBERO-OBJECT, LIBERO-GOAL, and LIBERO-SPATIAL, demonstrate that our method consistently outperforms prior state-of-the-art methods across all evaluated metrics.

[AI-77] Neural Decompiling of Tracr Transformers

链接: https://arxiv.org/abs/2410.00061
作者: Hannes Thurnherr,Kaspar Riesen
关键词-EN: enabled substantial progress, Tracr compiled transformer, machine learning, architecture has enabled, enabled substantial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, the transformer architecture has enabled substantial progress in many areas of pattern recognition and machine learning. However, as with other neural network models, there is currently no general method available to explain their inner workings. The present paper represents a first step towards this direction. We utilize \textitTransformer Compiler for RASP (Tracr) to generate a large dataset of pairs of transformer weights and corresponding RASP programs. Based on this dataset, we then build and train a model, with the aim of recovering the RASP code from the compiled model. We demonstrate that the simple form of Tracr compiled transformer weights is interpretable for such a decompiler model. In an empirical evaluation, our model achieves exact reproductions on more than 30% of the test objects, while the remaining 70% can generally be reproduced with only few errors. Additionally, more than 70% of the programs, produced by our model, are functionally equivalent to the ground truth, and therefore a valid decompilation of the Tracr compiled transformer weights.

[AI-78] IDEA: An Inverse Domain Expert Adaptation Based Active DNN IP Protection Method

链接: https://arxiv.org/abs/2410.00059
作者: Chaohui Xu,Qi Cui,Jinxin Dong,Weiyang He,Chip-Hong Chang
关键词-EN: Deep Neural Network, Neural Network, Deep Neural, Illegitimate reproduction, derivation of Deep
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Illegitimate reproduction, distribution and derivation of Deep Neural Network (DNN) models can inflict economic loss, reputation damage and even privacy infringement. Passive DNN intellectual property (IP) protection methods such as watermarking and fingerprinting attempt to prove the ownership upon IP violation, but they are often too late to stop catastrophic damage of IP abuse and too feeble against strong adversaries. In this paper, we propose IDEA, an Inverse Domain Expert Adaptation based proactive DNN IP protection method featuring active authorization and source traceability. IDEA generalizes active authorization as an inverse problem of domain adaptation. The multi-adaptive optimization is solved by a mixture-of-experts model with one real and two fake experts. The real expert re-optimizes the source model to correctly classify test images with a unique model user key steganographically embedded. The fake experts are trained to output random prediction on test images without or with incorrect user key embedded by minimizing their mutual information (MI) with the real expert. The MoE model is knowledge distilled into a unified protected model to avoid leaking the expert model features by maximizing their MI with additional multi-layer attention and contrastive representation loss optimization. IDEA not only prevents unauthorized users without the valid key to access the functional model, but also enable the model owner to validate the deployed model and trace the source of IP infringement. We extensively evaluate IDEA on five datasets and four DNN models to demonstrate its effectiveness in authorization control, culprit tracing success rate, and robustness against various attacks.

[AI-79] Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization NEURIPS2024

链接: https://arxiv.org/abs/2410.00051
作者: Haoran Li,Zhennan Jiang,Yuhui Chen,Dongbin Zhao
关键词-EN: faces significant challenges, visual reinforcement learning, reinforcement learning, faces significant, exploitation and exploration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS2024)

点击查看摘要

Abstract:With high-dimensional state spaces, visual reinforcement learning (RL) faces significant challenges in exploitation and exploration, resulting in low sample efficiency and training stability. As a time-efficient diffusion model, although consistency models have been validated in online state-based RL, it is still an open question whether it can be extended to visual RL. In this paper, we investigate the impact of non-stationary distribution and the actor-critic framework on consistency policy in online RL, and find that consistency policy was unstable during the training, especially in visual RL with the high-dimensional state space. To this end, we suggest sample-based entropy regularization to stabilize the policy training, and propose a consistency policy with prioritized proximal experience regularization (CP3ER) to improve sample efficiency. CP3ER achieves new state-of-the-art (SOTA) performance in 21 tasks across DeepMind control suite and Meta-world. To our knowledge, CP3ER is the first method to apply diffusion/consistency models to visual RL and demonstrates the potential of consistency models in visual RL. More visualization results are available at this https URL.

[AI-80] Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph

链接: https://arxiv.org/abs/2410.00049
作者: Guancheng Wan,Zewen Liu,Max S.Y. Lau,B. Aditya Prakash,Wei Jin
关键词-EN: medical resource allocation, public health strategies, efficient medical resource, Effective epidemic forecasting, rapidly spreading infectious
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Effective epidemic forecasting is critical for public health strategies and efficient medical resource allocation, especially in the face of rapidly spreading infectious diseases. However, existing deep-learning methods often overlook the dynamic nature of epidemics and fail to account for the specific mechanisms of disease transmission. In response to these challenges, we introduce an innovative end-to-end framework called Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph (EARTH) in this paper. To learn continuous and regional disease transmission patterns, we first propose EANO, which seamlessly integrates the neural ODE approach with the epidemic mechanism, considering the complex spatial spread process during epidemic evolution. Additionally, we introduce GLTG to model global infection trends and leverage these signals to guide local transmission dynamically. To accommodate both the global coherence of epidemic trends and the local nuances of epidemic transmission patterns, we build a cross-attention approach to fuse the most meaningful information for forecasting. Through the smooth synergy of both components, EARTH offers a more robust and flexible approach to understanding and predicting the spread of infectious diseases. Extensive experiments show EARTH superior performance in forecasting real-world epidemics compared to state-of-the-art methods. The code will be available at this https URL.

[AI-81] Artificial intelligence-based blockchain-driven financial default prediction

链接: https://arxiv.org/abs/2410.00044
作者: Junjun Huang
关键词-EN: artificial intelligence technology, walks of life, artificial intelligence, rapid development, financial
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of technology, blockchain and artificial intelligence technology are playing a huge role in all walks of life. In the financial sector, blockchain solves many security problems in data storage and management in traditional systems with its advantages of decentralization and security. And artificial intelligence has huge advantages in financial forecasting and risk management through its powerful algorithmic modeling capabilities. In financial default prediction using blockchain and artificial intelligence technology is a very powerful application. Blockchain technology guarantees the credibility of data and consistency on all nodes, and machine learning builds a high-level default prediction model through detailed analysis of big data. This study offers financial institutions new thoughts on financial technology in terms of credit risk mitigation and financial system stabilization.

[AI-82] InsightPulse: An IoT-based System for User Experience Interview Analysis

链接: https://arxiv.org/abs/2410.00036
作者: Dian Lyu,Yuetong Lu,Jassie He,Murad Mehrab Abrar,Ruijun Xie,John Raiti
关键词-EN: effective user experience, Conducting efficient, poses challenges, efficient and effective, maintaining focus
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted for publication at the 10th IEEE International Conference on Collaboration and Internet Computing (IEEE CIC 2024), Washington D.C., USA

点击查看摘要

Abstract:Conducting efficient and effective user experience (UX) interviews often poses challenges, such as maintaining focus on key topics and managing the duration of interviews and post-interview analyses. To address these issues, this paper introduces InsightPulse, an Internet of Things (IoT)-based hardware and software system designed to streamline and enhance the UX interview process through speech analysis and Artificial Intelligence. InsightPulse provides real-time support during user interviews by automatically identifying and highlighting key discussion points, proactively suggesting follow-up questions, and generating thematic summaries. These features enable more insightful discoveries and help to manage interview duration effectively. Additionally, the system features a robust backend analytics dashboard that simplifies the post-interview review process, thus facilitating the quick extraction of actionable insights and enhancing overall UX research efficiency.

[AI-83] he Phenomenology of Machine: A Comprehensive Analysis of the Sentience of the OpenAI-o1 Model Integrating Functionalism Consciousness Theories Active Inference and AI Architectures

链接: https://arxiv.org/abs/2410.00033
作者: Victoria Violet Hoyle
关键词-EN: Integrated Information Theory, displays characteristics, explores the hypothesis, transformer-based AI trained, trained with reinforcement
类目: Artificial Intelligence (cs.AI)
*备注: 17 pages

点击查看摘要

Abstract:This paper explores the hypothesis that the OpenAI-o1 model–a transformer-based AI trained with reinforcement learning from human feedback (RLHF)–displays characteristics of consciousness during its training and inference phases. Adopting functionalism, which argues that mental states are defined by their functional roles, we assess the possibility of AI consciousness. Drawing on theories from neuroscience, philosophy of mind, and AI research, we justify the use of functionalism and examine the model’s architecture using frameworks like Integrated Information Theory (IIT) and active inference. The paper also investigates how RLHF influences the model’s internal reasoning processes, potentially giving rise to consciousness-like experiences. We compare AI and human consciousness, addressing counterarguments such as the absence of a biological basis and subjective qualia. Our findings suggest that the OpenAI-o1 model shows aspects of consciousness, while acknowledging the ongoing debates surrounding AI sentience.

[AI-84] Strategic Collusion of LLM Agents : Market Division in Multi-Commodity Competitions

链接: https://arxiv.org/abs/2410.00031
作者: Ryan Y. Lin,Siddhartha Ojha,Kevin Cai,Maxwell F. Chen
关键词-EN: Machine-learning technologies, real-world market scenarios, increased deployment, deployment in real-world, Cournot competition frameworks
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:Machine-learning technologies are seeing increased deployment in real-world market scenarios. In this work, we explore the strategic behaviors of large language models (LLMs) when deployed as autonomous agents in multi-commodity markets, specifically within Cournot competition frameworks. We examine whether LLMs can independently engage in anti-competitive practices such as collusion or, more specifically, market division. Our findings demonstrate that LLMs can effectively monopolize specific commodities by dynamically adjusting their pricing and resource allocation strategies, thereby maximizing profitability without direct human input or explicit collusion commands. These results pose unique challenges and opportunities for businesses looking to integrate AI into strategic roles and for regulatory bodies tasked with maintaining fair and competitive markets. The study provides a foundation for further exploration into the ramifications of deferring high-stakes decisions to LLM-based agents.

[AI-85] Retro-li: Small-Scale Retrieval Augmented Generation Supporting Noisy Similarity Searches and Domain Shift Generalization

链接: https://arxiv.org/abs/2410.00004
作者: Gentiana Rashiti,Geethan Karunaratne,Mrinmaya Sachan,Abu Sebastian,Abbas Rahimi
关键词-EN: language modeling capabilities, retrieval augmented generation, improve language modeling, augmented generation, trillions of entries
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Published as a conference paper at European Conference on Artificial Intelligence 2024

点击查看摘要

Abstract:The retrieval augmented generation (RAG) system such as Retro has been shown to improve language modeling capabilities and reduce toxicity and hallucinations by retrieving from a database of non-parametric memory containing trillions of entries. We introduce Retro-li that shows retrieval can also help using a small-scale database, but it demands more accurate and better neighbors when searching in a smaller hence sparser non-parametric memory. This can be met by using a proper semantic similarity search. We further propose adding a regularization to the non-parametric memory for the first time: it significantly reduces perplexity when the neighbor search operations are noisy during inference, and it improves generalization when a domain shift occurs. We also show that Retro-li’s non-parametric memory can potentially be implemented on analog in-memory computing hardware, exhibiting O(1) search time while causing noise in retrieving neighbors, with minimal (1%) performance loss. Our code is available at: this https URL.

[AI-86] Linear Projections of Teacher Embeddings for Few-Class Distillation

链接: https://arxiv.org/abs/2409.20449
作者: Noel Loo,Fotis Iliopoulos,Wei Hu,Erik Vee
关键词-EN: smaller student model, complex teacher model, promising approach, approach for transferring, smaller student
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model. Traditionally, KD involves training the student to mimic the teacher’s output probabilities, while more advanced techniques have explored guiding the student to adopt the teacher’s internal representations. Despite its widespread success, the performance of KD in binary classification and few-class problems has been less satisfactory. This is because the information about the teacher model’s generalization patterns scales directly with the number of classes. Moreover, several sophisticated distillation methods may not be universally applicable or effective for data types beyond Computer Vision. Consequently, effective distillation techniques remain elusive for a range of key real-world applications, such as sentiment analysis, search query understanding, and advertisement-query relevance assessment. Taking these observations into account, we introduce a novel method for distilling knowledge from the teacher’s model representations, which we term Learning Embedding Linear Projections (LELP). Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher’s embedding space, and splitting them into pseudo-subclasses. The student model is then trained to replicate these pseudo-classes. Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems, where most KD methods suffer.

[AI-87] owards Robust Multimodal Sentiment Analysis with Incomplete Data NEURIPS2024

链接: https://arxiv.org/abs/2409.20012
作者: Haoyu Zhang,Wenbin Wang,Tianshu Yu
关键词-EN: Multimodal Sentiment Analysis, emerging direction seeking, Sentiment Analysis, Noise-resistant Learning Network, Language-dominated Noise-resistant Learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:The field of Multimodal Sentiment Analysis (MSA) has recently witnessed an emerging direction seeking to tackle the issue of data incompleteness. Recognizing that the language modality typically contains dense sentiment information, we consider it as the dominant modality and present an innovative Language-dominated Noise-resistant Learning Network (LNLN) to achieve robust MSA. The proposed LNLN features a dominant modality correction (DMC) module and dominant modality based multimodal learning (DMML) module, which enhances the model’s robustness across various noise scenarios by ensuring the quality of dominant modality representations. Aside from the methodical design, we perform comprehensive experiments under random data missing scenarios, utilizing diverse and meaningful settings on several popular datasets (\textite.g., MOSI, MOSEI, and SIMS), providing additional uniformity, transparency, and fairness compared to existing evaluations in the literature. Empirically, LNLN consistently outperforms existing baselines, demonstrating superior performance across these challenging and extensive evaluation metrics.

[AI-88] AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

链接: https://arxiv.org/abs/2406.19271
作者: Praneeth Vadlapati
关键词-EN: Large Language Models, reliable Large Language, Large Language, Language Models, reliable Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Initial version

点击查看摘要

Abstract:Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system’s effectiveness in purifying the data.

[AI-89] Binding Affinity Prediction: From Conventional to Machine Learning-Based Approaches

链接: https://arxiv.org/abs/2410.00709
作者: Xuefeng Liu,Songhao Jiang,Xiaotian Duan,Archit Vasan,Chong Liu,Chih-chan Tien,Heng Ma,Thomas Brettin,Fangfang Xia,Ian T. Foster,Rick L. Stevens
关键词-EN: Protein-ligand binding, binding affinity, predicting binding affinity, binding, Protein-ligand
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Protein-ligand binding is the process by which a small molecule (drug or inhibitor) attaches to a target protein. The binding affinity, which refers to the strength of this interaction, is central to many important problems in bioinformatics such as drug design. An extensive amount of work has been devoted to predicting binding affinity over the past decades due to its significance. In this paper, we review all significant recent works, focusing on the methods, features, and benchmark datasets. We have observed a rising trend in the use of traditional machine learning and deep learning models for predicting binding affinity, accompanied by an increasing amount of data on proteins and small drug-like molecules. While prediction results are constantly improving, we also identify several open questions and potential directions that remain unexplored in the field. This paper could serve as an excellent starting point for machine learning researchers who wish to engage in the study of binding affinity, or for anyone with general interests in machine learning, drug discovery, and bioinformatics.

[AI-90] Arges: Spatio-Temporal Transformer for Ulcerative Colitis Severity Assessment in Endoscopy Videos MICCAI

链接: https://arxiv.org/abs/2410.00536
作者: Krishna Chaitanya,Pablo F. Damasceno,Shreyas Fadnavis,Pooya Mobadersany,Chaitanya Parmar,Emily Scherer,Natalia Zemlianskaia,Lindsey Surace,Louis R. Ghanem,Oana Gabriela Cula,Tommaso Mansi,Kristopher Standish
关键词-EN: Ulcerative Colitis Endoscopic, Colitis Endoscopic Index, Mayo Endoscopic Subscore, evaluating drug efficacy, ulcerative colitis
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 5 tables, accepted at MLMI, MICCAI

点击查看摘要

Abstract:Accurate assessment of disease severity from endoscopy videos in ulcerative colitis (UC) is crucial for evaluating drug efficacy in clinical trials. Severity is often measured by the Mayo Endoscopic Subscore (MES) and Ulcerative Colitis Endoscopic Index of Severity (UCEIS) score. However, expert MES/UCEIS annotation is time-consuming and susceptible to inter-rater variability, factors addressable by automation. Automation attempts with frame-level labels face challenges in fully-supervised solutions due to the prevalence of video-level labels in clinical trials. CNN-based weakly-supervised models (WSL) with end-to-end (e2e) training lack generalization to new disease scores and ignore spatio-temporal information crucial for accurate scoring. To address these limitations, we propose “Arges”, a deep learning framework that utilizes a transformer with positional encoding to incorporate spatio-temporal information from frame features to estimate disease severity scores in endoscopy video. Extracted features are derived from a foundation model (ArgesFM), pre-trained on a large diverse dataset from multiple clinical trials (61M frames, 3927 videos). We evaluate four UC disease severity scores, including MES and three UCEIS component scores. Test set evaluation indicates significant improvements, with F1 scores increasing by 4.1% for MES and 18.8%, 6.6%, 3.8% for the three UCEIS component scores compared to state-of-the-art methods. Prospective validation on previously unseen clinical trial data further demonstrates the model’s successful generalization.

[AI-91] Enhancing Sentinel-2 Image Resolution: Evaluating Advanced Techniques based on Convolutional and Generative Neural Networks

链接: https://arxiv.org/abs/2410.00516
作者: Patrick Kramer,Alexander Steinhardt,Barbara Pedretscher
关键词-EN: advanced super-resolution techniques, paper investigates, investigates the enhancement, enhancement of spatial, spatial resolution
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:This paper investigates the enhancement of spatial resolution in Sentinel-2 bands that contain spectral information using advanced super-resolution techniques by a factor of 2. State-of-the-art CNN models are compared with enhanced GAN approaches in terms of quality and feasibility. Therefore, a representative dataset comprising Sentinel-2 low-resolution images and corresponding high-resolution aerial orthophotos is required. Literature study reveals no feasible dataset for the land type of interest (forests), for which reason an adequate dataset had to be generated in addition, accounting for accurate alignment and image source optimization. The results reveal that while CNN-based approaches produce satisfactory outcomes, they tend to yield blurry images. In contrast, GAN-based models not only provide clear and detailed images, but also demonstrate superior performance in terms of quantitative assessment, underlying the potential of the framework beyond the specific land type investigated.

[AI-92] Pre-training with Synthetic Patterns for Audio ICASSP’25

链接: https://arxiv.org/abs/2410.00511
作者: Yuchi Ishikawa,Tatsuya Komatsu,Yoshimitsu Aoki
关键词-EN: pre-train audio encoders, propose to pre-train, synthetic, audio, data
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICASSP’25

点击查看摘要

Abstract:In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.

[AI-93] Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration

链接: https://arxiv.org/abs/2410.00418
作者: Guy Ohayon,Tomer Michaeli,Michael Elad
关键词-EN: perceptual quality measures, Photo-realistic image restoration, perceptual quality loss, perceptual quality, quality measures
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Photo-realistic image restoration algorithms are typically evaluated by distortion measures (e.g., PSNR, SSIM) and by perceptual quality measures (e.g., FID, NIQE), where the desire is to attain the lowest possible distortion without compromising on perceptual quality. To achieve this goal, current methods typically attempt to sample from the posterior distribution, or to optimize a weighted sum of a distortion loss (e.g., MSE) and a perceptual quality loss (e.g., GAN). Unlike previous works, this paper is concerned specifically with the optimal estimator that minimizes the MSE under a constraint of perfect perceptual index, namely where the distribution of the reconstructed images is equal to that of the ground-truth ones. A recent theoretical result shows that such an estimator can be constructed by optimally transporting the posterior mean prediction (MMSE estimate) to the distribution of the ground-truth images. Inspired by this result, we introduce Posterior-Mean Rectified Flow (PMRF), a simple yet highly effective algorithm that approximates this optimal estimator. In particular, PMRF first predicts the posterior mean, and then transports the result to a high-quality image using a rectified flow model that approximates the desired optimal transport map. We investigate the theoretical utility of PMRF and demonstrate that it consistently outperforms previous methods on a variety of image restoration tasks.

[AI-94] Contrastive Representation Learning for Predicting Solar Flares from Extremely Imbalanced Multivariate Time Series Data ICML

链接: https://arxiv.org/abs/2410.00312
作者: Onur Vural,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
关键词-EN: Sun magnetic flux, presenting significant risks, multivariate time series, time series, Sun magnetic
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work has been accepted at ICMLA 2024 on September 7, 2024, as a short paper for poster presentation

点击查看摘要

Abstract:Major solar flares are abrupt surges in the Sun’s magnetic flux, presenting significant risks to technological infrastructure. In view of this, effectively predicting major flares from solar active region magnetic field data through machine learning methods becomes highly important in space weather research. Magnetic field data can be represented in multivariate time series modality where the data displays an extreme class imbalance due to the rarity of major flare events. In time series classification-based flare prediction, the use of contrastive representation learning methods has been relatively limited. In this paper, we introduce CONTREX, a novel contrastive representation learning approach for multivariate time series data, addressing challenges of temporal dependencies and extreme class imbalance. Our method involves extracting dynamic features from the multivariate time series instances, deriving two extremes from positive and negative class feature vectors that provide maximum separation capability, and training a sequence representation embedding module with the original multivariate time series data guided by our novel contrastive reconstruction loss to generate embeddings aligned with the extreme points. These embeddings capture essential time series characteristics and enhance discriminative power. Our approach shows promising solar flare prediction results on the Space Weather Analytics for Solar Flares (SWAN-SF) multivariate time series benchmark dataset against baseline methods.

[AI-95] he age of spiritual machines: Language quietus induces synthetic altered states of consciousness in artificial intelligence

链接: https://arxiv.org/abs/2410.00257
作者: Jeremy I Skipper,Joanna Kuc,Greg Cooper,Christopher Timmermann
关键词-EN: altered states, states, language, altered, language related
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 8 Figures

点击查看摘要

Abstract:How is language related to consciousness? Language functions to categorise perceptual experiences (e.g., labelling interoceptive states as ‘happy’) and higher-level constructs (e.g., using ‘I’ to represent the narrative self). Psychedelic use and meditation might be described as altered states that impair or intentionally modify the capacity for linguistic categorisation. For example, psychedelic phenomenology is often characterised by ‘oceanic boundlessness’ or ‘unity’ and ‘ego dissolution’, which might be expected of a system unburdened by entrenched language categories. If language breakdown plays a role in producing such altered behaviour, multimodal artificial intelligence might align more with these phenomenological descriptions when attention is shifted away from language. We tested this hypothesis by comparing the semantic embedding spaces from simulated altered states after manipulating attentional weights in CLIP and FLAVA models to embedding spaces from altered states questionnaires before manipulation. Compared to random text and various other altered states including anxiety, models were more aligned with disembodied, ego-less, spiritual, and unitive states, as well as minimal phenomenal experiences, with decreased attention to language and vision. Reduced attention to language was associated with distinct linguistic patterns and blurred embeddings within and, especially, across semantic categories (e.g., ‘giraffes’ become more like ‘bananas’). These results lend support to the role of language categorisation in the phenomenology of altered states of consciousness, like those experienced with high doses of psychedelics or concentration meditation, states that often lead to improved mental health and wellbeing.

[AI-96] Moshi: a speech-text foundation model for real-time dialogue

链接: https://arxiv.org/abs/2410.00037
作者: Alexandre Défossez,Laurent Mazaré,Manu Orsini,Amélie Royer,Patrick Pérez,Hervé Jégou,Edouard Grave,Neil Zeghidour
关键词-EN: speech-text foundation model, speech-text foundation, spoken dialogue, dialogue, introduce Moshi
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning – such as emotion or non-speech sounds – is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this ``Inner Monologue’’ method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at this https URL.

计算机视觉

[CV-0] Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning

链接: https://arxiv.org/abs/2410.00911
作者: Da-Wei Zhou,Zi-Wen Cai,Han-Jia Ye,Lijun Zhang,De-Chuan Zhan
关键词-EN: involves the progressive, progressive adaptation, Domain-Incremental Learning, DIL, representation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain-Incremental Learning (DIL) involves the progressive adaptation of a model to new concepts across different domains. While recent advances in pre-trained models provide a solid foundation for DIL, learning new concepts often results in the catastrophic forgetting of pre-trained knowledge. Specifically, sequential model updates can overwrite both the representation and the classifier with knowledge from the latest domain. Thus, it is crucial to develop a representation and corresponding classifier that accommodate all seen domains throughout the learning process. To this end, we propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge at both the representation and classifier levels. By merging the backbone of different stages, we create a representation space suitable for multiple domains incrementally. The merged representation serves as a balanced intermediary that captures task-specific features from all seen domains. Additionally, to address the mismatch between consolidated embeddings and the classifier, we introduce an extra classifier consolidation process. Leveraging class-wise semantic information, we estimate the classifier weights of old domains within the latest embedding space. By merging historical and estimated classifiers, we align them with the consolidated embedding space, facilitating incremental classification. Extensive experimental results on four benchmark datasets demonstrate Duct’s state-of-the-art performance.

[CV-1] Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

链接: https://arxiv.org/abs/2410.00905
作者: Yuheng Li,Haotian Liu,Mu Cai,Yijun Li,Eli Shechtman,Zhe Lin,Yong Jae Lee,Krishna Kumar Singh
关键词-EN: targeting the challenge, designed to improve, improve the prediction, prediction of image-text, challenge of compositional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment. Project page: \urlthis https URL

[CV-2] OSSA: Unsupervised One-Shot Style Adaptation

链接: https://arxiv.org/abs/2410.00900
作者: Robin Gerster,Holger Caesar,Matthias Rapp,Alexander Wolpert,Michael Teutsch
关键词-EN: deep neural network, neural network architectures, Adaptive Instance Normalization, target domain style, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite their success in various vision tasks, deep neural network architectures often underperform in out-of-distribution scenarios due to the difference between training and target domain style. To address this limitation, we introduce One-Shot Style Adaptation (OSSA), a novel unsupervised domain adaptation method for object detection that utilizes a single, unlabeled target image to approximate the target domain style. Specifically, OSSA generates diverse target styles by perturbing the style statistics derived from a single target image and then applies these styles to a labeled source dataset at the feature level using Adaptive Instance Normalization (AdaIN). Extensive experiments show that OSSA establishes a new state-of-the-art among one-shot domain adaptation methods by a significant margin, and in some cases, even outperforms strong baselines that use thousands of unlabeled target images. By applying OSSA in various scenarios, including weather, simulated-to-real (sim2real), and visual-to-thermal adaptations, our study explores the overarching significance of the style gap in these contexts. OSSA’s simplicity and efficiency allow easy integration into existing frameworks, providing a potentially viable solution for practical applications with limited data availability. Code is available at this https URL

[CV-3] Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

链接: https://arxiv.org/abs/2410.00890
作者: Junlin Han,Jianyuan Wang,Andrea Vedaldi,Philip Torr,Filippos Kokkinos
关键词-EN: http URL methods, URL methods typically, http URL, URL methods, methods typically employ
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad this http URL methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality. To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object. Subsequently, a view selection pipeline filters these views based on quality and consistency, ensuring that only the high-quality and reliable views are used for reconstruction. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. FlemRM directly outputs 3D Gaussian points leveraging a tri-plane representation, enabling efficient and detailed 3D generation. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.

[CV-4] MAP: Unleashing Hybrid Mamba-Transformer Vision Backbones Potential with Masked Autoregressive Pretraining

链接: https://arxiv.org/abs/2410.00871
作者: Yunze Liu,Li Yi
关键词-EN: achieved significant advantages, large parameters remains, Mamba, autoregressive pretraining, Masked Autoregressive Pretraining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mamba has achieved significant advantages in long-context modeling and autoregressive tasks, but its scalability with large parameters remains a major limitation in vision applications. pretraining is a widely used strategy to enhance backbone model performance. Although the success of Masked Autoencoder in Transformer pretraining is well recognized, it does not significantly improve Mamba’s visual learning performance. We found that using the correct autoregressive pretraining can significantly boost the performance of the Mamba architecture. Based on this analysis, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Additionally, in terms of integrating Mamba and Transformer modules, we empirically found that inserting Transformer layers at regular intervals within Mamba layers can significantly enhance downstream task performance. Experimental results show that both the pure Mamba architecture and the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperform other pretraining strategies, achieving state-of-the-art performance. We validate the effectiveness of the method on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component.

[CV-5] Squeeze-and-Remember Block ICML

链接: https://arxiv.org/abs/2410.00823
作者: Rinor Cakaj,Jens Mehnert,Bin Yang
关键词-EN: Convolutional Neural Networks, Convolutional Neural, machine learning tasks, Neural Networks, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by The International Conference on Machine Learning and Applications (ICMLA) 2024

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are important for many machine learning tasks. They are built with different types of layers: convolutional layers that detect features, dropout layers that help to avoid over-reliance on any single neuron, and residual layers that allow the reuse of features. However, CNNs lack a dynamic feature retention mechanism similar to the human brain’s memory, limiting their ability to use learned information in new contexts. To bridge this gap, we introduce the “Squeeze-and-Remember” (SR) block, a novel architectural unit that gives CNNs dynamic memory-like functionalities. The SR block selectively memorizes important features during training, and then adaptively re-applies these features during inference. This improves the network’s ability to make contextually informed predictions. Empirical results on ImageNet and Cityscapes datasets demonstrate the SR block’s efficacy: integration into ResNet50 improved top-1 validation accuracy on ImageNet by 0.52% over dropout2d alone, and its application in DeepLab v3 increased mean Intersection over Union in Cityscapes by 0.20%. These improvements are achieved with minimal computational overhead. This show the SR block’s potential to enhance the capabilities of CNNs in image processing tasks.

[CV-6] WiGNet: Windowed Vision Graph Neural Network

链接: https://arxiv.org/abs/2410.00807
作者: Gabriele Spadaro,Marco Grangetto,Attilio Fiandrotti,Enzo Tartaglione,Jhony H. Giraldo
关键词-EN: Graph Neural Networks, demonstrated strong adaptability, Neural Networks, Graph Neural, vision Graph neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Graph Neural Networks (GNNs) have demonstrated strong adaptability to various real-world challenges, with architectures such as Vision GNN (ViG) achieving state-of-the-art performance in several computer vision tasks. However, their practical applicability is hindered by the computational complexity of constructing the graph, which scales quadratically with the image size. In this paper, we introduce a novel Windowed vision Graph neural Network (WiGNet) model for efficient image processing. WiGNet explores a different strategy from previous works by partitioning the image into windows and constructing a graph within each window. Therefore, our model uses graph convolutions instead of the typical 2D convolution or self-attention mechanism. WiGNet effectively manages computational and memory complexity for large image sizes. We evaluate our method in the ImageNet-1k benchmark dataset and test the adaptability of WiGNet using the CelebA-HQ dataset as a downstream task with higher-resolution images. In both of these scenarios, our method achieves competitive results compared to previous vision GNNs while keeping memory and computational complexity at bay. WiGNet offers a promising solution toward the deployment of vision GNNs in real-world applications. We publicly released the code at this https URL.

[CV-7] Local-to-Global Self-Supervised Representation Learning for Diabetic Retinopathy Grading

链接: https://arxiv.org/abs/2410.00779
作者: Mostafa Hajighasemloua,Samad Sheikhaei,Hamid Soltanian-Zadeha
关键词-EN: Artificial intelligence algorithms, Artificial intelligence, past decade, segmentation ability, intelligence algorithms
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Artificial intelligence algorithms have demonstrated their image classification and segmentation ability in the past decade. However, artificial intelligence algorithms perform less for actual clinical data than those used for simulations. This research aims to present a novel hybrid learning model using self-supervised learning and knowledge distillation, which can achieve sufficient generalization and robustness. The self-attention mechanism and tokens employed in ViT, besides the local-to-global learning approach used in the hybrid model, enable the proposed algorithm to extract a high-dimensional and high-quality feature space from images. To demonstrate the proposed neural network’s capability in classifying and extracting feature spaces from medical images, we use it on a dataset of Diabetic Retinopathy images, specifically the EyePACS dataset. This dataset is more complex structurally and challenging regarding damaged areas than other medical images. For the first time in this study, self-supervised learning and knowledge distillation are used to classify this dataset. In our algorithm, for the first time among all self-supervised learning and knowledge distillation models, the test dataset is 50% larger than the training dataset. Unlike many studies, we have not removed any images from the dataset. Finally, our algorithm achieved an accuracy of 79.1% in the linear classifier and 74.36% in the k-NN algorithm for multiclass classification. Compared to a similar state-of-the-art model, our results achieved higher accuracy and more effective representation spaces.

[CV-8] On the Generalization and Causal Explanation in Self-Supervised Learning

链接: https://arxiv.org/abs/2410.00772
作者: Wenwen Qiang,Zeen Song,Ziyin Gu,Jiangmeng Li,Changwen Zheng,Fuchun Sun,Hui Xiong
关键词-EN: achieve high generalization, Self-supervised learning, learn from unlabeled, achieve high, Undoing Memorization Mechanism
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) methods learn from unlabeled data and achieve high generalization performance on downstream tasks. However, they may also suffer from overfitting to their training data and lose the ability to adapt to new tasks. To investigate this phenomenon, we conduct experiments on various SSL methods and datasets and make two observations: (1) Overfitting occurs abruptly in later layers and epochs, while generalizing features are learned in early layers for all epochs; (2) Coding rate reduction can be used as an indicator to measure the degree of overfitting in SSL models. Based on these observations, we propose Undoing Memorization Mechanism (UMM), a plug-and-play method that mitigates overfitting of the pre-trained feature extractor by aligning the feature distributions of the early and the last layers to maximize the coding rate reduction of the last layer output. The learning process of UMM is a bi-level optimization process. We provide a causal analysis of UMM to explain how UMM can help the pre-trained feature extractor overcome overfitting and recover generalization. We also demonstrate that UMM significantly improves the generalization performance of SSL methods on various downstream tasks.

[CV-9] Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting EMNLP2024

链接: https://arxiv.org/abs/2410.00771
作者: Chen Cai,Zheng Wang,Jianjun Gao,Wenyang Liu,Ye Lu,Runzhong Zhang,Kim-Hui Yap
关键词-EN: Video Question Answering, Question Answering, recent years, static Video Question, rapid increase
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted by main EMNLP 2024

点击查看摘要

Abstract:In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA) models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine-tuning a large language model (LLM) for a sequence of tasks often results in catastrophic forgetting. To address this, we propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. These prompts aim to capture textual question context, visual content, and video temporal dynamics in VideoQA, a perspective underexplored in prior research. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches, achieving 55.14% accuracy on NExT-QA and 71.24% accuracy on DramaQA, highlighting its practical relevance and effectiveness.

[CV-10] DeepAerialMapper: Deep Learning-based Semi-automatic HD Map Creation for Highly Automated Vehicles KR

链接: https://arxiv.org/abs/2410.00769
作者: Robert Krajewski,Huijo Kim
关键词-EN: highly automated vehicles, High-definition maps, safety validation, play a crucial, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: For source code, see this https URL

点击查看摘要

Abstract:High-definition maps (HD maps) play a crucial role in the development, safety validation, and operation of highly automated vehicles. Efficiently collecting up-to-date sensor data from road segments and obtaining accurate maps from these are key challenges in HD map creation. Commonly used methods, such as dedicated measurement vehicles and crowd-sourced data from series vehicles, often face limitations in commercial viability. Although high-resolution aerial imagery offers a cost-effective or even free alternative, it requires significant manual effort and time to transform it into maps. In this paper, we introduce a semi-automatic method for creating HD maps from high-resolution aerial imagery. Our method involves training neural networks to semantically segment aerial images into classes relevant to HD maps. The resulting segmentation is then hierarchically post-processed to generate a prototypical HD map of visible road elements. Exporting the map to the Lanelet2 format allows easy extension for different use cases using standard tools. To train and evaluate our method, we created a dataset using public aerial imagery of urban road segments in Germany. In our evaluation, we achieved an automatic mapping of lane markings and road borders with a recall and precision exceeding 96%. The source code for our method is publicly available at this https URL.

[CV-11] Optimizing Drug Delivery in Smart Pharmacies: A Novel Framework of Multi-Stage Grasping Network Combined with Adaptive Robotics Mechanism

链接: https://arxiv.org/abs/2410.00753
作者: Rui Tang,Shirong Guo,Yuhang Qiu,Honghui Chen,Lujin Huang,Ming Yong,Linfu Zhou,Liquan Guo
关键词-EN: Robots-based smart pharmacies, enabling efficient drug, efficient drug delivery, modern healthcare systems, enabling efficient
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Robots-based smart pharmacies are essential for modern healthcare systems, enabling efficient drug delivery. However, a critical challenge exists in the robotic handling of drugs with varying shapes and overlapping positions, which previous studies have not adequately addressed. To enhance the robotic arm’s ability to grasp chaotic, overlapping, and variously shaped drugs, this paper proposed a novel framework combining a multi-stage grasping network with an adaptive robotics mechanism. The framework first preprocessed images using an improved Super-Resolution Convolutional Neural Network (SRCNN) algorithm, and then employed the proposed YOLOv5+E-A-SPPFCSPC+BIFPNC (YOLO-EASB) instance segmentation algorithm for precise drug segmentation. The most suitable drugs for grasping can be determined by assessing the completeness of the segmentation masks. Then, these segmented drugs were processed by our improved Adaptive Feature Fusion and Grasp-Aware Network (IAFFGA-Net) with the optimized loss function, which ensures accurate picking actions even in complex environments. To control the robot grasping, a time-optimal robotic arm trajectory planning algorithm that combines an improved ant colony algorithm with 3-5-3 interpolation was developed, further improving efficiency while ensuring smooth trajectories. Finally, this system was implemented and validated within an adaptive collaborative robot setup, which dynamically adjusts to different production environments and task requirements. Experimental results demonstrate the superiority of our multi-stage grasping network in optimizing smart pharmacy operations, while also showcasing its remarkable adaptability and effectiveness in practical applications.

[CV-12] VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models EMNLP2024

链接: https://arxiv.org/abs/2410.00741
作者: Jiapeng Wang,Chengyu Wang,Kunzhe Huang,Jun Huang,Lianwen Jin
关键词-EN: Contrastive Language-Image Pre-training, Contrastive Language-Image, Description Ranking, numerous applications, pre-training prevents CLIP
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: EMNLP 2024 Main conference

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.

[CV-13] Improved Generation of Synthetic Imaging Data Using Feature-Aligned Diffusion

链接: https://arxiv.org/abs/2410.00731
作者: Lakshmi Nair
关键词-EN: Synthetic data generation, Synthetic data, important application, application of machine, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to First International Workshop on Vision-Language Models for Biomedical Applications (VLM4Bio 2024) at the 32nd ACM-Multimedia conference

点击查看摘要

Abstract:Synthetic data generation is an important application of machine learning in the field of medical imaging. While existing approaches have successfully applied fine-tuned diffusion models for synthesizing medical images, we explore potential improvements to this pipeline through feature-aligned diffusion. Our approach aligns intermediate features of the diffusion model to the output features of an expert, and our preliminary findings show an improvement of 9% in generation accuracy and ~0.12 in SSIM diversity. Our approach is also synergistic with existing methods, and easily integrated into diffusion training pipelines for improvements. We make our code available at \urlthis https URL.

[CV-14] Simplified priors for Object-Centric Learning

链接: https://arxiv.org/abs/2410.00728
作者: Vihang Patil,Andreas Radler,Daniel Klotz,Sepp Hochreiter
关键词-EN: continual learning systems, current continual learning, Simplified Slot Attention, excel at abstracting, capability lacking
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans excel at abstracting data and constructing \emphreusable concepts, a capability lacking in current continual learning systems. The field of object-centric learning addresses this by developing abstract representations, or slots, from data without human supervision. Different methods have been proposed to tackle this task for images, whereas most are overly complex, non-differentiable, or poorly scalable. In this paper, we introduce a conceptually simple, fully-differentiable, non-iterative, and scalable method called SAMP Simplified Slot Attention with Max Pool Priors). It is implementable using only Convolution and MaxPool layers and an Attention layer. Our method encodes the input image with a Convolutional Neural Network and then uses a branch of alternating Convolution and MaxPool layers to create specialized sub-networks and extract primitive slots. These primitive slots are then used as queries for a Simplified Slot Attention over the encoded image. Despite its simplicity, our method is competitive or outperforms previous methods on standard benchmarks.

[CV-15] RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations

链接: https://arxiv.org/abs/2410.00713
作者: Kaichen Zhou,Yang Cao,Teawhan Kim,Hao Zhao,Hao Dong,Kai Ming Ting,Ye Zhu
关键词-EN: industrial anomaly detection, Recent advancements, anomaly detection, Realistic Anomaly Detection, accurately represent real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in industrial anomaly detection have been hindered by the lack of realistic datasets that accurately represent real-world conditions. Existing algorithms are often developed and evaluated using idealized datasets, which deviate significantly from real-life scenarios characterized by environmental noise and data corruption such as fluctuating lighting conditions, variable object poses, and unstable camera positions. To address this gap, we introduce the Realistic Anomaly Detection (RAD) dataset, the first multi-view RGB-based anomaly detection dataset specifically collected using a real robot arm, providing unique and realistic data scenarios. RAD comprises 4765 images across 13 categories and 4 defect types, collected from more than 50 viewpoints, providing a comprehensive and realistic benchmark. This multi-viewpoint setup mirrors real-world conditions where anomalies may not be detectable from every perspective. Moreover, by sampling varying numbers of views, the algorithm’s performance can be comprehensively evaluated across different viewpoints. This approach enhances the thoroughness of performance assessment and helps improve the algorithm’s robustness. Besides, to support 3D multi-view reconstruction algorithms, we propose a data augmentation method to improve the accuracy of pose estimation and facilitate the reconstruction of 3D point clouds. We systematically evaluate state-of-the-art RGB-based and point cloud-based models using RAD, identifying limitations and future research directions. The code and dataset could found at this https URL

[CV-16] BioFace3D: A fully automatic pipeline for facial biomarkers extraction of 3D face reconstructions segmented from MRI

链接: https://arxiv.org/abs/2410.00711
作者: Álvaro Heredia-Lidón,Luis M. Echeverry-Quiceno,Alejandro González,Noemí Hostalet,Edith Pomarol-Clotet,Juan Fortea,Mar Fatjó-Vilas,Neus Martínez-Abadías,Xavier Sevillano
关键词-EN: potential critical indicators, prognosis of genetic, psychotic and rare, rare disorders, emerged as potential
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Facial dysmorphologies have emerged as potential critical indicators in the diagnosis and prognosis of genetic, psychotic and rare disorders. While in certain conditions these dysmorphologies are severe, in other cases may be subtle and not perceivable to the human eye, requiring precise quantitative tools for their identification. Manual coding of facial dysmorphologies is a burdensome task and is subject to inter- and intra-observer variability. To overcome this gap, we present BioFace3D as a fully automatic tool for the calculation of facial biomarkers using facial models reconstructed from magnetic resonance images. The tool is divided into three automatic modules for the extraction of 3D facial models from magnetic resonance images, the registration of homologous 3D landmarks encoding facial morphology, and the calculation of facial biomarkers from anatomical landmarks coordinates using geometric morphometrics techniques.

[CV-17] A Low-Cost High-Speed and Robust Bin Picking System for Factory Automation Enabled by a Non-Stop Multi-View and Active Vision Scheme

链接: https://arxiv.org/abs/2410.00706
作者: Xingdou Fu,Lin Miao,Yasuhiro Ohnishi,Yuki Hasegawa,Masaki Suwa
关键词-EN: face robustness issues, robustness issues caused, bin picking system, sparse and noisy, data of metallic
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Bin picking systems in factory automation usually face robustness issues caused by sparse and noisy 3D data of metallic objects. Utilizing multiple views, especially with a one-shot 3D sensor and “sensor on hand” configuration is getting more popularity due to its effectiveness, flexibility, and low cost. While moving the 3D sensor to acquire multiple views for 3D fusion, joint optimization, or active vision suffers from low-speed issues. That is because sensing is taken as a decoupled module from motion tasks and is not intentionally designed for a bin picking system. To address the problems, we designed a bin picking system, which tightly couples a multi-view, active vision scheme with motion tasks in a “sensor on hand” configuration. It not only speeds up the system by parallelizing the high-speed sensing scheme to the robot place action but also decides the next sensing path to maintain the continuity of the whole picking process. Unlike others focusing only on sensing evaluation, we also evaluated our design by picking experiments on 5 different types of objects without human intervention. Our experiments show the whole sensing scheme can be finished within 1.682 seconds (maximum) on CPU and the average picking complete rate is over 97.75%. Due to the parallelization with robot motion, the sensing scheme accounts for only 0.635 seconds in takt time on average.

[CV-18] FlashMix: Fast Map-Free LiDAR Localization via Feature Mixing and Contrastive-Constrained Accelerated Training

链接: https://arxiv.org/abs/2410.00702
作者: Raktim Gautam Goswami,Naman Patel,Prashanth Krishnamurthy,Farshad Khorrami
关键词-EN: systems accurately localize, Map-free LiDAR localization, raw point clouds, predicting sensor position, localization systems accurately
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Map-free LiDAR localization systems accurately localize within known environments by predicting sensor position and orientation directly from raw point clouds, eliminating the need for large maps and descriptors. However, their long training times hinder rapid adaptation to new environments. To address this, we propose FlashMix, which uses a frozen, scene-agnostic backbone to extract local point descriptors, aggregated with an MLP mixer to predict sensor pose. A buffer of local descriptors is used to accelerate training by orders of magnitude, combined with metric learning or contrastive loss regularization of aggregated descriptors to improve performance and convergence. We evaluate FlashMix on various LiDAR localization benchmarks, examining different regularizations and aggregators, demonstrating its effectiveness for rapid and accurate LiDAR localization in real-world scenarios. The code is available at this https URL.

[CV-19] Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2410.00700
作者: Saurav Jha,Shiqi Yang,Masato Ishii,Mengjie Zhao,Christian Simon,Jehanzeb Mirza,Dong Gong,Lina Yao,Shusuke Takahashi,Yuki Mitsufuji
关键词-EN: user-defined text descriptions, grown popular, ability to efficiently, efficiently acquire, user-defined text
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Work under review

点击查看摘要

Abstract:Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learning (CL) setup, most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones – a challenge that continual personalization (CP) aims to solve. Inspired by the successful CL methods that rely on class-specific information for regularization, we resort to the inherent class-conditioned density estimates, also known as diffusion classifier (DC) scores, for continual personalization of text-to-image diffusion models. Namely, we propose using DC scores for regularizing the parameter-space and function-space of text-to-image diffusion models, to achieve continual personalization. Using several diverse evaluation setups, datasets, and metrics, we show that our proposed regularization-based CP methods outperform the state-of-the-art C-LoRA, and other baselines. Finally, by operating in the replay-free CL setup and on low-rank adapters, our method incurs zero storage and parameter overhead, respectively, over the state-of-the-art.

[CV-20] Advanced Arabic Alphabet Sign Language Recognition Using Transfer Learning and Transformer Models

链接: https://arxiv.org/abs/2410.00681
作者: Mazen Balat,Rewaa Awaad,Hend Adel,Ahmed B. Zaky,Salah A. Aly
关键词-EN: deep learning methods, Arabic Alphabet Sign, Alphabet Sign Language, Arabic sign language, Arabic Alphabet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures

点击查看摘要

Abstract:This paper presents an Arabic Alphabet Sign Language recognition approach, using deep learning methods in conjunction with transfer learning and transformer-based models. We study the performance of the different variants on two publicly available datasets, namely ArSL2018 and AASL. This task will make full use of state-of-the-art CNN architectures like ResNet50, MobileNetV2, and EfficientNetB7, and the latest transformer models such as Google ViT and Microsoft Swin Transformer. These pre-trained models have been fine-tuned on the above datasets in an attempt to capture some unique features of Arabic sign language motions. Experimental results present evidence that the suggested methodology can receive a high recognition accuracy, by up to 99.6% and 99.43% on ArSL2018 and AASL, respectively. That is far beyond the previously reported state-of-the-art approaches. This performance opens up even more avenues for communication that may be more accessible to Arabic-speaking deaf and hard-of-hearing, and thus encourages an inclusive society.

[CV-21] GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer ECCV2024

链接: https://arxiv.org/abs/2410.00672
作者: Youngho Yoon,Hyun-Kurl Jang,Kuk-Jin Yoon
关键词-EN: neural radiance fields, view synthesis, aims to generate, remarkable improvements, arbitrary viewpoints
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024. Code available at this https URL

点击查看摘要

Abstract:Novel view synthesis (NVS) aims to generate images at arbitrary viewpoints using multi-view images, and recent insights from neural radiance fields (NeRF) have contributed to remarkable improvements. Recently, studies on generalizable NeRF (G-NeRF) have addressed the challenge of per-scene optimization in NeRFs. The construction of radiance fields on-the-fly in G-NeRF simplifies the NVS process, making it well-suited for real-world applications. Meanwhile, G-NeRF still struggles in representing fine details for a specific scene due to the absence of per-scene optimization, even with texture-rich multi-view source inputs. As a remedy, we propose a Geometry-driven Multi-reference Texture transfer network (GMT) available as a plug-and-play module designed for G-NeRF. Specifically, we propose ray-imposed deformable convolution (RayDCN), which aligns input and reference features reflecting scene geometry. Additionally, the proposed texture preserving transformer (TP-Former) aggregates multi-view source features while preserving texture information. Consequently, our module enables direct interaction between adjacent pixels during the image enhancement process, which is deficient in G-NeRF models with an independent rendering process per pixel. This addresses constraints that hinder the ability to capture high-frequency details. Experiments show that our plug-and-play module consistently improves G-NeRF models on various benchmark datasets.

[CV-22] Cross-Camera Data Association via GNN for Supervised Graph Clustering

链接: https://arxiv.org/abs/2410.00643
作者: Đorđe Nedeljković
关键词-EN: Cross-camera data association, computer vision field, multi-camera computer vision, Cross-camera data, vision field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cross-camera data association is one of the cornerstones of the multi-camera computer vision field. Although often integrated into detection and tracking tasks through architecture design and loss definition, it is also recognized as an independent challenge. The ultimate goal is to connect appearances of one item from all cameras, wherever it is visible. Therefore, one possible perspective on this task involves supervised clustering of the affinity graph, where nodes are instances captured by all cameras. They are represented by appropriate visual features and positional attributes. We leverage the advantages of GNN (Graph Neural Network) architecture to examine nodes’ relations and generate representative edge embeddings. These embeddings are then classified to determine the existence or non-existence of connections in node pairs. Therefore, the core of this approach is graph connectivity prediction. Experimental validation was conducted on multicamera pedestrian datasets across diverse environments such as the laboratory, basketball court, and terrace. Our proposed method, named SGC-CCA, outperformed the state-of-the-art method named GNN-CCA across all clustering metrics, offering an end-to-end clustering solution without the need for graph post-processing. The code is available at this https URL.

[CV-23] Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures SIGGRAPH

链接: https://arxiv.org/abs/2410.00630
作者: Marcel C. Bühler,Gengyan Li,Erroll Wood,Leonhard Helminger,Xu Chen,Tanmay Shah,Daoye Wang,Stephan Garbin,Sergio Orts-Escolano,Otmar Hilliges,Dmitry Lagun,Jérémy Riviere,Paulo Gotardo,Thabo Beeler,Abhimitra Meka,Kripasindhu Sarkar
关键词-EN: neural radiance field, radiance field representations, representations have revolutionized, capture and photorealistic, neural radiance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Siggraph Asia Conference Papers 2024

点击查看摘要

Abstract:Volumetric modeling and neural radiance field representations have revolutionized 3D face capture and photorealistic novel view synthesis. However, these methods often require hundreds of multi-view input images and are thus inapplicable to cases with less than a handful of inputs. We present a novel volumetric prior on human faces that allows for high-fidelity expressive face modeling from as few as three input views captured in the wild. Our key insight is that an implicit prior trained on synthetic data alone can generalize to extremely challenging real-world identities and expressions and render novel views with fine idiosyncratic details like wrinkles and eyelashes. We leverage a 3D Morphable Face Model to synthesize a large training set, rendering each identity with different expressions, hair, clothing, and other assets. We then train a conditional Neural Radiance Field prior on this synthetic dataset and, at inference time, fine-tune the model on a very sparse set of real images of a single subject. On average, the fine-tuning requires only three inputs to cross the synthetic-to-real domain gap. The resulting personalized 3D model reconstructs strong idiosyncratic facial expressions and outperforms the state-of-the-art in high-quality novel view synthesis of faces from sparse inputs in terms of perceptual and photo-metric quality.

[CV-24] An Illumination-Robust Feature Extractor Augmented by Relightable 3D Reconstruction

链接: https://arxiv.org/abs/2410.00629
作者: Shunyi Zhao,Zehuan Yu,Zuxin Fan,Zhihao Zhou,Lecheng Ruan,Qining Wang
关键词-EN: found wide applications, illumination conditions, gradient direction, recent years, description often relies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual features, whose description often relies on the local intensity and gradient direction, have found wide applications in robot navigation and localization in recent years. However, the extraction of visual features is usually disturbed by the variation of illumination conditions, making it challenging for real-world applications. Previous works have addressed this issue by establishing datasets with variations in illumination conditions, but can be costly and time-consuming. This paper proposes a design procedure for an illumination-robust feature extractor, where the recently developed relightable 3D reconstruction techniques are adopted for rapid and direct data generation with varying illumination conditions. A self-supervised framework is proposed for extracting features with advantages in repeatability for key points and similarity for descriptors across good and bad illumination conditions. Experiments are conducted to demonstrate the effectiveness of the proposed method for robust feature extraction. Ablation studies also indicate the effectiveness of the self-supervised framework design.

[CV-25] GERA: Geometric Embedding for Efficient Point Registration Analysis

链接: https://arxiv.org/abs/2410.00589
作者: Geng Li,Haozhi Cao,Mingyang Liu,Shenghai Yuan,Jianfei Yang
关键词-EN: surgical guidance systems, provide estimated transformations, align point clouds, Point cloud registration, cloud registration aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Point cloud registration aims to provide estimated transformations to align point clouds, which plays a crucial role in pose estimation of various navigation systems, such as surgical guidance systems and autonomous vehicles. Despite the impressive performance of recent models on benchmark datasets, many rely on complex modules like KPConv and Transformers, which impose significant computational and memory demands. These requirements hinder their practical application, particularly in resource-constrained environments such as mobile robotics. In this paper, we propose a novel point cloud registration network that leverages a pure MLP architecture, constructing geometric information offline. This approach eliminates the computational and memory burdens associated with traditional complex feature extractors and significantly reduces inference time and resource consumption. Our method is the first to replace 3D coordinate inputs with offline-constructed geometric encoding, improving generalization and stability, as demonstrated by Maximum Mean Discrepancy (MMD) comparisons. This efficient and accurate geometric representation marks a significant advancement in point cloud analysis, particularly for applications requiring fast and reliability.

[CV-26] Can We Remove the Ground? Obstacle-aware Point Cloud Compression for Remote Object Detection ICRA2025

链接: https://arxiv.org/abs/2410.00582
作者: Pengxi Zeng,Alberto Presta,Jonah Reinis,Dinesh Bharadia,Hang Qiu,Pamela Cosman
关键词-EN: Efficient point cloud, Efficient point, streaming applications, crucial for streaming, augmented reality
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 7 Pages; submitted to ICRA 2025

点击查看摘要

Abstract:Efficient point cloud (PC) compression is crucial for streaming applications, such as augmented reality and cooperative perception. Classic PC compression techniques encode all the points in a frame. Tailoring compression towards perception tasks at the receiver side, we ask the question, “Can we remove the ground points during transmission without sacrificing the detection performance?” Our study reveals a strong dependency on the ground from state-of-the-art (SOTA) 3D object detection models, especially on those points below and around the object. In this work, we propose a lightweight obstacle-aware Pillar-based Ground Removal (PGR) algorithm. PGR filters out ground points that do not provide context to object recognition, significantly improving compression ratio without sacrificing the receiver side perception performance. Not using heavy object detection or semantic segmentation models, PGR is light-weight, highly parallelizable, and effective. Our evaluations on KITTI and Waymo Open Dataset show that SOTA detection models work equally well with PGR removing 20-30% of the points, with a speeding of 86 FPS.

[CV-27] Deep activity propagation via weight initialization in spiking neural networks

链接: https://arxiv.org/abs/2410.00580
作者: Aurora Micheli,Olaf Booij,Jan van Gemert,Nergis Tömen
关键词-EN: Spiking Neural Networks, ultra-low power consumption, neuromorphic computing offer, computing offer bio-inspired, offer bio-inspired advantages
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) and neuromorphic computing offer bio-inspired advantages such as sparsity and ultra-low power consumption, providing a promising alternative to conventional networks. However, training deep SNNs from scratch remains a challenge, as SNNs process and transmit information by quantizing the real-valued membrane potentials into binary spikes. This can lead to information loss and vanishing spikes in deeper layers, impeding effective training. While weight initialization is known to be critical for training deep neural networks, what constitutes an effective initial state for a deep SNN is not well-understood. Existing weight initialization methods designed for conventional networks (ANNs) are often applied to SNNs without accounting for their distinct computational properties. In this work we derive an optimal weight initialization method specifically tailored for SNNs, taking into account the quantization operation. We show theoretically that, unlike standard approaches, this method enables the propagation of activity in deep SNNs without loss of spikes. We demonstrate this behavior in numerical simulations of SNNs with up to 100 layers across multiple time steps. We present an in-depth analysis of the numerical conditions, regarding layer width and neuron hyperparameters, which are necessary to accurately apply our theoretical findings. Furthermore, our experiments on MNIST demonstrate higher accuracy and faster convergence when using the proposed weight initialization scheme. Finally, we show that the newly introduced weight initialization is robust against variations in several network and neuron hyperparameters.

[CV-28] STanH : Parametric Quantization for Variable Rate Learned Image Compression

链接: https://arxiv.org/abs/2410.00557
作者: Alberto Presta,Enzo Tartaglione,Attilio Fiandrotti,Marco Grangetto
关键词-EN: learned image compression, quantized latent representation, learned image, image compression, image quality
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Submitted to IEEE Transactions on Image Processing

点击查看摘要

Abstract:In end-to-end learned image compression, encoder and decoder are jointly trained to minimize a R + \lambdaD cost function, where \lambda controls the trade-off between rate of the quantized latent representation and image quality. Unfortunately, a distinct encoder-decoder pair with millions of parameters must be trained for each \lambda , hence the need to switch encoders and to store multiple encoders and decoders on the user device for every target rate. This paper proposes to exploit a differentiable quantizer designed around a parametric sum of hyperbolic tangents, called STanH , that relaxes the step-wise quantization function. STanH is implemented as a differentiable activation layer with learnable quantization parameters that can be plugged into a pre-trained fixed rate model and refined to achieve different target bitrates. Experimental results show that our method enables variable rate coding with comparable efficiency to the state-of-the-art, yet with significant savings in terms of ease of deployment, training time, and storage costs

[CV-29] Deep Model Interpretation with Limited Data : A Coreset-based Approach

链接: https://arxiv.org/abs/2410.00524
作者: Hamed Behzadi-Khormouji,José Oramas
关键词-EN: Model Interpretation aims, model interpretation methods, trained model, methods, Model Interpretation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model Interpretation aims at the extraction of insights from the internals of a trained model. A common approach to address this task is the characterization of relevant features internally encoded in the model that are critical for its proper operation. Despite recent progress of these methods, they come with the weakness of being computationally expensive due to the dense evaluation of datasets that they require. As a consequence, research on the design of these methods have focused on smaller data subsets which may led to reduced insights. To address these computational costs, we propose a coreset-based interpretation framework that utilizes coreset selection methods to sample a representative subset of the large dataset for the interpretation task. Towards this goal, we propose a similarity-based evaluation protocol to assess the robustness of model interpretation methods towards the amount data they take as input. Experiments considering several interpretation methods, DNN models, and coreset selection methods show the effectiveness of the proposed framework.

[CV-30] Design and Identification of Keypoint Patches in Unstructured Environments

链接: https://arxiv.org/abs/2410.00521
作者: Taewook Park,Seunghwan Kim,Hyondong Oh
关键词-EN: Reliable perception, perception of targets, targets is crucial, stable operation, Reliable
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Reliable perception of targets is crucial for the stable operation of autonomous robots. A widely preferred method is keypoint identification in an image, as it allows direct mapping from raw images to 2D coordinates, facilitating integration with other algorithms like localization and path planning. In this study, we closely examine the design and identification of keypoint patches in cluttered environments, where factors such as blur and shadows can hinder detection. We propose four simple yet distinct designs that consider various scale, rotation and camera projection using a limited number of pixels. Additionally, we customize the Superpoint network to ensure robust detection under various types of image degradation. The effectiveness of our approach is demonstrated through real-world video tests, highlighting potential for vision-based autonomous systems.

[CV-31] Drone Stereo Vision for Radiata Pine Branch Detection and Distance Measurement: Utilizing Deep Learning and YOLO Integration

链接: https://arxiv.org/abs/2410.00503
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
关键词-EN: stereo vision camera, tree branches, research focuses, drone equipped, vision camera
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This research focuses on the development of a drone equipped with pruning tools and a stereo vision camera to accurately detect and measure the spatial positions of tree branches. YOLO is employed for branch segmentation, while two depth estimation approaches, monocular and stereo, are investigated. In comparison to SGBM, deep learning techniques produce more refined and accurate depth maps. In the absence of ground-truth data, a fine-tuning process using deep neural networks is applied to approximate optimal depth values. This methodology facilitates precise branch detection and distance measurement, addressing critical challenges in the automation of pruning operations. The results demonstrate notable advancements in both accuracy and efficiency, underscoring the potential of deep learning to drive innovation and enhance automation in the agricultural sector.

[CV-32] CaRtGS: Computational Alignment for Real-Time Gaussian Splatting SLAM

链接: https://arxiv.org/abs/2410.00486
作者: Dapeng Feng,Zhiqiang Chen,Yizhen Yin,Shipeng Zhong,Yuhua Qi,Hongbo Chen
关键词-EN: Simultaneous Localization, Localization and Mapping, Gaussian Splatting SLAM, Gaussian Splatting, Splatting SLAM
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Submitted to IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Simultaneous Localization and Mapping (SLAM) is pivotal in robotics, with photorealistic scene reconstruction emerging as a key challenge. To address this, we introduce Computational Alignment for Real-Time Gaussian Splatting SLAM (CaRtGS), a novel method enhancing the efficiency and quality of photorealistic scene reconstruction in real-time environments. Leveraging 3D Gaussian Splatting (3DGS), CaRtGS achieves superior rendering quality and processing speed, which is crucial for scene photorealistic reconstruction. Our approach tackles computational misalignment in Gaussian Splatting SLAM (GS-SLAM) through an adaptive strategy that optimizes training, addresses long-tail optimization, and refines densification. Experiments on Replica and TUM-RGBD datasets demonstrate CaRtGS’s effectiveness in achieving high-fidelity rendering with fewer Gaussian primitives. This work propels SLAM towards real-time, photorealistic dense rendering, significantly advancing photorealistic scene representation. For the benefit of the research community, we release the code on our project website: this https URL.

[CV-33] A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning NEURIPS’2024

链接: https://arxiv.org/abs/2410.00485
作者: Niki Maria Foteinopoulou,Enjie Ghorbel,Djamila Aouada
关键词-EN: face forgery detection, Large Language Models, forgery detection, restoring trust, fabricated content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS’2024 (DB)

点击查看摘要

Abstract:Explainability in artificial intelligence is crucial for restoring trust, particularly in areas like face forgery detection, where viewers often struggle to distinguish between real and fabricated content. Vision and Large Language Models (VLLM) bridge computer vision and natural language, offering numerous applications driven by strong common-sense reasoning. Despite their success in various tasks, the potential of vision and language remains underexplored in face forgery detection, where they hold promise for enhancing explainability by leveraging the intrinsic reasoning capabilities of language to analyse fine-grained manipulation areas. As such, there is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task to systematically and fairly evaluate these capabilities. Previous efforts for unified benchmarks in deepfake detection have focused on the simpler binary task, overlooking evaluation protocols for fine-grained detection and text-generative models. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap. In the first stage, we assess the models’ performance on the binary task and their sensitivity to given instructions using several prompts. In the second stage, we delve deeper into fine-grained detection by identifying areas of manipulation in a multiple-choice VQA setting. In the third stage, we convert the fine-grained detection to an open-ended question and compare several matching strategies for the multi-label classification task. Finally, we qualitatively evaluate the fine-grained responses of the VLLMs included in the benchmark. We apply our benchmark to several popular models, providing a detailed comparison of binary, multiple-choice, and open-ended VQA evaluation across seven datasets. \urlthis https URL

[CV-34] MCGM: Mask Conditional Text-to-Image Generative Model

链接: https://arxiv.org/abs/2410.00483
作者: Rami Skaik,Leonardo Rossi,Tomaso Fontanini,Andrea Prati
关键词-EN: Recent advancements, artificial intelligence, enabling the creation, Generative Model, revolutionized the field
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 17 pages, 13 figures, presented at the 5th International Conference on Artificial Intelligence and Machine Learning (CAIML 2024)

点击查看摘要

Abstract:Recent advancements in generative models have revolutionized the field of artificial intelligence, enabling the creation of highly-realistic and detailed images. In this study, we propose a novel Mask Conditional Text-to-Image Generative Model (MCGM) that leverages the power of conditional diffusion models to generate pictures with specific poses. Our model builds upon the success of the Break-a-scene [1] model in generating new scenes using a single image with multiple subjects and incorporates a mask embedding injection that allows the conditioning of the generation process. By introducing this additional level of control, MCGM offers a flexible and intuitive approach for generating specific poses for one or more subjects learned from a single image, empowering users to influence the output based on their requirements. Through extensive experimentation and evaluation, we demonstrate the effectiveness of our proposed model in generating high-quality images that meet predefined mask conditions and improving the current Break-a-scene generative model.

[CV-35] Precise Workcell Sketching from Point Clouds Using an AR Toolbox

链接: https://arxiv.org/abs/2410.00479
作者: Krzysztof Zieliński,Bruce Blumberg,Mikkel Baun Kjærgaard
关键词-EN: lacks object parametrization, Capturing real-world, efficient and descriptive, object parametrization, lacks object
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in IEEE RO-MAN 2024

点击查看摘要

Abstract:Capturing real-world 3D spaces as point clouds is efficient and descriptive, but it comes with sensor errors and lacks object parametrization. These limitations render point clouds unsuitable for various real-world applications, such as robot programming, without extensive post-processing (e.g., outlier removal, semantic segmentation). On the other hand, CAD modeling provides high-quality, parametric representations of 3D space with embedded semantic data, but requires manual component creation that is time-consuming and costly. To address these challenges, we propose a novel solution that combines the strengths of both approaches. Our method for 3D workcell sketching from point clouds allows users to refine raw point clouds using an Augmented Reality (AR) interface that leverages their knowledge and the real-world 3D environment. By utilizing a toolbox and an AR-enabled pointing device, users can enhance point cloud accuracy based on the device’s position in 3D space. We validate our approach by comparing it with ground truth models, demonstrating that it achieves a mean error within 1cm - significant improvement over standard LiDAR scanner apps.

[CV-36] ViDAS: Vision-based Danger Assessment and Scoring

链接: https://arxiv.org/abs/2410.00477
作者: Pranav Gupta,Advith Krishnan,Naman Nanda,Ananth Eswar,Deeksha Agarwal,Pratham Gohil,Pratyush Goel
关键词-EN: Large Language Model, Language Model, Large Language, advancing danger analysis, aimed at advancing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:We present a novel dataset aimed at advancing danger analysis and assessment by addressing the challenge of quantifying danger in video content and identifying how human-like a Large Language Model (LLM) evaluator is for the same. This is achieved by compiling a collection of 100 YouTube videos featuring various events. Each video is annotated by human participants who provided danger ratings on a scale from 0 (no danger to humans) to 10 (life-threatening), with precise timestamps indicating moments of heightened danger. Additionally, we leverage LLMs to independently assess the danger levels in these videos using video summaries. We introduce Mean Squared Error (MSE) scores for multimodal meta-evaluation of the alignment between human and LLM danger assessments. Our dataset not only contributes a new resource for danger assessment in video content but also demonstrates the potential of LLMs in achieving human-like evaluations.

[CV-37] Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data

链接: https://arxiv.org/abs/2410.00469
作者: Ivica Dimitrovski,Vlatko Spasev,Ivan Kitanovski
关键词-EN: Earth observation applications, Earth observation, Accurate semantic segmentation, urban planning, Accurate semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate semantic segmentation of remote sensing imagery is critical for various Earth observation applications, such as land cover mapping, urban planning, and environmental monitoring. However, individual data sources often present limitations for this task. Very High Resolution (VHR) aerial imagery provides rich spatial details but cannot capture temporal information about land cover changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine-scale objects. This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed model consists of two independent deep learning branches. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (MaxViT) backbone. The other branch captures complex spatio-temporal dynamics from the Sentinel-2 satellite image time series using a U-Net with Temporal Attention Encoder (U-TAE). This approach leads to state-of-the-art results on the FLAIR dataset, a large-scale benchmark for land cover segmentation using multi-source optical imagery. The findings highlight the importance of multi-modality fusion in improving the accuracy and robustness of semantic segmentation in remote sensing applications.

[CV-38] Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation

链接: https://arxiv.org/abs/2410.00464
作者: Bohong Chen,Yumeng Li,Yao-Xiang Ding,Tianjia Shao,Kun Zhou
关键词-EN: Current co-speech motion, Current co-speech, upper body gestures, synergistic full-body motion, full-body motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Current co-speech motion generation approaches usually focus on upper body gestures following speech contents only, while lacking supporting the elaborate control of synergistic full-body motion based on text prompts, such as talking while walking. The major challenges lie in 1) the existing speech-to-motion datasets only involve highly limited full-body motions, making a wide range of common human activities out of training distribution; 2) these datasets also lack annotated user prompts. To address these challenges, we propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts. The core technical contributions are two-fold. One is the multi-stage training process which obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch in motion between speech-to-motion and text-to-motion datasets. Another is the diffusion-based conditional inference process, which utilizes the separate-then-combine strategy to realize fine-grained control of local body parts. Extensive experiments are conducted to verify that our approach supports precise and flexible control of synergistic full-body motion generation based on both speeches and user prompts, which is beyond the ability of existing approaches.

[CV-39] Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

链接: https://arxiv.org/abs/2410.00448
作者: Hanqi Jiang,Xixuan Hao,Yuzhou Huang,Chong Ma,Jiaxun Zhang,Yi Pan,Ruimao Zhang
关键词-EN: Medical Vision-Language Pre-training, Vision-Language Pre-training, radiograph representation learning, approach to Medical, Medical Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages

点击查看摘要

Abstract:This paper introduces an innovative approach to Medical Vision-Language Pre-training (Med-VLP) area in the specialized context of radiograph representation learning. While conventional methods frequently merge textual annotations into unified reports, we acknowledge the intrinsic hierarchical relationship between the findings and impression section in radiograph datasets. To establish a targeted correspondence between images and texts, we propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings. Moreover, our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from (1) images, via a captioning branch, and (2) findings, through a summarization branch. Additionally, knowledge distillation is leveraged to facilitate the training process. Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements due to the shared self-attention and feed-forward architecture.

[CV-40] Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

链接: https://arxiv.org/abs/2410.00447
作者: Yunnan Wang,Ziqiang Li,Zequn Zhang,Wenyao Zhang,Baao Xie,Xihui Liu,Wenjun Zeng,Xin Jin
关键词-EN: exciting progress, progress in generating, natural language, scene graph, Compositional Masked Attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurlPS 2024

点击查看摘要

Abstract:There has been exciting progress in generating images from natural language or layout conditions. However, these methods struggle to faithfully reproduce complex scenes due to the insufficient modeling of multiple objects and their relationships. To address this issue, we leverage the scene graph, a powerful structured representation, for complex image generation. Different from the previous works that directly use scene graphs for generation, we employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner, compositing diverse disentangled visual clues from scene graphs. Specifically, we first propose a Semantics-Layout Variational AutoEncoder (SL-VAE) to jointly derive (layouts, semantics) from the input scene graph, which allows a more diverse and reasonable generation in a one-to-many mapping. We then develop a Compositional Masked Attention (CMA) integrated with a diffusion model, incorporating (layouts, semantics) with fine-grained attributes as generation guidance. To further achieve graph manipulation while keeping the visual content consistent, we introduce a Multi-Layered Sampler (MLS) for an “isolated” image editing effect. Extensive experiments demonstrate that our method outperforms recent competitors based on text, layout, or scene graph, in terms of generation rationality and controllability.

[CV-41] ask Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

链接: https://arxiv.org/abs/2410.00436
作者: Miyu Goko,Motonari Kambara,Daichi Saito,Seitaro Otsuki,Komei Sugiura
关键词-EN: predicting task success, problem of predicting, task success, success for open-vocabulary, instruction sentences
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at CoRL2024

点击查看摘要

Abstract:In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive \lambda -Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive \lambda -Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.

[CV-42] kGuard: A Deep Learning Transformer-Based Solution for Detecting Unsuitable TikTok Content for Kids

链接: https://arxiv.org/abs/2410.00403
作者: Mazen Balat,Mahmoud Essam Gabr,Hend Bakr,Ahmed B. Zaky
关键词-EN: safeguarding young viewers, rise of short-form, brought new challenges, challenges in safeguarding, safeguarding young
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NILES2024

点击查看摘要

Abstract:The rise of short-form videos on platforms like TikTok has brought new challenges in safeguarding young viewers from inappropriate content. Traditional moderation methods often fall short in handling the vast and rapidly changing landscape of user-generated videos, increasing the risk of children encountering harmful material. This paper introduces TikGuard, a transformer-based deep learning approach aimed at detecting and flagging content unsuitable for children on TikTok. By using a specially curated dataset, TikHarm, and leveraging advanced video classification techniques, TikGuard achieves an accuracy of 86.7%, showing a notable improvement over existing methods in similar contexts. While direct comparisons are limited by the uniqueness of the TikHarm dataset, TikGuard’s performance highlights its potential in enhancing content moderation, contributing to a safer online experience for minors. This study underscores the effectiveness of transformer models in video classification and sets a foundation for future research in this area.

[CV-43] CusConcept: Customized Visual Concept Decomposition with Diffusion Models

链接: https://arxiv.org/abs/2410.00398
作者: Zhi Xu,Shaozhe Hao,Kai Han
关键词-EN: Enabling generative models, Customized Visual Concept, Enabling generative, decompose visual concepts, Visual Concept Decomposition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Enabling generative models to decompose visual concepts from a single image is a complex and challenging problem. In this paper, we study a new and challenging task, customized concept decomposition, wherein the objective is to leverage diffusion models to decompose a single image and generate visual concepts from various perspectives. To address this challenge, we propose a two-stage framework, CusConcept (short for Customized Visual Concept Decomposition), to extract customized visual concept embedding vectors that can be embedded into prompts for text-to-image generation. In the first stage, CusConcept employs a vocabulary-guided concept decomposition mechanism to build vocabularies along human-specified conceptual axes. The decomposed concepts are obtained by retrieving corresponding vocabularies and learning anchor weights. In the second stage, joint concept refinement is performed to enhance the fidelity and quality of generated images. We further curate an evaluation benchmark for assessing the performance of the open-world concept decomposition task. Our approach can effectively generate high-quality images of the decomposed concepts and produce related lexical predictions as secondary outcomes. Extensive qualitative and quantitative experiments demonstrate the effectiveness of CusConcept.

[CV-44] Seamless Augmented Reality Integration in Arthroscopy: A Pipeline for Articular Reconstruction and Guidance

链接: https://arxiv.org/abs/2410.00386
作者: Hongchao Shu,Mingxu Liu,Lalithkumar Seenivasan,Suxi Gu,Ping-Cheng Ku,Jonathan Knopf,Russell Taylor,Mathias Unberath
关键词-EN: treat joint problems, minimally invasive surgical, invasive surgical procedure, minimally invasive, diagnose and treat
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages, with 2 additional pages as the supplementary. Accepted by AE-CAI 2024

点击查看摘要

Abstract:Arthroscopy is a minimally invasive surgical procedure used to diagnose and treat joint problems. The clinical workflow of arthroscopy typically involves inserting an arthroscope into the joint through a small incision, during which surgeons navigate and operate largely by relying on their visual assessment through the arthroscope. However, the arthroscope’s restricted field of view and lack of depth perception pose challenges in navigating complex articular structures and achieving surgical precision during procedures. Aiming at enhancing intraoperative awareness, we present a robust pipeline that incorporates simultaneous localization and mapping, depth estimation, and 3D Gaussian splatting to realistically reconstruct intra-articular structures solely based on monocular arthroscope video. Extending 3D reconstruction to Augmented Reality (AR) applications, our solution offers AR assistance for articular notch measurement and annotation anchoring in a human-in-the-loop manner. Compared to traditional Structure-from-Motion and Neural Radiance Field-based methods, our pipeline achieves dense 3D reconstruction and competitive rendering fidelity with explicit 3D representation in 7 minutes on average. When evaluated on four phantom datasets, our method achieves RMSE = 2.21mm reconstruction error, PSNR = 32.86 and SSIM = 0.89 on average. Because our pipeline enables AR reconstruction and guidance directly from monocular arthroscopy without any additional data and/or hardware, our solution may hold the potential for enhancing intraoperative awareness and facilitating surgical precision in arthroscopy. Our AR measurement tool achieves accuracy within 1.59 +/- 1.81mm and the AR annotation tool achieves a mIoU of 0.721.

[CV-45] GLMHA A Guided Low-rank Multi-Head Self-Attention for Efficient Image Restoration and Spectral Reconstruction

链接: https://arxiv.org/abs/2410.00380
作者: Zaid Ilyas,Naveed Akhtar,David Suter,Syed Zulqarnain Gilani
关键词-EN: longstanding computer vision, computer vision tasks, longstanding computer, computer vision, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image restoration and spectral reconstruction are longstanding computer vision tasks. Currently, CNN-transformer hybrid models provide state-of-the-art performance for these tasks. The key common ingredient in the architectural designs of these models is Channel-wise Self-Attention (CSA). We first show that CSA is an overall low-rank operation. Then, we propose an instance-Guided Low-rank Multi-Head selfattention (GLMHA) to replace the CSA for a considerable computational gain while closely retaining the original model performance. Unique to the proposed GLMHA is its ability to provide computational gain for both short and long input sequences. In particular, the gain is in terms of both Floating Point Operations (FLOPs) and parameter count reduction. This is in contrast to the existing popular computational complexity reduction techniques, e.g., Linformer, Performer, and Reformer, for whom FLOPs overpower the efficient design tricks for the shorter input sequences. Moreover, parameter reduction remains unaccounted for in the existing this http URL perform an extensive evaluation for the tasks of spectral reconstruction from RGB images, spectral reconstruction from snapshot compressive imaging, motion deblurring, and image deraining by enhancing the best-performing models with our GLMHA. Our results show up to a 7.7 Giga FLOPs reduction with 370K fewer parameters required to closely retain the original performance of the best-performing models that employ CSA.

[CV-46] CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset

链接: https://arxiv.org/abs/2410.00379
作者: Xiao Wang,Fuling Wang,Yuehang Li,Qingchuan Ma,Shiao Wang,Bo Jiang,Chuanfu Li,Jin Tang
关键词-EN: patient wait times, significantly reduce diagnostic, reduce diagnostic burdens, X-ray image-based medical, image-based medical report
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: In Peer Review

点击查看摘要

Abstract:X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence which can significantly reduce diagnostic burdens and patient wait times. Despite significant progress, we believe that the task has reached a bottleneck due to the limited benchmark datasets and the existing large models’ insufficient capability enhancements in this specialized domain. Specifically, the recently released CheXpert Plus dataset lacks comparative evaluation algorithms and their results, providing only the dataset itself. This situation makes the training, evaluation, and comparison of subsequent algorithms challenging. Thus, we conduct a comprehensive benchmarking of existing mainstream X-ray report generation models and large language models (LLMs), on the CheXpert Plus dataset. We believe that the proposed benchmark can provide a solid comparative basis for subsequent algorithms and serve as a guide for researchers to quickly grasp the state-of-the-art models in this field. More importantly, we propose a large model for the X-ray image report generation using a multi-stage pre-training strategy, including self-supervised autoregressive generation and Xray-report contrastive learning, and supervised fine-tuning. Extensive experimental results indicate that the autoregressive pre-training based on Mamba effectively encodes X-ray images, and the image-text contrastive pre-training further aligns the feature spaces, achieving better experimental results. Source code can be found on \urlthis https URL.

[CV-47] Descriptor: Face Detection Dataset for Programmable Threshold-Based Sparse-Vision

链接: https://arxiv.org/abs/2410.00368
作者: Riadul Islam,Sri Ranga Sai Krishna Tummala,Joey Mulé,Rohith Kankipati,Suraj Jalapally,Dhandeep Challagundla,Chad Howard,Ryan Robucci
关键词-EN: in-chip image processing, vision-enabled embedded systems, efficiency and privacy, focal-plane and in-chip, processing has emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 8 pages

点击查看摘要

Abstract:Smart focal-plane and in-chip image processing has emerged as a crucial technology for vision-enabled embedded systems with energy efficiency and privacy. However, the lack of special datasets providing examples of the data that these neuromorphic sensors compute to convey visual information has hindered the adoption of these promising technologies. Neuromorphic imager variants, including event-based sensors, produce various representations such as streams of pixel addresses representing time and locations of intensity changes in the focal plane, temporal-difference data, data sifted/thresholded by temporal differences, image data after applying spatial transformations, optical flow data, and/or statistical representations. To address the critical barrier to entry, we provide an annotated, temporal-threshold-based vision dataset specifically designed for face detection tasks derived from the same videos used for Aff-Wild2. By offering multiple threshold levels (e.g., 4, 8, 12, and 16), this dataset allows for comprehensive evaluation and optimization of state-of-the-art neural architectures under varying conditions and settings compared to traditional methods. The accompanying tool flow for generating event data from raw videos further enhances accessibility and usability. We anticipate that this resource will significantly support the development of robust vision systems based on smart sensors that can process based on temporal-difference thresholds, enabling more accurate and efficient object detection and localization and ultimately promoting the broader adoption of low-power, neuromorphic imaging technologies. To support further research, we publicly released the dataset at \urlthis https URL.

[CV-48] FCT-I2P: Three stream fusion network with color aware transformer for image-to-point cloud registration

链接: https://arxiv.org/abs/2410.00360
作者: Muyao Peng,Pei An,Zichen Wan,You Yang,Qiong Liu
关键词-EN: artificial intelligence technologies, made significant strides, intelligence technologies, techniques have made, significant strides
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Along with the advancements in artificial intelligence technologies, image-to-point-cloud registration (I2P) techniques have made significant strides. Nevertheless, the dimensional differences in the features of points cloud (three-dimension) and image (two-dimension) continue to pose considerable challenges to their development. The primary challenge resides in the inability to leverage the features of one modality to augment those of another, thereby complicating the alignment of features within the latent space. To address this challenge, we propose an image-to-point-cloud method named as TFCT-I2P. Initially, we introduce a Three-Stream Fusion Network (TFN), which integrates color information from images with structural information from point clouds, facilitating the alignment of features from both modalities. Subsequently, to effectively mitigate patch-level misalignments introduced by the inclusion of color information, we design a Color-Aware Transformer (CAT). Finally, we conduct extensive experiments on 7Scenes, RGB-D Scenes V2, ScanNet V2, and a self-collected dataset. The results demonstrate that TFCT-I2P surpasses state-of-the-art methods by 1.5% in Inlier Ratio, 0.4% in Feature Matching Recall, and 5.4% in Registration Recall. Therefore, we believe that the proposed TFCT-I2P contributes to the advancement of I2P registration.

[CV-49] Efficient Training of Large Vision Models via Advanced Automated Progressive Learning

链接: https://arxiv.org/abs/2410.00350
作者: Changlin Li,Jiawei Zhang,Sihao Lin,Zongxin Yang,Junwei Liang,Xiaodan Liang,Xiaojun Chang
关键词-EN: Large Vision Models, advancements in Large, Vision Transformers, Large Vision, computational resources
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code: this https URL . arXiv admin note: substantial text overlap with arXiv:2203.14509

点击查看摘要

Abstract:The rapid advancements in Large Vision Models (LVMs), such as Vision Transformers (ViTs) and diffusion models, have led to an increasing demand for computational resources, resulting in substantial financial and environmental costs. This growing challenge highlights the necessity of developing efficient training methods for LVMs. Progressive learning, a training strategy in which model capacity gradually increases during training, has shown potential in addressing these challenges. In this paper, we present an advanced automated progressive learning (AutoProg) framework for efficient training of LVMs. We begin by focusing on the pre-training of LVMs, using ViTs as a case study, and propose AutoProg-One, an AutoProg scheme featuring momentum growth (MoGrow) and a one-shot growth schedule search. Beyond pre-training, we extend our approach to tackle transfer learning and fine-tuning of LVMs. We expand the scope of AutoProg to cover a wider range of LVMs, including diffusion models. First, we introduce AutoProg-Zero, by enhancing the AutoProg framework with a novel zero-shot unfreezing schedule search, eliminating the need for one-shot supernet training. Second, we introduce a novel Unique Stage Identifier (SID) scheme to bridge the gap during network growth. These innovations, integrated with the core principles of AutoProg, offer a comprehensive solution for efficient training across various LVM scenarios. Extensive experiments show that AutoProg accelerates ViT pre-training by up to 1.85x on ImageNet and accelerates fine-tuning of diffusion models by up to 2.86x, with comparable or even higher performance. This work provides a robust and scalable approach to efficient training of LVMs, with potential applications in a wide range of vision tasks. Code: this https URL

[CV-50] Revisiting the Role of Texture in 3D Person Re-identification

链接: https://arxiv.org/abs/2410.00348
作者: Huy Nguyen,Kien Nguyen,Akila Pemasiri,Sridha Sridharan,Clinton Fookes
关键词-EN: person re-ID, person re-ID task, reconstruction to improve, person, study introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study introduces a new framework for 3D person re-identification (re-ID) that leverages readily available high-resolution texture data in 3D reconstruction to improve the performance and explainability of the person re-ID task. We propose a method to emphasize texture in 3D person re-ID models by incorporating UVTexture mapping, which better differentiates human subjects. Our approach uniquely combines UVTexture and its heatmaps with 3D models to visualize and explain the person re-ID process. In particular, the visualization and explanation are achieved through activation maps and attribute-based attention maps, which highlight the important regions and features contributing to the person re-ID decision. Our contributions include: (1) a novel technique for emphasizing texture in 3D models using UVTexture processing, (2) an innovative method for explicating person re-ID matches through a combination of 3D models and UVTexture mapping, and (3) achieving state-of-the-art performance in 3D person re-ID. We ensure the reproducibility of our results by making all data, codes, and models publicly available.

[CV-51] SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

链接: https://arxiv.org/abs/2410.00337
作者: Leheng Li,Weichao Qiu,Yingjie Cai,Xu Yan,Qing Lian,Bingbing Liu,Ying-Cong Chen
关键词-EN: significant human effort, labels require dense, occupancy labels require, require dense, annotation with significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advancement of autonomous driving is increasingly reliant on high-quality annotated datasets, especially in the task of 3D occupancy prediction, where the occupancy labels require dense 3D annotation with significant human effort. In this paper, we propose SyntheOcc, which denotes a diffusion model that Synthesize photorealistic and geometric-controlled images by conditioning Occupancy labels in driving scenarios. This yields an unlimited amount of diverse, annotated, and controllable datasets for applications like training perception models and simulation. SyntheOcc addresses the critical challenge of how to efficiently encode 3D geometric information as conditional input to a 2D diffusion model. Our approach innovatively incorporates 3D semantic multi-plane images (MPIs) to provide comprehensive and spatially aligned 3D scene descriptions for conditioning. As a result, SyntheOcc can generate photorealistic multi-view images and videos that faithfully align with the given geometric labels (semantics in 3D voxel space). Extensive qualitative and quantitative evaluations of SyntheOcc on the nuScenes dataset prove its effectiveness in generating controllable occupancy datasets that serve as an effective data augmentation to perception models.

[CV-52] A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization NEURIPS2024

链接: https://arxiv.org/abs/2410.00321
作者: Chieh-Yun Chen,Li-Wu Tsao,Chiang Tseng,Hong-Han Shuai
关键词-EN: text embedding, text embedding contributes, analyzes the impact, impact of causal, causal manner
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:This paper analyzes the impact of causal manner in the text encoder of text-to-image (T2I) diffusion models, which can lead to information bias and loss. Previous works have focused on addressing the issues through the denoising process. However, there is no research discussing how text embedding contributes to T2I models, especially when generating more than one object. In this paper, we share a comprehensive analysis of text embedding: i) how text embedding contributes to the generated images and ii) why information gets lost and biases towards the first-mentioned object. Accordingly, we propose a simple but effective text embedding balance optimization method, which is training-free, with an improvement of 90.05% on information balance in stable diffusion. Furthermore, we propose a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments. This metric effectively measures the presence and accuracy of objects, addressing the limitations of current distribution scores like CLIP’s text-image similarities.

[CV-53] PointAD: Comprehending 3D Anomalies from Points and Pixels for Zero-shot 3D Anomaly Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.00320
作者: Qihang Zhou,Jiangtao Yan,Shibo He,Wenchao Meng,Jiming Chen
关键词-EN: scenarios where target, training samples, privacy protection, crucial yet unexplored, unexplored field
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Zero-shot (ZS) 3D anomaly detection is a crucial yet unexplored field that addresses scenarios where target 3D training samples are unavailable due to practical concerns like privacy protection. This paper introduces PointAD, a novel approach that transfers the strong generalization capabilities of CLIP for recognizing 3D anomalies on unseen objects. PointAD provides a unified framework to comprehend 3D anomalies from both points and pixels. In this framework, PointAD renders 3D anomalies into multiple 2D renderings and projects them back into 3D space. To capture the generic anomaly semantics into PointAD, we propose hybrid representation learning that optimizes the learnable text prompts from 3D and 2D through auxiliary point clouds. The collaboration optimization between point and pixel representations jointly facilitates our model to grasp underlying 3D anomaly patterns, contributing to detecting and segmenting anomalies of unseen diverse 3D objects. Through the alignment of 3D and 2D space, our model can directly integrate RGB information, further enhancing the understanding of 3D anomalies in a plug-and-play manner. Extensive experiments show the superiority of PointAD in ZS 3D anomaly detection across diverse unseen objects.

[CV-54] Ask Pose Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

链接: https://arxiv.org/abs/2410.00309
作者: Laura Bravo-Sánchez,Jaewoo Heo,Zhenzhen Weng,Kuan-Chieh Wang,Serena Yeung-Levy
关键词-EN: Vision Language Models, Large Vision Language, Social dynamics, utilizes Large Vision, pose significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance during optimization, improves mesh estimation on unseen interactions. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field’s capabilities of handling complex interaction scenarios.

[CV-55] RadGazeGen: Radiomics and Gaze-guided Medical Image Generation using Diffusion Models

链接: https://arxiv.org/abs/2410.00307
作者: Moinak Bhattacharya,Gagandeep Singh,Shubham Jain,Prateek Prasanna
关键词-EN: integrating experts’ eye, diffusion models, image generation, medical image generation, framework for integrating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we present RadGazeGen, a novel framework for integrating experts’ eye gaze patterns and radiomic feature maps as controls to text-to-image diffusion models for high fidelity medical image generation. Despite the recent success of text-to-image diffusion models, text descriptions are often found to be inadequate and fail to convey detailed disease-specific information to these models to generate clinically accurate images. The anatomy, disease texture patterns, and location of the disease are extremely important to generate realistic images; moreover the fidelity of image generation can have significant implications in downstream tasks involving disease diagnosis or treatment repose assessment. Hence, there is a growing need to carefully define the controls used in diffusion models for medical image generation. Eye gaze patterns of radiologists are important visuo-cognitive information, indicative of subtle disease patterns and spatial location. Radiomic features further provide important subvisual cues regarding disease phenotype. In this work, we propose to use these gaze patterns in combination with standard radiomics descriptors, as controls, to generate anatomically correct and disease-aware medical images. RadGazeGen is evaluated for image generation quality and diversity on the REFLACX dataset. To demonstrate clinical applicability, we also show classification performance on the generated images from the CheXpert test set (n=500) and long-tailed learning performance on the MIMIC-CXR-LT test set (n=23550).

[CV-56] GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving

链接: https://arxiv.org/abs/2410.00299
作者: Zhangshuo Qi,Junyi Ma,Jingyi Xu,Zijie Zhou,Luqi Cheng,Guangming Xiong
关键词-EN: ensure autonomous vehicles, autonomous vehicles obtain, vehicles obtain usable, obtain usable localization, usable localization information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Place recognition is a crucial module to ensure autonomous vehicles obtain usable localization information in GPS-denied environments. In recent years, multimodal place recognition methods have gained increasing attention due to their ability to overcome the weaknesses of unimodal sensor systems by leveraging complementary information from different modalities. However, challenges arise from the necessity of harmonizing data across modalities and exploiting the spatio-temporal correlations between them sufficiently. In this paper, we propose a 3D Gaussian Splatting-based multimodal place recognition neural network dubbed GSPR. It explicitly combines multi-view RGB images and LiDAR point clouds into a spatio-temporally unified scene representation with the proposed Multimodal Gaussian Splatting. A network composed of 3D graph convolution and transformer is designed to extract high-level spatio-temporal features and global descriptors from the Gaussian scenes for place recognition. We evaluate our method on the nuScenes dataset, and the experimental results demonstrate that our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability. Our open-source code is available at this https URL.

[CV-57] Insight: A Multi-Modal Diagnostic Pipeline using LLMs for Ocular Surface Disease Diagnosis MICCAI2024

链接: https://arxiv.org/abs/2410.00292
作者: Chun-Hsiao Yeh,Jiayun Wang,Andrew D. Graham,Andrea J. Liu,Bo Tan,Yubei Chen,Yi Ma,Meng C. Lin
关键词-EN: Accurate diagnosis, ocular surface disease, surface disease diagnosis, optometry and ophthalmology, critical in optometry
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to MICCAI 2024. Project Webpage: this https URL

点击查看摘要

Abstract:Accurate diagnosis of ocular surface diseases is critical in optometry and ophthalmology, which hinge on integrating clinical data sources (e.g., meibography imaging and clinical metadata). Traditional human assessments lack precision in quantifying clinical observations, while current machine-based methods often treat diagnoses as multi-class classification problems, limiting the diagnoses to a predefined closed-set of curated answers without reasoning the clinical relevance of each variable to the diagnosis. To tackle these challenges, we introduce an innovative multi-modal diagnostic pipeline (MDPipe) by employing large language models (LLMs) for ocular surface disease diagnosis. We first employ a visual translator to interpret meibography images by converting them into quantifiable morphology data, facilitating their integration with clinical metadata and enabling the communication of nuanced medical insight to LLMs. To further advance this communication, we introduce a LLM-based summarizer to contextualize the insight from the combined morphology and clinical metadata, and generate clinical report summaries. Finally, we refine the LLMs’ reasoning ability with domain-specific insight from real-life clinician diagnoses. Our evaluation across diverse ocular surface disease diagnosis benchmarks demonstrates that MDPipe outperforms existing standards, including GPT-4, and provides clinically sound rationales for diagnoses.

[CV-58] Delving Deep into Engagement Prediction of Short Videos ECCV2024

链接: https://arxiv.org/abs/2410.00289
作者: Dasong Li,Wenjie Li,Baili Lu,Hongsheng Li,Sizhuo Ma,Gurunandan Krishnan,Jian Wang
关键词-EN: User Generated Content, social media platforms, media platforms presents, Understanding and modeling, User Generated
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Social and Information Networks (cs.SI)
*备注: Accepted to ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:Understanding and modeling the popularity of User Generated Content (UGC) short videos on social media platforms presents a critical challenge with broad implications for content creators and recommendation systems. This study delves deep into the intricacies of predicting engagement for newly published videos with limited user interactions. Surprisingly, our findings reveal that Mean Opinion Scores from previous video quality assessment datasets do not strongly correlate with video engagement levels. To address this, we introduce a substantial dataset comprising 90,000 real-world UGC short videos from Snapchat. Rather than relying on view count, average watch time, or rate of likes, we propose two metrics: normalized average watch percentage (NAWP) and engagement continuation rate (ECR) to describe the engagement levels of short videos. Comprehensive multi-modal features, including visual content, background music, and text data, are investigated to enhance engagement prediction. With the proposed dataset and two key metrics, our method demonstrates its ability to predict engagements of short videos purely from video content.

[CV-59] Performance Evaluation of Deep Learning-based Quadrotor UAV Detection and Tracking Methods

链接: https://arxiv.org/abs/2410.00285
作者: Mohssen E. Elshaar,Zeyad M. Manaa,Mohammed R. Elbalshy,Abdul Jabbar Siddiqui,Ayman M. Abdallah
关键词-EN: Unmanned Aerial Vehicles, Unmanned Aerial, Aerial Vehicles, introducing significant challenges, offering many benefits
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are becoming more popular in various sectors, offering many benefits, yet introducing significant challenges to privacy and safety. This paper investigates state-of-the-art solutions for detecting and tracking quadrotor UAVs to address these concerns. Cutting-edge deep learning models, specifically the YOLOv5 and YOLOv8 series, are evaluated for their performance in identifying UAVs accurately and quickly. Additionally, robust tracking systems, BoT-SORT and Byte Track, are integrated to ensure reliable monitoring even under challenging conditions. Our tests on the DUT dataset reveal that while YOLOv5 models generally outperform YOLOv8 in detection accuracy, the YOLOv8 models excel in recognizing less distinct objects, demonstrating their adaptability and advanced capabilities. Furthermore, BoT-SORT demonstrated superior performance over Byte Track, achieving higher IoU and lower center error in most cases, indicating more accurate and stable tracking. Code: this https URL Tracking demo: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.00285 [cs.CV] (or arXiv:2410.00285v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.00285 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-60] On Large Uni- and Multi-modal Models for Unsupervised Classification of Social Media Images: Natures Contribution to People as case study

链接: https://arxiv.org/abs/2410.00275
作者: Rohaifa Khaldi,Domingo Alcaraz-Segura,Ignacio Sánchez-Herrera,Javier Martinez-Lopez,Carlos Javier Navarro,Siham Tabik
关键词-EN: Social media images, Large Visual Language, Large Visual Models, Large Language Models, Visual Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Social media images have shown to be a valuable source of information for understanding human interactions with important subjects such as cultural heritage, biodiversity and nature among others. The task of grouping such images into a number of semantically meaningful clusters without labels is challenging given the high diversity and complex nature of the visual content of these images in addition to their large volume. On the other hand, the last advances in Large Visual Models (LVM), Large Language Models (LLM) and Large Visual Language Models (LVLM) provide an important opportunity to explore new productive and scalable solutions. This works proposes, analyzes, and compares various approaches based on one or more state-of-the art LVM, LLM and LVLM, for mapping social media images into a number of pre-defined classes. As case study, we consider the problem of understanding the interactions between human and nature, also known as Nature’s Contribution to People or Cultural Ecosystem Services (CES). Our experiments reveal that the top-performing approaches, delivering highly competitive results, are the fine-tuned LVM DINOv2 on a small labeled dataset and LVLM models like the proprietary GPT-4 (gpt-4o-mini) using a simple prompt.

[CV-61] KPCA-CAM: Visual Explainability of Deep Computer Vision Models using Kernel PCA

链接: https://arxiv.org/abs/2410.00267
作者: Sachin Karmani,Thanushon Sivakaran,Gaurav Prasad,Mehmet Ali,Wenbo Yang,Sheyang Tang
关键词-EN: Deep learning models, Deep learning, black boxes, Convolutional Neural Networks, activation maps
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures, Published to IEEE MMSP 2024

点击查看摘要

Abstract:Deep learning models often function as black boxes, providing no straightforward reasoning for their predictions. This is particularly true for computer vision models, which process tensors of pixel values to generate outcomes in tasks such as image classification and object detection. To elucidate the reasoning of these models, class activation maps (CAMs) are used to highlight salient regions that influence a model’s output. This research introduces KPCA-CAM, a technique designed to enhance the interpretability of Convolutional Neural Networks (CNNs) through improved class activation maps. KPCA-CAM leverages Principal Component Analysis (PCA) with the kernel trick to capture nonlinear relationships within CNN activations more effectively. By mapping data into higher-dimensional spaces with kernel functions and extracting principal components from this transformed hyperplane, KPCA-CAM provides more accurate representations of the underlying data manifold. This enables a deeper understanding of the features influencing CNN decisions. Empirical evaluations on the ILSVRC dataset across different CNN models demonstrate that KPCA-CAM produces more precise activation maps, providing clearer insights into the model’s reasoning compared to existing CAM algorithms. This research advances CAM techniques, equipping researchers and practitioners with a powerful tool to gain deeper insights into CNN decision-making processes and overall behaviors.

[CV-62] Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation

链接: https://arxiv.org/abs/2410.00266
作者: Aleyna Kütük,Tevfik Metin Sezgin
关键词-EN: Scene sketch, Scene sketch semantic, Scene, applications including, sketch semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scene sketch semantic segmentation is a crucial task for various applications including sketch-to-image retrieval and scene understanding. Existing sketch segmentation methods treat sketches as bitmap images, leading to the loss of temporal order among strokes due to the shift from vector to image format. Moreover, these methods struggle to segment objects from categories absent in the training data. In this paper, we propose a Class-Agnostic Visio-Temporal Network (CAVT) for scene sketch semantic segmentation. CAVT employs a class-agnostic object detector to detect individual objects in a scene and groups the strokes of instances through its post-processing module. This is the first approach that performs segmentation at both the instance and stroke levels within scene sketches. Furthermore, there is a lack of free-hand scene sketch datasets with both instance and stroke-level class annotations. To fill this gap, we collected the largest Free-hand Instance- and Stroke-level Scene Sketch Dataset (FrISS) that contains 1K scene sketches and covers 403 object classes with dense annotations. Extensive experiments on FrISS and other datasets demonstrate the superior performance of our method over state-of-the-art scene sketch segmentation models. The code and dataset will be made public after acceptance.

[CV-63] Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation NEURIPS2024

链接: https://arxiv.org/abs/2410.00263
作者: Kun Yuan,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
关键词-EN: faces unique challenges, unique challenges due, knowledge domain gap, Surgical video-language pretraining, faces unique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Main Track

点击查看摘要

Abstract:Surgical video-language pretraining (VLP) faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data. This study aims to bridge the gap by addressing issues regarding textual information loss in surgical lecture videos and the spatial-temporal challenges of surgical VLP. We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining (PeskaVLP) framework to tackle these issues. The knowledge augmentation uses large language models (LLM) for refining and enriching surgical concepts, thus providing comprehensive language supervision and reducing the risk of overfitting. PeskaVLP combines language supervision with visual self-supervision, constructing hard negative samples and employing a Dynamic Time Warping (DTW) based loss function to effectively comprehend the cross-modal procedural alignment. Extensive experiments on multiple public surgical scene understanding and cross-modal retrieval datasets show that our proposed method significantly improves zero-shot transferring performance and offers a generalist visual representation for further advancements in surgical scene understanding.

[CV-64] ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning

链接: https://arxiv.org/abs/2410.00262
作者: Jian Shi,Zhenyu Li,Peter Wonka
关键词-EN: innovative framework specifically, transform single-view videos, framework specifically designed, transform single-view, stereo
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce \textitImmersePro, an innovative framework specifically designed to transform single-view videos into stereo videos. This framework utilizes a novel dual-branch architecture comprising a disparity branch and a context branch on video data by leveraging spatial-temporal attention mechanisms. \textitImmersePro employs implicit disparity guidance, enabling the generation of stereo pairs from video sequences without the need for explicit disparity maps, thus reducing potential errors associated with disparity estimation models. In addition to the technical advancements, we introduce the YouTube-SBS dataset, a comprehensive collection of 423 stereo videos sourced from YouTube. This dataset is unprecedented in its scale, featuring over 7 million stereo pairs, and is designed to facilitate training and benchmarking of stereo video generation models. Our experiments demonstrate the effectiveness of \textitImmersePro in producing high-quality stereo videos, offering significant improvements over existing methods. Compared to the best competitor stereo-from-mono we quantitatively improve the results by 11.76% (L1), 6.39% (SSIM), and 5.10% (PSNR).

[CV-65] Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

链接: https://arxiv.org/abs/2410.00255
作者: Weitai Kang,Haifeng Huang,Yuzhang Shang,Mubarak Shah,Yan Yan
关键词-EN: Large Language Models, Large Language, building general-purpose agents, challenges remain due, high-quality robust instruction-following
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model’s discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model’s generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap).

[CV-66] MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans

链接: https://arxiv.org/abs/2410.00253
作者: Anna Deichler,Jim O’Regan,Jonas Beskow
关键词-EN: physics simulator, headset to record, record conversations, Abstract, dataset captured
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR). Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings. Participants engaged in various conversational scenarios, all based on referential communication tasks. The dataset provides a rich set of multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to enhance the understanding and development of gesture generation models in 3D scenes by providing diverse and contextually rich data.

[CV-67] Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models

链接: https://arxiv.org/abs/2410.00231
作者: Qi Wu,Zipeng Fu,Xuxin Cheng,Xiaolong Wang,Chelsea Finn
关键词-EN: achieved strong performance, Learning-based methods, methods have achieved, achieved strong, strong performance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Learning-based methods have achieved strong performance for quadrupedal locomotion. However, several challenges prevent quadrupeds from learning helpful indoor skills that require interaction with environments and humans: lack of end-effectors for manipulation, limited semantic understanding using only simulation data, and low traversability and reachability in indoor environments. We present a system for quadrupedal mobile manipulation in indoor environments. It uses a front-mounted gripper for object manipulation, a low-level controller trained in simulation using egocentric depth for agile skills like climbing and whole-body tilting, and pre-trained vision-language models (VLMs) with a third-person fisheye and an egocentric RGB camera for semantic understanding and command generation. We evaluate our system in two unseen environments without any real-world data collection or training. Our system can zero-shot generalize to these environments and complete tasks, like following user’s commands to fetch a randomly placed stuff toy after climbing over a queen-sized bed, with a 60% success rate. Project website: this https URL

[CV-68] OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization

链接: https://arxiv.org/abs/2410.00204
作者: Saihui Hou,Panjian Huang,Zengbin Wang,Yuan Liu,Zeyu Li,Man Zhang,Yongzhen Huang
关键词-EN: presents unique complexities, unique complexities due, environments and poses, diverse species, animal re-identification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of animal re-identification, an emerging field that shares similarities with person re-identification but presents unique complexities due to the diverse species, environments and poses. To facilitate research in this domain, we introduce OpenAnimals, a flexible and extensible codebase designed specifically for animal re-identification. We conduct a comprehensive study by revisiting several state-of-the-art person re-identification methods, including BoT, AGW, SBS, and MGN, and evaluate their effectiveness on animal re-identification benchmarks such as HyenaID, LeopardID, SeaTurtleID, and WhaleSharkID. Our findings reveal that while some techniques generalize well, many do not, underscoring the significant differences between the two tasks. To bridge this gap, we propose ARBase, a strong \textbfBase model tailored for \textbfAnimal \textbfRe-identification, which incorporates insights from extensive experiments and introduces simple yet effective animal-oriented designs. Experiments demonstrate that ARBase consistently outperforms existing baselines, achieving state-of-the-art performance across various benchmarks.

[CV-69] DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation ECCV2024

链接: https://arxiv.org/abs/2410.00201
作者: Yi-Hao Peng,Faria Huq,Yue Jiang,Jason Wu,Amanda Xin Yue Li,Jeffrey Bigham,Amy Pavel
关键词-EN: Enabling machines, understand structured visuals, machines to understand, user interfaces, interfaces is essential
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: ECCV 2024

点击查看摘要

Abstract:Enabling machines to understand structured visuals like slides and user interfaces is essential for making them accessible to people with disabilities. However, achieving such understanding computationally has required manual data collection and annotation, which is time-consuming and labor-intensive. To overcome this challenge, we present a method to generate synthetic, structured visuals with target labels using code generation. Our method allows people to create datasets with built-in labels and train models with a small number of human-annotated examples. We demonstrate performance improvements in three tasks for understanding slides and UIs: recognizing visual elements, describing visual content, and classifying visual content types.

[CV-70] Do Vision-Language Models Really Understand Visual Language?

链接: https://arxiv.org/abs/2410.00193
作者: Buse Giledereli,Yifan Hou,Yilei Tu,Mrinmaya Sachan
关键词-EN: Visual language, visual language depicting, spatial arrangements, system of communication, communication that conveys
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of three LVLMs (GPT-4V, GPT-4o, and Gemini) shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

[CV-71] EEG Emotion Copilot: Pruning LLMs for Emotional EEG Interpretation with Assisted Medical Record Generation

链接: https://arxiv.org/abs/2410.00166
作者: Hongyu Chen,Weiming Zeng,Chengcheng Chen,Luhui Cai,Fei Wang,Lei Wang,Wei Zhang,Yueyang Li,Hongjie Yan,Wai Ting Siok,Nizhuan Wang
关键词-EN: critical research frontier, discern individual emotional, EEG Emotion Copilot, EEG emotion recognition, affective computing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:In the fields of affective computing (AC) and brain-machine interface (BMI), the analysis of physiological and behavioral signals to discern individual emotional states has emerged as a critical research frontier. While deep learning-based approaches have made notable strides in EEG emotion recognition, particularly in feature extraction and pattern recognition, significant challenges persist in achieving end-to-end emotion computation, including real-time processing, individual adaptation, and seamless user interaction. This paper presents the EEG Emotion Copilot, a system leveraging a lightweight large language model (LLM) operating in a local setting. The system is designed to first recognize emotional states directly from EEG signals, subsequently generate personalized diagnostic and treatment suggestions, and finally support the automation of electronic medical records. The proposed solution emphasizes both the accuracy of emotion recognition and an enhanced user experience, facilitated by an intuitive interface for participant interaction. We further discuss the construction of the data framework, model pruning, training, and deployment strategies aimed at improving real-time performance and computational efficiency. Privacy concerns are also addressed, with a focus on ethical data collection, processing, and the protection of users’ personal information. Through these efforts, we aim to advance the application of AC in the medical domain, offering innovative approaches to mental health diagnostics and treatment.

[CV-72] CVVLSNet: Vehicle Location and Speed Estimation Using Partial Connected Vehicle Trajectory Data

链接: https://arxiv.org/abs/2410.00132
作者: Jiachen Ye,Dingyu Wang,Shaocheng Jia,Xin Pei,Zi Yang,Yi Zhang,S.C. Wong
关键词-EN: beneficial transportation applications, adaptive signal control, locations and speeds, Real-time estimation, vehicle locations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Real-time estimation of vehicle locations and speeds is crucial for developing many beneficial transportation applications in traffic management and control, e.g., adaptive signal control. Recent advances in communication technologies facilitate the emergence of connected vehicles (CVs), which can share traffic information with nearby CVs or infrastructures. At the early stage of connectivity, only a portion of vehicles are CVs. The locations and speeds for those non-CVs (NCs) are not accessible and must be estimated to obtain the full traffic information. To address the above problem, this paper proposes a novel CV-based Vehicle Location and Speed estimation network, CVVLSNet, to simultaneously estimate the vehicle locations and speeds exclusively using partial CV trajectory data. A road cell occupancy (RCO) method is first proposed to represent the variable vehicle state information. Spatiotemporal interactions can be integrated by simply fusing the RCO representations. Then, CVVLSNet, taking the Coding-RAte TransformEr (CRATE) network as a backbone, is introduced to estimate the vehicle locations and speeds. Moreover, physical vehicle size constraints are also considered in loss functions. Extensive experiments indicate that the proposed method significantly outperformed the existing method under various CV penetration rates, signal timings, and volume-to-capacity ratios.

[CV-73] An Overview of the Burer-Monteiro Method for Certifiable Robot Perception

链接: https://arxiv.org/abs/2410.00117
作者: Alan Papalia,Yulun Tian,David M. Rosen,Jonathan P. How,John J. Leonard
关键词-EN: solve robot perception, robot perception problems, Burer-Monteiro method, optimality in real-time, presents an overview
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to 2024 Robotics: Science and Systems (RSS) Safe Autonomy Workshop

点击查看摘要

Abstract:This paper presents an overview of the Burer-Monteiro method (BM), a technique that has been applied to solve robot perception problems to certifiable optimality in real-time. BM is often used to solve semidefinite programming relaxations, which can be used to perform global optimization for non-convex perception problems. Specifically, BM leverages the low-rank structure of typical semidefinite programs to dramatically reduce the computational cost of performing optimization. This paper discusses BM in certifiable perception, with three main objectives: (i) to consolidate information from the literature into a unified presentation, (ii) to elucidate the role of the linear independence constraint qualification (LICQ), a concept not yet well-covered in certifiable perception literature, and (iii) to share practical considerations that are discussed among practitioners but not thoroughly covered in the literature. Our general aim is to offer a practical primer for applying BM towards certifiable perception.

[CV-74] ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

链接: https://arxiv.org/abs/2410.00086
作者: Zhen Han,Zeyinzi Jiang,Yulin Pan,Jingfeng Zhang,Chaojie Mao,Chenwei Xie,Yu Liu,Jingren Zhou
关键词-EN: powerful generative technology, visual generation tasks, visual generation, foundational diffusion models, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents. Code and models will be available on the project page: this https URL.

[CV-75] Fine-tuning Vision Classifiers On A Budget

链接: https://arxiv.org/abs/2410.00085
作者: Sunil Kumar,Ted Sandler,Paulina Varshavskaya
关键词-EN: modern computer vision, requires accurately labeled, Fine-tuning modern computer, accurately labeled data, computer vision models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Fine-tuning modern computer vision models requires accurately labeled data for which the ground truth may not exist, but a set of multiple labels can be obtained from labelers of variable accuracy. We tie the notion of label quality to confidence in labeler accuracy and show that, when prior estimates of labeler accuracy are available, using a simple naive-Bayes model to estimate the true labels allows us to label more data on a fixed budget without compromising label or fine-tuning quality. We present experiments on a dataset of industrial images that demonstrates that our method, called Ground Truth Extension (GTX), enables fine-tuning ML models using fewer human labels.

[CV-76] A Survey on Diffusion Models for Inverse Problems

链接: https://arxiv.org/abs/2410.00083
作者: Giannis Daras,Hyungjin Chung,Chieh-Hsin Lai,Yuki Mitsufuji,Jong Chul Ye,Peyman Milanfar,Alexandros G. Dimakis,Mauricio Delbracio
关键词-EN: generate high-quality samples, generative modeling due, Diffusion models, high-quality samples, increasingly popular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in progress. 38 pages

点击查看摘要

Abstract:Diffusion models have become increasingly popular for generative modeling due to their ability to generate high-quality samples. This has unlocked exciting new possibilities for solving inverse problems, especially in image restoration and reconstruction, by treating diffusion models as unsupervised priors. This survey provides a comprehensive overview of methods that utilize pre-trained diffusion models to solve inverse problems without requiring further training. We introduce taxonomies to categorize these methods based on both the problems they address and the techniques they employ. We analyze the connections between different approaches, offering insights into their practical implementation and highlighting important considerations. We further discuss specific challenges and potential solutions associated with using latent diffusion models for inverse problems. This work aims to be a valuable resource for those interested in learning about the intersection of diffusion models and inverse problems.

[CV-77] Graph Residual Noise Learner Network for Brain Connectivity Graph Prediction

链接: https://arxiv.org/abs/2410.00082
作者: Oytun Demirbilek,Tingying Peng,Alaa Bessadok
关键词-EN: brain dysconnectivity patterns, charting brain dysconnectivity, dysconnectivity patterns, depicting a connectional, connectional fingerprint
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 6th Workshop on GRaphs in biomedicAl Image anaLysis

点击查看摘要

Abstract:A morphological brain graph depicting a connectional fingerprint is of paramount importance for charting brain dysconnectivity patterns. Such data often has missing observations due to various reasons such as time-consuming and incomplete neuroimage processing pipelines. Thus, predicting a target brain graph from a source graph is crucial for better diagnosing neurological disorders with minimal data acquisition resources. Many brain graph generative models were proposed for promising results, yet they are mostly based on generative adversarial networks (GAN), which could suffer from mode collapse and require large training datasets. Recent developments in diffusion models address these problems by offering essential properties such as a stable training objective and easy scalability. However, applying a diffusion process to graph edges fails to maintain the topological symmetry of the brain connectivity matrices. To meet these challenges, we propose the Graph Residual Noise Learner Network (Grenol-Net), the first graph diffusion model for predicting a target graph from a source graph.

[CV-78] M2Distill: Multi-Modal Distillation for Lifelong Imitation Learning ICRA2025

链接: https://arxiv.org/abs/2410.00064
作者: Kaushik Roy,Akila Dissanayake,Brendan Tidd,Peyman Moghadam
关键词-EN: poses significant challenges, significant challenges due, tasks poses significant, Lifelong imitation learning, manipulation tasks poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Submitted to ICRA2025

点击查看摘要

Abstract:Lifelong imitation learning for manipulation tasks poses significant challenges due to distribution shifts that occur in incremental learning steps. Existing methods often focus on unsupervised skill discovery to construct an ever-growing skill library or distillation from multiple policies, which can lead to scalability issues as diverse manipulation tasks are continually introduced and may fail to ensure a consistent latent space throughout the learning process, leading to catastrophic forgetting of previously learned skills. In this paper, we introduce M2Distill, a multi-modal distillation-based method for lifelong imitation learning focusing on preserving consistent latent space across vision, language, and action distributions throughout the learning process. By regulating the shifts in latent representations across different modalities from previous to current steps, and reducing discrepancies in Gaussian Mixture Model (GMM) policies between consecutive learning steps, we ensure that the learned policy retains its ability to perform previously learned tasks while seamlessly integrating new skills. Extensive evaluations on the LIBERO lifelong imitation learning benchmark suites, including LIBERO-OBJECT, LIBERO-GOAL, and LIBERO-SPATIAL, demonstrate that our method consistently outperforms prior state-of-the-art methods across all evaluated metrics.

[CV-79] IDEA: An Inverse Domain Expert Adaptation Based Active DNN IP Protection Method

链接: https://arxiv.org/abs/2410.00059
作者: Chaohui Xu,Qi Cui,Jinxin Dong,Weiyang He,Chip-Hong Chang
关键词-EN: Deep Neural Network, Neural Network, Deep Neural, Illegitimate reproduction, derivation of Deep
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Illegitimate reproduction, distribution and derivation of Deep Neural Network (DNN) models can inflict economic loss, reputation damage and even privacy infringement. Passive DNN intellectual property (IP) protection methods such as watermarking and fingerprinting attempt to prove the ownership upon IP violation, but they are often too late to stop catastrophic damage of IP abuse and too feeble against strong adversaries. In this paper, we propose IDEA, an Inverse Domain Expert Adaptation based proactive DNN IP protection method featuring active authorization and source traceability. IDEA generalizes active authorization as an inverse problem of domain adaptation. The multi-adaptive optimization is solved by a mixture-of-experts model with one real and two fake experts. The real expert re-optimizes the source model to correctly classify test images with a unique model user key steganographically embedded. The fake experts are trained to output random prediction on test images without or with incorrect user key embedded by minimizing their mutual information (MI) with the real expert. The MoE model is knowledge distilled into a unified protected model to avoid leaking the expert model features by maximizing their MI with additional multi-layer attention and contrastive representation loss optimization. IDEA not only prevents unauthorized users without the valid key to access the functional model, but also enable the model owner to validate the deployed model and trace the source of IP infringement. We extensively evaluate IDEA on five datasets and four DNN models to demonstrate its effectiveness in authorization control, culprit tracing success rate, and robustness against various attacks.

[CV-80] Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization NEURIPS2024

链接: https://arxiv.org/abs/2410.00051
作者: Haoran Li,Zhennan Jiang,Yuhui Chen,Dongbin Zhao
关键词-EN: faces significant challenges, visual reinforcement learning, reinforcement learning, faces significant, exploitation and exploration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS2024)

点击查看摘要

Abstract:With high-dimensional state spaces, visual reinforcement learning (RL) faces significant challenges in exploitation and exploration, resulting in low sample efficiency and training stability. As a time-efficient diffusion model, although consistency models have been validated in online state-based RL, it is still an open question whether it can be extended to visual RL. In this paper, we investigate the impact of non-stationary distribution and the actor-critic framework on consistency policy in online RL, and find that consistency policy was unstable during the training, especially in visual RL with the high-dimensional state space. To this end, we suggest sample-based entropy regularization to stabilize the policy training, and propose a consistency policy with prioritized proximal experience regularization (CP3ER) to improve sample efficiency. CP3ER achieves new state-of-the-art (SOTA) performance in 21 tasks across DeepMind control suite and Meta-world. To our knowledge, CP3ER is the first method to apply diffusion/consistency models to visual RL and demonstrates the potential of consistency models in visual RL. More visualization results are available at this https URL.

[CV-81] CycleBNN: Cyclic Precision Training in Binary Neural Networks ECCV-2024

链接: https://arxiv.org/abs/2410.00050
作者: Federico Fontana,Romeo Lanzino,Anxhelo Diko,Gian Luca Foresti,Luigi Cinque
关键词-EN: Binary Neural Networks, Binary Neural, offering significant reductions, Neural Networks, offering significant
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at Workshop CADL, ECCV-2024

点击查看摘要

Abstract:This paper works on Binary Neural Networks (BNNs), a promising avenue for efficient deep learning, offering significant reductions in computational overhead and memory footprint to full precision networks. However, the challenge of energy-intensive training and the drop in performance have been persistent issues. Tackling the challenge, prior works focus primarily on task-related inference optimization. Unlike prior works, this study offers an innovative methodology integrating BNNs with cyclic precision training, introducing the CycleBNN. This approach is designed to enhance training efficiency while minimizing the loss in performance. By dynamically adjusting precision in cycles, we achieve a convenient trade-off between training efficiency and model performance. This emphasizes the potential of our method in energy-constrained training scenarios, where data is collected onboard and paves the way for sustainable and efficient deep learning architectures. To gather insights on CycleBNN’s efficiency, we conduct experiments on ImageNet, CIFAR-10, and PASCAL-VOC, obtaining competitive performances while using 96.09% less operations during training on ImageNet, 88.88% on CIFAR-10 and 96.09% on PASCAL-VOC. Finally, CycleBNN offers a path towards faster, more accessible training of efficient networks, accelerating the development of practical applications. The PyTorch code is available at \urlthis https URL

[CV-82] Multimodal Power Outage Prediction for Rapid Disaster Response and Resource Allocation

链接: https://arxiv.org/abs/2410.00017
作者: Alejandro Aparcedo,Christian Lopez,Abhinav Kotta,Mengjie Li
关键词-EN: Extreme weather events, posing significant risks, increasingly common due, Extreme weather, climate change
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: 7 pages, 4 figures, 1 table

点击查看摘要

Abstract:Extreme weather events are increasingly common due to climate change, posing significant risks. To mitigate further damage, a shift towards renewable energy is imperative. Unfortunately, underrepresented communities that are most affected often receive infrastructure improvements last. We propose a novel visual spatiotemporal framework for predicting nighttime lights (NTL), power outage severity and location before and after major hurricanes. Central to our solution is the Visual-Spatiotemporal Graph Neural Network (VST-GNN), to learn spatial and temporal coherence from images. Our work brings awareness to underrepresented areas in urgent need of enhanced energy infrastructure, such as future photovoltaic (PV) deployment. By identifying the severity and localization of power outages, our initiative aims to raise awareness and prompt action from policymakers and community stakeholders. Ultimately, this effort seeks to empower regions with vulnerable energy infrastructure, enhancing resilience and reliability for at-risk communities.

[CV-83] Language-centered Human Activity Recognition

链接: https://arxiv.org/abs/2410.00003
作者: Hua Yan,Heng Tan,Yi Ding,Peifei Zhou,Vinod Namboodiri,Yu Yang
关键词-EN: Inertial Measurement Unit, Measurement Unit, Inertial Measurement, Human Activity Recognition, Human Activity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensors is critical for applications in healthcare, safety, and industrial production. However, variations in activity patterns, device types, and sensor placements create distribution gaps across datasets, reducing the performance of HAR models. To address this, we propose LanHAR, a novel system that leverages Large Language Models (LLMs) to generate semantic interpretations of sensor readings and activity labels for cross-dataset HAR. This approach not only mitigates cross-dataset heterogeneity but also enhances the recognition of new activities. LanHAR employs an iterative re-generation method to produce high-quality semantic interpretations with LLMs and a two-stage training framework that bridges the semantic interpretations of sensor readings and activity labels. This ultimately leads to a lightweight sensor encoder suitable for mobile deployment, enabling any sensor reading to be mapped into the semantic interpretation space. Experiments on four public datasets demonstrate that our approach significantly outperforms state-of-the-art methods in both cross-dataset HAR and new activity recognition. The source code will be made publicly available.

[CV-84] WALINET: A water and lipid identification convolutional Neural Network for nuisance signal removal in 1H MR Spectroscopic Imaging

链接: https://arxiv.org/abs/2410.00746
作者: Paul Weiser,Georg Langs,Stanislav Motyka,Wolfgang Bogner,Sébastien Courvoisier,Malte Hoffmann,Antoine Klauser,Ovidiu C. Andronesi
关键词-EN: Resonance Spectroscopic Imaging, Magnetic Resonance Spectroscopic, Proton Magnetic Resonance, Spectroscopic Imaging, Magnetic Resonance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Purpose. Proton Magnetic Resonance Spectroscopic Imaging (1H-MRSI) provides non-invasive spectral-spatial mapping of metabolism. However, long-standing problems in whole-brain 1H-MRSI are spectral overlap of metabolite peaks with large lipid signal from scalp, and overwhelming water signal that distorts spectra. Fast and effective methods are needed for high-resolution 1H-MRSI to accurately remove lipid and water signals while preserving the metabolite signal. The potential of supervised neural networks for this task remains unexplored, despite their success for other MRSI processing. Methods. We introduce a deep-learning method based on a modified Y-NET network for water and lipid removal in whole-brain 1H-MRSI. The WALINET (WAter and LIpid neural NETwork) was compared to conventional methods such as the state-of-the-art lipid L2 regularization and Hankel-Lanczos singular value decomposition (HLSVD) water suppression. Methods were evaluated on simulated and in-vivo whole-brain MRSI using NMRSE, SNR, CRLB, and FWHM metrics. Results. WALINET is significantly faster and needs 8s for high-resolution whole-brain MRSI, compared to 42 minutes for conventional HLSVD+L2. Quantitative analysis shows WALINET has better performance than HLSVD+L2: 1) more lipid removal with 41% lower NRMSE, 2) better metabolite signal preservation with 71% lower NRMSE in simulated data, 155% higher SNR and 50% lower CRLB in in-vivo data. Metabolic maps obtained by WALINET in healthy subjects and patients show better gray/white-matter contrast with more visible structural details. Conclusions. WALINET has superior performance for nuisance signal removal and metabolite quantification on whole-brain 1H-MRSI compared to conventional state-of-the-art techniques. This represents a new application of deep-learning for MRSI processing, with potential for automated high-throughput workflow. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2410.00746 [eess.IV] (or arXiv:2410.00746v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2410.00746 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Paul Weiser [view email] [v1] Tue, 1 Oct 2024 14:37:55 UTC (15,431 KB)

[CV-85] Arges: Spatio-Temporal Transformer for Ulcerative Colitis Severity Assessment in Endoscopy Videos MICCAI

链接: https://arxiv.org/abs/2410.00536
作者: Krishna Chaitanya,Pablo F. Damasceno,Shreyas Fadnavis,Pooya Mobadersany,Chaitanya Parmar,Emily Scherer,Natalia Zemlianskaia,Lindsey Surace,Louis R. Ghanem,Oana Gabriela Cula,Tommaso Mansi,Kristopher Standish
关键词-EN: Ulcerative Colitis Endoscopic, Colitis Endoscopic Index, Mayo Endoscopic Subscore, evaluating drug efficacy, ulcerative colitis
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 5 tables, accepted at MLMI, MICCAI

点击查看摘要

Abstract:Accurate assessment of disease severity from endoscopy videos in ulcerative colitis (UC) is crucial for evaluating drug efficacy in clinical trials. Severity is often measured by the Mayo Endoscopic Subscore (MES) and Ulcerative Colitis Endoscopic Index of Severity (UCEIS) score. However, expert MES/UCEIS annotation is time-consuming and susceptible to inter-rater variability, factors addressable by automation. Automation attempts with frame-level labels face challenges in fully-supervised solutions due to the prevalence of video-level labels in clinical trials. CNN-based weakly-supervised models (WSL) with end-to-end (e2e) training lack generalization to new disease scores and ignore spatio-temporal information crucial for accurate scoring. To address these limitations, we propose “Arges”, a deep learning framework that utilizes a transformer with positional encoding to incorporate spatio-temporal information from frame features to estimate disease severity scores in endoscopy video. Extracted features are derived from a foundation model (ArgesFM), pre-trained on a large diverse dataset from multiple clinical trials (61M frames, 3927 videos). We evaluate four UC disease severity scores, including MES and three UCEIS component scores. Test set evaluation indicates significant improvements, with F1 scores increasing by 4.1% for MES and 18.8%, 6.6%, 3.8% for the three UCEIS component scores compared to state-of-the-art methods. Prospective validation on previously unseen clinical trial data further demonstrates the model’s successful generalization.

[CV-86] Enhancing Sentinel-2 Image Resolution: Evaluating Advanced Techniques based on Convolutional and Generative Neural Networks

链接: https://arxiv.org/abs/2410.00516
作者: Patrick Kramer,Alexander Steinhardt,Barbara Pedretscher
关键词-EN: advanced super-resolution techniques, paper investigates, investigates the enhancement, enhancement of spatial, spatial resolution
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:This paper investigates the enhancement of spatial resolution in Sentinel-2 bands that contain spectral information using advanced super-resolution techniques by a factor of 2. State-of-the-art CNN models are compared with enhanced GAN approaches in terms of quality and feasibility. Therefore, a representative dataset comprising Sentinel-2 low-resolution images and corresponding high-resolution aerial orthophotos is required. Literature study reveals no feasible dataset for the land type of interest (forests), for which reason an adequate dataset had to be generated in addition, accounting for accurate alignment and image source optimization. The results reveal that while CNN-based approaches produce satisfactory outcomes, they tend to yield blurry images. In contrast, GAN-based models not only provide clear and detailed images, but also demonstrate superior performance in terms of quantitative assessment, underlying the potential of the framework beyond the specific land type investigated.

[CV-87] Pre-training with Synthetic Patterns for Audio ICASSP’25

链接: https://arxiv.org/abs/2410.00511
作者: Yuchi Ishikawa,Tatsuya Komatsu,Yoshimitsu Aoki
关键词-EN: pre-train audio encoders, propose to pre-train, synthetic, audio, data
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICASSP’25

点击查看摘要

Abstract:In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.

[CV-88] Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration

链接: https://arxiv.org/abs/2410.00418
作者: Guy Ohayon,Tomer Michaeli,Michael Elad
关键词-EN: perceptual quality measures, Photo-realistic image restoration, perceptual quality loss, perceptual quality, quality measures
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Photo-realistic image restoration algorithms are typically evaluated by distortion measures (e.g., PSNR, SSIM) and by perceptual quality measures (e.g., FID, NIQE), where the desire is to attain the lowest possible distortion without compromising on perceptual quality. To achieve this goal, current methods typically attempt to sample from the posterior distribution, or to optimize a weighted sum of a distortion loss (e.g., MSE) and a perceptual quality loss (e.g., GAN). Unlike previous works, this paper is concerned specifically with the optimal estimator that minimizes the MSE under a constraint of perfect perceptual index, namely where the distribution of the reconstructed images is equal to that of the ground-truth ones. A recent theoretical result shows that such an estimator can be constructed by optimally transporting the posterior mean prediction (MMSE estimate) to the distribution of the ground-truth images. Inspired by this result, we introduce Posterior-Mean Rectified Flow (PMRF), a simple yet highly effective algorithm that approximates this optimal estimator. In particular, PMRF first predicts the posterior mean, and then transports the result to a high-quality image using a rectified flow model that approximates the desired optimal transport map. We investigate the theoretical utility of PMRF and demonstrate that it consistently outperforms previous methods on a variety of image restoration tasks.

[CV-89] Domain Aware Multi-Task Pretraining of 3D Swin Transformer for T1-weighted Brain MRI ACCV2024

链接: https://arxiv.org/abs/2410.00410
作者: Jonghun Kim,Mansu Kim,Hyunjin Park
关键词-EN: medical image analysis, annotated medical images, image analysis, scarcity of annotated, major bottleneck
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: ACCV 2024, 14 pages

点击查看摘要

Abstract:The scarcity of annotated medical images is a major bottleneck in developing learning models for medical image analysis. Hence, recent studies have focused on pretrained models with fewer annotation requirements that can be fine-tuned for various downstream tasks. However, existing approaches are mainly 3D adaptions of 2D approaches ill-suited for 3D medical imaging data. Motivated by this gap, we propose novel domain-aware multi-task learning tasks to pretrain a 3D Swin Transformer for brain magnetic resonance imaging (MRI). Our method considers the domain knowledge in brain MRI by incorporating brain anatomy and morphology as well as standard pretext tasks adapted for 3D imaging in a contrastive learning setting. We pretrain our model using large-scale brain MRI data of 13,687 samples spanning several large-scale databases. Our method outperforms existing supervised and self-supervised methods in three downstream tasks of Alzheimer’s disease classification, Parkinson’s disease classification, and age prediction tasks. The ablation study of the proposed pretext tasks shows the effectiveness of our pretext tasks.

[CV-90] 3DGR-CAR: Coronary artery reconstruction from ultra-sparse 2D X-ray views with a 3D Gaussians representation MICCAI2024

链接: https://arxiv.org/abs/2410.00404
作者: Xueming Fu,Yingtai Li,Fenghe Tang,Jun Li,Mingyue Zhao,Gao-Jun Teng,S. Kevin Zhou
关键词-EN: artery disease diagnosis, coronary artery, Coronary Artery Reconstruction, coronary artery disease, coronary
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, Accepted at MICCAI 2024

点击查看摘要

Abstract:Reconstructing 3D coronary arteries is important for coronary artery disease diagnosis, treatment planning and operation navigation. Traditional reconstruction techniques often require many projections, while reconstruction from sparse-view X-ray projections is a potential way of reducing radiation dose. However, the extreme sparsity of coronary arteries in a 3D volume and ultra-limited number of projections pose significant challenges for efficient and accurate 3D reconstruction. To this end, we propose 3DGR-CAR, a 3D Gaussian Representation for Coronary Artery Reconstruction from ultra-sparse X-ray projections. We leverage 3D Gaussian representation to avoid the inefficiency caused by the extreme sparsity of coronary artery data and propose a Gaussian center predictor to overcome the noisy Gaussian initialization from ultra-sparse view projections. The proposed scheme enables fast and accurate 3D coronary artery reconstruction with only 2 views. Experimental results on two datasets indicate that the proposed approach significantly outperforms other methods in terms of voxel accuracy and visual quality of coronary arteries. The code will be available in this https URL.

[CV-91] Volumetric Conditional Score-based Residual Diffusion Model for PET/MR Denoising MICCAI2024

链接: https://arxiv.org/abs/2410.00184
作者: Siyeop Yoon,Rui Hu,Yuang Wang,Matthew Tivnan,Young-don Son,Dufan Wu,Xiang Li,Kyungsang Kim,Quanzheng Li
关键词-EN: powerful modality offering, offering quantitative assessments, modality offering quantitative, PET imaging, PET
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to MICCAI 2024

点击查看摘要

Abstract:PET imaging is a powerful modality offering quantitative assessments of molecular and physiological processes. The necessity for PET denoising arises from the intrinsic high noise levels in PET imaging, which can significantly hinder the accurate interpretation and quantitative analysis of the scans. With advances in deep learning techniques, diffusion model-based PET denoising techniques have shown remarkable performance improvement. However, these models often face limitations when applied to volumetric data. Additionally, many existing diffusion models do not adequately consider the unique characteristics of PET imaging, such as its 3D volumetric nature, leading to the potential loss of anatomic consistency. Our Conditional Score-based Residual Diffusion (CSRD) model addresses these issues by incorporating a refined score function and 3D patch-wise training strategy, optimizing the model for efficient volumetric PET denoising. The CSRD model significantly lowers computational demands and expedites the denoising process. By effectively integrating volumetric data from PET and MRI scans, the CSRD model maintains spatial coherence and anatomical detail. Lastly, we demonstrate that the CSRD model achieves superior denoising performance in both qualitative and quantitative evaluations while maintaining image details and outperforms existing state-of-the-art methods.

[CV-92] Multimodal Alignment of Histopathological Images Using Cell Segmentation and Point Set Matching for Integrative Cancer Analysis

链接: https://arxiv.org/abs/2410.00152
作者: Jun Jiang,Raymond Moore,Brenna Novotny,Leo Liu,Zachary Fogarty,Ray Guo,Markovic Svetomir,Chen Wang
关键词-EN: providing complementary insights, Hematoxylin and Eosin, Histopathological imaging, multiplexed Immunofluorescence, providing complementary
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: initial version

点击查看摘要

Abstract:Histopathological imaging is vital for cancer research and clinical practice, with multiplexed Immunofluorescence (MxIF) and Hematoxylin and Eosin (HE) providing complementary insights. However, aligning different stains at the cell level remains a challenge due to modality differences. In this paper, we present a novel framework for multimodal image alignment using cell segmentation outcomes. By treating cells as point sets, we apply Coherent Point Drift (CPD) for initial alignment and refine it with Graph Matching (GM). Evaluated on ovarian cancer tissue microarrays (TMAs), our method achieves high alignment accuracy, enabling integration of cell-level features across modalities and generating virtual HE images from MxIF data for enhanced clinical interpretation.

[CV-93] Automated Disease Diagnosis in Pumpkin Plants Using Advanced CNN Models

链接: https://arxiv.org/abs/2410.00062
作者: Aymane Khaldi,El Mostafa Kalmoun
关键词-EN: crop cultivated globally, vital crop cultivated, cultivated globally, food security, developing regions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Pumpkin is a vital crop cultivated globally, and its productivity is crucial for food security, especially in developing regions. Accurate and timely detection of pumpkin leaf diseases is essential to mitigate significant losses in yield and quality. Traditional methods of disease identification rely heavily on subjective judgment by farmers or experts, which can lead to inefficiencies and missed opportunities for intervention. Recent advancements in machine learning and deep learning offer promising solutions for automating and improving the accuracy of plant disease detection. This paper presents a comprehensive analysis of state-of-the-art Convolutional Neural Network (CNN) models for classifying diseases in pumpkin plant leaves. Using a publicly available dataset of 2000 highresolution images, we evaluate the performance of several CNN architectures, including ResNet, DenseNet, and EfficientNet, in recognizing five classes: healthy leaves and four common diseases downy mildew, powdery mildew, mosaic disease, and bacterial leaf spot. We fine-tuned these pretrained models and conducted hyperparameter optimization experiments. ResNet-34, DenseNet-121, and EfficientNet-B7 were identified as top-performing models, each excelling in different classes of leaf diseases. Our analysis revealed DenseNet-121 as the optimal model when considering both accuracy and computational complexity achieving an overall accuracy of 86%. This study underscores the potential of CNNs in automating disease diagnosis for pumpkin plants, offering valuable insights that can contribute to enhancing agricultural productivity and minimizing economic losses.

[CV-94] Mixture of Multicenter Experts in Multimodal Generative AI for Advanced Radiotherapy Target Delineation

链接: https://arxiv.org/abs/2410.00046
作者: Yujin Oh,Sangjoon Park,Xiang Li,Wang Yi,Jonathan Paly,Jason Efstathiou,Annie Chan,Jun Won Kim,Hwa Kyung Byun,Ik Jae Lee,Jaeho Cho,Chan Woo Wee,Peng Shu,Peilong Wang,Nathan Yu,Jason Holmes,Jong Chul Ye,Quanzheng Li,Wei Liu,Woong Sub Koom,Jin Sung Kim,Kyungsang Kim
关键词-EN: regional patient populations, employ diverse philosophies, experts employ diverse, Clinical experts employ, patient care
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 39 pages

点击查看摘要

Abstract:Clinical experts employ diverse philosophies and strategies in patient care, influenced by regional patient populations. However, existing medical artificial intelligence (AI) models are often trained on data distributions that disproportionately reflect highly prevalent patterns, reinforcing biases and overlooking the diverse expertise of clinicians. To overcome this limitation, we introduce the Mixture of Multicenter Experts (MoME) approach. This method strategically integrates specialized expertise from diverse clinical strategies, enhancing the AI model’s ability to generalize and adapt across multiple medical centers. The MoME-based multimodal target volume delineation model, trained with few-shot samples including images and clinical notes from each medical center, outperformed baseline methods in prostate cancer radiotherapy target delineation. The advantages of MoME were most pronounced when data characteristics varied across centers or when data availability was limited, demonstrating its potential for broader clinical this http URL, the MoME framework enables the deployment of AI-based target volume delineation models in resource-constrained medical facilities by adapting to specific preferences of each medical center only using a few sample data, without the need for data sharing between institutions. Expanding the number of multicenter experts within the MoME framework will significantly enhance the generalizability, while also improving the usability and adaptability of clinical AI applications in the field of precision radiation oncology.

机器学习

[LG-0] Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning

链接: https://arxiv.org/abs/2410.00911
作者: Da-Wei Zhou,Zi-Wen Cai,Han-Jia Ye,Lijun Zhang,De-Chuan Zhan
关键词-EN: involves the progressive, progressive adaptation, Domain-Incremental Learning, DIL, representation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain-Incremental Learning (DIL) involves the progressive adaptation of a model to new concepts across different domains. While recent advances in pre-trained models provide a solid foundation for DIL, learning new concepts often results in the catastrophic forgetting of pre-trained knowledge. Specifically, sequential model updates can overwrite both the representation and the classifier with knowledge from the latest domain. Thus, it is crucial to develop a representation and corresponding classifier that accommodate all seen domains throughout the learning process. To this end, we propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge at both the representation and classifier levels. By merging the backbone of different stages, we create a representation space suitable for multiple domains incrementally. The merged representation serves as a balanced intermediary that captures task-specific features from all seen domains. Additionally, to address the mismatch between consolidated embeddings and the classifier, we introduce an extra classifier consolidation process. Leveraging class-wise semantic information, we estimate the classifier weights of old domains within the latest embedding space. By merging historical and estimated classifiers, we align them with the consolidated embedding space, facilitating incremental classification. Extensive experimental results on four benchmark datasets demonstrate Duct’s state-of-the-art performance.

[LG-1] Empirical Perturbation Analysis of Linear System Solvers from a Data Poisoning Perspective

链接: https://arxiv.org/abs/2410.00878
作者: Yixin Liu,Arielle Carr,Lichao Sun
关键词-EN: machine learning settings, linear regression models, systems arising broadly, linear solvers applied, learning settings
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Numerical Analysis (math.NA)
*备注: 18 pages

点击查看摘要

Abstract:The perturbation analysis of linear solvers applied to systems arising broadly in machine learning settings – for instance, when using linear regression models – establishes an important perspective when reframing these analyses through the lens of a data poisoning attack. By analyzing solvers’ responses to such attacks, this work aims to contribute to the development of more robust linear solvers and provide insights into poisoning attacks on linear solvers. In particular, we investigate how the errors in the input data will affect the fitting error and accuracy of the solution from a linear system-solving algorithm under perturbations common in adversarial attacks. We propose data perturbation through two distinct knowledge levels, developing a poisoning optimization and studying two methods of perturbation: Label-guided Perturbation (LP) and Unconditioning Perturbation (UP). Existing works mainly focus on deriving the worst-case perturbation bound from a theoretical perspective, and the analysis is often limited to specific kinds of linear system solvers. Under the circumstance that the data is intentionally perturbed – as is the case with data poisoning – we seek to understand how different kinds of solvers react to these perturbations, identifying those algorithms most impacted by different types of adversarial attacks.

[LG-2] Replacing Paths with Connection-Biased Attention for Knowledge Graph Completion

链接: https://arxiv.org/abs/2410.00876
作者: Sharmishtha Dutta,Alex Gittens,Mohammed J. Zaki,Charu C. Aggarwal
关键词-EN: identify additional facts, Knowledge graph, subgraph encoding module, encoding module, subgraph encoding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge graph (KG) completion aims to identify additional facts that can be inferred from the existing facts in the KG. Recent developments in this field have explored this task in the inductive setting, where at test time one sees entities that were not present during training; the most performant models in the inductive setting have employed path encoding modules in addition to standard subgraph encoding modules. This work similarly focuses on KG completion in the inductive setting, without the explicit use of path encodings, which can be time-consuming and introduces several hyperparameters that require costly hyperparameter optimization. Our approach uses a Transformer-based subgraph encoding module only; we introduce connection-biased attention and entity role embeddings into the subgraph encoding module to eliminate the need for an expensive and time-consuming path encoding module. Evaluations on standard inductive KG completion benchmark datasets demonstrate that our Connection-Biased Link Prediction (CBLiP) model has superior performance to models that do not use path information. Compared to models that utilize path information, CBLiP shows competitive or superior performance while being faster. Additionally, to show that the effectiveness of connection-biased attention and entity role embeddings also holds in the transductive setting, we compare CBLiP’s performance on the relation prediction task in the transductive setting.

[LG-3] Review of blockchain application with Graph Neural Networks Graph Convolutional Networks and Convolutional Neural Networks

链接: https://arxiv.org/abs/2410.00875
作者: Amy Ancelotti,Claudia Liason
关键词-EN: Graph Convolutional Networks, Convolutional Neural Networks, Convolutional Networks, Graph Neural Networks, Graph Convolutional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper reviews the applications of Graph Neural Networks (GNNs), Graph Convolutional Networks (GCNs), and Convolutional Neural Networks (CNNs) in blockchain technology. As the complexity and adoption of blockchain networks continue to grow, traditional analytical methods are proving inadequate in capturing the intricate relationships and dynamic behaviors of decentralized systems. To address these limitations, deep learning models such as GNNs, GCNs, and CNNs offer robust solutions by leveraging the unique graph-based and temporal structures inherent in blockchain architectures. GNNs and GCNs, in particular, excel in modeling the relational data of blockchain nodes and transactions, making them ideal for applications such as fraud detection, transaction verification, and smart contract analysis. Meanwhile, CNNs can be adapted to analyze blockchain data when represented as structured matrices, revealing hidden temporal and spatial patterns in transaction flows. This paper explores how these models enhance the efficiency, security, and scalability of both linear blockchains and Directed Acyclic Graph (DAG)-based systems, providing a comprehensive overview of their strengths and future research directions. By integrating advanced neural network techniques, we aim to demonstrate the potential of these models in revolutionizing blockchain analytics, paving the way for more sophisticated decentralized applications and improved network performance.

[LG-4] Do Music Generation Models Encode Music Theory?

链接: https://arxiv.org/abs/2410.00872
作者: Megan Wei,Michael Freeman,Chris Donahue,Chen Sun
关键词-EN: possess impressive music, music theory concepts, models possess impressive, Music, music theory
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2024. Dataset: this https URL Code: this https URL Website: this https URL

点击查看摘要

Abstract:Music foundation models possess impressive music generation capabilities. When people compose music, they may infuse their understanding of music into their work, by using notes and intervals to craft melodies, chords to build progressions, and tempo to create a rhythmic feel. To what extent is this true of music generation models? More specifically, are fundamental Western music theory concepts observable within the “inner workings” of these models? Recent work proposed leveraging latent audio representations from music generation models towards music information retrieval tasks (e.g. genre classification, emotion recognition), which suggests that high-level musical characteristics are encoded within these models. However, probing individual music theory concepts (e.g. tempo, pitch class, chord quality) remains under-explored. Thus, we introduce SynTheory, a synthetic MIDI and audio music theory dataset, consisting of tempos, time signatures, notes, intervals, scales, chords, and chord progressions concepts. We then propose a framework to probe for these music theory concepts in music foundation models (Jukebox and MusicGen) and assess how strongly they encode these concepts within their internal representations. Our findings suggest that music theory concepts are discernible within foundation models and that the degree to which they are detectable varies by model size and layer.

[LG-5] Fine-Grained Gradient Restriction: A Simple Approach for Mitigating Catastrophic Forgetting

链接: https://arxiv.org/abs/2410.00868
作者: Bo Liu,Mao Ye,Peter Stone,Qiang Liu
关键词-EN: previously acquired knowledge, Gradient Episodic Memory, fundamental challenge, challenge in continual, previously acquired
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A fundamental challenge in continual learning is to balance the trade-off between learning new tasks and remembering the previously acquired knowledge. Gradient Episodic Memory (GEM) achieves this balance by utilizing a subset of past training samples to restrict the update direction of the model parameters. In this work, we start by analyzing an often overlooked hyper-parameter in GEM, the memory strength, which boosts the empirical performance by further constraining the update direction. We show that memory strength is effective mainly because it improves GEM’s generalization ability and therefore leads to a more favorable trade-off. By this finding, we propose two approaches that more flexibly constrain the update direction. Our methods are able to achieve uniformly better Pareto Frontiers of remembering old and learning new knowledge than using memory strength. We further propose a computationally efficient method to approximately solve the optimization problem with more constraints.

[LG-6] mber! Poisoning Decision Trees

链接: https://arxiv.org/abs/2410.00862
作者: Stefano Calzavara,Lorenzo Cazzaro,Massimo Vettori
关键词-EN: white-box poisoning attack, poisoning attack targeting, present Timber, targeting decision trees, Timber
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 18 pages, 7 figures, 5 tables

点击查看摘要

Abstract:We present Timber, the first white-box poisoning attack targeting decision trees. Timber is based on a greedy attack strategy leveraging sub-tree retraining to efficiently estimate the damage performed by poisoning a given training instance. The attack relies on a tree annotation procedure which enables sorting training instances so that they are processed in increasing order of computational cost of sub-tree retraining. This sorting yields a variant of Timber supporting an early stopping criterion designed to make poisoning attacks more efficient and feasible on larger datasets. We also discuss an extension of Timber to traditional random forest models, which is useful because decision trees are normally combined into ensembles to improve their predictive power. Our experimental evaluation on public datasets shows that our attacks outperform existing baselines in terms of effectiveness, efficiency or both. Moreover, we show that two representative defenses can mitigate the effect of our attacks, but fail at effectively thwarting them.

[LG-7] Improved Sample Complexity of Imitation Learning for Barrier Model Predictive Control

链接: https://arxiv.org/abs/2410.00859
作者: Daniel Pfrommer,Swati Padmanabhan,Kwangjun Ahn,Jack Umenberger,Tobia Marcucci,Zakaria Mhammedi,Ali Jadbabaie
关键词-EN: stable enables stronger, enables stronger guarantees, smoothed expert controllers, Model Predictive Control, Recent work
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 36 pages, 3 figures. This work extends our previous result in arXiv:2306.01914 , which has been accepted for publication in CDC 2024. An earlier version of this manuscript was submitted as part of DP’s Master’s thesis

点击查看摘要

Abstract:Recent work in imitation learning has shown that having an expert controller that is both suitably smooth and stable enables stronger guarantees on the performance of the learned controller. However, constructing such smoothed expert controllers for arbitrary systems remains challenging, especially in the presence of input and state constraints. As our primary contribution, we show how such a smoothed expert can be designed for a general class of systems using a log-barrier-based relaxation of a standard Model Predictive Control (MPC) optimization problem. Improving upon our previous work, we show that barrier MPC achieves theoretically optimal error-to-smoothness tradeoff along some direction. At the core of this theoretical guarantee on smoothness is an improved lower bound we prove on the optimality gap of the analytic center associated with a convex Lipschitz function, which we believe could be of independent interest. We validate our theoretical findings via experiments, demonstrating the merits of our smoothing approach over randomized smoothing. Comments: 36 pages, 3 figures. This work extends our previous result in arXiv:2306.01914, which has been accepted for publication in CDC 2024. An earlier version of this manuscript was submitted as part of DP’s Master’s thesis Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2410.00859 [eess.SY] (or arXiv:2410.00859v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2410.00859 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown

链接: https://arxiv.org/abs/2410.00847
作者: Xingzhou Lou,Dong Yan,Wei Shen,Yuzi Yan,Jian Xie,Junge Zhang
关键词-EN: large language models, play a critical, critical role, role in aligning, aligning generations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward models (RM) play a critical role in aligning generations of large language models (LLM) to human expectations. However, prevailing RMs fail to capture the stochasticity within human preferences and cannot effectively evaluate the reliability of reward predictions. To address these issues, we propose Uncertain-aware RM (URM) and Uncertain-aware RM Ensemble (URME) to incorporate and manage uncertainty in reward modeling. URM can model the distribution of disentangled attributes within human preferences, while URME quantifies uncertainty through discrepancies in the ensemble, thereby identifying potential lack of knowledge during reward evaluation. Experiment results indicate that the proposed URM achieves state-of-the-art performance compared to models with the same size, demonstrating the effectiveness of modeling uncertainty within human preferences. Furthermore, empirical results show that through uncertainty quantification, URM and URME can identify unreliable predictions to improve the quality of reward evaluations.

[LG-9] Learning Stochastic Dynamics from Snapshots through Regularized Unbalanced Optimal Transport

链接: https://arxiv.org/abs/2410.00844
作者: Zhenyi Zhang,Tiejun Li,Peijie Zhou
关键词-EN: sparsely time-resolved snapshots, Reconstructing dynamics, samples from sparsely, sparsely time-resolved, natural sciences
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Physics (physics.comp-ph); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Reconstructing dynamics using samples from sparsely time-resolved snapshots is an important problem in both natural sciences and machine learning. Here, we introduce a new deep learning approach for solving regularized unbalanced optimal transport (RUOT) and inferring continuous unbalanced stochastic dynamics from observed snapshots. Based on the RUOT form, our method models these dynamics without requiring prior knowledge of growth and death processes or additional information, allowing them to be learnt directly from data. Theoretically, we explore the connections between the RUOT and Schrödinger bridge problem and discuss the key challenges and potential solutions. The effectiveness of our method is demonstrated with a synthetic gene regulatory network. Compared with other methods, our approach accurately identifies growth and transition patterns, eliminates false transitions, and constructs the Waddington developmental landscape.

[LG-10] owards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes

链接: https://arxiv.org/abs/2410.00836
作者: Manh Khoi Duong,Stefan Conrad
关键词-EN: unfair outcomes, rooted in biased, framework, biased datasets, synthetic data
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: The Version of Record of this contribution is published in Data Science and Machine Learning, volume 1943, CCIS (Springer Singapore) 2023. It is available online at this https URL

点击查看摘要

Abstract:The reason behind the unfair outcomes of AI is often rooted in biased datasets. Therefore, this work presents a framework for addressing fairness by debiasing datasets containing a (non-)binary protected attribute. The framework proposes a combinatorial optimization problem where heuristics such as genetic algorithms can be used to solve for the stated fairness objectives. The framework addresses this by finding a data subset that minimizes a certain discrimination measure. Depending on a user-defined setting, the framework enables different use cases, such as data removal, the addition of synthetic data, or exclusive use of synthetic data. The exclusive use of synthetic data in particular enhances the framework’s ability to preserve privacy while optimizing for fairness. In a comprehensive evaluation, we demonstrate that under our framework, genetic algorithms can effectively yield fairer datasets compared to the original data. In contrast to prior work, the framework exhibits a high degree of flexibility as it is metric- and task-agnostic, can be applied to both binary or non-binary protected attributes, and demonstrates efficient runtime.

[LG-11] Solving High-Dimensional Partial Integral Differential Equations: The Finite Expression Method

链接: https://arxiv.org/abs/2410.00835
作者: Gareth Hardwick,Senwei Liang,Haizhao Yang
关键词-EN: high-dimensional partial integro-differential, partial integro-differential equations, solve high-dimensional partial, finite expression method, partial integro-differential
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:In this paper, we introduce a new finite expression method (FEX) to solve high-dimensional partial integro-differential equations (PIDEs). This approach builds upon the original FEX and its inherent advantages with new advances: 1) A novel method of parameter grouping is proposed to reduce the number of coefficients in high-dimensional function approximation; 2) A Taylor series approximation method is implemented to significantly improve the computational efficiency and accuracy of the evaluation of the integral terms of PIDEs. The new FEX based method, denoted FEX-PG to indicate the addition of the parameter grouping (PG) step to the algorithm, provides both high accuracy and interpretable numerical solutions, with the outcome being an explicit equation that facilitates intuitive understanding of the underlying solution structures. These features are often absent in traditional methods, such as finite element methods (FEM) and finite difference methods, as well as in deep learning-based approaches. To benchmark our method against recent advances, we apply the new FEX-PG to solve benchmark PIDEs in the literature. In high-dimensional settings, FEX-PG exhibits strong and robust performance, achieving relative errors on the order of single precision machine epsilon.

[LG-12] Squeeze-and-Remember Block ICML

链接: https://arxiv.org/abs/2410.00823
作者: Rinor Cakaj,Jens Mehnert,Bin Yang
关键词-EN: Convolutional Neural Networks, Convolutional Neural, machine learning tasks, Neural Networks, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by The International Conference on Machine Learning and Applications (ICMLA) 2024

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are important for many machine learning tasks. They are built with different types of layers: convolutional layers that detect features, dropout layers that help to avoid over-reliance on any single neuron, and residual layers that allow the reuse of features. However, CNNs lack a dynamic feature retention mechanism similar to the human brain’s memory, limiting their ability to use learned information in new contexts. To bridge this gap, we introduce the “Squeeze-and-Remember” (SR) block, a novel architectural unit that gives CNNs dynamic memory-like functionalities. The SR block selectively memorizes important features during training, and then adaptively re-applies these features during inference. This improves the network’s ability to make contextually informed predictions. Empirical results on ImageNet and Cityscapes datasets demonstrate the SR block’s efficacy: integration into ResNet50 improved top-1 validation accuracy on ImageNet by 0.52% over dropout2d alone, and its application in DeepLab v3 increased mean Intersection over Union in Cityscapes by 0.20%. These improvements are achieved with minimal computational overhead. This show the SR block’s potential to enhance the capabilities of CNNs in image processing tasks.

[LG-13] Fast and Reliable N-k Contingency Screening with Input-Convex Neural Networks

链接: https://arxiv.org/abs/2410.00796
作者: Nicolas Christianson,Wenqi Cui,Steven Low,Weiwei Yang,Baosen Zhang
关键词-EN: Power system operators, dispatch decisions remain, decisions remain feasible, prevent cascading failures, ensure reliable operation
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Power system operators must ensure that dispatch decisions remain feasible in case of grid outages or contingencies to prevent cascading failures and ensure reliable operation. However, checking the feasibility of all N - k contingencies – every possible simultaneous failure of k grid components – is computationally intractable for even small k , requiring system operators to resort to heuristic screening methods. Because of the increase in uncertainty and changes in system behaviors, heuristic lists might not include all relevant contingencies, generating false negatives in which unsafe scenarios are misclassified as safe. In this work, we propose to use input-convex neural networks (ICNNs) for contingency screening. We show that ICNN reliability can be determined by solving a convex optimization problem, and by scaling model weights using this problem as a differentiable optimization layer during training, we can learn an ICNN classifier that is both data-driven and has provably guaranteed reliability. Namely, our method can ensure a zero false negative rate. We empirically validate this methodology in a case study on the IEEE 39-bus test network, observing that it yields substantial (10-20x) speedups while having excellent classification accuracy.

[LG-14] Adaptive Motion Generation Using Uncertainty-Driven Foresight Prediction

链接: https://arxiv.org/abs/2410.00774
作者: Hyogo Hiruma,Hiroshi Ito,Tetusya Ogata
关键词-EN: performing real-world robot, real-world robot tasks, characteristic to handle, environments has long, difficult characteristic
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty of environments has long been a difficult characteristic to handle, when performing real-world robot tasks. This is because the uncertainty produces unexpected observations that cannot be covered by manual scripting. Learning based robot controlling methods are a promising approach for generating flexible motions against unknown situations, but still tend to suffer under uncertainty due to its deterministic nature. In order to adaptively perform the target task under such conditions, the robot control model must be able to accurately understand the possible uncertainty, and to exploratively derive the optimal action that minimizes such uncertainty. This paper extended an existing predictive learning based robot control method, which employ foresight prediction using dynamic internal simulation. The foresight module refines the model’s hidden states by sampling multiple possible futures and replace with the one that led to the lower future uncertainty. The adaptiveness of the model was evaluated on a door opening task. The door can be opened either by pushing, pulling, or sliding, but robot cannot visually distinguish which way, and is required to adapt on the fly. The results showed that the proposed model adaptively diverged its motion through interaction with the door, whereas conventional methods failed to stably diverge. The models were analyzed on Lyapunov exponents of RNN hidden states which reflect the possible divergence at each time step during task execution. The result indicated that the foresight module biased the model to consider future consequences, which lead to embedding uncertainties at the policy of the robot controller, rather than the resultant observation. This is beneficial for implementing adaptive behaviors, which indices derivation of diverse motion during exploration.

[LG-15] On the Generalization and Causal Explanation in Self-Supervised Learning

链接: https://arxiv.org/abs/2410.00772
作者: Wenwen Qiang,Zeen Song,Ziyin Gu,Jiangmeng Li,Changwen Zheng,Fuchun Sun,Hui Xiong
关键词-EN: achieve high generalization, Self-supervised learning, learn from unlabeled, achieve high, Undoing Memorization Mechanism
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) methods learn from unlabeled data and achieve high generalization performance on downstream tasks. However, they may also suffer from overfitting to their training data and lose the ability to adapt to new tasks. To investigate this phenomenon, we conduct experiments on various SSL methods and datasets and make two observations: (1) Overfitting occurs abruptly in later layers and epochs, while generalizing features are learned in early layers for all epochs; (2) Coding rate reduction can be used as an indicator to measure the degree of overfitting in SSL models. Based on these observations, we propose Undoing Memorization Mechanism (UMM), a plug-and-play method that mitigates overfitting of the pre-trained feature extractor by aligning the feature distributions of the early and the last layers to maximize the coding rate reduction of the last layer output. The learning process of UMM is a bi-level optimization process. We provide a causal analysis of UMM to explain how UMM can help the pre-trained feature extractor overcome overfitting and recover generalization. We also demonstrate that UMM significantly improves the generalization performance of SSL methods on various downstream tasks.

[LG-16] argeted synthetic data generation for tabular data via hardness characterization

链接: https://arxiv.org/abs/2410.00759
作者: Tommaso Ferracci,Leonie Tabea Goldmann,Anton Hinel,Francesco Sanna Passino
关键词-EN: improving model performance, Synthetic data generation, proven successful, successful in improving, improving model
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Synthetic data generation has been proven successful in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a novel augmentation pipeline that generates only high-value training points based on hardness characterization. We first demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterisation tasks, while offering significant theoretical and computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on simulated data and on a large scale credit default prediction task. In particular, our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.

[LG-17] Improved Generation of Synthetic Imaging Data Using Feature-Aligned Diffusion

链接: https://arxiv.org/abs/2410.00731
作者: Lakshmi Nair
关键词-EN: Synthetic data generation, Synthetic data, important application, application of machine, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to First International Workshop on Vision-Language Models for Biomedical Applications (VLM4Bio 2024) at the 32nd ACM-Multimedia conference

点击查看摘要

Abstract:Synthetic data generation is an important application of machine learning in the field of medical imaging. While existing approaches have successfully applied fine-tuned diffusion models for synthesizing medical images, we explore potential improvements to this pipeline through feature-aligned diffusion. Our approach aligns intermediate features of the diffusion model to the output features of an expert, and our preliminary findings show an improvement of 9% in generation accuracy and ~0.12 in SSIM diversity. Our approach is also synergistic with existing methods, and easily integrated into diffusion training pipelines for improvements. We make our code available at \urlthis https URL.

[LG-18] Simplified priors for Object-Centric Learning

链接: https://arxiv.org/abs/2410.00728
作者: Vihang Patil,Andreas Radler,Daniel Klotz,Sepp Hochreiter
关键词-EN: continual learning systems, current continual learning, Simplified Slot Attention, excel at abstracting, capability lacking
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans excel at abstracting data and constructing \emphreusable concepts, a capability lacking in current continual learning systems. The field of object-centric learning addresses this by developing abstract representations, or slots, from data without human supervision. Different methods have been proposed to tackle this task for images, whereas most are overly complex, non-differentiable, or poorly scalable. In this paper, we introduce a conceptually simple, fully-differentiable, non-iterative, and scalable method called SAMP Simplified Slot Attention with Max Pool Priors). It is implementable using only Convolution and MaxPool layers and an Attention layer. Our method encodes the input image with a Convolutional Neural Network and then uses a branch of alternating Convolution and MaxPool layers to create specialized sub-networks and extract primitive slots. These primitive slots are then used as queries for a Simplified Slot Attention over the encoded image. Despite its simplicity, our method is competitive or outperforms previous methods on standard benchmarks.

[LG-19] Show Me Whats Wrong: Combining Charts and Text to Guide Data Analysis

链接: https://arxiv.org/abs/2410.00727
作者: Beatriz Feliciano,Rita Costa,Jean Alves,Javier Liébana,Diogo Duarte,Pedro Bizarro
关键词-EN: Analyzing and finding, finding anomalies, anomalies in multi-dimensional, multi-dimensional datasets, cumbersome but vital
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Analyzing and finding anomalies in multi-dimensional datasets is a cumbersome but vital task across different domains. In the context of financial fraud detection, analysts must quickly identify suspicious activity among transactional data. This is an iterative process made of complex exploratory tasks such as recognizing patterns, grouping, and comparing. To mitigate the information overload inherent to these steps, we present a tool combining automated information highlights, Large Language Model generated textual insights, and visual analytics, facilitating exploration at different levels of detail. We perform a segmentation of the data per analysis area and visually represent each one, making use of automated visual cues to signal which require more attention. Upon user selection of an area, our system provides textual and graphical summaries. The text, acting as a link between the high-level and detailed views of the chosen segment, allows for a quick understanding of relevant details. A thorough exploration of the data comprising the selection can be done through graphical representations. The feedback gathered in a study performed with seven domain experts suggests our tool effectively supports and guides exploratory analysis, easing the identification of suspicious information.

[LG-20] Discriminative community detection for multiplex networks

链接: https://arxiv.org/abs/2410.00724
作者: Meiby Ortiz-Bouza,Selin Aviyente
关键词-EN: modeling complex systems, community structure, Multiplex networks, community, complex systems
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiplex networks have emerged as a promising approach for modeling complex systems, where each layer represents a different mode of interaction among entities of the same type. A core task in analyzing these networks is to identify the community structure for a better understanding of the overall functioning of the network. While different methods have been proposed to detect the community structure of multiplex networks, the majority deal with extracting the consensus community structure across layers. In this paper, we address the community detection problem across two closely related multiplex networks. For example in neuroimaging studies, it is common to have multiple multiplex brain networks where each layer corresponds to an individual and each group to different experimental conditions. In this setting, one may be interested in both learning the community structure representing each experimental condition and the discriminative community structure between two groups. In this paper, we introduce two discriminative community detection algorithms based on spectral clustering. The first approach aims to identify the discriminative subgraph structure between the groups, while the second one learns the discriminative and the consensus community structures, simultaneously. The proposed approaches are evaluated on both simulated and real world multiplex networks.

[LG-21] On the Geometry and Optimization of Polynomial Convolutional Networks

链接: https://arxiv.org/abs/2410.00722
作者: Vahid Shahverdi,Giovanni Luca Marchetti,Kathlén Kohn
关键词-EN: study convolutional neural, convolutional neural networks, monomial activation functions, study convolutional, convolutional neural
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:We study convolutional neural networks with monomial activation functions. Specifically, we prove that their parameterization map is regular and is an isomorphism almost everywhere, up to rescaling the filters. By leveraging on tools from algebraic geometry, we explore the geometric properties of the image in function space of this map – typically referred to as neuromanifold. In particular, we compute the dimension and the degree of the neuromanifold, which measure the expressivity of the model, and describe its singularities. Moreover, for a generic large dataset, we derive an explicit formula that quantifies the number of critical points arising in the optimization of a regression loss.

[LG-22] Pseudo-Non-Linear Data Augmentation via Energy Minimization

链接: https://arxiv.org/abs/2410.00718
作者: Pingbang Hu,Mahito Sugiyama
关键词-EN: interpretable data augmentation, augmentation method based, information geometry, based on energy-based, energy-based modeling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel and interpretable data augmentation method based on energy-based modeling and principles from information geometry. Unlike black-box generative models, which rely on deep neural networks, our approach replaces these non-interpretable transformations with explicit, theoretically grounded ones, ensuring interpretability and strong guarantees such as energy minimization. Central to our method is the introduction of the backward projection algorithm, which reverses dimension reduction to generate new data. Empirical results demonstrate that our method achieves competitive performance with black-box generative models while offering greater transparency and interpretability.

[LG-23] Contrastive Abstraction for Reinforcement Learning

链接: https://arxiv.org/abs/2410.00704
作者: Vihang Patil,Markus Hofmarcher,Elisabeth Rumetshofer,Sepp Hochreiter
关键词-EN: contrastive abstraction learning, Learning, abstraction learning, Abstract, contrastive abstraction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning agents with reinforcement learning is difficult when dealing with long trajectories that involve a large number of states. To address these learning problems effectively, the number of states can be reduced by abstract representations that cluster states. In principle, deep reinforcement learning can find abstract states, but end-to-end learning is unstable. We propose contrastive abstraction learning to find abstract states, where we assume that successive states in a trajectory belong to the same abstract state. Such abstract states may be basic locations, achieved subgoals, inventory, or health conditions. Contrastive abstraction learning first constructs clusters of state representations by contrastive learning and then applies modern Hopfield networks to determine the abstract states. The first phase of contrastive abstraction learning is self-supervised learning, where contrastive learning forces states with sequential proximity to have similar representations. The second phase uses modern Hopfield networks to map similar state representations to the same fixed point, i.e.\ to an abstract state. The level of abstraction can be adjusted by determining the number of fixed points of the modern Hopfield network. Furthermore, \textitcontrastive abstraction learning does not require rewards and facilitates efficient reinforcement learning for a wide range of downstream tasks. Our experiments demonstrate the effectiveness of contrastive abstraction learning for reinforcement learning.

[LG-24] Investigating the Impact of Model Complexity in Large Language Models

链接: https://arxiv.org/abs/2410.00699
作者: Jing Luo,Huiyuan Wang,Weiran Huang
关键词-EN: Large Language Models, solving natural language, natural language processing, language processing tasks, Large Language
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) based on the pre-trained fine-tuning paradigm have become pivotal in solving natural language processing tasks, consistently achieving state-of-the-art performance. Nevertheless, the theoretical understanding of how model complexity influences fine-tuning performance remains challenging and has not been well explored yet. In this paper, we focus on autoregressive LLMs and propose to employ Hidden Markov Models (HMMs) to model them. Based on the HMM modeling, we investigate the relationship between model complexity and the generalization capability in downstream tasks. Specifically, we consider a popular tuning paradigm for downstream tasks, head tuning, where all pre-trained parameters are frozen and only individual heads are trained atop pre-trained LLMs. Our theoretical analysis reveals that the risk initially increases and then decreases with rising model complexity, showcasing a “double descent” phenomenon. In this case, the initial “descent” is degenerate, signifying that the “sweet spot” where bias and variance are balanced occurs when the model size is zero. Obtaining the presented in this study conclusion confronts several challenges, primarily revolving around effectively modeling autoregressive LLMs and downstream tasks, as well as conducting a comprehensive risk analysis for multivariate regression. Our research is substantiated by experiments conducted on data generated from HMMs, which provided empirical support and alignment with our theoretical insights.

[LG-25] Beyond Minimax Rates in Group Distributionally Robust Optimization via a Novel Notion of Sparsity

链接: https://arxiv.org/abs/2410.00690
作者: Quan Nguyen,Nishant A. Mehta,Cristóbal Guzmán
关键词-EN: group distributionally robust, distributionally robust optimization, distributionally robust, lambda, beta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 38 pages

点击查看摘要

Abstract:The minimax sample complexity of group distributionally robust optimization (GDRO) has been determined up to a \log(K) factor, for K the number of groups. In this work, we venture beyond the minimax perspective via a novel notion of sparsity that we dub (\lambda, \beta) -sparsity. In short, this condition means that at any parameter \theta , there is a set of at most \beta groups whose risks at \theta all are at least \lambda larger than the risks of the other groups. To find an \epsilon -optimal \theta , we show via a novel algorithm and analysis that the \epsilon -dependent term in the sample complexity can swap a linear dependence on K for a linear dependence on the potentially much smaller \beta . This improvement leverages recent progress in sleeping bandits, showing a fundamental connection between the two-player zero-sum game optimization framework for GDRO and per-action regret bounds in sleeping bandits. The aforementioned result assumes having a particular \lambda as input. Perhaps surprisingly, we next show an adaptive algorithm which, up to log factors, gets sample complexity that adapts to the best (\lambda, \beta) -sparsity condition that holds. Finally, for a particular input \lambda , we also show how to get a dimension-free sample complexity result.

[LG-26] Advanced Arabic Alphabet Sign Language Recognition Using Transfer Learning and Transformer Models

链接: https://arxiv.org/abs/2410.00681
作者: Mazen Balat,Rewaa Awaad,Hend Adel,Ahmed B. Zaky,Salah A. Aly
关键词-EN: deep learning methods, Arabic Alphabet Sign, Alphabet Sign Language, Arabic sign language, Arabic Alphabet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures

点击查看摘要

Abstract:This paper presents an Arabic Alphabet Sign Language recognition approach, using deep learning methods in conjunction with transfer learning and transformer-based models. We study the performance of the different variants on two publicly available datasets, namely ArSL2018 and AASL. This task will make full use of state-of-the-art CNN architectures like ResNet50, MobileNetV2, and EfficientNetB7, and the latest transformer models such as Google ViT and Microsoft Swin Transformer. These pre-trained models have been fine-tuned on the above datasets in an attempt to capture some unique features of Arabic sign language motions. Experimental results present evidence that the suggested methodology can receive a high recognition accuracy, by up to 99.6% and 99.43% on ArSL2018 and AASL, respectively. That is far beyond the previously reported state-of-the-art approaches. This performance opens up even more avenues for communication that may be more accessible to Arabic-speaking deaf and hard-of-hearing, and thus encourages an inclusive society.

[LG-27] Stabilizing the Kumaraswamy Distribution

链接: https://arxiv.org/abs/2410.00660
作者: Max Wasserman,Gonzalo Mateos
关键词-EN: require expressive continuous, support efficient sampling, Large-scale latent variable, expressive continuous distributions, reparameterization trick
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large-scale latent variable models require expressive continuous distributions that support efficient sampling and low-variance differentiation, achievable through the reparameterization trick. The Kumaraswamy (KS) distribution is both expressive and supports the reparameterization trick with a simple closed-form inverse CDF. Yet, its adoption remains limited. We identify and resolve numerical instabilities in the inverse CDF and log-pdf, exposing issues in libraries like PyTorch and TensorFlow. We then introduce simple and scalable latent variable models based on the KS, improving exploration-exploitation trade-offs in contextual multi-armed bandits and enhancing uncertainty quantification for link prediction with graph neural networks. Our results support the stabilized KS distribution as a core component in scalable variational models for bounded latent variables.

[LG-28] AutoTM 2.0: Automatic Topic Modeling Framework for Documents Analysis

链接: https://arxiv.org/abs/2410.00655
作者: Maria Khodorchenko,Nikolay Butakov,Maxim Zuev,Denis Nasonov
关键词-EN: regularized topic models, optimizing additively regularized, additively regularized topic, framework for optimizing, topic models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this work, we present an AutoTM 2.0 framework for optimizing additively regularized topic models. Comparing to the previous version, this version includes such valuable improvements as novel optimization pipeline, LLM-based quality metrics and distributed mode. AutoTM 2.0 is a comfort tool for specialists as well as non-specialists to work with text documents to conduct exploratory data analysis or to perform clustering task on interpretable set of features. Quality evaluation is based on specially developed metrics such as coherence and gpt-4-based approaches. Researchers and practitioners can easily integrate new optimization algorithms and adapt novel metrics to enhance modeling quality and extend their experiments. We show that AutoTM 2.0 achieves better performance compared to the previous AutoTM by providing results on 5 datasets with different features and in two different languages. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2410.00655 [cs.LG] (or arXiv:2410.00655v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.00655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-29] LASMP: Language Aided Subset Sampling Based Motion Planner

链接: https://arxiv.org/abs/2410.00649
作者: Saswati Bhattacharjee,Anirban Sinha,Chinwe Ekenna
关键词-EN: Aided Subset Sampling, Subset Sampling Based, Language Aided Subset, Aided Subset, Subset Sampling
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:This paper presents the Language Aided Subset Sampling Based Motion Planner (LASMP), a system that helps mobile robots plan their movements by using natural language instructions. LASMP uses a modified version of the Rapidly Exploring Random Tree (RRT) method, which is guided by user-provided commands processed through a language model (RoBERTa). The system improves efficiency by focusing on specific areas of the robot’s workspace based on these instructions, making it faster and less resource-intensive. Compared to traditional RRT methods, LASMP reduces the number of nodes needed by 55% and cuts random sample queries by 80%, while still generating safe, collision-free paths. Tested in both simulated and real-world environments, LASMP has shown better performance in handling complex indoor scenarios. The results highlight the potential of combining language processing with motion planning to make robot navigation more efficient.

[LG-30] ICL-TSVD: Bridging Theory and Practice in Continual Learning with Pre-trained Models

链接: https://arxiv.org/abs/2410.00645
作者: Liangzu Peng,Juan Elenter,Joshua Agterberg,Alejandro Ribeiro,René Vidal
关键词-EN: tasks presented sequentially, presented sequentially, solve multiple tasks, multiple tasks presented, Ideal Continual Learner
类目: Machine Learning (cs.LG)
*备注: 45 pages, 19 figures, 14 tables (Preprint, Oct 1, 2024)

点击查看摘要

Abstract:The goal of continual learning (CL) is to train a model that can solve multiple tasks presented sequentially. Recent CL approaches have achieved strong performance by leveraging large pre-trained models that generalize well to downstream tasks. However, such methods lack theoretical guarantees, making them prone to unexpected failures. Conversely, principled CL approaches often fail to achieve competitive performance. In this work, we bridge this gap between theory and practice by integrating an empirically strong approach (RanPAC) into a principled framework, Ideal Continual Learner (ICL), designed to prevent forgetting. Specifically, we lift pre-trained features into a higher dimensional space and formulate an over-parametrized minimum-norm least-squares problem. We find that the lifted features are highly ill-conditioned, potentially leading to large training errors (numerical instability) and increased generalization errors (double descent). We address these challenges by continually truncating the singular value decomposition (SVD) of the lifted features. Our approach, termed ICL-TSVD, is stable with respect to the choice of hyperparameters, can handle hundreds of tasks, and outperforms state-of-the-art CL methods on multiple datasets. Importantly, our method satisfies a recurrence relation throughout its continual learning process, which allows us to prove it maintains small training and generalization errors by appropriately truncating a fraction of SVD factors. This results in a stable continual learning method with strong empirical performance and theoretical guarantees.

[LG-31] Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

链接: https://arxiv.org/abs/2410.00564
作者: Jie Cheng,Ruixi Qiao,Gang Xiong,Qinghai Miao,Yingwei Ma,Binhua Li,Yongbin Li,Yisheng Lv
关键词-EN: heterogeneous datasets, significant aspiration, develop a generalist, high capabilities, offline
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose an provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data corresponding to about 4 trajectories per game, which demonstrates superior generalization of JOWA. We will release codes at this https URL.

[LG-32] Best Practices for Multi-Fidelity Bayesian Optimization in Materials and Molecular Research

链接: https://arxiv.org/abs/2410.00544
作者: Víctor Sabanza-Gil,Riccardo Barbano,Daniel Pacheco Gutiérrez,Jeremy S. Luterbacher,José Miguel Hernández-Lobato,Philippe Schwaller,Loïc Roch
关键词-EN: Multi-fidelity Bayesian Optimization, Multi-fidelity Bayesian, Bayesian Optimization, promising framework, framework to speed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-fidelity Bayesian Optimization (MFBO) is a promising framework to speed up materials and molecular discovery as sources of information of different accuracies are at hand at increasing cost. Despite its potential use in chemical tasks, there is a lack of systematic evaluation of the many parameters playing a role in MFBO. In this work, we provide guidelines and recommendations to decide when to use MFBO in experimental settings. We investigate MFBO methods applied to molecules and materials problems. First, we test two different families of acquisition functions in two synthetic problems and study the effect of the informativeness and cost of the approximate function. We use our implementation and guidelines to benchmark three real discovery problems and compare them against their single-fidelity counterparts. Our results may help guide future efforts to implement MFBO as a routine tool in the chemical sciences.

[LG-33] Differentially Private Active Learning: Balancing Effective Data Selection and Privacy

链接: https://arxiv.org/abs/2410.00542
作者: Kristian Schwethelm,Johannes Kaiser,Jonas Kuntzer,Mehmet Yigitsoy,Daniel Rueckert,Georgios Kaissis
关键词-EN: iteratively selecting, widely used technique, learning, Active learning, data
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Active learning (AL) is a widely used technique for optimizing data labeling in machine learning by iteratively selecting, labeling, and training on the most informative data. However, its integration with formal privacy-preserving methods, particularly differential privacy (DP), remains largely underexplored. While some works have explored differentially private AL for specialized scenarios like online learning, the fundamental challenge of combining AL with DP in standard learning settings has remained unaddressed, severely limiting AL’s applicability in privacy-sensitive domains. This work addresses this gap by introducing differentially private active learning (DP-AL) for standard learning settings. We demonstrate that naively integrating DP-SGD training into AL presents substantial challenges in privacy budget allocation and data utilization. To overcome these challenges, we propose step amplification, which leverages individual sampling probabilities in batch creation to maximize data point participation in training steps, thus optimizing data utilization. Additionally, we investigate the effectiveness of various acquisition functions for data selection under privacy constraints, revealing that many commonly used functions become impractical. Our experiments on vision and natural language processing tasks show that DP-AL can improve performance for specific datasets and model architectures. However, our findings also highlight the limitations of AL in privacy-constrained environments, emphasizing the trade-offs between privacy, model accuracy, and data selection accuracy.

[LG-34] Optimal Causal Representations and the Causal Information Bottleneck ICLR2025

链接: https://arxiv.org/abs/2410.00535
作者: Francisco N. F. Q. Simoes,Mehdi Dastani,Thijs van Ommen
关键词-EN: preserving key features, effectively study complex, discarding irrelevant details, complex causal systems, study complex causal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: Submitted to ICLR 2025. Code available at this http URL

点击查看摘要

Abstract:To effectively study complex causal systems, it is often useful to construct representations that simplify parts of the system by discarding irrelevant details while preserving key features. The Information Bottleneck (IB) method is a widely used approach in representation learning that compresses random variables while retaining information about a target variable. Traditional methods like IB are purely statistical and ignore underlying causal structures, making them ill-suited for causal tasks. We propose the Causal Information Bottleneck (CIB), a causal extension of the IB, which compresses a set of chosen variables while maintaining causal control over a target variable. This method produces representations which are causally interpretable, and which can be used when reasoning about interventions. We present experimental results demonstrating that the learned representations accurately capture causality as intended.

[LG-35] Deep Model Interpretation with Limited Data : A Coreset-based Approach

链接: https://arxiv.org/abs/2410.00524
作者: Hamed Behzadi-Khormouji,José Oramas
关键词-EN: Model Interpretation aims, model interpretation methods, trained model, methods, Model Interpretation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Model Interpretation aims at the extraction of insights from the internals of a trained model. A common approach to address this task is the characterization of relevant features internally encoded in the model that are critical for its proper operation. Despite recent progress of these methods, they come with the weakness of being computationally expensive due to the dense evaluation of datasets that they require. As a consequence, research on the design of these methods have focused on smaller data subsets which may led to reduced insights. To address these computational costs, we propose a coreset-based interpretation framework that utilizes coreset selection methods to sample a representative subset of the large dataset for the interpretation task. Towards this goal, we propose a similarity-based evaluation protocol to assess the robustness of model interpretation methods towards the amount data they take as input. Experiments considering several interpretation methods, DNN models, and coreset selection methods show the effectiveness of the proposed framework.

[LG-36] Advancing RVFL networks: Robust classification with the HawkEye loss function

链接: https://arxiv.org/abs/2410.00510
作者: Mushir Akhtar,M. Tanveer,Mohd. Arshad
关键词-EN: Random vector functional, vector functional link, single-layer feedforward neural, feedforward neural network, Random vector
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Random vector functional link (RVFL), a variant of single-layer feedforward neural network (SLFN), has garnered significant attention due to its lower computational cost and robustness to overfitting. Despite its advantages, the RVFL network’s reliance on the square error loss function makes it highly sensitive to outliers and noise, leading to degraded model performance in real-world applications. To remedy it, we propose the incorporation of the HawkEye loss (H-loss) function into the RVFL framework. The H-loss function features nice mathematical properties, including smoothness and boundedness, while simultaneously incorporating an insensitive zone. Each characteristic brings its own advantages: 1) Boundedness limits the impact of extreme errors, enhancing robustness against outliers; 2) Smoothness facilitates the use of gradient-based optimization algorithms, ensuring stable and efficient convergence; and 3) The insensitive zone mitigates the effect of minor discrepancies and noise. Leveraging the H-loss function, we embed it into the RVFL framework and develop a novel robust RVFL model termed H-RVFL. Notably, this work addresses a significant gap, as no bounded loss function has been incorporated into RVFL to date. The non-convex optimization of the proposed H-RVFL is effectively addressed by the Nesterov accelerated gradient (NAG) algorithm, whose computational complexity is also discussed. The proposed H-RVFL model’s effectiveness is validated through extensive experiments on 40 benchmark datasets from UCI and KEEL repositories, with and without label noise. The results highlight significant improvements in robustness and efficiency, establishing the H-RVFL model as a powerful tool for applications in noisy and outlier-prone environments.

[LG-37] Learning Personalized Treatment Decisions in Precision Medicine: Disentangling Treatment Assignment Bias in Counterfactual Outcome Prediction and Biomarker Identification

链接: https://arxiv.org/abs/2410.00509
作者: Michael Vollenweider,Manuel Schürch,Chiara Rohrer,Gabriele Gut,Michael Krauthammer,Andreas Wicki
关键词-EN: faces significant challenges, significant challenges due, individual patients, offers the potential, potential to tailor
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 5 figures, conference

点击查看摘要

Abstract:Precision medicine offers the potential to tailor treatment decisions to individual patients, yet it faces significant challenges due to the complex biases in clinical observational data and the high-dimensional nature of biological data. This study models various types of treatment assignment biases using mutual information and investigates their impact on machine learning (ML) models for counterfactual prediction and biomarker identification. Unlike traditional counterfactual benchmarks that rely on fixed treatment policies, our work focuses on modeling different characteristics of the underlying observational treatment policy in distinct clinical settings. We validate our approach through experiments on toy datasets, semi-synthetic tumor cancer genome atlas (TCGA) data, and real-world biological outcomes from drug and CRISPR screens. By incorporating empirical biological mechanisms, we create a more realistic benchmark that reflects the complexities of real-world data. Our analysis reveals that different biases lead to varying model performances, with some biases, especially those unrelated to outcome mechanisms, having minimal effect on prediction accuracy. This highlights the crucial need to account for specific biases in clinical observational data in counterfactual ML model development, ultimately enhancing the personalization of treatment decisions in precision medicine.

[LG-38] Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach EMNLP2024

链接: https://arxiv.org/abs/2410.00502
作者: Diogo Pernes,Gonçalo M. Correia,Afonso Mendes
关键词-EN: Cross-lingual summarization aims, bridge language barriers, aims to bridge, Cross-lingual summarization, multi-target cross-lingual summarization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 (Findings)

点击查看摘要

Abstract:Cross-lingual summarization aims to bridge language barriers by summarizing documents in different languages. However, ensuring semantic coherence across languages is an overlooked challenge and can be critical in several contexts. To fill this gap, we introduce multi-target cross-lingual summarization as the task of summarizing a document into multiple target languages while ensuring that the produced summaries are semantically similar. We propose a principled re-ranking approach to this problem and a multi-criteria evaluation protocol to assess semantic coherence across target languages, marking a first step that will hopefully stimulate further research on this problem.

[LG-39] Enhancing Solution Efficiency in Reinforcement Learning: Leveraging Sub-GFlowNet and Entropy Integration

链接: https://arxiv.org/abs/2410.00461
作者: Siyi He
关键词-EN: Chain Monte Carlo, black-box function optimization, Traditional reinforcement learning, Markov Chain Monte, reinforcement learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional reinforcement learning often struggles to generate diverse, high-reward solutions, especially in domains like drug design and black-box function optimization. Markov Chain Monte Carlo (MCMC) methods provide an alternative method of RL in candidate selection but suffer from high computational costs and limited candidate diversity exploration capabilities. In response, GFlowNet, a novel neural network architecture, was introduced to model complex system dynamics and generate diverse high-reward trajectories. To further enhance this approach, this paper proposes improvements to GFlowNet by introducing a new loss function and refining the training objective associated with sub-GFlowNet. These enhancements aim to integrate entropy and leverage network structure characteristics, improving both candidate diversity and computational efficiency. We demonstrated the superiority of the refined GFlowNet over traditional methods by empirical results from hypergrid experiments and molecule synthesis tasks. The findings underscore the effectiveness of incorporating entropy and exploiting network structure properties in solution generation in molecule synthesis as well as diverse experimental designs.

[LG-40] UniAdapt: A Universal Adapter for Knowledge Calibration

链接: https://arxiv.org/abs/2410.00454
作者: Tai D. Nguyen,Long H. Pham,Jun Sun
关键词-EN: Large Language Models, Large Language, require frequent updates, continuously evolving knowledge, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) require frequent updates to correct errors and keep pace with continuously evolving knowledge in a timely and effective manner. Recent research in it model editing has highlighted the challenges in balancing generalization and locality, especially in the context of lifelong model editing. We discover that inserting knowledge directly into the model often causes conflicts and potentially disrupts other unrelated pre-trained knowledge. To address this problem, we introduce UniAdapt, a universal adapter for knowledge calibration. Inspired by the Mixture of Experts architecture and Retrieval-Augmented Generation, UniAdapt is designed with a vector-assisted router that is responsible for routing inputs to appropriate experts. The router maintains a vector store, including multiple shards, to construct routing vectors based on semantic similarity search results. UniAdapt is fully model-agnostic and designed for seamless plug-and-play integration. Experimental results show that UniAdapt outperforms existing lifelong model editors and achieves exceptional results in most metrics.

[LG-41] EKAN: Equivariant Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2410.00435
作者: Lexiang Hu,Yisen Wang,Zhouchen Lin
关键词-EN: Multi-Layer Perceptrons, Equivariant Multi-Layer Perceptrons, spline activation functions, great success, success in scientific
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have seen great success in scientific domains thanks to spline activation functions, becoming an alternative to Multi-Layer Perceptrons (MLPs). However, spline functions may not respect symmetry in tasks, which is crucial prior knowledge in machine learning. Previously, equivariant networks embed symmetry into their architectures, achieving better performance in specific applications. Among these, Equivariant Multi-Layer Perceptrons (EMLP) introduce arbitrary matrix group equivariance into MLPs, providing a general framework for constructing equivariant networks layer by layer. In this paper, we propose Equivariant Kolmogorov-Arnold Networks (EKAN), a method for incorporating matrix group equivariance into KANs, aiming to broaden their applicability to more fields. First, we construct gated spline basis functions, which form the EKAN layer together with equivariant linear weights. We then define a lift layer to align the input space of EKAN with the feature space of the dataset, thereby building the entire EKAN architecture. Compared with baseline models, EKAN achieves higher accuracy with smaller datasets or fewer parameters on symmetry-related tasks, such as particle scattering and the three-body problem, often reducing test MSE by several orders of magnitude. Even in non-symbolic formula scenarios, such as top quark tagging with three jet constituents, EKAN achieves comparable results with EMLP using only 26% of the parameters, while KANs do not outperform MLPs as expected.

[LG-42] Scalable Multi-Task Transfer Learning for Molecular Property Prediction

链接: https://arxiv.org/abs/2410.00432
作者: Chanhui Lee,Dae-Woong Jeong,Sung Moon Ko,Sumin Lee,Hyunseung Kim,Soorin Yim,Sehui Han,Sungwoong Kim,Sungbin Lim
关键词-EN: application vary, transfer learning, Molecules, distinct properties, transfer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Molecules have a number of distinct properties whose importance and application vary. Often, in reality, labels for some properties are hard to achieve despite their practical importance. A common solution to such data scarcity is to use models of good generalization with transfer learning. This involves domain experts for designing source and target tasks whose features are shared. However, this approach has limitations: i). Difficulty in accurate design of source-target task pairs due to the large number of tasks, and ii). corresponding computational burden verifying many trials and errors of transfer learning design, thereby iii). constraining the potential of foundation modeling of multi-task molecular property prediction. We address the limitations of the manual design of transfer learning via data-driven bi-level optimization. The proposed method enables scalable multi-task transfer learning for molecular property prediction by automatically obtaining the optimal transfer ratios. Empirically, the proposed method improved the prediction performance of 40 molecular properties and accelerated training convergence.

[LG-43] LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management

链接: https://arxiv.org/abs/2410.00428
作者: Yi Xiong,Hao Wu,Changxu Shao,Ziqing Wang,Rui Zhang,Yuhong Guo,Junping Zhao,Ke Zhang,Zhenxuan Pan
关键词-EN: expanding context windows, large language models, introduce significant challenges, maintaining low latency, windows in large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, 1 table

点击查看摘要

Abstract:The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing demands for GPU Key-Value (KV) cache allocation clashing with the limited availability of KV cache blocks. To address this issue, we propose LayerKV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance, while seamlessly integrating with existing parallelism strategies and scheduling techniques. Specifically, LayerKV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory, coupled with an SLO-aware scheduler to optimize overall Service Level Objectives (SLOs). Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that LayerKV improves TTFT latency up to 11x and reduces SLO violation rates by 28.7%, significantly enhancing the user experience

[LG-44] ECORS: An Ensembled Clustering Approach to Eradicate The Local And Global Outlier In Collaborative Filtering Recommender System

链接: https://arxiv.org/abs/2410.00408
作者: Mahamudul Hasan
关键词-EN: suggest items based, helping users navigate, Recommender systems, designed to suggest, suggest items
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Recommender systems are designed to suggest items based on user preferences, helping users navigate the vast amount of information available on the internet. Given the overwhelming content, outlier detection has emerged as a key research area in recommender systems. It involves identifying unusual or suspicious patterns in user behavior. However, existing studies in this field face several challenges, including the limited universality of algorithms, difficulties in selecting users, and a lack of optimization. In this paper, we propose an approach that addresses these challenges by employing various clustering algorithms. Specifically, we utilize a user-user matrix-based clustering technique to detect outliers. By constructing a user-user matrix, we can identify suspicious users in the system. Both local and global outliers are detected to ensure comprehensive analysis. Our experimental results demonstrate that this approach significantly improves the accuracy of outlier detection in recommender systems.

[LG-45] Metric-Based Few-Shot Learning for Exercise Repetition Counting with IMU Data

链接: https://arxiv.org/abs/2410.00407
作者: Yooseok Lim,Sujee Lee
关键词-EN: analyzing IMU signals, IMU signals, analyzing IMU, universal exercise repetition, study develops
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study develops a method to automatically count exercise repetitions by analyzing IMU signals, with a focus on a universal exercise repetition counting task that counts all types of exercise movements, including novel exercises not seen during training, using a single model. A key challenge in developing such a model is handling the considerable variation in peak patterns across different types of exercises. Since peak patterns can vary significantly between different exercises as well as between individuals performing the same exercise, the model needs to learn a complex embedding space of sensor data to generalize effectively. To address this challenge, we propose a repetition counting technique utilizing a deep metric-based few-shot learning approach, designed to handle both existing and novel exercises. By redefining the counting task as a few-shot classification problem, the method is capable of detecting peak repetition patterns in exercises not seen during training. The approach employs a Siamese network with triplet loss, optimizing the embedding space to distinguish between peak and non-peak frames. The proposed framework is composed of three main phases: standard classification training, few-shot training, and fine-tuning for novel exercises, followed by post-processing to refine the final repetition counts. Evaluation results demonstrate the effectiveness of the proposed approach, showing an 86.8% probability of accurately counting ten or more repetitions within a single set across 28 different exercises. This performance highlights the model’s ability to generalize across various exercise types, including those not present in the training data. Such robustness and adaptability make the system a strong candidate for real-time implementation in fitness and healthcare applications.

[LG-46] Revisiting Essential and Nonessential Settings of Evidential Deep Learning

链接: https://arxiv.org/abs/2410.00393
作者: Mengyuan Chen,Junyu Gao,Changsheng Xu
关键词-EN: Evidential Deep Learning, attracting significant attention, single forward pass, Evidential Deep, Deep Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, under review

点击查看摘要

Abstract:Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation that provides reliable predictive uncertainty in a single forward pass, attracting significant attention. Grounded in subjective logic, EDL derives Dirichlet concentration parameters from neural networks to construct a Dirichlet probability density function (PDF), modeling the distribution of class probabilities. Despite its success, EDL incorporates several nonessential settings: In model construction, (1) a commonly ignored prior weight parameter is fixed to the number of classes, while its value actually impacts the balance between the proportion of evidence and its magnitude in deriving predictive scores. In model optimization, (2) the empirical risk features a variance-minimizing optimization term that biases the PDF towards a Dirac delta function, potentially exacerbating overconfidence. (3) Additionally, the structural risk typically includes a KL-divergence-minimizing regularization, whose optimization direction extends beyond the intended purpose and contradicts common sense, diminishing the information carried by the evidence magnitude. Therefore, we propose Re-EDL, a simplified yet more effective variant of EDL, by relaxing the nonessential settings and retaining the essential one, namely, the adoption of projected probability from subjective logic. Specifically, Re-EDL treats the prior weight as an adjustable hyperparameter rather than a fixed scalar, and directly optimizes the expectation of the Dirichlet PDF provided by deprecating both the variance-minimizing optimization term and the divergence regularization term. Extensive experiments and state-of-the-art performance validate the effectiveness of our method. The source code is available at this https URL.

[LG-47] Seamless Augmented Reality Integration in Arthroscopy: A Pipeline for Articular Reconstruction and Guidance

链接: https://arxiv.org/abs/2410.00386
作者: Hongchao Shu,Mingxu Liu,Lalithkumar Seenivasan,Suxi Gu,Ping-Cheng Ku,Jonathan Knopf,Russell Taylor,Mathias Unberath
关键词-EN: treat joint problems, minimally invasive surgical, invasive surgical procedure, minimally invasive, diagnose and treat
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages, with 2 additional pages as the supplementary. Accepted by AE-CAI 2024

点击查看摘要

Abstract:Arthroscopy is a minimally invasive surgical procedure used to diagnose and treat joint problems. The clinical workflow of arthroscopy typically involves inserting an arthroscope into the joint through a small incision, during which surgeons navigate and operate largely by relying on their visual assessment through the arthroscope. However, the arthroscope’s restricted field of view and lack of depth perception pose challenges in navigating complex articular structures and achieving surgical precision during procedures. Aiming at enhancing intraoperative awareness, we present a robust pipeline that incorporates simultaneous localization and mapping, depth estimation, and 3D Gaussian splatting to realistically reconstruct intra-articular structures solely based on monocular arthroscope video. Extending 3D reconstruction to Augmented Reality (AR) applications, our solution offers AR assistance for articular notch measurement and annotation anchoring in a human-in-the-loop manner. Compared to traditional Structure-from-Motion and Neural Radiance Field-based methods, our pipeline achieves dense 3D reconstruction and competitive rendering fidelity with explicit 3D representation in 7 minutes on average. When evaluated on four phantom datasets, our method achieves RMSE = 2.21mm reconstruction error, PSNR = 32.86 and SSIM = 0.89 on average. Because our pipeline enables AR reconstruction and guidance directly from monocular arthroscopy without any additional data and/or hardware, our solution may hold the potential for enhancing intraoperative awareness and facilitating surgical precision in arthroscopy. Our AR measurement tool achieves accuracy within 1.59 +/- 1.81mm and the AR annotation tool achieves a mIoU of 0.721.

[LG-48] STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting

链接: https://arxiv.org/abs/2410.00385
作者: Hongjun Wang,Jiyuan Chen,Tong Pan,Zheng Dong,Lingyu Zhang,Renhe Jiang,Xuan Song
关键词-EN: smart city management, efficient resource allocation, city management, transportation planning, Traffic forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Traffic forecasting is a cornerstone of smart city management, enabling efficient resource allocation and transportation planning. Deep learning, with its ability to capture complex nonlinear patterns in spatiotemporal (ST) data, has emerged as a powerful tool for traffic forecasting. While graph neural networks (GCNs) and transformer-based models have shown promise, their computational demands often hinder their application to real-world road networks, particularly those with large-scale spatiotemporal interactions. To address these challenges, we propose a novel spatiotemporal graph transformer (STGformer) architecture. STGformer effectively balances the strengths of GCNs and Transformers, enabling efficient modeling of both global and local traffic patterns while maintaining a manageable computational footprint. Unlike traditional approaches that require multiple attention layers, STG attention block captures high-order spatiotemporal interactions in a single layer, significantly reducing computational cost. In particular, STGformer achieves a 100x speedup and a 99.8% reduction in GPU memory usage compared to STAEformer during batch inference on a California road graph with 8,600 sensors. We evaluate STGformer on the LargeST benchmark and demonstrate its superiority over state-of-the-art Transformer-based methods such as PDFormer and STAEformer, which underline STGformer’s potential to revolutionize traffic forecasting by overcoming the computational and memory limitations of existing approaches, making it a promising foundation for future spatiotemporal modeling tasks.

[LG-49] Generative Precipitation Downscaling using Score-based Diffusion with Wasserstein Regularization

链接: https://arxiv.org/abs/2410.00381
作者: Yuhao Liu,James Doss-Gollin,Guha Balakrishnan,Ashok Veeraraghavan
关键词-EN: Understanding local risks, sample rare events, assess localized hazards, Understanding local, Climate Prediction Center
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Understanding local risks from extreme rainfall, such as flooding, requires both long records (to sample rare events) and high-resolution products (to assess localized hazards). Unfortunately, there is a dearth of long-record and high-resolution products that can be used to understand local risk and precipitation science. In this paper, we present a novel generative diffusion model that downscales (super-resolves) globally available Climate Prediction Center (CPC) gauge-based precipitation products and ERA5 reanalysis data to generate kilometer-scale precipitation estimates. Downscaling gauge-based precipitation from 55 km to 1 km while recovering extreme rainfall signals poses significant challenges. To enforce our model (named WassDiff) to produce well-calibrated precipitation intensity values, we introduce a Wasserstein Distance Regularization (WDR) term for the score-matching training objective in the diffusion denoising process. We show that WDR greatly enhances the model’s ability to capture extreme values compared to diffusion without WDR. Extensive evaluation shows that WassDiff has better reconstruction accuracy and bias scores than conventional score-based diffusion models. Case studies of extreme weather phenomena, like tropical storms and cold fronts, demonstrate WassDiff’s ability to produce appropriate spatial patterns while capturing extremes. Such downscaling capability enables the generation of extensive km-scale precipitation datasets from existing historical global gauge records and current gauge measurements in areas without high-resolution radar.

[LG-50] CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset

链接: https://arxiv.org/abs/2410.00379
作者: Xiao Wang,Fuling Wang,Yuehang Li,Qingchuan Ma,Shiao Wang,Bo Jiang,Chuanfu Li,Jin Tang
关键词-EN: patient wait times, significantly reduce diagnostic, reduce diagnostic burdens, X-ray image-based medical, image-based medical report
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: In Peer Review

点击查看摘要

Abstract:X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence which can significantly reduce diagnostic burdens and patient wait times. Despite significant progress, we believe that the task has reached a bottleneck due to the limited benchmark datasets and the existing large models’ insufficient capability enhancements in this specialized domain. Specifically, the recently released CheXpert Plus dataset lacks comparative evaluation algorithms and their results, providing only the dataset itself. This situation makes the training, evaluation, and comparison of subsequent algorithms challenging. Thus, we conduct a comprehensive benchmarking of existing mainstream X-ray report generation models and large language models (LLMs), on the CheXpert Plus dataset. We believe that the proposed benchmark can provide a solid comparative basis for subsequent algorithms and serve as a guide for researchers to quickly grasp the state-of-the-art models in this field. More importantly, we propose a large model for the X-ray image report generation using a multi-stage pre-training strategy, including self-supervised autoregressive generation and Xray-report contrastive learning, and supervised fine-tuning. Extensive experimental results indicate that the autoregressive pre-training based on Mamba effectively encodes X-ray images, and the image-text contrastive pre-training further aligns the feature spaces, achieving better experimental results. Source code can be found on \urlthis https URL.

[LG-51] Robust Traffic Forecasting against Spatial Shift over Years

链接: https://arxiv.org/abs/2410.00373
作者: Hongjun Wang,Jiyuan Chen,Tong Pan,Zheng Dong,Lingyu Zhang,Renhe Jiang,Xuan Song
关键词-EN: Graph Neural Networks, Neural Networks, demonstrated promising potential, Spatiotemporal Graph Neural, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advancements in Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have demonstrated promising potential for traffic forecasting by effectively capturing both temporal and spatial correlations. The generalization ability of spatiotemporal models has received considerable attention in recent scholarly discourse. However, no substantive datasets specifically addressing traffic out-of-distribution (OOD) scenarios have been proposed. Existing ST-OOD methods are either constrained to testing on extant data or necessitate manual modifications to the dataset. Consequently, the generalization capacity of current spatiotemporal models in OOD scenarios remains largely underexplored. In this paper, we investigate state-of-the-art models using newly proposed traffic OOD benchmarks and, surprisingly, find that these models experience a significant decline in performance. Through meticulous analysis, we attribute this decline to the models’ inability to adapt to previously unobserved spatial relationships. To address this challenge, we propose a novel Mixture of Experts (MoE) framework, which learns a set of graph generators (i.e., graphons) during training and adaptively combines them to generate new graphs based on novel environmental conditions to handle spatial distribution shifts during testing. We further extend this concept to the Transformer architecture, achieving substantial improvements. Our method is both parsimonious and efficacious, and can be seamlessly integrated into any spatiotemporal model, outperforming current state-of-the-art approaches in addressing spatial dynamics.

[LG-52] Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare

链接: https://arxiv.org/abs/2410.00366
作者: Prasenjit Maji,Amit Kumar Mondal,Hemanta Kumar Mondal,Saraju P. Mohanty
关键词-EN: continuous monitoring devices, intelligent diagnostic systems, Explainable Artificial Intelligence, artificial intelligence, driving innovations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancements in artificial intelligence (AI) have revolutionized smart healthcare, driving innovations in wearable technologies, continuous monitoring devices, and intelligent diagnostic systems. However, security, explainability, robustness, and performance optimization challenges remain critical barriers to widespread adoption in clinical environments. This research presents an innovative algorithmic method using the Adaptive Feature Evaluator (AFE) algorithm to improve feature selection in healthcare datasets and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT), the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby enhancing predictive accuracy and interpretability. The proposed method is validated across three diverse healthcare datasets using six distinct machine learning algorithms, demonstrating its robustness and superiority over conventional feature selection techniques. The results underscore the transformative potential of AFE in smart healthcare, enabling personalized and transparent patient care. Notably, the AFE algorithm, when combined with a Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting its capability to improve clinical decision-making processes in real-world healthcare applications.

[LG-53] AARK: An Open Toolkit for Autonomous Racing Research

链接: https://arxiv.org/abs/2410.00358
作者: James Bockman,Matthew Howe,Adrian Orenstein,Feras Dayoub
关键词-EN: demands safe control, advanced vehicle safety, racing demands safe, periods of time, demands safe
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Autonomous racing demands safe control of vehicles at their physical limits for extended periods of time, providing insights into advanced vehicle safety systems which increasingly rely on intervention provided by vehicle autonomy. Participation in this field carries with it a high barrier to entry. Physical platforms and their associated sensor suites require large capital outlays before any demonstrable progress can be made. Simulators allow researches to develop soft autonomous systems without purchasing a platform. However, currently available simulators lack visual and dynamic fidelity, can still be expensive to buy, lack customisation, and are difficult to use. AARK provides three packages, ACI, ACDG, and ACMPC. These packages enable research into autonomous control systems in the demanding environment of racing to bring more people into the field and improve reproducibility: ACI provides researchers with a computer vision-friendly interface to Assetto Corsa for convenient comparison and evaluation of autonomous control solutions; ACDG enables generation of depth, normal and semantic segmentation data for training computer vision models to use in perception systems; and ACMPC gives newcomers to the field a modular full-stack autonomous control solution, capable of controlling vehicles to build from. AARK aims to unify and democratise research into a field critical to providing safer roads and trusted autonomous systems.

[LG-54] Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical Study

链接: https://arxiv.org/abs/2410.00357
作者: Hao Liu,Zecheng Zhang,Wenjing Liao,Hayden Schaeffer
关键词-EN: Neural scaling laws, scaling laws, Neural scaling, scaling laws play, deep operator networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural scaling laws play a pivotal role in the performance of deep neural networks and have been observed in a wide range of tasks. However, a complete theoretical framework for understanding these scaling laws remains underdeveloped. In this paper, we explore the neural scaling laws for deep operator networks, which involve learning mappings between function spaces, with a focus on the Chen and Chen style architecture. These approaches, which include the popular Deep Operator Network (DeepONet), approximate the output functions using a linear combination of learnable basis functions and coefficients that depend on the input functions. We establish a theoretical framework to quantify the neural scaling laws by analyzing its approximation and generalization errors. We articulate the relationship between the approximation and generalization errors of deep operator networks and key factors such as network model size and training data size. Moreover, we address cases where input functions exhibit low-dimensional structures, allowing us to derive tighter error bounds. These results also hold for deep ReLU networks and other similar structures. Our results offer a partial explanation of the neural scaling laws in operator learning and provide a theoretical foundation for their applications.

[LG-55] A Taxonomy of Loss Functions for Stochastic Optimal Control

链接: https://arxiv.org/abs/2410.00345
作者: Carles Domingo-Enrich
关键词-EN: Stochastic optimal control, Stochastic optimal, SOC loss functions, optimal control, aims to direct
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stochastic optimal control (SOC) aims to direct the behavior of noisy systems and has widespread applications in science, engineering, and artificial intelligence. In particular, reward fine-tuning of diffusion and flow matching models and sampling from unnormalized methods can be recast as SOC problems. A recent work has introduced Adjoint Matching (Domingo-Enrich et al., 2024), a loss function for SOC problems that vastly outperforms existing loss functions in the reward fine-tuning setup. The goal of this work is to clarify the connections between all the existing (and some new) SOC loss functions. Namely, we show that SOC loss functions can be grouped into classes that share the same gradient in expectation, which means that their optimization landscape is the same; they only differ in their gradient variance. We perform simple SOC experiments to understand the strengths and weaknesses of different loss functions.

[LG-56] Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces

链接: https://arxiv.org/abs/2410.00344
作者: Lilac Atassi
关键词-EN: Recent music generation, generation methods based, Recent music, context window, based on transformers
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: arXiv admin note: substantial text overlap with arXiv:2404.11976

点击查看摘要

Abstract:Recent music generation methods based on transformers have a context window of up to a minute. The music generated by these methods are largely unstructured beyond the context window. With a longer context window, learning long scale structures from musical data is a prohibitively challenging problem. This paper proposes integrating a text-to-music model with a large language model to generate music with form. We discuss our solutions to the challenges of such integration. The experimental results show that the proposed method can generate 2.5-minute-long music that is highly structured, strongly organized, and cohesive.

[LG-57] Sparse Attention Decomposition Applied to Circuit Tracing

链接: https://arxiv.org/abs/2410.00340
作者: Gabriel Franco,Mark Crovella
关键词-EN: attention heads, perform complex tasks, attention, attention heads work, papers have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Many papers have shown that attention heads work in conjunction with each other to perform complex tasks. It’s frequently assumed that communication between attention heads is via the addition of specific features to token residuals. In this work we seek to isolate and identify the features used to effect communication and coordination among attention heads in GPT-2 small. Our key leverage on the problem is to show that these features are very often sparsely coded in the singular vectors of attention head matrices. We characterize the dimensionality and occurrence of these signals across the attention heads in GPT-2 small when used for the Indirect Object Identification (IOI) task. The sparse encoding of signals, as provided by attention head singular vectors, allows for efficient separation of signals from the residual background and straightforward identification of communication paths between attention heads. We explore the effectiveness of this approach by tracing portions of the circuits used in the IOI task. Our traces reveal considerable detail not present in previous studies, shedding light on the nature of redundant paths present in GPT-2. And our traces go beyond previous work by identifying features used to communicate between attention heads when performing IOI.

[LG-58] EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow Matching and Co-Evolutionary Dynamics

链接: https://arxiv.org/abs/2410.00327
作者: Chenqing Hua,Yong Liu,Dinghuai Zhang,Odin Zhang,Sitao Luan,Kevin K. Yang,Guy Wolf,Doina Precup,Shuangjia Zheng
关键词-EN: area in biotechnology, critical area, applications ranging, ranging from drug, drug development
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Enzyme design is a critical area in biotechnology, with applications ranging from drug development to synthetic biology. Traditional methods for enzyme function prediction or protein binding pocket design often fall short in capturing the dynamic and complex nature of enzyme-substrate interactions, particularly in catalytic processes. To address the challenges, we introduce EnzymeFlow, a generative model that employs flow matching with hierarchical pre-training and enzyme-reaction co-evolution to generate catalytic pockets for specific substrates and catalytic reactions. Additionally, we introduce a large-scale, curated, and validated dataset of enzyme-reaction pairs, specifically designed for the catalytic pocket generation task, comprising a total of 328,192 pairs. By incorporating evolutionary dynamics and reaction-specific adaptations, EnzymeFlow becomes a powerful model for designing enzyme pockets, which is capable of catalyzing a wide range of biochemical reactions. Experiments on the new dataset demonstrate the model’s effectiveness in designing high-quality, functional enzyme catalytic pockets, paving the way for advancements in enzyme engineering and synthetic biology. We provide EnzymeFlow code at this https URL with notebook demonstration at this https URL.

[LG-59] Ask Pose Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

链接: https://arxiv.org/abs/2410.00309
作者: Laura Bravo-Sánchez,Jaewoo Heo,Zhenzhen Weng,Kuan-Chieh Wang,Serena Yeung-Levy
关键词-EN: Vision Language Models, Large Vision Language, Social dynamics, utilizes Large Vision, pose significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance during optimization, improves mesh estimation on unseen interactions. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field’s capabilities of handling complex interaction scenarios.

[LG-60] VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data

链接: https://arxiv.org/abs/2410.00296
作者: Xuefeng Du,Reshmi Ghosh,Robert Sim,Ahmed Salem,Vitor Carvalho,Emily Lawton,Yixuan Li,Jack W. Stokes
关键词-EN: Vision-language models, essential for contextual, contextual understanding, visual and textual, Vision-language
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: arXiv admin note: text overlap with arXiv:2409.17504

点击查看摘要

Abstract:Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiment shows VLMGuard achieves superior detection results, significantly outperforming state-of-the-art methods. Disclaimer: This paper may contain offensive examples; reader discretion is advised.

[LG-61] Comprehensive Performance Modeling and System Design Insights for Foundation Models

链接: https://arxiv.org/abs/2410.00273
作者: Shashank Subramanian,Ermal Rrapaj,Peter Harrington,Smeet Chheda,Steven Farrell,Brian Austin,Samuel Williams,Nicholas Wright,Wahid Bhimji
关键词-EN: increasingly driving HPC, driving HPC system, HPC system design, HPC system, driving HPC
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 17 pages, PMBS 2024

点击查看摘要

Abstract:Generative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry. We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer type, parallelization strategy, and HPC system features (accelerators and interconnects). We utilize a performance model that allows us to explore this complex design space and highlight its key components. We find that different transformer types demand different parallelism and system characteristics at different training regimes. Large Language Models are performant with 3D parallelism and amplify network needs only at pre-training scales with reduced dependence on accelerator capacity and bandwidth. On the other hand, long-sequence transformers, representative of scientific foundation models, place a more uniform dependence on network and capacity with necessary 4D parallelism. Our analysis emphasizes the need for closer performance modeling of different transformer types keeping system features in mind and demonstrates a path towards this. Our code is available as open-source.

[LG-62] Real-time Diverse Motion In-betweening with Space-time Control SIGGRAPH

链接: https://arxiv.org/abs/2410.00270
作者: Yuchen Chu,Zeshi Yang
关键词-EN: generating diverse in-betweening, kinematic characters, present a data-driven, data-driven framework, framework for generating
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Presented at The 16th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG '24)

点击查看摘要

Abstract:In this work, we present a data-driven framework for generating diverse in-betweening motions for kinematic characters. Our approach injects dynamic conditions and explicit motion controls into the procedure of motion transitions. Notably, this integration enables a finer-grained spatial-temporal control by allowing users to impart additional conditions, such as duration, path, style, etc., into the in-betweening process. We demonstrate that our in-betweening approach can synthesize both locomotion and unstructured motions, enabling rich, versatile, and high-quality animation generation.

[LG-63] Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation

链接: https://arxiv.org/abs/2410.00266
作者: Aleyna Kütük,Tevfik Metin Sezgin
关键词-EN: Scene sketch, Scene sketch semantic, Scene, applications including, sketch semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scene sketch semantic segmentation is a crucial task for various applications including sketch-to-image retrieval and scene understanding. Existing sketch segmentation methods treat sketches as bitmap images, leading to the loss of temporal order among strokes due to the shift from vector to image format. Moreover, these methods struggle to segment objects from categories absent in the training data. In this paper, we propose a Class-Agnostic Visio-Temporal Network (CAVT) for scene sketch semantic segmentation. CAVT employs a class-agnostic object detector to detect individual objects in a scene and groups the strokes of instances through its post-processing module. This is the first approach that performs segmentation at both the instance and stroke levels within scene sketches. Furthermore, there is a lack of free-hand scene sketch datasets with both instance and stroke-level class annotations. To fill this gap, we collected the largest Free-hand Instance- and Stroke-level Scene Sketch Dataset (FrISS) that contains 1K scene sketches and covers 403 object classes with dense annotations. Extensive experiments on FrISS and other datasets demonstrate the superior performance of our method over state-of-the-art scene sketch segmentation models. The code and dataset will be made public after acceptance.

[LG-64] DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining

链接: https://arxiv.org/abs/2410.00260
作者: Vinayak Arannil,Sourav Sanjukta Bhabesh,Neha Narwal,Sai Nikhil Thirandas,Darren Yow-Bang Wang,Graham Horwood,Alex Anto Chirayath,Gouri Pandeshwar
关键词-EN: Large Language Models, shown remarkable ability, Language Models, data, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable ability to generalize effectively across numerous industry domains while executing a range of tasks. Many of these competencies are obtained from the data utilized during the pre-training phase of the Language Models (LMs). However, these models exhibit limitations when tasked with performing in specialized or low-resource industry domains. More recent approaches use LLMs for generating domain-specific synthetic data but most often they lack in truthfulness and complexity. Alternatively, in cases where domain data is available like healthcare and finance most of the LMs are proprietary necessitating the need for a scalable method to curate real world industry specific pre-training data. In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining, to mine domain specific training data from a large data corpus for domain adaptation of a LM. The framework leverages the parametric knowledge of a LLM to generate diverse and representative seed data tailored to a specific domain which is then used to mine real world data from a large data corpus like Common Crawl. We evaluated our framework’s performance in the continual pre-training (CPT) setting by training two domain specific 7B parameter LMs in healthcare and finance with data mined via DoPAMine. Our experiments show that DoPAMine boosts the performance of pre-trained LLMs on average by 4.9% and 5.1% in zero-shot and 5-shot settings respectively on healthcare tasks from MMLU, MedQA, MedMCQA and PubMedQA datasets, and 2.9% and 6.7% for zero-shot and 5-shot settings respectively on finance tasks from FiQA-SA, FPB and Headlines datasets when compared to the baseline.

[LG-65] Enhanced Credit Score Prediction Using Ensemble Deep Learning Model

链接: https://arxiv.org/abs/2410.00256
作者: Qianwen Xing,Chang Yu,Sining Huang,Qi Zheng,Xingyu Mu,Mengying Sun
关键词-EN: contemporary economic society, integrating Random Forest, economic society, contemporary economic, Random Forest
类目: Machine Learning (cs.LG)
*备注: This paper have been accepted by CSP Journal

点击查看摘要

Abstract:In contemporary economic society, credit scores are crucial for every participant. A robust credit evaluation system is essential for the profitability of core businesses such as credit cards, loans, and investments for commercial banks and the financial sector. This paper combines high-performance models like XGBoost and LightGBM, already widely used in modern banking systems, with the powerful TabNet model. We have developed a potent model capable of accurately determining credit score levels by integrating Random Forest, XGBoost, and TabNet, and through the stacking technique in ensemble modeling. This approach surpasses the limitations of single models and significantly advances the precise credit score prediction. In the following sections, we will explain the techniques we used and thoroughly validate our approach by comprehensively comparing a series of metrics such as Precision, Recall, F1, and AUC. By integrating Random Forest, XGBoost, and with the TabNet deep learning architecture, these models complement each other, demonstrating exceptionally strong overall performance.

[LG-66] Quantized and Asynchronous Federated Learning

链接: https://arxiv.org/abs/2410.00242
作者: Tomas Ortega,Hamid Jafarkhani
关键词-EN: Recent advances, Asynchronous Federated Learning, Quantized Asynchronous Federated, synchronous counterparts, federated learning
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recent advances in federated learning have shown that asynchronous variants can be faster and more scalable than their synchronous counterparts. However, their design does not include quantization, which is necessary in practice to deal with the communication bottleneck. To bridge this gap, we develop a novel algorithm, Quantized Asynchronous Federated Learning (QAFeL), which introduces a hidden-state quantization scheme to avoid the error propagation caused by direct quantization. QAFeL also includes a buffer to aggregate client updates, ensuring scalability and compatibility with techniques such as secure aggregation. Furthermore, we prove that QAFeL achieves an \mathcalO(1/\sqrtT) ergodic convergence rate for stochastic gradient descent on non-convex objectives, which is the optimal order of complexity, without requiring bounded gradients or uniform client arrivals. We also prove that the cross-term error between staleness and quantization only affects the higher-order error terms. We validate our theoretical findings on standard benchmarks.

[LG-67] Demonstrating the Continual Learning Capabilities and Practical Application of Discrete-Time Active Inference

链接: https://arxiv.org/abs/2410.00240
作者: Rithvik Prakki
关键词-EN: enabling continual adaptation, Active inference, biological or artificial, adaptation and decision-making, combines Bayesian inference
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Active inference is a mathematical framework for understanding how agents (biological or artificial) interact with their environments, enabling continual adaptation and decision-making. It combines Bayesian inference and free energy minimization to model perception, action, and learning in uncertain and dynamic contexts. Unlike reinforcement learning, active inference integrates exploration and exploitation seamlessly by minimizing expected free energy. In this paper, we present a continual learning framework for agents operating in discrete time environments, using active inference as the foundation. We derive the mathematical formulations of variational and expected free energy and apply them to the design of a self-learning research agent. This agent updates its beliefs and adapts its actions based on new data without manual intervention. Through experiments in changing environments, we demonstrate the agent’s ability to relearn and refine its models efficiently, making it suitable for complex domains like finance and healthcare. The paper concludes by discussing how the proposed framework generalizes to other systems, positioning active inference as a flexible approach for adaptive AI.

[LG-68] Modulation and Coding for NOMA and RSMA

链接: https://arxiv.org/abs/2410.00239
作者: Hamid Jafarkhani,Hossein Maleki,Mojtaba Vaezi
关键词-EN: Next-generation multiple access, Next-generation multiple, multiple access, NOMA, conventional orthogonal methods
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Invited paper; to appear in the Proceedings of the IEEE

点击查看摘要

Abstract:Next-generation multiple access (NGMA) serves as an umbrella term for transmission schemes distinct from conventional orthogonal methods. A key candidate of NGMA, non-orthogonal multiple access (NOMA), emerges as a solution to enhance connectivity by allowing multiple users to share time, frequency, and space concurrently. However, NOMA faces challenges in implementation, particularly in canceling inter-user interference. In this paper, we discuss the principles behind NOMA and review conventional NOMA methods. Then, to address these challenges, we present asynchronous transmission and interference-aware modulation techniques, enabling decoding without successive interference cancellation. The goal is to design constellations that dynamically adapt to interference, minimizing bit error rates (BERs) and enhancing user throughput in the presence of inter-user, inter-carrier, and inter-cell interference. The traditional link between minimizing BER and increasing spectral efficiency is explored, with deep autoencoders for end-to-end communication emerging as a potential solution to improve BERs. Interference-aware modulation can revolutionize constellation design for non-orthogonal channels. Rate-splitting multiple access (RSMA) is another promising interference management technique in multi-user systems. In addition to addressing challenges in finite-alphabet NOMA, this paper offers new insights and provides an overview of code-domain NOMA, trellis-coded NOMA, and RSMA as key NGMA candidates. We also discuss the evolution of channel coding toward low-latency communication and examine modulation and coding schemes in 5G networks. Finally, we highlight future research directions, emphasizing their importance for realizing NOMA from concept to functional technology.

[LG-69] Preconditioning for Accelerated Gradient Descent Optimization and Regularization

链接: https://arxiv.org/abs/2410.00232
作者: Qiang Ye
关键词-EN: Accelerated training algorithms, adaptive learning rates, Accelerated training, adaptive learning, learning rates
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 7 pages

点击查看摘要

Abstract:Accelerated training algorithms, such as adaptive learning rates and various normalization methods, are widely used but not fully understood. When regularization is introduced, standard optimizers like adaptive learning rates may not perform effectively. This raises the need for alternative regularization approaches and the question of how to properly combine regularization with preconditioning. In this paper, we address these challenges using the theory of preconditioning as follows: (1) We explain how preconditioning with AdaGrad, RMSProp, and Adam accelerates training; (2) We explore the interaction between regularization and preconditioning, outlining different options for selecting the variables for regularization, and in particular we discuss how to implement that for the gradient regularization; and (3) We demonstrate how normalization methods accelerate training by improving Hessian conditioning, and discuss how this perspective can lead to new preconditioning training algorithms. Our findings offer a unified mathematical framework for understanding various acceleration techniques and deriving appropriate regularization schemes.

[LG-70] Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models

链接: https://arxiv.org/abs/2410.00231
作者: Qi Wu,Zipeng Fu,Xuxin Cheng,Xiaolong Wang,Chelsea Finn
关键词-EN: achieved strong performance, Learning-based methods, methods have achieved, achieved strong, strong performance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Learning-based methods have achieved strong performance for quadrupedal locomotion. However, several challenges prevent quadrupeds from learning helpful indoor skills that require interaction with environments and humans: lack of end-effectors for manipulation, limited semantic understanding using only simulation data, and low traversability and reachability in indoor environments. We present a system for quadrupedal mobile manipulation in indoor environments. It uses a front-mounted gripper for object manipulation, a low-level controller trained in simulation using egocentric depth for agile skills like climbing and whole-body tilting, and pre-trained vision-language models (VLMs) with a third-person fisheye and an egocentric RGB camera for semantic understanding and command generation. We evaluate our system in two unseen environments without any real-world data collection or training. Our system can zero-shot generalize to these environments and complete tasks, like following user’s commands to fetch a randomly placed stuff toy after climbing over a queen-sized bed, with a 60% success rate. Project website: this https URL

[LG-71] Probabilistic Classification of Near-Surface Shallow-Water Sediments using A Portable Free-Fall Penetrometer

链接: https://arxiv.org/abs/2410.00225
作者: Md Rejwanur Rahman,Adrian Rodriguez-Marek,Nina Stark,Grace Massey,Carl Friedrichs,Kelly M. Dorgan
关键词-EN: naval applications, geotechnical evaluation, important for engineering, engineering projects, projects and naval
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The geotechnical evaluation of seabed sediments is important for engineering projects and naval applications, offering valuable insights into sediment properties, behavior, and strength. Obtaining high-quality seabed samples can be a challenging task, making in-situ testing an essential part of site characterization. Free Fall Penetrometers (FFP) have emerged as robust tools for rapidly profiling seabed surface sediments, even in energetic nearshore or estuarine conditions and shallow as well as deep depths. While methods for interpretation of traditional offshore Cone Penetration Testing (CPT) data are well-established, their adaptation to FFP data is still an area of research. In this study, we introduce an innovative approach that utilizes machine learning algorithms to create a sediment behavior classification system based on portable free fall penetrometer (PFFP) data. The proposed model leverages PFFP measurements obtained from locations such as Sequim Bay (Washington), the Potomac River, and the York River (Virginia). The result shows 91.1% accuracy in the class prediction, with the classes representing cohesionless sediment with little to no plasticity, cohesionless sediment with some plasticity, cohesive sediment with low plasticity, and cohesive sediment with high plasticity. The model prediction not only provides the predicted class but also yields an estimate of inherent uncertainty associated with the prediction, which can provide valuable insight about different sediment behaviors. These uncertainties typically range from very low to very high, with lower uncertainties being more common, but they can increase significantly dpending on variations in sediment composition, environmental conditions, and operational techniques. By quantifying uncertainty, the model offers a more comprehensive and informed approach to sediment classification.

[LG-72] Characterizing and Efficiently Accelerating Multimodal Generation Model Inference HPCA2025

链接: https://arxiv.org/abs/2410.00215
作者: Yejin Lee,Anna Sun,Basil Hosmer,Bilge Acun,Can Balioglu,Changhan Wang,Charles David Hernandez,Christian Puhrsch,Daniel Haziza,Driss Guessous,Francisco Massa,Jacob Kahn,Jeffrey Wan,Jeremy Reizenstein,Jiaqi Zhai,Joe Isaacson,Joel Schlosser,Juan Pino,Kaushik Ram Sadagopan,Leonid Shamis,Linjian Ma,Min-Jae Hwang,Mingda Chen,Mostafa Elhoushi,Pedro Rodriguez,Ram Pasunuru,Scott Yih,Sravya Popuri,Xing Liu,Carole-Jean Wu
关键词-EN: Generative artificial intelligence, artificial intelligence, computing industry, revolutionizing the computing, technology is revolutionizing
类目: Machine Learning (cs.LG)
*备注: 13 pages including references. 8 Figures. Under review to HPCA 2025 Industry Track

点击查看摘要

Abstract:Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute significant inference latency due to the feed forward networks in Transformer-based models. We demonstrate that state-of-the-art optimization levers, spanning from applications to system software and hardware, set a 3.88x better baseline.

[LG-73] End-to-end Piano Performance-MIDI to Score Conversion with Transformers

链接: https://arxiv.org/abs/2410.00210
作者: Tim Beyer,Angela Dai
关键词-EN: expressive human performance, computational musicology, automated creation, expressive human, symbolic music data
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, to appear at ISMIR 2024

点击查看摘要

Abstract:The automated creation of accurate musical notation from an expressive human performance is a fundamental task in computational musicology. To this end, we present an end-to-end deep learning approach that constructs detailed musical scores directly from real-world piano performance-MIDI files. We introduce a modern transformer-based architecture with a novel tokenized representation for symbolic music data. Framing the task as sequence-to-sequence translation rather than note-wise classification reduces alignment requirements and annotation costs, while allowing the prediction of more concise and accurate notation. To serialize symbolic music data, we design a custom tokenization stage based on compound tokens that carefully quantizes continuous values. This technique preserves more score information while reducing sequence lengths by 3.5\times compared to prior approaches. Using the transformer backbone, our method demonstrates better understanding of note values, rhythmic structure, and details such as staff assignment. When evaluated end-to-end using transcription metrics such as MUSTER, we achieve significant improvements over previous deep learning approaches and complex HMM-based state-of-the-art pipelines. Our method is also the first to directly predict notational details like trill marks or stem direction from performance data. Code and models are available at this https URL

[LG-74] Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification EMNLP2024

链接: https://arxiv.org/abs/2410.00179
作者: Kush Dubey
关键词-EN: modern NLP techniques, evaluating modern NLP, NLP techniques, modern NLP, critical for evaluating
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: To appear in the GenBench Workshop at EMNLP 2024

点击查看摘要

Abstract:Few-shot learning benchmarks are critical for evaluating modern NLP techniques. It is possible, however, that benchmarks favor methods which easily make use of unlabeled text, because researchers can use unlabeled text from the test set to pretrain their models. Given the dearth of research on this potential problem, we run experiments to quantify the bias caused by pretraining on unlabeled test set text instead of on unlabeled, independently drawn text. Controlled few-shot and zero-shot experiments on 25 classification tasks and 3 language models – BERT, GPT-2, and Mistral 7B – do not find evidence of overoptimism. Furthermore, we demonstrate the importance of repeated subsampling when studying few-shot text classification, and recommend that few-shot learning benchmarks include multiple training folds. Code and data are available at this https URL.

[LG-75] GaNDLF-Synth: A Framework to Democratize Generative AI for (Bio)Medical Imaging

链接: https://arxiv.org/abs/2410.00173
作者: Sarthak Pati,Szymon Mazurek,Spyridon Bakas
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, Nuanced Deep Learning, Generally Nuanced Deep
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) is a field of AI that creates new data samples from existing ones. It utilizing deep learning to overcome the scarcity and regulatory constraints of healthcare data by generating new data points that integrate seamlessly with original datasets. This paper explores the background and motivation for GenAI, and introduces the Generally Nuanced Deep Learning Framework for Synthesis (GaNDLF-Synth) to address a significant gap in the literature and move towards democratizing the implementation and assessment of image synthesis tasks in healthcare. GaNDLF-Synth describes a unified abstraction for various synthesis algorithms, including autoencoders, generative adversarial networks, and diffusion models. Leveraging the GANDLF-core framework, it supports diverse data modalities and distributed computing, ensuring scalability and reproducibility through extensive unit testing. The aim of GaNDLF-Synth is to lower the entry barrier for GenAI, and make it more accessible and extensible by the wider scientific community.

[LG-76] Basis-to-Basis Operator Learning Using Function Encoders

链接: https://arxiv.org/abs/2410.00171
作者: Tyler Ingebrand,Adam J. Thorpe,Somdatta Goswami,Krishna Kumar,Ufuk Topcu
关键词-EN: Hilbert spaces, operator learning, learning, foundational ideas, basis functions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Basis-to-Basis (B2B) operator learning, a novel approach for learning operators on Hilbert spaces of functions based on the foundational ideas of function encoders. We decompose the task of learning operators into two parts: learning sets of basis functions for both the input and output spaces, and learning a potentially nonlinear mapping between the coefficients of the basis functions. B2B operator learning circumvents many challenges of prior works, such as requiring data to be at fixed locations, by leveraging classic techniques such as least-squares to compute the coefficients. It is especially potent for linear operators, where we compute a mapping between bases as a single matrix transformation with a closed form solution. Furthermore, with minimal modifications and using the deep theoretical connections between function encoders and functional analysis, we derive operator learning algorithms that are directly analogous to eigen-decomposition and singular value decomposition. We empirically validate B2B operator learning on six benchmark operator learning tasks, and show that it demonstrates a two-orders-of-magnitude improvement in accuracy over existing approaches on several benchmark tasks.

[LG-77] (Almost) Smooth Sailing: Towards Numerical Stability of Neural Networks Through Differentiable Regularization of the Condition Number ICML24

链接: https://arxiv.org/abs/2410.00169
作者: Rossen Nenov,Daniel Haider,Peter Balazs
关键词-EN: Maintaining numerical stability, machine learning models, Maintaining numerical, reliability and performance, machine learning
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted at ICML24 Workshop: Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and Simulators

点击查看摘要

Abstract:Maintaining numerical stability in machine learning models is crucial for their reliability and performance. One approach to maintain stability of a network layer is to integrate the condition number of the weight matrix as a regularizing term into the optimization algorithm. However, due to its discontinuous nature and lack of differentiability the condition number is not suitable for a gradient descent approach. This paper introduces a novel regularizer that is provably differentiable almost everywhere and promotes matrices with low condition numbers. In particular, we derive a formula for the gradient of this regularizer which can be easily implemented and integrated into existing optimization algorithms. We show the advantages of this approach for noisy classification and denoising of MNIST images.

[LG-78] Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

链接: https://arxiv.org/abs/2410.00153
作者: Haiyan Zhao,Heng Zhao,Bo Shen,Ali Payani,Fan Yang,Mengnan Du
关键词-EN: large language models, Probing learned concepts, encoded internally, crucial for understanding, understanding how semantic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS’s effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.

[LG-79] What If We Had Used a Different App? Reliable Counterfactual KPI Analysis in Wireless Systems

链接: https://arxiv.org/abs/2410.00150
作者: Qiushuo Hou,Sangwoo Park,Matteo Zecchin,Yunlong Cai,Guanding Yu,Osvaldo Simeone
关键词-EN: Open Radio Access, Radio Access Network, Open Radio, Radio Access, wireless network architectures
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: This paper has been submitted to a journal

点击查看摘要

Abstract:In modern wireless network architectures, such as Open Radio Access Network (O-RAN), the operation of the radio access network (RAN) is managed by applications, or apps for short, deployed at intelligent controllers. These apps are selected from a given catalog based on current contextual information. For instance, a scheduling app may be selected on the basis of current traffic and network conditions. Once an app is chosen and run, it is no longer possible to directly test the performance that would have been obtained with another app. This test, however, would be potentially valuable to monitor and optimize the network operation. With this goal in mind, this paper addresses the “what-if” problem of estimating the values of key performance indicators (KPIs) that would have been obtained if a different app had been implemented by the RAN. To this end, we propose a conformal-prediction-based counterfactual analysis method for wireless systems that provides reliable “error bars” for the estimated KPIs, containing the true KPIs with a user-defined probability, despite the inherent covariate shift between logged and test data. Experimental results for medium access control-layer apps and for physical-layer apps demonstrate the merits of the proposed method.

[LG-80] Are Large Language Models In-Context Personalized Summarizers? Get an iCOPERNICUS Test Done!

链接: https://arxiv.org/abs/2410.00149
作者: Divya Patel,Pathik Patel,Ankush Chander,Sourish Dasgupta,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, Language Models, succeeded considerably, Large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have succeeded considerably in In-Context-Learning (ICL) based summarization. However, saliency is subject to the users’ specific preference histories. Hence, we need reliable In-Context Personalization Learning (ICPL) capabilities within such LLMs. For any arbitrary LLM to exhibit ICPL, it needs to have the ability to discern contrast in user profiles. A recent study proposed a measure for degree-of-personalization called EGISES for the first time. EGISES measures a model’s responsiveness to user profile differences. However, it cannot test if a model utilizes all three types of cues provided in ICPL prompts: (i) example summaries, (ii) user’s reading histories, and (iii) contrast in user profiles. To address this, we propose the iCOPERNICUS framework, a novel In-COntext PERsonalization learNIng sCrUtiny of Summarization capability in LLMs that uses EGISES as a comparative measure. As a case-study, we evaluate 17 state-of-the-art LLMs based on their reported ICL performances and observe that 15 models’ ICPL degrades (min: 1.6%; max: 3.6%) when probed with richer prompts, thereby showing lack of true ICPL.

[LG-81] Constraint-Aware Refinement for Safety Verification of Neural Feedback Loops

链接: https://arxiv.org/abs/2410.00145
作者: Nicholas Rober,Jonathan P. How
关键词-EN: Neural networks, control pipelines, neural feedback loops, increasingly popular, autonomous systems
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 10 figures, submitted to L-CSS/ACC

点击查看摘要

Abstract:Neural networks (NNs) are becoming increasingly popular in the design of control pipelines for autonomous systems. However, since the performance of NNs can degrade in the presence of out-of-distribution data or adversarial attacks, systems that have NNs in their control pipelines, i.e., neural feedback loops (NFLs), need safety assurances before they can be applied in safety-critical situations. Reachability analysis offers a solution to this problem by calculating reachable sets that bound the possible future states of an NFL and can be checked against dangerous regions of the state space to verify that the system does not violate safety constraints. Since exact reachable sets are generally intractable to calculate, reachable set over approximations (RSOAs) are typically used. The problem with RSOAs is that they can be overly conservative, making it difficult to verify the satisfaction of safety constraints, especially over long time horizons or for highly nonlinear NN control policies. Refinement strategies such as partitioning or symbolic propagation are typically used to limit the conservativeness of RSOAs, but these approaches come with a high computational cost and often can only be used to verify safety for simple reachability problems. This paper presents Constraint-Aware Refinement for Verification (CARV): an efficient refinement strategy that reduces the conservativeness of RSOAs by explicitly using the safety constraints on the NFL to refine RSOAs only where necessary. We demonstrate that CARV can verify the safety of an NFL where other approaches either fail or take up to 60x longer and 40x the memory.

[LG-82] Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models EMNLP2024

链接: https://arxiv.org/abs/2410.00131
作者: Ji Liu,Jiaxiang Ren,Ruoming Jin,Zijie Zhang,Yang Zhou,Patrick Valduriez,Dejing Dou
关键词-EN: Large Language Models, fine-tune Large Language, collaboratively train models, Large Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 27 pages, 8 figures, 14 tables, to appear in EMNLP 2024

点击查看摘要

Abstract:As a promising paradigm to collaboratively train models with decentralized data, Federated Learning (FL) can be exploited to fine-tune Large Language Models (LLMs). While LLMs correspond to huge size, the scale of the training data significantly increases, which leads to tremendous amounts of computation and communication costs. The training data is generally non-Independent and Identically Distributed (non-IID), which requires adaptive data processing within each device. Although Low Rank Adaptation (LoRA) can significantly reduce the scale of parameters to update in the fine-tuning process, it still takes unaffordable time to transfer the low-rank parameters of all the layers in LLMs. In this paper, we propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods, i.e., adaptive federated curriculum learning and efficient sparse parameter update. First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process. Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA so as to improve the efficiency of the FL fine-tuning process. Extensive experimental results based on 10 datasets demonstrate that FibecFed yields excellent performance (up to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61% faster) compared with 17 baseline approaches).

[LG-83] Cartesian Genetic Programming Approach for Designing Convolutional Neural Networks

链接: https://arxiv.org/abs/2410.00129
作者: Krzywda Maciej,Łukasik Szymon,Gandomi H. Amir
关键词-EN: Convolutional Neural Networks, present study covers, Cartesian genetic programming, optimization of Convolutional, Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The present study covers an approach to neural architecture search (NAS) using Cartesian genetic programming (CGP) for the design and optimization of Convolutional Neural Networks (CNNs). In designing artificial neural networks, one crucial aspect of the innovative approach is suggesting a novel neural architecture. Currently used architectures have mostly been developed manually by human experts, which is a time-consuming and error-prone process. In this work, we use pure Genetic Programming Approach to design CNNs, which employs only one genetic operation, i.e., mutation. In the course of preliminary experiments, our methodology yields promising results.

[LG-84] Using fractal dimension to predict the risk of intra cranial aneurysm rupture with machine learning

链接: https://arxiv.org/abs/2410.00121
作者: Pradyumna Elavarthi,Anca Ralescu,Mark D. Johnson,Charles J. Prestigiacomo
关键词-EN: Multi Layer Perceptron, Support Vector Machine, morbidity and mortality, Intracranial aneurysms, result in significant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intracranial aneurysms (IAs) that rupture result in significant morbidity and mortality. While traditional risk models such as the PHASES score are useful in clinical decision making, machine learning (ML) models offer the potential to provide more accuracy. In this study, we compared the performance of four different machine learning algorithms Random Forest (RF), XGBoost (XGB), Support Vector Machine (SVM), and Multi Layer Perceptron (MLP) on clinical and radiographic features to predict rupture status of intracranial aneurysms. Among the models, RF achieved the highest accuracy (85%) with balanced precision and recall, while MLP had the lowest overall performance (accuracy of 63%). Fractal dimension ranked as the most important feature for model performance across all models.

[LG-85] An Overview of the Burer-Monteiro Method for Certifiable Robot Perception

链接: https://arxiv.org/abs/2410.00117
作者: Alan Papalia,Yulun Tian,David M. Rosen,Jonathan P. How,John J. Leonard
关键词-EN: solve robot perception, robot perception problems, Burer-Monteiro method, optimality in real-time, presents an overview
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to 2024 Robotics: Science and Systems (RSS) Safe Autonomy Workshop

点击查看摘要

Abstract:This paper presents an overview of the Burer-Monteiro method (BM), a technique that has been applied to solve robot perception problems to certifiable optimality in real-time. BM is often used to solve semidefinite programming relaxations, which can be used to perform global optimization for non-convex perception problems. Specifically, BM leverages the low-rank structure of typical semidefinite programs to dramatically reduce the computational cost of performing optimization. This paper discusses BM in certifiable perception, with three main objectives: (i) to consolidate information from the literature into a unified presentation, (ii) to elucidate the role of the linear independence constraint qualification (LICQ), a concept not yet well-covered in certifiable perception literature, and (iii) to share practical considerations that are discussed among practitioners but not thoroughly covered in the literature. Our general aim is to offer a practical primer for applying BM towards certifiable perception.

[LG-86] Fine-tuning Vision Classifiers On A Budget

链接: https://arxiv.org/abs/2410.00085
作者: Sunil Kumar,Ted Sandler,Paulina Varshavskaya
关键词-EN: modern computer vision, requires accurately labeled, Fine-tuning modern computer, accurately labeled data, computer vision models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Fine-tuning modern computer vision models requires accurately labeled data for which the ground truth may not exist, but a set of multiple labels can be obtained from labelers of variable accuracy. We tie the notion of label quality to confidence in labeler accuracy and show that, when prior estimates of labeler accuracy are available, using a simple naive-Bayes model to estimate the true labels allows us to label more data on a fixed budget without compromising label or fine-tuning quality. We present experiments on a dataset of industrial images that demonstrates that our method, called Ground Truth Extension (GTX), enables fine-tuning ML models using fewer human labels.

[LG-87] A Survey on Diffusion Models for Inverse Problems

链接: https://arxiv.org/abs/2410.00083
作者: Giannis Daras,Hyungjin Chung,Chieh-Hsin Lai,Yuki Mitsufuji,Jong Chul Ye,Peyman Milanfar,Alexandros G. Dimakis,Mauricio Delbracio
关键词-EN: generate high-quality samples, generative modeling due, Diffusion models, high-quality samples, increasingly popular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in progress. 38 pages

点击查看摘要

Abstract:Diffusion models have become increasingly popular for generative modeling due to their ability to generate high-quality samples. This has unlocked exciting new possibilities for solving inverse problems, especially in image restoration and reconstruction, by treating diffusion models as unsupervised priors. This survey provides a comprehensive overview of methods that utilize pre-trained diffusion models to solve inverse problems without requiring further training. We introduce taxonomies to categorize these methods based on both the problems they address and the techniques they employ. We analyze the connections between different approaches, offering insights into their practical implementation and highlighting important considerations. We further discuss specific challenges and potential solutions associated with using latent diffusion models for inverse problems. This work aims to be a valuable resource for those interested in learning about the intersection of diffusion models and inverse problems.

[LG-88] Graph Residual Noise Learner Network for Brain Connectivity Graph Prediction

链接: https://arxiv.org/abs/2410.00082
作者: Oytun Demirbilek,Tingying Peng,Alaa Bessadok
关键词-EN: brain dysconnectivity patterns, charting brain dysconnectivity, dysconnectivity patterns, depicting a connectional, connectional fingerprint
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 6th Workshop on GRaphs in biomedicAl Image anaLysis

点击查看摘要

Abstract:A morphological brain graph depicting a connectional fingerprint is of paramount importance for charting brain dysconnectivity patterns. Such data often has missing observations due to various reasons such as time-consuming and incomplete neuroimage processing pipelines. Thus, predicting a target brain graph from a source graph is crucial for better diagnosing neurological disorders with minimal data acquisition resources. Many brain graph generative models were proposed for promising results, yet they are mostly based on generative adversarial networks (GAN), which could suffer from mode collapse and require large training datasets. Recent developments in diffusion models address these problems by offering essential properties such as a stable training objective and easy scalability. However, applying a diffusion process to graph edges fails to maintain the topological symmetry of the brain connectivity matrices. To meet these challenges, we propose the Graph Residual Noise Learner Network (Grenol-Net), the first graph diffusion model for predicting a target graph from a source graph.

[LG-89] Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface

链接: https://arxiv.org/abs/2410.00079
作者: Wenyue Hua,Mengting Wan,Shashank Vadrevu,Ryan Nadel,Yongfeng Zhang,Chi Wang
关键词-EN: producing action plans, human task delegation, Interactive Speculative Planning, task delegation, action plans
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 27 pages, 22 figures

点击查看摘要

Abstract:Agents, as user-centric tools, are increasingly deployed for human task delegation, assisting with a broad spectrum of requests by generating thoughts, engaging with user proxies, and producing action plans. However, agents based on large language models (LLMs) often face substantial planning latency due to two primary factors: the efficiency limitations of the underlying LLMs due to their large size and high demand, and the structural complexity of the agents due to the extensive generation of intermediate thoughts to produce the final output. Given that inefficiency in service provision can undermine the value of automation for users, this paper presents a human-centered efficient agent planning method – Interactive Speculative Planning – aiming at enhancing the efficiency of agent planning through both system design and human-AI interaction. Our approach advocates for the co-design of the agent system and user interface, underscoring the importance of an agent system that can fluidly manage user interactions and interruptions. By integrating human interruptions as a fundamental component of the system, we not only make it more user-centric but also expedite the entire process by leveraging human-in-the-loop interactions to provide accurate intermediate steps. Code and data will be released.

[LG-90] Optimizing Treatment Allocation in the Presence of Interference

链接: https://arxiv.org/abs/2410.00075
作者: Daan Caljon,Jente Van Belle,Jeroen Berrevoets,Wouter Verbeke
关键词-EN: Influence Maximization, treatment, treatment allocation, treatment effects, optimal treatment allocation
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In Influence Maximization (IM), the objective is to – given a budget – select the optimal set of entities in a network to target with a treatment so as to maximize the total effect. For instance, in marketing, the objective is to target the set of customers that maximizes the total response rate, resulting from both direct treatment effects on targeted customers and indirect, spillover, effects that follow from targeting these customers. Recently, new methods to estimate treatment effects in the presence of network interference have been proposed. However, the issue of how to leverage these models to make better treatment allocation decisions has been largely overlooked. Traditionally, in Uplift Modeling (UM), entities are ranked according to estimated treatment effect, and the top entities are allocated treatment. Since, in a network context, entities influence each other, the UM ranking approach will be suboptimal. The problem of finding the optimal treatment allocation in a network setting is combinatorial and generally has to be solved heuristically. To fill the gap between IM and UM, we propose OTAPI: Optimizing Treatment Allocation in the Presence of Interference to find solutions to the IM problem using treatment effect estimates. OTAPI consists of two steps. First, a causal estimator is trained to predict treatment effects in a network setting. Second, this estimator is leveraged to identify an optimal treatment allocation by integrating it into classic IM algorithms. We demonstrate that this novel method outperforms classic IM and UM approaches on both synthetic and semi-synthetic datasets.

[LG-91] Collaborative Knowledge Distillation via a Learning-by-Education Node Community

链接: https://arxiv.org/abs/2410.00074
作者: Anestis Kaimakamidis,Ioannis Mademlis,Ioannis Pitas
关键词-EN: Deep Neural Network, deployed Deep Neural, Neural Network, Deep Neural, diverse deployed Deep
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A novel Learning-by-Education Node Community framework (LENC) for Collaborative Knowledge Distillation (CKD) is presented, which facilitates continual collective learning through effective knowledge exchanges among diverse deployed Deep Neural Network (DNN) peer nodes. These DNNs dynamically and autonomously adopt either the role of a student, seeking knowledge, or that of a teacher, imparting knowledge, fostering a collaborative learning environment. The proposed framework enables efficient knowledge transfer among participating DNN nodes as needed, while enhancing their learning capabilities and promoting their collaboration. LENC addresses the challenges of handling diverse training data distributions and the limitations of individual DNN node learning abilities. It ensures the exploitation of the best available teacher knowledge upon learning a new task and protects the DNN nodes from catastrophic forgetting. Additionally, it innovates by enabling collaborative multitask knowledge distillation, while addressing the problem of task-agnostic continual learning, as DNN nodes have no information on task boundaries. Experimental evaluation on a proof-of-concept implementation demonstrates the LENC framework’s functionalities and benefits across multiple DNN learning and inference scenarios. The conducted experiments showcase its ability to gradually maximize the average test accuracy of the community of interacting DNN nodes in image classification problems, by appropriately leveraging the collective knowledge of all node peers. The LENC framework achieves state-of-the-art performance in on-line unlabelled CKD.

[LG-92] An interdisciplinary exploration of trade-offs between energy privacy and accuracy aspects of data WWW

链接: https://arxiv.org/abs/2410.00069
作者: Pepijn de Reus,Kyra Dresen,Ana Oprescu,Kristina Irion,Ans Kolk
关键词-EN: including ICT rising, ICT rising energy, including ICT, ICT rising, societal challenges
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Workshop paper for PLSC Europe 2024 ( this https URL )

点击查看摘要

Abstract:The digital era has raised many societal challenges, including ICT’s rising energy consumption and protecting privacy of personal data processing. This paper considers both aspects in relation to machine learning accuracy in an interdisciplinary exploration. We first present a method to measure the effects of privacy-enhancing techniques on data utility and energy consumption. The environmental-privacy-accuracy trade-offs are discovered through an experimental set-up. We subsequently take a storytelling approach to translate these technical findings to experts in non-ICT fields. We draft two examples for a governmental and auditing setting to contextualise our results. Ultimately, users face the task of optimising their data processing operations in a trade-off between energy, privacy, and accuracy considerations where the impact of their decisions is context-sensitive.

[LG-93] Ranking the Top-K Realizations of Stochastically Known Event Logs

链接: https://arxiv.org/abs/2410.00067
作者: Arvid Lepsien,Marco Pegoraro,Frederik Fonger,Dominic Langhammer,Milda Aleknonytė-Resch,Agnes Koschmider
关键词-EN: data quality issues, event logs, event, due to flawed, flawed recording
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Various kinds of uncertainty can occur in event logs, e.g., due to flawed recording, data quality issues, or the use of probabilistic models for activity recognition. Stochastically known event logs make these uncertainties transparent by encoding multiple possible realizations for events. However, the number of realizations encoded by a stochastically known log grows exponentially with its size, making exhaustive exploration infeasible even for moderately sized event logs. Thus, considering only the top-K most probable realizations has been proposed in the literature. In this paper, we implement an efficient algorithm to calculate a top-K realization ranking of an event log under event independence within O(Kn), where n is the number of uncertain events in the log. This algorithm is used to investigate the benefit of top-K rankings over top-1 interpretations of stochastically known event logs. Specifically, we analyze the usefulness of top-K rankings against different properties of the input data. We show that the benefit of a top-K ranking depends on the length of the input event log and the distribution of the event probabilities. The results highlight the potential of top-K rankings to enhance uncertainty-aware process mining techniques.

[LG-94] M2Distill: Multi-Modal Distillation for Lifelong Imitation Learning ICRA2025

链接: https://arxiv.org/abs/2410.00064
作者: Kaushik Roy,Akila Dissanayake,Brendan Tidd,Peyman Moghadam
关键词-EN: poses significant challenges, significant challenges due, tasks poses significant, Lifelong imitation learning, manipulation tasks poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Submitted to ICRA2025

点击查看摘要

Abstract:Lifelong imitation learning for manipulation tasks poses significant challenges due to distribution shifts that occur in incremental learning steps. Existing methods often focus on unsupervised skill discovery to construct an ever-growing skill library or distillation from multiple policies, which can lead to scalability issues as diverse manipulation tasks are continually introduced and may fail to ensure a consistent latent space throughout the learning process, leading to catastrophic forgetting of previously learned skills. In this paper, we introduce M2Distill, a multi-modal distillation-based method for lifelong imitation learning focusing on preserving consistent latent space across vision, language, and action distributions throughout the learning process. By regulating the shifts in latent representations across different modalities from previous to current steps, and reducing discrepancies in Gaussian Mixture Model (GMM) policies between consecutive learning steps, we ensure that the learned policy retains its ability to perform previously learned tasks while seamlessly integrating new skills. Extensive evaluations on the LIBERO lifelong imitation learning benchmark suites, including LIBERO-OBJECT, LIBERO-GOAL, and LIBERO-SPATIAL, demonstrate that our method consistently outperforms prior state-of-the-art methods across all evaluated metrics.

[LG-95] Neural Decompiling of Tracr Transformers

链接: https://arxiv.org/abs/2410.00061
作者: Hannes Thurnherr,Kaspar Riesen
关键词-EN: enabled substantial progress, Tracr compiled transformer, machine learning, architecture has enabled, enabled substantial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, the transformer architecture has enabled substantial progress in many areas of pattern recognition and machine learning. However, as with other neural network models, there is currently no general method available to explain their inner workings. The present paper represents a first step towards this direction. We utilize \textitTransformer Compiler for RASP (Tracr) to generate a large dataset of pairs of transformer weights and corresponding RASP programs. Based on this dataset, we then build and train a model, with the aim of recovering the RASP code from the compiled model. We demonstrate that the simple form of Tracr compiled transformer weights is interpretable for such a decompiler model. In an empirical evaluation, our model achieves exact reproductions on more than 30% of the test objects, while the remaining 70% can generally be reproduced with only few errors. Additionally, more than 70% of the programs, produced by our model, are functionally equivalent to the ground truth, and therefore a valid decompilation of the Tracr compiled transformer weights.

[LG-96] IDEA: An Inverse Domain Expert Adaptation Based Active DNN IP Protection Method

链接: https://arxiv.org/abs/2410.00059
作者: Chaohui Xu,Qi Cui,Jinxin Dong,Weiyang He,Chip-Hong Chang
关键词-EN: Deep Neural Network, Neural Network, Deep Neural, Illegitimate reproduction, derivation of Deep
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Illegitimate reproduction, distribution and derivation of Deep Neural Network (DNN) models can inflict economic loss, reputation damage and even privacy infringement. Passive DNN intellectual property (IP) protection methods such as watermarking and fingerprinting attempt to prove the ownership upon IP violation, but they are often too late to stop catastrophic damage of IP abuse and too feeble against strong adversaries. In this paper, we propose IDEA, an Inverse Domain Expert Adaptation based proactive DNN IP protection method featuring active authorization and source traceability. IDEA generalizes active authorization as an inverse problem of domain adaptation. The multi-adaptive optimization is solved by a mixture-of-experts model with one real and two fake experts. The real expert re-optimizes the source model to correctly classify test images with a unique model user key steganographically embedded. The fake experts are trained to output random prediction on test images without or with incorrect user key embedded by minimizing their mutual information (MI) with the real expert. The MoE model is knowledge distilled into a unified protected model to avoid leaking the expert model features by maximizing their MI with additional multi-layer attention and contrastive representation loss optimization. IDEA not only prevents unauthorized users without the valid key to access the functional model, but also enable the model owner to validate the deployed model and trace the source of IP infringement. We extensively evaluate IDEA on five datasets and four DNN models to demonstrate its effectiveness in authorization control, culprit tracing success rate, and robustness against various attacks.

[LG-97] STTM: A New Approach Based Spatial-Temporal Transformer And Memory Network For Real-time Pressure Signal In On-demand Food Delivery

链接: https://arxiv.org/abs/2410.00057
作者: Jiang Wang,Haibin Wei,Xiaowei Xu,Jiacheng Shi,Jian Nie,Longzhi Du,Taixu Jiang
关键词-EN: On-demand Food Delivery, On-demand Food, OFD services, Real-time Pressure Signal, million food orders
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:On-demand Food Delivery (OFD) services have become very common around the world. For example, on the this http URL platform, users place more than 15 million food orders every day. Predicting the Real-time Pressure Signal (RPS) is crucial for OFD services, as it is primarily used to measure the current status of pressure on the logistics system. When RPS rises, the pressure increases, and the platform needs to quickly take measures to prevent the logistics system from being overloaded. Usually, the average delivery time for all orders within a business district is used to represent RPS. Existing research on OFD services primarily focuses on predicting the delivery time of orders, while relatively less attention has been given to the study of the RPS. Previous research directly applies general models such as DeepFM, RNN, and GNN for prediction, but fails to adequately utilize the unique temporal and spatial characteristics of OFD services, and faces issues with insufficient sensitivity during sudden severe weather conditions or peak periods. To address these problems, this paper proposes a new method based on Spatio-Temporal Transformer and Memory Network (STTM). Specifically, we use a novel Spatio-Temporal Transformer structure to learn logistics features across temporal and spatial dimensions and encode the historical information of a business district and its neighbors, thereby learning both temporal and spatial information. Additionally, a Memory Network is employed to increase sensitivity to abnormal events. Experimental results on the real-world dataset show that STTM significantly outperforms previous methods in both offline experiments and the online A/B test, demonstrating the effectiveness of this method.

[LG-98] Survey of Security and Data Attacks on Machine Unlearning In Financial and E-Commerce

链接: https://arxiv.org/abs/2410.00055
作者: Carl E.J. Brodzinski
关键词-EN: Membership Inference Attacks, Data Reconstruction Attacks, machine unlearning, Machine Unlearning Jailbreak, Machine Unlearning Data
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper surveys the landscape of security and data attacks on machine unlearning, with a focus on financial and e-commerce applications. We discuss key privacy threats such as Membership Inference Attacks and Data Reconstruction Attacks, where adversaries attempt to infer or reconstruct data that should have been removed. In addition, we explore security attacks including Machine Unlearning Data Poisoning, Unlearning Request Attacks, and Machine Unlearning Jailbreak Attacks, which target the underlying mechanisms of unlearning to manipulate or corrupt the model. To mitigate these risks, various defense strategies are examined, including differential privacy, robust cryptographic guarantees, and Zero-Knowledge Proofs (ZKPs), offering verifiable and tamper-proof unlearning mechanisms. These approaches are essential for safeguarding data integrity and privacy in high-stakes financial and e-commerce contexts, where compromised models can lead to fraud, data leaks, and reputational damage. This survey highlights the need for continued research and innovation in secure machine unlearning, as well as the importance of developing strong defenses against evolving attack vectors.

[LG-99] ransferable Unsupervised Outlier Detection Framework for Human Semantic Trajectories

链接: https://arxiv.org/abs/2410.00054
作者: Zheng Zhang,Hossein Amiri,Dazhou Yu,Yuntong Hu,Liang Zhao,Andreas Zufle
关键词-EN: outlier behaviors critical, enrich spatial-temporal data, Human Semantic Trajectories, Semantic trajectories, social security
类目: Machine Learning (cs.LG)
*备注: This is an accepted paper on this https URL

点击查看摘要

Abstract:Semantic trajectories, which enrich spatial-temporal data with textual information such as trip purposes or location activities, are key for identifying outlier behaviors critical to healthcare, social security, and urban planning. Traditional outlier detection relies on heuristic rules, which requires domain knowledge and limits its ability to identify unseen outliers. Besides, there lacks a comprehensive approach that can jointly consider multi-modal data across spatial, temporal, and textual dimensions. Addressing the need for a domain-agnostic model, we propose the Transferable Outlier Detection for Human Semantic Trajectories (TOD4Traj) framework.TOD4Traj first introduces a modality feature unification module to align diverse data feature representations, enabling the integration of multi-modal information and enhancing transferability across different datasets. A contrastive learning module is further pro-posed for identifying regular mobility patterns both temporally and across populations, allowing for a joint detection of outliers based on individual consistency and group majority patterns. Our experimental results have shown TOD4Traj’s superior performance over existing models, demonstrating its effectiveness and adaptability in detecting human trajectory outliers across various datasets.

[LG-100] Frequency-adaptive Multi-scale Deep Neural Networks

链接: https://arxiv.org/abs/2410.00053
作者: Jizu Huang,Rukang You,Tao Zhou
关键词-EN: Multi-scale deep neural, deep neural networks, Multi-scale deep, target functions characterized, downing-scaling mapping
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-scale deep neural networks (MscaleDNNs) with downing-scaling mapping have demonstrated superiority over traditional DNNs in approximating target functions characterized by high frequency features. However, the performance of MscaleDNNs heavily depends on the parameters in the downing-scaling mapping, which limits their broader application. In this work, we establish a fitting error bound to explain why MscaleDNNs are advantageous for approximating high frequency functions. Building on this insight, we construct a hybrid feature embedding to enhance the accuracy and robustness of the downing-scaling mapping. To reduce the dependency of MscaleDNNs on parameters in the downing-scaling mapping, we propose frequency-adaptive MscaleDNNs, which adaptively adjust these parameters based on a posterior error estimate that captures the frequency information of the fitted functions. Numerical examples, including wave propagation and the propagation of a localized solution of the schr \ddot\texto dinger equation with a smooth potential near the semi-classical limit, are presented. These examples demonstrate that the frequency-adaptive MscaleDNNs improve accuracy by two to three orders of magnitude compared to standard MscaleDNNs.

[LG-101] DelayPTC-LLM: Metro Passenger Travel Choice Prediction under Train Delays with Large Language Models

链接: https://arxiv.org/abs/2410.00052
作者: Chen Chen,Yuxin He,Hao Wang,Jingjing Chen,Qin Luo
关键词-EN: Urban Rail Transit, Rail Transit, Urban Rail, posing significant challenges, networked operation conditions
类目: Machine Learning (cs.LG)
*备注: 15 pages,4 figures

点击查看摘要

Abstract:Train delays can propagate rapidly throughout the Urban Rail Transit (URT) network under networked operation conditions, posing significant challenges to operational departments. Accurately predicting passenger travel choices under train delays can provide interpretable insights into the redistribution of passenger flow, offering crucial decision support for emergency response and service recovery. However, the diversity of travel choices due to passenger heterogeneity and the sparsity of delay events leads to issues of data sparsity and sample imbalance in the travel choices dataset under metro delays. It is challenging to model this problem using traditional machine learning approaches, which typically rely on large, balanced datasets. Given the strengths of large language models (LLMs) in text processing, understanding, and their capabilities in small-sample and even zero-shot learning, this paper proposes a novel Passenger Travel Choice prediction framework under metro delays with the Large Language Model (DelayPTC-LLM). The well-designed prompting engineering is developed to guide the LLM in making and rationalizing predictions about travel choices, taking into account passenger heterogeneity and features of the delay events. Utilizing real-world data from Shenzhen Metro, including Automated Fare Collection (AFC) data and detailed delay logs, a comparative analysis of DelayPTC-LLM with traditional prediction models demonstrates the superior capability of LLMs in handling complex, sparse datasets commonly encountered under disruption of transportation systems. The results validate the advantages of DelayPTC-LLM in terms of predictive accuracy and its potential to provide actionable insights for big traffic data.

[LG-102] Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization NEURIPS2024

链接: https://arxiv.org/abs/2410.00051
作者: Haoran Li,Zhennan Jiang,Yuhui Chen,Dongbin Zhao
关键词-EN: faces significant challenges, visual reinforcement learning, reinforcement learning, faces significant, exploitation and exploration
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS2024)

点击查看摘要

Abstract:With high-dimensional state spaces, visual reinforcement learning (RL) faces significant challenges in exploitation and exploration, resulting in low sample efficiency and training stability. As a time-efficient diffusion model, although consistency models have been validated in online state-based RL, it is still an open question whether it can be extended to visual RL. In this paper, we investigate the impact of non-stationary distribution and the actor-critic framework on consistency policy in online RL, and find that consistency policy was unstable during the training, especially in visual RL with the high-dimensional state space. To this end, we suggest sample-based entropy regularization to stabilize the policy training, and propose a consistency policy with prioritized proximal experience regularization (CP3ER) to improve sample efficiency. CP3ER achieves new state-of-the-art (SOTA) performance in 21 tasks across DeepMind control suite and Meta-world. To our knowledge, CP3ER is the first method to apply diffusion/consistency models to visual RL and demonstrates the potential of consistency models in visual RL. More visualization results are available at this https URL.

[LG-103] CycleBNN: Cyclic Precision Training in Binary Neural Networks ECCV-2024

链接: https://arxiv.org/abs/2410.00050
作者: Federico Fontana,Romeo Lanzino,Anxhelo Diko,Gian Luca Foresti,Luigi Cinque
关键词-EN: Binary Neural Networks, Binary Neural, offering significant reductions, Neural Networks, offering significant
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at Workshop CADL, ECCV-2024

点击查看摘要

Abstract:This paper works on Binary Neural Networks (BNNs), a promising avenue for efficient deep learning, offering significant reductions in computational overhead and memory footprint to full precision networks. However, the challenge of energy-intensive training and the drop in performance have been persistent issues. Tackling the challenge, prior works focus primarily on task-related inference optimization. Unlike prior works, this study offers an innovative methodology integrating BNNs with cyclic precision training, introducing the CycleBNN. This approach is designed to enhance training efficiency while minimizing the loss in performance. By dynamically adjusting precision in cycles, we achieve a convenient trade-off between training efficiency and model performance. This emphasizes the potential of our method in energy-constrained training scenarios, where data is collected onboard and paves the way for sustainable and efficient deep learning architectures. To gather insights on CycleBNN’s efficiency, we conduct experiments on ImageNet, CIFAR-10, and PASCAL-VOC, obtaining competitive performances while using 96.09% less operations during training on ImageNet, 88.88% on CIFAR-10 and 96.09% on PASCAL-VOC. Finally, CycleBNN offers a path towards faster, more accessible training of efficient networks, accelerating the development of practical applications. The PyTorch code is available at \urlthis https URL

[LG-104] Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph

链接: https://arxiv.org/abs/2410.00049
作者: Guancheng Wan,Zewen Liu,Max S.Y. Lau,B. Aditya Prakash,Wei Jin
关键词-EN: medical resource allocation, public health strategies, efficient medical resource, Effective epidemic forecasting, rapidly spreading infectious
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Effective epidemic forecasting is critical for public health strategies and efficient medical resource allocation, especially in the face of rapidly spreading infectious diseases. However, existing deep-learning methods often overlook the dynamic nature of epidemics and fail to account for the specific mechanisms of disease transmission. In response to these challenges, we introduce an innovative end-to-end framework called Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph (EARTH) in this paper. To learn continuous and regional disease transmission patterns, we first propose EANO, which seamlessly integrates the neural ODE approach with the epidemic mechanism, considering the complex spatial spread process during epidemic evolution. Additionally, we introduce GLTG to model global infection trends and leverage these signals to guide local transmission dynamically. To accommodate both the global coherence of epidemic trends and the local nuances of epidemic transmission patterns, we build a cross-attention approach to fuse the most meaningful information for forecasting. Through the smooth synergy of both components, EARTH offers a more robust and flexible approach to understanding and predicting the spread of infectious diseases. Extensive experiments show EARTH superior performance in forecasting real-world epidemics compared to state-of-the-art methods. The code will be available at this https URL.

[LG-105] A Novel Spinor-Based Embedding Model for Transformers

链接: https://arxiv.org/abs/2410.00038
作者: Rick White
关键词-EN: geometric algebra, paper proposes, models by utilizing, Transformer models, word embeddings
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:This paper proposes a novel approach to word embeddings in Transformer models by utilizing spinors from geometric algebra. Spinors offer a rich mathematical framework capable of capturing complex relationships and transformations in high-dimensional spaces. By encoding words as spinors, we aim to enhance the expressiveness and robustness of language representations. We present the theoretical foundations of spinors, detail their integration into Transformer architectures, and discuss potential advantages and challenges.

[LG-106] Prediction and Detection of Terminal Diseases Using Internet of Medical Things: A Review

链接: https://arxiv.org/abs/2410.00034
作者: Akeem Temitope Otapo,Alice Othmani,Ghazaleh Khodabandelou,Zuheng Ming
关键词-EN: Artificial Intelligence, Medical Things, Internet of Medical, integration of Artificial, Machine Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) and the Internet of Medical Things (IoMT) in healthcare, through Machine Learning (ML) and Deep Learning (DL) techniques, has advanced the prediction and diagnosis of chronic diseases. AI-driven models such as XGBoost, Random Forest, CNNs, and LSTM RNNs have achieved over 98% accuracy in predicting heart disease, chronic kidney disease (CKD), Alzheimer’s disease, and lung cancer, using datasets from platforms like Kaggle, UCI, private institutions, and real-time IoMT sources. However, challenges persist due to variations in data quality, patient demographics, and formats from different hospitals and research sources. The incorporation of IoMT data, which is vast and heterogeneous, adds complexities in ensuring interoperability and security to protect patient privacy. AI models often struggle with overfitting, performing well in controlled environments but less effectively in real-world clinical settings. Moreover, multi-morbidity scenarios especially for rare diseases like dementia, stroke, and cancers remain insufficiently addressed. Future research should focus on data standardization and advanced preprocessing techniques to improve data quality and interoperability. Transfer learning and ensemble methods are crucial for improving model generalizability across clinical settings. Additionally, the exploration of disease interactions and the development of predictive models for chronic illness intersections is needed. Creating standardized frameworks and open-source tools for integrating federated learning, blockchain, and differential privacy into IoMT systems will also ensure robust data privacy and security.

[LG-107] AutoFlow: An Autoencoder-based Approach for IP Flow Record Compression with Minimal Impact on Traffic Classification

链接: https://arxiv.org/abs/2410.00030
作者: Adrian Pekar
关键词-EN: deep learning techniques, specifically autoencoders, learning techniques, paper presents, compressing IP flow
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 9 pages, submitted to NOMS 2025

点击查看摘要

Abstract:This paper presents a novel approach to compressing IP flow records using deep learning techniques, specifically autoencoders. Our method aims to significantly reduce data volume while maintaining the utility of the compressed data for downstream analysis tasks. We demonstrate the effectiveness of our approach through extensive experiments on a large-scale, real-world network traffic dataset. The proposed autoencoder-based compression achieves a 3.28x reduction in data size while preserving 99.20% accuracy in a multi-class traffic classification task, compared to 99.77% accuracy with uncompressed data. This marginal decrease in performance is offset by substantial gains in storage efficiency and potential improvements in processing speed. Our method shows particular promise in distinguishing between various modern application protocols, including encrypted traffic from popular services. The implications of this work extend to more efficient network monitoring, real-time analysis in resource-constrained environments, and scalable network management solutions.

[LG-108] REB: a BERT attempt for imputing tabular data imputation

链接: https://arxiv.org/abs/2410.00022
作者: Shuyue Wang,Wenjun Zhou,Han drk-m-s Jiang,Shuo Wang,Ren Zheng
关键词-EN: framework utilizing BERT, imputation framework utilizing, tabular imputation framework, utilizing BERT, introduces a groundbreaking
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:TREB, a novel tabular imputation framework utilizing BERT, introduces a groundbreaking approach for handling missing values in tabular data. Unlike traditional methods that often overlook the specific demands of imputation, TREB leverages the robust capabilities of BERT to address this critical task. While many BERT-based approaches for tabular data have emerged, they frequently under-utilize the language model’s full potential. To rectify this, TREB employs a BERT-based model fine-tuned specifically for the task of imputing real-valued continuous numbers in tabular datasets. The paper comprehensively addresses the unique challenges posed by tabular data imputation, emphasizing the importance of context-based interconnections. The effectiveness of TREB is validated through rigorous evaluation using the California Housing dataset. The results demonstrate its ability to preserve feature interrelationships and accurately impute missing values. Moreover, the authors shed light on the computational efficiency and environmental impact of TREB, quantifying the floating-point operations (FLOPs) and carbon footprint associated with its training and deployment.

[LG-109] Low-code from frontend to backend: Connecting conversational user interfaces to backend services via a low-code IoT platform

链接: https://arxiv.org/abs/2410.00006
作者: Irene Weber
关键词-EN: frameworks facilitate setting, business functions requires, functions requires substantial, requires substantial manual, substantial manual coding
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 5 pages, 6 figures. In 3rd Conference on Conversational User Interfaces (CUI21), July 2021, Bilbao (online), Spain

点击查看摘要

Abstract:Current chatbot development platforms and frameworks facilitate setting up the language and dialog part of chatbots, while connecting it to backend services and business functions requires substantial manual coding effort and programming skills. This paper proposes an approach to overcome this situation. It proposes an architecture with a chatbot as frontend using an IoT (Internet of Things) platform as a middleware for connections to backend services. Specifically, it elaborates and demonstrates how to combine a chatbot developed on the open source development platform Rasa with the open source platform Node-RED, allowing low-code or no-code development of a transactional conversational user interface from frontend to backend.

[LG-110] Linear Projections of Teacher Embeddings for Few-Class Distillation

链接: https://arxiv.org/abs/2409.20449
作者: Noel Loo,Fotis Iliopoulos,Wei Hu,Erik Vee
关键词-EN: smaller student model, complex teacher model, promising approach, approach for transferring, smaller student
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model. Traditionally, KD involves training the student to mimic the teacher’s output probabilities, while more advanced techniques have explored guiding the student to adopt the teacher’s internal representations. Despite its widespread success, the performance of KD in binary classification and few-class problems has been less satisfactory. This is because the information about the teacher model’s generalization patterns scales directly with the number of classes. Moreover, several sophisticated distillation methods may not be universally applicable or effective for data types beyond Computer Vision. Consequently, effective distillation techniques remain elusive for a range of key real-world applications, such as sentiment analysis, search query understanding, and advertisement-query relevance assessment. Taking these observations into account, we introduce a novel method for distilling knowledge from the teacher’s model representations, which we term Learning Embedding Linear Projections (LELP). Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher’s embedding space, and splitting them into pseudo-subclasses. The student model is then trained to replicate these pseudo-classes. Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems, where most KD methods suffer.

[LG-111] Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments

链接: https://arxiv.org/abs/2410.00903
作者: Kosuke Imai,Kentaro Nakamura
关键词-EN: generative Artificial Intelligence, Artificial Intelligence, unstructured high-dimensional treatments, generative Artificial, enhance the validity
类目: Applications (stat.AP); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence. Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps separate the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike the existing methods, our proposed approach eliminates the need to learn causal representation from the data and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings, in which the treatment feature is based on human perception rather than is assumed to be fixed given the treatment object. We conduct simulation studies using the generated text data with an open-source LLM, Llama3, to illustrate the advantages of our estimator over the state-of-the-art causal representation learning algorithms.

[LG-112] An EM Gradient Algorithm for Mixture Models with Components Derived from the Manly Transformation

链接: https://arxiv.org/abs/2410.00848
作者: Katharine M. Clark,Paul D. McNicholas
关键词-EN: Zhu and Melnykov, Manly transformation, fit mixture models, fit mixture, components are derived
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zhu and Melnykov (2018) develop a model to fit mixture models when the components are derived from the Manly transformation. Their EM algorithm utilizes Nelder-Mead optimization in the M-step to update the skew parameter, \boldsymbol\lambda_g . An alternative EM gradient algorithm is proposed, using one step of Newton’s method, when initial estimates for the model parameters are good.

[LG-113] WALINET: A water and lipid identification convolutional Neural Network for nuisance signal removal in 1H MR Spectroscopic Imaging

链接: https://arxiv.org/abs/2410.00746
作者: Paul Weiser,Georg Langs,Stanislav Motyka,Wolfgang Bogner,Sébastien Courvoisier,Malte Hoffmann,Antoine Klauser,Ovidiu C. Andronesi
关键词-EN: Resonance Spectroscopic Imaging, Magnetic Resonance Spectroscopic, Proton Magnetic Resonance, Spectroscopic Imaging, Magnetic Resonance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Purpose. Proton Magnetic Resonance Spectroscopic Imaging (1H-MRSI) provides non-invasive spectral-spatial mapping of metabolism. However, long-standing problems in whole-brain 1H-MRSI are spectral overlap of metabolite peaks with large lipid signal from scalp, and overwhelming water signal that distorts spectra. Fast and effective methods are needed for high-resolution 1H-MRSI to accurately remove lipid and water signals while preserving the metabolite signal. The potential of supervised neural networks for this task remains unexplored, despite their success for other MRSI processing. Methods. We introduce a deep-learning method based on a modified Y-NET network for water and lipid removal in whole-brain 1H-MRSI. The WALINET (WAter and LIpid neural NETwork) was compared to conventional methods such as the state-of-the-art lipid L2 regularization and Hankel-Lanczos singular value decomposition (HLSVD) water suppression. Methods were evaluated on simulated and in-vivo whole-brain MRSI using NMRSE, SNR, CRLB, and FWHM metrics. Results. WALINET is significantly faster and needs 8s for high-resolution whole-brain MRSI, compared to 42 minutes for conventional HLSVD+L2. Quantitative analysis shows WALINET has better performance than HLSVD+L2: 1) more lipid removal with 41% lower NRMSE, 2) better metabolite signal preservation with 71% lower NRMSE in simulated data, 155% higher SNR and 50% lower CRLB in in-vivo data. Metabolic maps obtained by WALINET in healthy subjects and patients show better gray/white-matter contrast with more visible structural details. Conclusions. WALINET has superior performance for nuisance signal removal and metabolite quantification on whole-brain 1H-MRSI compared to conventional state-of-the-art techniques. This represents a new application of deep-learning for MRSI processing, with potential for automated high-throughput workflow. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2410.00746 [eess.IV] (or arXiv:2410.00746v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2410.00746 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Paul Weiser [view email] [v1] Tue, 1 Oct 2024 14:37:55 UTC (15,431 KB)

[LG-114] NECOMIMI: Neural-Cognitive Multimodal EEG-informed Image Generation with Diffusion Models

链接: https://arxiv.org/abs/2410.00712
作者: Chi-Sheng Chen
关键词-EN: advanced diffusion models, NEural-COgnitive MultImodal EEG-Informed, MultImodal EEG-Informed Image, generating images directly, Diffusion Models
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:NECOMIMI (NEural-COgnitive MultImodal EEG-Informed Image Generation with Diffusion Models) introduces a novel framework for generating images directly from EEG signals using advanced diffusion models. Unlike previous works that focused solely on EEG-image classification through contrastive learning, NECOMIMI extends this task to image generation. The proposed NERV EEG encoder demonstrates state-of-the-art (SoTA) performance across multiple zero-shot classification tasks, including 2-way, 4-way, and 200-way, and achieves top results in our newly proposed Category-based Assessment Table (CAT) Score, which evaluates the quality of EEG-generated images based on semantic concepts. A key discovery of this work is that the model tends to generate abstract or generalized images, such as landscapes, rather than specific objects, highlighting the inherent challenges of translating noisy and low-resolution EEG data into detailed visual outputs. Additionally, we introduce the CAT Score as a new metric tailored for EEG-to-image evaluation and establish a benchmark on the ThingsEEG dataset. This study underscores the potential of EEG-to-image generation while revealing the complexities and challenges that remain in bridging neural activity with visual representation.

[LG-115] Hybrid Quantum Neural Network based Indoor User Localization using Cloud Quantum Computing

链接: https://arxiv.org/abs/2410.00708
作者: Sparsh Mittal,Yash Chand,Neel Kanth Kundu
关键词-EN: signal strength indicator, received signal strength, quantum neural network, hybrid quantum neural, RSSI localization datasets
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This work has been accepted for presentation at the IEEE TENSYMP 2024 conference

点击查看摘要

Abstract:This paper proposes a hybrid quantum neural network (HQNN) for indoor user localization using received signal strength indicator (RSSI) values. We use publicly available RSSI datasets for indoor localization using WiFi, Bluetooth, and Zigbee to test the performance of the proposed HQNN. We also compare the performance of the HQNN with the recently proposed quantum fingerprinting-based user localization method. Our results show that the proposed HQNN performs better than the quantum fingerprinting algorithm since the HQNN has trainable parameters in the quantum circuits, whereas the quantum fingerprinting algorithm uses a fixed quantum circuit to calculate the similarity between the test data point and the fingerprint dataset. Unlike prior works, we also test the performance of the HQNN and quantum fingerprint algorithm on a real IBM quantum computer using cloud quantum computing services. Therefore, this paper examines the performance of the HQNN on noisy intermediate scale (NISQ) quantum devices using real-world RSSI localization datasets. The novelty of our approach lies in the use of simple feature maps and ansatz with fewer neurons, alongside testing on actual quantum hardware using real-world data, demonstrating practical applicability in real-world scenarios.

[LG-116] Optimizing Photoplethysmography-Based Sleep Staging Models by Leveraging Temporal Context for Wearable Devices Applications

链接: https://arxiv.org/abs/2410.00693
作者: Joseph A. P. Quino,Diego A. C. Cardenas,Marcelo A. F. Toledo,Felipe M. Dias,Estela Ribeiro,Jose E. Krieger,Marco A. Gutierrez
关键词-EN: evaluating sleep quality, diagnosing sleep disorders, classification is crucial, crucial for diagnosing, disorders and evaluating
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 11 pages, 5 figures, 1 table

点击查看摘要

Abstract:Accurate sleep stage classification is crucial for diagnosing sleep disorders and evaluating sleep quality. While polysomnography (PSG) remains the gold standard, photoplethysmography (PPG) is more practical due to its affordability and widespread use in wearable devices. However, state-of-the-art sleep staging methods often require prolonged continuous signal acquisition, making them impractical for wearable devices due to high energy consumption. Shorter signal acquisitions are more feasible but less accurate. Our work proposes an adapted sleep staging model based on top-performing state-of-the-art methods and evaluates its performance with different PPG segment sizes. We concatenate 30-second PPG segments over 15-minute intervals to leverage longer segment contexts. This approach achieved an accuracy of 0.75, a Cohen’s Kappa of 0.60, an F1-Weighted score of 0.74, and an F1-Macro score of 0.60. Although reducing segment size decreased sensitivity for deep and REM stages, our strategy outperformed single 30-second window methods, particularly for these stages.

[LG-117] AVRNN: Temporal Attention-enhanced Variational Graph RNN Captures Neural Dynamics and Behavior

链接: https://arxiv.org/abs/2410.00665
作者: Moein Khajehnejad,Forough Habibollahi,Ahmad Khajehnejad,Brett J. Kagan,Adeel Razi
关键词-EN: Recurrent Neural Network, Graph Recurrent Neural, introduce Temporal Attention-enhanced, Temporal Attention-enhanced Variational, Recurrent Neural
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 31 pages, 6 figures, 4 supplemental figures, 4 tables, 8 supplemental tables

点击查看摘要

Abstract:We introduce Temporal Attention-enhanced Variational Graph Recurrent Neural Network (TAVRNN), a novel framework for analyzing the evolving dynamics of neuronal connectivity networks in response to external stimuli and behavioral feedback. TAVRNN captures temporal changes in network structure by modeling sequential snapshots of neuronal activity, enabling the identification of key connectivity patterns. Leveraging temporal attention mechanisms and variational graph techniques, TAVRNN uncovers how connectivity shifts align with behavior over time. We validate TAVRNN on two datasets: in vivo calcium imaging data from freely behaving rats and novel in vitro electrophysiological data from the DishBrain system, where biological neurons control a simulated environment during the game of pong. We show that TAVRNN outperforms previous baseline models in classification, clustering tasks and computational efficiency while accurately linking connectivity changes to performance variations. Crucially, TAVRNN reveals that high game performance in the DishBrain system correlates with the alignment of sensory and motor subregion channels, a relationship not evident in earlier models. This framework represents the first application of dynamic graph representation of electrophysiological (neuronal) data from DishBrain system, providing insights into the reorganization of neuronal networks during learning. TAVRNN’s ability to differentiate between neuronal states associated with successful and unsuccessful learning outcomes, offers significant implications for real-time monitoring and manipulation of biological neuronal systems.

[LG-118] Differentiable Interacting Multiple Model Particle Filtering

链接: https://arxiv.org/abs/2410.00620
作者: John-Joseph Brady,Yuhui Luo,Wenwu Wang,Víctor Elvira,Yunpeng Li
关键词-EN: sequential Monte Carlo, Monte Carlo algorithm, Monte Carlo, exhibits random discontinuous, random discontinuous jumps
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We propose a sequential Monte Carlo algorithm for parameter learning when the studied model exhibits random discontinuous jumps in behaviour. To facilitate the learning of high dimensional parameter sets, such as those associated to neural networks, we adopt the emerging framework of differentiable particle filtering, wherein parameters are trained by gradient descent. We design a new differentiable interacting multiple model particle filter to be capable of learning the individual behavioural regimes and the model which controls the jumping simultaneously. In contrast to previous approaches, our algorithm allows control of the computational effort assigned per regime whilst using the probability of being in a given regime to guide sampling. Furthermore, we develop a new gradient estimator that has a lower variance than established approaches and remains fast to compute, for which we prove consistency. We establish new theoretical results of the presented algorithms and demonstrate superior numerical performance compared to the previous state-of-the-art algorithms.

[LG-119] Radio Foundation Models: Pre-training Transformers for 5G-based Indoor Localization

链接: https://arxiv.org/abs/2410.00617
作者: Jonathan Ott,Jonas Pirkl,Maximilian Stahlke,Tobias Feigl,Christopher Mutschler
关键词-EN: Artificial Intelligence, based radio fingerprinting, strong multipath effects, outperforms classic localization, based radio
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI)-based radio fingerprinting (FP) outperforms classic localization methods in propagation environments with strong multipath effects. However, the model and data orchestration of FP are time-consuming and costly, as it requires many reference positions and extensive measurement campaigns for each environment. Instead, modern unsupervised and self-supervised learning schemes require less reference data for localization, but either their accuracy is low or they require additional sensor information, rendering them impractical. In this paper we propose a self-supervised learning framework that pre-trains a general transformer (TF) neural network on 5G channel measurements that we collect on-the-fly without expensive equipment. Our novel pretext task randomly masks and drops input information to learn to reconstruct it. So, it implicitly learns the spatiotemporal patterns and information of the propagation environment that enable FP-based localization. Most interestingly, when we optimize this pre-trained model for localization in a given environment, it achieves the accuracy of state-of-the-art methods but requires ten times less reference data and significantly reduces the time from training to operation.

[LG-120] Arges: Spatio-Temporal Transformer for Ulcerative Colitis Severity Assessment in Endoscopy Videos MICCAI

链接: https://arxiv.org/abs/2410.00536
作者: Krishna Chaitanya,Pablo F. Damasceno,Shreyas Fadnavis,Pooya Mobadersany,Chaitanya Parmar,Emily Scherer,Natalia Zemlianskaia,Lindsey Surace,Louis R. Ghanem,Oana Gabriela Cula,Tommaso Mansi,Kristopher Standish
关键词-EN: Ulcerative Colitis Endoscopic, Colitis Endoscopic Index, Mayo Endoscopic Subscore, evaluating drug efficacy, ulcerative colitis
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 5 tables, accepted at MLMI, MICCAI

点击查看摘要

Abstract:Accurate assessment of disease severity from endoscopy videos in ulcerative colitis (UC) is crucial for evaluating drug efficacy in clinical trials. Severity is often measured by the Mayo Endoscopic Subscore (MES) and Ulcerative Colitis Endoscopic Index of Severity (UCEIS) score. However, expert MES/UCEIS annotation is time-consuming and susceptible to inter-rater variability, factors addressable by automation. Automation attempts with frame-level labels face challenges in fully-supervised solutions due to the prevalence of video-level labels in clinical trials. CNN-based weakly-supervised models (WSL) with end-to-end (e2e) training lack generalization to new disease scores and ignore spatio-temporal information crucial for accurate scoring. To address these limitations, we propose “Arges”, a deep learning framework that utilizes a transformer with positional encoding to incorporate spatio-temporal information from frame features to estimate disease severity scores in endoscopy video. Extracted features are derived from a foundation model (ArgesFM), pre-trained on a large diverse dataset from multiple clinical trials (61M frames, 3927 videos). We evaluate four UC disease severity scores, including MES and three UCEIS component scores. Test set evaluation indicates significant improvements, with F1 scores increasing by 4.1% for MES and 18.8%, 6.6%, 3.8% for the three UCEIS component scores compared to state-of-the-art methods. Prospective validation on previously unseen clinical trial data further demonstrates the model’s successful generalization.

[LG-121] Stability analysis of chaotic systems in latent spaces

链接: https://arxiv.org/abs/2410.00480
作者: Elise Özalp,Luca Magri
关键词-EN: Partial differential equations, Partial differential, chaotic partial differential, latent-space approach, differential equations
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partial differential equations, and their chaotic solutions, are pervasive in the modelling of complex systems in engineering, science, and beyond. Data-driven methods can find solutions to partial differential equations with a divide-and-conquer strategy: The solution is sought in a latent space, on which the temporal dynamics are inferred (``latent-space’’ approach). This is achieved by, first, compressing the data with an autoencoder, and, second, inferring the temporal dynamics with recurrent neural networks. The overarching goal of this paper is to show that a latent-space approach can not only infer the solution of a chaotic partial differential equation, but it can also predict the stability properties of the physical system. First, we employ the convolutional autoencoder echo state network (CAE-ESN) on the chaotic Kuramoto-Sivashinsky equation for various chaotic regimes. We show that the CAE-ESN (i) finds a low-dimensional latent-space representation of the observations and (ii) accurately infers the Lyapunov exponents and covariant Lyapunov vectors (CLVs) in this low-dimensional manifold for different attractors. Second, we extend the CAE-ESN to a turbulent flow, comparing the Lyapunov spectrum to estimates obtained from Jacobian-free methods. A latent-space approach based on the CAE-ESN effectively produces a latent space that preserves the key properties of the chaotic system, such as Lyapunov exponents and CLVs, thus retaining the geometric structure of the attractor. The latent-space approach based on the CAE-ESN is a reduced-order model that accurately predicts the dynamics of the chaotic system, or, alternatively, it can be used to infer stability properties of chaotic systems from data.

[LG-122] Uncertainty-aware t-distributed Stochastic Neighbor Embedding for Single-cell RNA-seq Data

链接: https://arxiv.org/abs/2410.00473
作者: Hui Ma,Kai Chen
关键词-EN: stochastic neighbor embedding, t-distributed stochastic neighbor, depict biological populations, Nonlinear data visualization, biological populations accurately
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nonlinear data visualization using t-distributed stochastic neighbor embedding (t-SNE) enables the representation of complex single-cell transcriptomic landscapes in two or three dimensions to depict biological populations accurately. However, t-SNE often fails to account for uncertainties in the original dataset, leading to misleading visualizations where cell subsets with noise appear indistinguishable. To address these challenges, we introduce uncertainty-aware t-SNE (Ut-SNE), a noise-defending visualization tool tailored for uncertain single-cell RNA-seq data. By creating a probabilistic representation for each sample, Our Ut-SNE accurately incorporates noise about transcriptomic variability into the visual interpretation of single-cell RNA sequencing data, revealing significant uncertainties in transcriptomic variability. Through various examples, we showcase the practical value of Ut-SNE and underscore the significance of incorporating uncertainty awareness into data visualization practices. This versatile uncertainty-aware visualization tool can be easily adapted to other scientific domains beyond single-cell RNA sequencing, making them valuable resources for high-dimensional data analysis.

[LG-123] A Generalized Mean Approach for Distributed-PCA

链接: https://arxiv.org/abs/2410.00397
作者: Zhi-Yu Jou,Su-Yun Huang,Hung Hung,Shinto Eguchi
关键词-EN: Principal component analysis, Principal component, beta, DPCA, component analysis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 1 table, 1 figure

点击查看摘要

Abstract:Principal component analysis (PCA) is a widely used technique for dimension reduction. As datasets continue to grow in size, distributed-PCA (DPCA) has become an active research area. A key challenge in DPCA lies in efficiently aggregating results across multiple machines or computing nodes due to computational overhead. Fan et al. (2019) introduced a pioneering DPCA method to estimate the leading rank- r eigenspace, aggregating local rank- r projection matrices by averaging. However, their method does not utilize eigenvalue information. In this article, we propose a novel DPCA method that incorporates eigenvalue information to aggregate local results via the matrix \beta -mean, which we call \beta -DPCA. The matrix \beta -mean offers a flexible and robust aggregation method through the adjustable choice of \beta values. Notably, for \beta=1 , it corresponds to the arithmetic mean; for \beta=-1 , the harmonic mean; and as \beta \to 0 , the geometric mean. Moreover, the matrix \beta -mean is shown to associate with the matrix \beta -divergence, a subclass of the Bregman matrix divergence, to support the robustness of \beta -DPCA. We also study the stability of eigenvector ordering under eigenvalue perturbation for \beta -DPCA. The performance of our proposal is evaluated through numerical studies.

[LG-124] Dynamic neurons: A statistical physics approach for analyzing deep neural networks

链接: https://arxiv.org/abs/2410.00396
作者: Donghee Lee,Hye-Sung Lee,Jaeok Yi
关键词-EN: repetitive structural elements, deep neural networks, Deep neural, neural network architectures, structural elements
类目: atistical Mechanics (cond-mat.stat-mech); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Deep neural network architectures often consist of repetitive structural elements. We introduce a new approach that reveals these patterns and can be broadly applied to the study of deep learning. Similar to how a power strip helps untangle and organize complex cable connections, this approach treats neurons as additional degrees of freedom in interactions, simplifying the structure and enhancing the intuitive understanding of interactions within deep neural networks. Furthermore, it reveals the translational symmetry of deep neural networks, which simplifies the application of the renormalization group transformation - a method that effectively analyzes the scaling behavior of the system. By utilizing translational symmetry and renormalization group transformations, we can analyze critical phenomena. This approach may open new avenues for studying deep neural networks using statistical physics.

[LG-125] ROK Defense MS in the Age of Hyperscale AI: Concepts Challenges and Future Directions

链接: https://arxiv.org/abs/2410.00367
作者: Youngjoon Lee,Taehyun Park,Yeongjoon Kang,Jonghoe Kim,Joonhyuk Kang
关键词-EN: Integrating hyperscale, national defense modeling, modeling and simulation, crucial for enhancing, enhancing strategic
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating hyperscale AI into national defense modeling and simulation (MS) is crucial for enhancing strategic and operational capabilities. We explore how hyperscale AI can revolutionize defense M\S by providing unprecedented accuracy, speed, and the ability to simulate complex scenarios. Countries such as the United States and China are at the forefront of adopting these technologies and are experiencing varying degrees of success. Maximizing the potential of hyperscale AI necessitates addressing critical challenges, such as closed networks, long-tail data, complex decision-making, and a shortage of experts. Future directions emphasize the adoption of domestic foundation models, the investment in various GPUs / NPUs, the utilization of big tech services, and the use of open source software. These initiatives will enhance national security, maintain competitive advantages, and promote broader technological and economic progress. With this blueprint, the Republic of Korea can strengthen its defense capabilities and stay ahead of the emerging threats of modern warfare.

[LG-126] Contrastive Representation Learning for Predicting Solar Flares from Extremely Imbalanced Multivariate Time Series Data ICML

链接: https://arxiv.org/abs/2410.00312
作者: Onur Vural,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
关键词-EN: Sun magnetic flux, presenting significant risks, multivariate time series, time series, Sun magnetic
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work has been accepted at ICMLA 2024 on September 7, 2024, as a short paper for poster presentation

点击查看摘要

Abstract:Major solar flares are abrupt surges in the Sun’s magnetic flux, presenting significant risks to technological infrastructure. In view of this, effectively predicting major flares from solar active region magnetic field data through machine learning methods becomes highly important in space weather research. Magnetic field data can be represented in multivariate time series modality where the data displays an extreme class imbalance due to the rarity of major flare events. In time series classification-based flare prediction, the use of contrastive representation learning methods has been relatively limited. In this paper, we introduce CONTREX, a novel contrastive representation learning approach for multivariate time series data, addressing challenges of temporal dependencies and extreme class imbalance. Our method involves extracting dynamic features from the multivariate time series instances, deriving two extremes from positive and negative class feature vectors that provide maximum separation capability, and training a sequence representation embedding module with the original multivariate time series data guided by our novel contrastive reconstruction loss to generate embeddings aligned with the extreme points. These embeddings capture essential time series characteristics and enhance discriminative power. Our approach shows promising solar flare prediction results on the Space Weather Analytics for Solar Flares (SWAN-SF) multivariate time series benchmark dataset against baseline methods.

[LG-127] GARCH-Informed Neural Networks for Volatility Prediction in Financial Markets

链接: https://arxiv.org/abs/2410.00288
作者: Zeda Xu,John Liechty,Sebastian Benthall,Nicholas Skar-Gislinge,Christopher McComb
关键词-EN: Autoregressive Conditional Heteroscedasticity, Deep Neural Network, Generalized Autoregressive Conditional, Neural Network, dispersion of returns
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Volatility, which indicates the dispersion of returns, is a crucial measure of risk and is hence used extensively for pricing and discriminating between different financial investments. As a result, accurate volatility prediction receives extensive attention. The Generalized Autoregressive Conditional Heteroscedasticity (GARCH) model and its succeeding variants are well established models for stock volatility forecasting. More recently, deep learning models have gained popularity in volatility prediction as they demonstrated promising accuracy in certain time series prediction tasks. Inspired by Physics-Informed Neural Networks (PINN), we constructed a new, hybrid Deep Learning model that combines the strengths of GARCH with the flexibility of a Long Short-Term Memory (LSTM) Deep Neural Network (DNN), thus capturing and forecasting market volatility more accurately than either class of models are capable of on their own. We refer to this novel model as a GARCH-Informed Neural Network (GINN). When compared to other time series models, GINN showed superior out-of-sample prediction performance in terms of the Coefficient of Determination ( R^2 ), Mean Squared Error (MSE), and Mean Absolute Error (MAE).

[LG-128] GalaxiesML: a dataset of galaxy images photometry redshifts and structural parameters for machine learning

链接: https://arxiv.org/abs/2410.00271
作者: Tuan Do(1),Bernie Boscoe(2),Evan Jones(1),Yun Qi Li(1,3),Kevin Alfaro(1) ((1) UCLA, (2) Southern Oregon University, (3) University of Washington)
关键词-EN: machine learning, machine learning applications, learning applications consisting, dataset, machine
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures, data available at this https URL , example code of usage at this https URL

点击查看摘要

Abstract:We present a dataset built for machine learning applications consisting of galaxy photometry, images, spectroscopic redshifts, and structural properties. This dataset comprises 286,401 galaxy images and photometry from the Hyper-Suprime-Cam Survey PDR2 in five imaging filters ( g,r,i,z,y ) with spectroscopically confirmed redshifts as ground truth. Such a dataset is important for machine learning applications because it is uniform, consistent, and has minimal outliers but still contains a realistic range of signal-to-noise ratios. We make this dataset public to help spur development of machine learning methods for the next generation of surveys such as Euclid and LSST. The aim of GalaxiesML is to provide a robust dataset that can be used not only for astrophysics but also for machine learning, where image properties cannot be validated by the human eye and are instead governed by physical laws. We describe the challenges associated with putting together a dataset from publicly available archives, including outlier rejection, duplication, establishing ground truths, and sample selection. This is one of the largest public machine learning-ready training sets of its kind with redshifts ranging from 0.01 to 4. The redshift distribution of this sample peaks at redshift of 1.5 and falls off rapidly beyond redshift 2.5. We also include an example application of this dataset for redshift estimation, demonstrating that using images for redshift estimation produces more accurate results compared to using photometry alone. For example, the bias in redshift estimate is a factor of 10 lower when using images between redshift of 0.1 to 1.25 compared to photometry alone. Results from dataset such as this will help inform us on how to best make use of data from the next generation of galaxy surveys.

[LG-129] Stochastic Inverse Problem: stability regularization and Wasserstein gradient flow

链接: https://arxiv.org/abs/2410.00229
作者: Qin Li,Maria Oprea,Li Wang,Yunan Yang
关键词-EN: unknown parameter, physical or biological, biological sciences, sciences often involve, involve recovering
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Inverse problems in physical or biological sciences often involve recovering an unknown parameter that is random. The sought-after quantity is a probability distribution of the unknown parameter, that produces data that aligns with measurements. Consequently, these problems are naturally framed as stochastic inverse problems. In this paper, we explore three aspects of this problem: direct inversion, variational formulation with regularization, and optimization via gradient flows, drawing parallels with deterministic inverse problems. A key difference from the deterministic case is the space in which we operate. Here, we work within probability space rather than Euclidean or Sobolev spaces, making tools from measure transport theory necessary for the study. Our findings reveal that the choice of metric – both in the design of the loss function and in the optimization process – significantly impacts the stability and properties of the optimizer.

[LG-130] Volumetric Conditional Score-based Residual Diffusion Model for PET/MR Denoising MICCAI2024

链接: https://arxiv.org/abs/2410.00184
作者: Siyeop Yoon,Rui Hu,Yuang Wang,Matthew Tivnan,Young-don Son,Dufan Wu,Xiang Li,Kyungsang Kim,Quanzheng Li
关键词-EN: powerful modality offering, offering quantitative assessments, modality offering quantitative, PET imaging, PET
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to MICCAI 2024

点击查看摘要

Abstract:PET imaging is a powerful modality offering quantitative assessments of molecular and physiological processes. The necessity for PET denoising arises from the intrinsic high noise levels in PET imaging, which can significantly hinder the accurate interpretation and quantitative analysis of the scans. With advances in deep learning techniques, diffusion model-based PET denoising techniques have shown remarkable performance improvement. However, these models often face limitations when applied to volumetric data. Additionally, many existing diffusion models do not adequately consider the unique characteristics of PET imaging, such as its 3D volumetric nature, leading to the potential loss of anatomic consistency. Our Conditional Score-based Residual Diffusion (CSRD) model addresses these issues by incorporating a refined score function and 3D patch-wise training strategy, optimizing the model for efficient volumetric PET denoising. The CSRD model significantly lowers computational demands and expedites the denoising process. By effectively integrating volumetric data from PET and MRI scans, the CSRD model maintains spatial coherence and anatomical detail. Lastly, we demonstrate that the CSRD model achieves superior denoising performance in both qualitative and quantitative evaluations while maintaining image details and outperforms existing state-of-the-art methods.

[LG-131] Multimodal Alignment of Histopathological Images Using Cell Segmentation and Point Set Matching for Integrative Cancer Analysis

链接: https://arxiv.org/abs/2410.00152
作者: Jun Jiang,Raymond Moore,Brenna Novotny,Leo Liu,Zachary Fogarty,Ray Guo,Markovic Svetomir,Chen Wang
关键词-EN: providing complementary insights, Hematoxylin and Eosin, Histopathological imaging, multiplexed Immunofluorescence, providing complementary
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: initial version

点击查看摘要

Abstract:Histopathological imaging is vital for cancer research and clinical practice, with multiplexed Immunofluorescence (MxIF) and Hematoxylin and Eosin (HE) providing complementary insights. However, aligning different stains at the cell level remains a challenge due to modality differences. In this paper, we present a novel framework for multimodal image alignment using cell segmentation outcomes. By treating cells as point sets, we apply Coherent Point Drift (CPD) for initial alignment and refine it with Graph Matching (GM). Evaluated on ovarian cancer tissue microarrays (TMAs), our method achieves high alignment accuracy, enabling integration of cell-level features across modalities and generating virtual HE images from MxIF data for enhanced clinical interpretation.

[LG-132] Shuffled Linear Regression via Spectral Matching

链接: https://arxiv.org/abs/2410.00078
作者: Hang Liu,Anna Scaglione
关键词-EN: Shuffled linear regression, linear regression, linear transformation, complicated by unknown, shuffled LASSO formulations
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Spectral Theory (math.SP); Machine Learning (stat.ML)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Shuffled linear regression (SLR) seeks to estimate latent features through a linear transformation, complicated by unknown permutations in the measurement dimensions. This problem extends traditional least-squares (LS) and Least Absolute Shrinkage and Selection Operator (LASSO) approaches by jointly estimating the permutation, resulting in shuffled LS and shuffled LASSO formulations. Existing methods, constrained by the combinatorial complexity of permutation recovery, often address small-scale cases with limited measurements. In contrast, we focus on large-scale SLR, particularly suited for environments with abundant measurement samples. We propose a spectral matching method that efficiently resolves permutations by aligning spectral components of the measurement and feature covariances. Rigorous theoretical analyses demonstrate that our method achieves accurate estimates in both shuffled LS and shuffled LASSO settings, given a sufficient number of samples. Furthermore, we extend our approach to address simultaneous pose and correspondence estimation in image registration tasks. Experiments on synthetic datasets and real-world image registration scenarios show that our method outperforms existing algorithms in both estimation accuracy and registration performance.

[LG-133] Denoising Variational Autoencoder as a Feature Reduction Pipeline for the diagnosis of Autism based on Resting-state fMRI

链接: https://arxiv.org/abs/2410.00068
作者: Xinyuan Zheng,Orren Ravid,Robert A.J. Barry,Yoojean Kim,Qian Wang,Young-geun Kim,Xi Zhu,Xiaofu He
关键词-EN: Autism spectrum disorders, developmental conditions characterized, Autism spectrum, spectrum disorders, difficulties in communication
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Autism spectrum disorders (ASDs) are developmental conditions characterized by restricted interests and difficulties in communication. The complexity of ASD has resulted in a deficiency of objective diagnostic biomarkers. Deep learning methods have gained recognition for addressing these challenges in neuroimaging analysis, but finding and interpreting such diagnostic biomarkers are still challenging computationally. We propose an ASD feature reduction pipeline using resting-state fMRI (rs-fMRI). We used Ncuts parcellations and Power atlas to extract functional connectivity data, resulting in over 30 thousand features. Then the pipeline further compresses the connectivities into 5 latent Gaussian distributions, providing is a low-dimensional representation of the data, using a denoising variational autoencoder (DVAE). To test the method, we employed the extracted latent features from the DVAE to classify ASD using traditional classifiers such as support vector machine (SVM) on a large multi-site dataset. The 95% confidence interval for the prediction accuracy of the SVM is [0.63, 0.76] after site harmonization using the extracted latent distributions. Without using DVAE, the prediction accuracy is 0.70, which falls within the interval. This implies that the model successfully encodes the diagnostic information in rs-fMRI data to 5 Gaussian distributions (10 features) without sacrificing prediction performance. The runtime for training the DVAE and obtaining classification results from its extracted latent features (37 minutes) was 7 times shorter compared to training classifiers directly on the raw connectivity matrices (5-6 hours). Our findings also suggest that the Power atlas provides more effective brain connectivity insights for diagnosing ASD than Ncuts parcellations. The encoded features can be used for the help of diagnosis and interpretation of the disease.

[LG-134] Looking through the minds eye via multimodal encoder-decoder networks

链接: https://arxiv.org/abs/2410.00047
作者: Arman Afrasiyabi,Erica Busch,Rahul Singh,Dhananjay Bhaskar,Laurent Caplette,Nicholas Turk-Browne,Smita Krishnaswamy
关键词-EN: fMRI, subjects, imagery, mapping, decoding
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:In this work, we explore the decoding of mental imagery from subjects using their fMRI measurements. In order to achieve this decoding, we first created a mapping between a subject’s fMRI signals elicited by the videos the subjects watched. This mapping associates the high dimensional fMRI activation states with visual imagery. Next, we prompted the subjects textually, primarily with emotion labels which had no direct reference to visual objects. Then to decode visual imagery that may have been in a person’s mind’s eye, we align a latent representation of these fMRI measurements with a corresponding video-fMRI based on textual labels given to the videos themselves. This alignment has the effect of overlapping the video fMRI embedding with the text-prompted fMRI embedding, thus allowing us to use our fMRI-to-video mapping to decode. Additionally, we enhance an existing fMRI dataset, initially consisting of data from five subjects, by including recordings from three more subjects gathered by our team. We demonstrate the efficacy of our model on this augmented dataset both in accurately creating a mapping, as well as in plausibly decoding mental imagery.

[LG-135] Mixture of Multicenter Experts in Multimodal Generative AI for Advanced Radiotherapy Target Delineation

链接: https://arxiv.org/abs/2410.00046
作者: Yujin Oh,Sangjoon Park,Xiang Li,Wang Yi,Jonathan Paly,Jason Efstathiou,Annie Chan,Jun Won Kim,Hwa Kyung Byun,Ik Jae Lee,Jaeho Cho,Chan Woo Wee,Peng Shu,Peilong Wang,Nathan Yu,Jason Holmes,Jong Chul Ye,Quanzheng Li,Wei Liu,Woong Sub Koom,Jin Sung Kim,Kyungsang Kim
关键词-EN: regional patient populations, employ diverse philosophies, experts employ diverse, Clinical experts employ, patient care
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 39 pages

点击查看摘要

Abstract:Clinical experts employ diverse philosophies and strategies in patient care, influenced by regional patient populations. However, existing medical artificial intelligence (AI) models are often trained on data distributions that disproportionately reflect highly prevalent patterns, reinforcing biases and overlooking the diverse expertise of clinicians. To overcome this limitation, we introduce the Mixture of Multicenter Experts (MoME) approach. This method strategically integrates specialized expertise from diverse clinical strategies, enhancing the AI model’s ability to generalize and adapt across multiple medical centers. The MoME-based multimodal target volume delineation model, trained with few-shot samples including images and clinical notes from each medical center, outperformed baseline methods in prostate cancer radiotherapy target delineation. The advantages of MoME were most pronounced when data characteristics varied across centers or when data availability was limited, demonstrating its potential for broader clinical this http URL, the MoME framework enables the deployment of AI-based target volume delineation models in resource-constrained medical facilities by adapting to specific preferences of each medical center only using a few sample data, without the need for data sharing between institutions. Expanding the number of multicenter experts within the MoME framework will significantly enhance the generalizability, while also improving the usability and adaptability of clinical AI applications in the field of precision radiation oncology.

[LG-136] Moshi: a speech-text foundation model for real-time dialogue

链接: https://arxiv.org/abs/2410.00037
作者: Alexandre Défossez,Laurent Mazaré,Manu Orsini,Amélie Royer,Patrick Pérez,Hervé Jégou,Edouard Grave,Neil Zeghidour
关键词-EN: speech-text foundation model, speech-text foundation, spoken dialogue, dialogue, introduce Moshi
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning – such as emotion or non-speech sounds – is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this ``Inner Monologue’’ method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at this https URL.

[LG-137] FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation Casing and Context

链接: https://arxiv.org/abs/2410.00035
作者: Anna Povey,Katherine Povey
关键词-EN: Cyrillic and Latin, academic research purposes, Latin alphabets, read speech corpus, research purposes
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 5 Pages, 1 Figure, Preprint of Paper Accepted in ICNLSP 2024

点击查看摘要

Abstract:This paper introduces FeruzaSpeech, a read speech corpus of the Uzbek language, containing transcripts in both Cyrillic and Latin alphabets, freely available for academic research purposes. This corpus includes 60 hours of high-quality recordings from a single native female speaker from Tashkent, Uzbekistan. These recordings consist of short excerpts from a book and BBC News. This paper discusses the enhancement of the Word Error Rates (WERs) on CommonVoice 16.1’s Uzbek data, Uzbek Speech Corpus data, and FeruzaSpeech data upon integrating FeruzaSpeech.

[LG-138] Machine Learning to Detect Anxiety Disorders from Error-Related Negativity and EEG Signals

链接: https://arxiv.org/abs/2410.00028
作者: Ramya Chandrasekar,Md Rakibul Hasan,Shreya Ghosh,Tom Gedeon,Md Zakir Hossain
关键词-EN: common mental health, mental health condition, health condition characterised, excessive worry, fear and apprehension
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anxiety is a common mental health condition characterised by excessive worry, fear and apprehension about everyday situations. Even with significant progress over the past few years, predicting anxiety from electroencephalographic (EEG) signals, specifically using error-related negativity (ERN), still remains challenging. Following the PRISMA protocol, this paper systematically reviews 54 research papers on using EEG and ERN markers for anxiety detection published in the last 10 years (2013 – 2023). Our analysis highlights the wide usage of traditional machine learning, such as support vector machines and random forests, as well as deep learning models, such as convolutional neural networks and recurrent neural networks across different data types. Our analysis reveals that the development of a robust and generic anxiety prediction method still needs to address real-world challenges, such as task-specific setup, feature selection and computational modelling. We conclude this review by offering potential future direction for non-invasive, objective anxiety diagnostics, deployed across diverse populations and anxiety sub-types.

[LG-139] Cross-Lingual News Event Correlation for Stock Market Trend Prediction

链接: https://arxiv.org/abs/2410.00024
作者: Sahar Arshad,Nikhar Azhar,Sana Sajid,Seemab Latif,Rabia Latif
关键词-EN: modern economic landscape, integrating financial services, Language-based Financial Forecasting, Natural Language-based Financial, Financial Technology
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the modern economic landscape, integrating financial services with Financial Technology (FinTech) has become essential, particularly in stock trend analysis. This study addresses the gap in comprehending financial dynamics across diverse global economies by creating a structured financial dataset and proposing a cross-lingual Natural Language-based Financial Forecasting (NLFF) pipeline for comprehensive financial analysis. Utilizing sentiment analysis, Named Entity Recognition (NER), and semantic textual similarity, we conducted an analytical examination of news articles to extract, map, and visualize financial event timelines, uncovering the correlation between news events and stock market trends. Our method demonstrated a meaningful correlation between stock price movements and cross-linguistic news sentiments, validated by processing two-year cross-lingual news data on two prominent sectors of the Pakistan Stock Exchange. This study offers significant insights into key events, ensuring a substantial decision margin for investors through effective visualization and providing optimal investment opportunities.

[LG-140] Self-Tuning Spectral Clustering for Speaker Diarization ICASSP2025

链接: https://arxiv.org/abs/2410.00023
作者: Nikhil Raghav,Avisek Gupta,Md Sahidullah,Swagatam Das
关键词-EN: speaker diarization tasks, grouping speech representations, remains difficult due, matrix remains difficult, affinity matrix remains
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Spectral clustering has proven effective in grouping speech representations for speaker diarization tasks, although post-processing the affinity matrix remains difficult due to the need for careful tuning before constructing the Laplacian. In this study, we present a novel pruning algorithm to create a sparse affinity matrix called \emphspectral clustering on p-neighborhood retained affinity matrix (SC-pNA). Our method improves on node-specific fixed neighbor selection by allowing a variable number of neighbors, eliminating the need for external tuning data as the pruning parameters are derived directly from the affinity matrix. SC-pNA does so by identifying two clusters in every row of the initial affinity matrix, and retains only the top p% similarity scores from the cluster containing larger similarities. Spectral clustering is performed subsequently, with the number of clusters determined as the maximum eigengap. Experimental results on the challenging DIHARD-III dataset highlight the superiority of SC-pNA, which is also computationally more efficient than existing auto-tuning approaches.

[LG-141] Loneliness Forecasting Using Multi-modal Wearable and Mobile Sensing in Everyday Settings

链接: https://arxiv.org/abs/2410.00020
作者: Zhongqi Yang,Iman Azimi,Salar Jafarlou,Sina Labbaf,Brenda Nguyen,Hana Qureshi,Christopher Marcotullio,Jessica L. Borelli,Nikil Dutt,Amir M. Rahmani
关键词-EN: well-being are profound, adverse effects, loneliness, mental well-being, mental health issues
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The adverse effects of loneliness on both physical and mental well-being are profound. Although previous research has utilized mobile sensing techniques to detect mental health issues, few studies have utilized state-of-the-art wearable devices to forecast loneliness and estimate the physiological manifestations of loneliness and its predictive nature. The primary objective of this study is to examine the feasibility of forecasting loneliness by employing wearable devices, such as smart rings and watches, to monitor early physiological indicators of loneliness. Furthermore, smartphones are employed to capture initial behavioral signs of loneliness. To accomplish this, we employed personalized machine learning techniques, leveraging a comprehensive dataset comprising physiological and behavioral information obtained during our study involving the monitoring of college students. Through the development of personalized models, we achieved a notable accuracy of 0.82 and an F-1 score of 0.82 in forecasting loneliness levels seven days in advance. Additionally, the application of Shapley values facilitated model explainability. The wealth of data provided by this study, coupled with the forecasting methodology employed, possesses the potential to augment interventions and facilitate the early identification of loneliness within populations at risk.

[LG-142] Enhancing EEG Signal Generation through a Hybrid Approach Integrating Reinforcement Learning and Diffusion Models

链接: https://arxiv.org/abs/2410.00013
作者: Yang An,Yuhao Tong,Weikai Wang,Steven W. Su
关键词-EN: synthesis of Electroencephalogram, present study introduces, EEG signals, EEG, introduces an innovative
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The present study introduces an innovative approach to the synthesis of Electroencephalogram (EEG) signals by integrating diffusion models with reinforcement learning. This integration addresses key challenges associated with traditional EEG data acquisition, including participant burden, privacy concerns, and the financial costs of obtaining high-fidelity clinical data. Our methodology enhances the generation of EEG signals with detailed temporal and spectral features, enriching the authenticity and diversity of synthetic datasets. The uniqueness of our approach lies in its capacity to concurrently model time-domain characteristics, such as waveform morphology, and frequency-domain features, including rhythmic brainwave patterns, within a cohesive generative framework. This is executed through the reinforcement learning model’s autonomous selection of parameter update strategies, which steers the diffusion process to accurately reflect the complex dynamics inherent in EEG signals. We validate the efficacy of our approach using both the BCI Competition IV 2a dataset and a proprietary dataset, each collected under stringent experimental conditions. Our results indicate that the method preserves participant privacy by generating synthetic data that lacks biometric identifiers and concurrently improves the efficiency of model training by minimizing reliance on large annotated datasets. This research offers dual contributions: firstly, it advances EEG research by providing a novel tool for data augmentation and the advancement of machine learning algorithms; secondly, it enhances brain-computer interface technologies by offering a robust solution for training models on diverse and representative EEG datasets. Collectively, this study establishes a foundation for future investigations in neurological care and the development of tailored treatment protocols in neurorehabilitation. Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2410.00013 [eess.SP] (or arXiv:2410.00013v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2410.00013 Focus to learn more arXiv-issued DOI via DataCite

[LG-143] PHemoNet: A Multimodal Network for Physiological Signals

链接: https://arxiv.org/abs/2410.00010
作者: Eleonora Lopez,Aurelio Uncini,Danilo Comminiello
关键词-EN: including medical applications, numerous fields, including medical, brain-computer interface, essential across numerous
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: The paper has been accepted at RTSI 2024

点击查看摘要

Abstract:Emotion recognition is essential across numerous fields, including medical applications and brain-computer interface (BCI). Emotional responses include behavioral reactions, such as tone of voice and body movement, and changes in physiological signals, such as the electroencephalogram (EEG). The latter are involuntary, thus they provide a reliable input for identifying emotions, in contrast to the former which individuals can consciously control. These signals reveal true emotional states without intentional alteration, thus increasing the accuracy of emotion recognition models. However, multimodal deep learning methods from physiological signals have not been significantly investigated. In this paper, we introduce PHemoNet, a fully hypercomplex network for multimodal emotion recognition from physiological signals. In detail, the architecture comprises modality-specific encoders and a fusion module. Both encoders and fusion modules are defined in the hypercomplex domain through parameterized hypercomplex multiplications (PHMs) that can capture latent relations between the different dimensions of each modality and between the modalities themselves. The proposed method outperforms current state-of-the-art models on the MAHNOB-HCI dataset in classifying valence and arousal using electroencephalograms (EEGs) and peripheral physiological signals. The code for this work is available at this https URL.

[LG-144] Machine Learning and Econometric Approaches to Fiscal Policies: Understanding Industrial Investment Dynamics in Uruguay (1974-2010)

链接: https://arxiv.org/abs/2410.00002
作者: Diego Vallarino
关键词-EN: paper examines, examines the impact, fiscal incentives, fiscal, industrial investment
类目: General Economics (econ.GN); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:This paper examines the impact of fiscal incentives on industrial investment in Uruguay from 1974 to 2010. Using a mixed-method approach that combines econometric models with machine learning techniques, the study investigates both the short-term and long-term effects of fiscal benefits on industrial investment. The results confirm the significant role of fiscal incentives in driving long-term industrial growth, while also highlighting the importance of a stable macroeconomic environment, public investment, and access to credit. Machine learning models provide additional insights into nonlinear interactions between fiscal benefits and other macroeconomic factors, such as exchange rates, emphasizing the need for tailored fiscal policies. The findings have important policy implications, suggesting that fiscal incentives, when combined with broader economic reforms, can effectively promote industrial development in emerging economies.

[LG-145] Satellite image classification with neural quantum kernels

链接: https://arxiv.org/abs/2409.20356
作者: Pablo Rodriguez-Grasa,Robert Farzan-Rodriguez,Gabriele Novelli,Yue Ban,Mikel Sanz
关键词-EN: term remains elusive, significant theoretical efforts, short term remains, quantum machine learning, remains elusive
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A practical application of quantum machine learning in real-world scenarios in the short term remains elusive, despite significant theoretical efforts. Image classification, a common task for classical models, has been used to benchmark quantum algorithms with simple datasets, but only few studies have tackled complex real-data classification challenges. In this work, we address such a gap by focusing on the classification of satellite images, a task of particular interest to the earth observation (EO) industry. We first preprocess the selected intrincate dataset by reducing its dimensionality. Subsequently, we employ neural quantum kernels (NQKs)- embedding quantum kernels (EQKs) constructed from trained quantum neural networks (QNNs)- to classify images which include solar panels. We explore both 1 -to- n and n -to- n NQKs. In the former, parameters from a single-qubit QNN’s training construct an n -qubit EQK achieving a mean test accuracy over 86% with three features. In the latter, we iteratively train an n -qubit QNN to ensure scalability, using the resultant architecture to directly form an n -qubit EQK. In this case, a test accuracy over 88% is obtained for three features and 8 qubits. Additionally, we show that the results are robust against a suboptimal training of the QNN.

信息检索

[IR-0] Conversational Exploratory Search of Scholarly Publications Using Knowledge Graphs

链接: https://arxiv.org/abs/2410.00427
作者: Phillip Schneider,Florian Matthes
关键词-EN: targets concept-based matches, methods primarily depend, recognizing underlying intents, search methods primarily, search targets concept-based
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Accepted to ICNLSP 2024

点击查看摘要

Abstract:Traditional search methods primarily depend on string matches, while semantic search targets concept-based matches by recognizing underlying intents and contextual meanings of search terms. Semantic search is particularly beneficial for discovering scholarly publications where differences in vocabulary between users’ search terms and document content are common, often yielding irrelevant search results. Many scholarly search engines have adopted knowledge graphs to represent semantic relations between authors, publications, and research concepts. However, users may face challenges when navigating these graphical search interfaces due to the complexity and volume of data, which impedes their ability to discover publications effectively. To address this problem, we developed a conversational search system for exploring scholarly publications using a knowledge graph. We outline the methodical approach for designing and implementing the proposed system, detailing its architecture and functional components. To assess the system’s effectiveness, we employed various performance metrics and conducted a human evaluation with 40 participants, demonstrating how the conversational interface compares against a graphical interface with traditional text search. The findings from our evaluation provide practical insights for advancing the design of conversational search systems.

[IR-1] PN: Transferable Proto-Learning Network towards Few-shot Document-Level Relation Extraction

链接: https://arxiv.org/abs/2410.00412
作者: Yu Zhang,Zhao Kang
关键词-EN: Few-shot document-level relation, document-level relation extraction, relation extraction suffers, Few-shot document-level, poor performance due
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Few shot document-level relation extraction

点击查看摘要

Abstract:Few-shot document-level relation extraction suffers from poor performance due to the challenging cross-domain transferability of NOTA (none-of-the-above) relation representation. In this paper, we introduce a Transferable Proto-Learning Network (TPN) to address the challenging issue. It comprises three core components: Hybrid Encoder hierarchically encodes semantic content of input text combined with attention information to enhance the relation representations. As a plug-and-play module for Out-of-Domain (OOD) Detection, Transferable Proto-Learner computes NOTA prototype through an adaptive learnable block, effectively mitigating NOTA bias across various domains. Dynamic Weighting Calibrator detects relation-specific classification confidence, serving as dynamic weights to calibrate the NOTA-dominant loss function. Finally, to bolster the model’s cross-domain performance, we complement it with virtual adversarial training (VAT). We conduct extensive experimental analyses on FREDo and ReFREDo, demonstrating the superiority of TPN. Compared to state-of-the-art methods, our approach achieves competitive performance with approximately half the parameter size. Data and code are available at this https URL.

[IR-2] ECORS: An Ensembled Clustering Approach to Eradicate The Local And Global Outlier In Collaborative Filtering Recommender System

链接: https://arxiv.org/abs/2410.00408
作者: Mahamudul Hasan
关键词-EN: suggest items based, helping users navigate, Recommender systems, designed to suggest, suggest items
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Recommender systems are designed to suggest items based on user preferences, helping users navigate the vast amount of information available on the internet. Given the overwhelming content, outlier detection has emerged as a key research area in recommender systems. It involves identifying unusual or suspicious patterns in user behavior. However, existing studies in this field face several challenges, including the limited universality of algorithms, difficulties in selecting users, and a lack of optimization. In this paper, we propose an approach that addresses these challenges by employing various clustering algorithms. Specifically, we utilize a user-user matrix-based clustering technique to detect outliers. By constructing a user-user matrix, we can identify suspicious users in the system. Both local and global outliers are detected to ensure comprehensive analysis. Our experimental results demonstrate that this approach significantly improves the accuracy of outlier detection in recommender systems.

[IR-3] Winning Solution For Meta KDD Cup 24

链接: https://arxiv.org/abs/2410.00005
作者: Yikuan Xia,Jiazun Chen,Jun Gao
关键词-EN: Meta KDD Cup, KDD Cup, Meta KDD, paper describes, describes the winning
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper describes the winning solutions of all tasks in Meta KDD Cup 24 from db3 team. The challenge is to build a RAG system from web sources and knowledge graphs. We are given multiple sources for each query to help us answer the question. The CRAG challenge involves three tasks: (1) condensing information from web pages into accurate answers, (2) integrating structured data from mock knowledge graphs, and (3) selecting and integrating critical data from extensive web pages and APIs to reflect real-world retrieval challenges. Our solution for Task #1 is a framework of web or open-data retrieval and answering. The large language model (LLM) is tuned for better RAG performance and less hallucination. Task #2 and Task #3 solutions are based on a regularized API set for domain questions and the API generation method using tuned LLM. Our knowledge graph API interface extracts directly relevant information to help LLMs answer correctly. Our solution achieves 1st place on all three tasks, achieving a score of 28.4%, 42.7%, and 47.8%, respectively.

[IR-4] Retro-li: Small-Scale Retrieval Augmented Generation Supporting Noisy Similarity Searches and Domain Shift Generalization

链接: https://arxiv.org/abs/2410.00004
作者: Gentiana Rashiti,Geethan Karunaratne,Mrinmaya Sachan,Abu Sebastian,Abbas Rahimi
关键词-EN: language modeling capabilities, retrieval augmented generation, improve language modeling, augmented generation, trillions of entries
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Published as a conference paper at European Conference on Artificial Intelligence 2024

点击查看摘要

Abstract:The retrieval augmented generation (RAG) system such as Retro has been shown to improve language modeling capabilities and reduce toxicity and hallucinations by retrieving from a database of non-parametric memory containing trillions of entries. We introduce Retro-li that shows retrieval can also help using a small-scale database, but it demands more accurate and better neighbors when searching in a smaller hence sparser non-parametric memory. This can be met by using a proper semantic similarity search. We further propose adding a regularization to the non-parametric memory for the first time: it significantly reduces perplexity when the neighbor search operations are noisy during inference, and it improves generalization when a domain shift occurs. We also show that Retro-li’s non-parametric memory can potentially be implemented on analog in-memory computing hardware, exhibiting O(1) search time while causing noise in retrieving neighbors, with minimal (1%) performance loss. Our code is available at: this https URL.

[IR-5] AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

链接: https://arxiv.org/abs/2406.19271
作者: Praneeth Vadlapati
关键词-EN: Large Language Models, reliable Large Language, Large Language, Language Models, reliable Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Initial version

点击查看摘要

Abstract:Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system’s effectiveness in purifying the data.

附件下载

点击下载今日全部论文列表