本篇博文主要展示 2024-08-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-22)

今日共更新401篇论文,其中:

  • 自然语言处理58篇(Computation and Language (cs.CL))
  • 人工智能108篇(Artificial Intelligence (cs.AI))
  • 计算机视觉110篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习99篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Great Memory Shallow Reasoning: Limits of kNN-LMs
[NLP-0] 伟大记忆浅层推理:kNN-LM的局限性

链接: https://arxiv.org/abs/2408.11815
作者: Shangyi Geng,Wenting Zhao,Alexander M Rush
关键词-EN: downstream NLP benchmarks, nearest neighbor language, neighbor language models, NLP benchmarks, demonstrated strong performance
关键词-ZH: 下游NLP基准测试、最近邻居语言、邻居语言模型、NLP基准测试,表现出强劲的性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract: K -nearest neighbor language models ( k NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a k NN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate k NN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that k NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval, k NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at this https URL.
摘要:k-近邻语言模型(k-NN-LMS)将检索和下一词预测结合在一起,在语言建模和下游NLP基准测试中表现出了很好的性能。这些结果导致研究人员争辩说,通过使用可以访问更高质量数据存储的k近邻扩展,对质量较差或过时的数据进行训练的模型可以很好地执行。在这项工作中,我们问这种提高的回忆信息的能力是否真的转化为下游能力。我们在一系列任务上对k NN-LMS进行了广泛的评估,从情感分类和常识推理到多跳推理。结果表明,k NN-LMS擅长记忆密集型任务,利用输入中的模式就足以确定输出,但在需要整合多条信息来推导新知识的推理任务中却举步维艰。我们通过Oracle实验和定性分析进一步证明,即使在完美检索的情况下,k NN-LMS仍然无法确定正确的答案,从而限制了它们的推理性能。代码和数据存储在此HTTPS URL发布。

[NLP-1] PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain
[NLP-1] PermitQA:风选址和许可领域检索增强发电的基准

链接: https://arxiv.org/abs/2408.11800
作者: Rounak Meyur,Hung Phan,Sridevi Wagle,Jan Strube,Mahantesh Halappanavar,Sameera Horawalavithana,Anurag Acharya,Sai Munikoti
关键词-EN: Natural Language Processing, Retrieval Augmented Generation, Retrieval Augmented, rapidly evolving landscape, leveraging information retrieved
关键词-ZH: 自然语言处理、检索增强生成、检索增强、快速发展的环境、利用检索到的信息
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of Natural Language Processing (NLP) and text generation, the emergence of Retrieval Augmented Generation (RAG) presents a promising avenue for improving the quality and reliability of generated text by leveraging information retrieved from user specified database. Benchmarking is essential to evaluate and compare the performance of the different RAG configurations in terms of retriever and generator, providing insights into their effectiveness, scalability, and suitability for the specific domain and applications. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI Large Language Model (LLM) teaming. As a case study, we demonstrate the framework by introducing PermitQA, a first-of-its-kind benchmark on the wind siting and permitting domain which comprises of multiple scientific documents/reports related to environmental impact of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level. We also demonstrate the performance of different models on our benchmark.
摘要:在自然语言处理(NLP)和文本生成的快速发展中,检索增强生成(RAG)的出现为利用从用户指定的数据库中检索的信息来提高生成文本的质量和可靠性提供了一条很有前途的途径。基准测试对于评估和比较不同RAG配置在检索器和生成器方面的性能至关重要,从而提供对它们的有效性、可伸缩性和对特定领域和应用程序的适用性的洞察。在本文中,我们提出了一个全面的框架来生成与领域相关的RAG基准。我们的框架是基于人类(领域专家)-AI大语言模型(LLM)团队的自动问答生成。作为一个案例研究,我们通过引入PermitQA来演示该框架,PermitQA是风能选址和许可领域的首个同类基准,由多个与风能项目的环境影响相关的科学文件/报告组成。我们的框架使用不同的度量和具有不同复杂程度的多个问题类型来系统地评估RAG的性能。我们还展示了不同模型在我们的基准上的性能。

[NLP-2] Practical token pruning for foundation models in few-shot conversational virtual assistant systems
[NLP-2] 少量对话虚拟助理系统中基础模型的实用令牌修剪

链接: https://arxiv.org/abs/2408.11799
作者: Haode Qi,Cheng Qian,Jian Ni,Pratyush Singh,Reza Fazeli,Gengyu Wang,Zhongzheng Shu,Eric Wayne,Juergen Bross
关键词-EN: enterprise Virtual Assistant, Virtual Assistant, enterprise Virtual, intent classification, crucial component
关键词-ZH: 企业虚拟助理,虚拟助理,企业虚拟,意图分类,关键组件
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:In an enterprise Virtual Assistant (VA) system, intent classification is the crucial component that determines how a user input is handled based on what the user wants. The VA system is expected to be a cost-efficient SaaS service with low training and inference time while achieving high accuracy even with a small number of training samples. We pretrain a transformer-based sentence embedding model with a contrastive learning objective and leverage the embedding of the model as features when training intent classification models. Our approach achieves the state-of-the-art results for few-shot scenarios and performs better than other commercial solutions on popular intent classification benchmarks. However, generating features via a transformer-based model increases the inference time, especially for longer user inputs, due to the quadratic runtime of the transformer’s attention mechanism. On top of model distillation, we introduce a practical multi-task adaptation approach that configures dynamic token pruning without the need for task-specific training for intent classification. We demonstrate that this approach improves the inference speed of popular sentence transformer models without affecting model performance.
摘要:在企业虚拟助理(VA)系统中,意图分类是决定如何根据用户的需求处理用户输入的关键组件。预计VA系统将是一种具有成本效益的SaaS服务,训练和推理时间较短,同时即使使用少量训练样本也能实现高精度。在训练意图分类模型时,我们使用对比学习目标来预先训练一个基于转换器的句子嵌入模型,并利用该模型的嵌入作为特征。我们的方法在少镜头场景下达到了最先进的结果,并且在流行的意图分类基准上比其他商业解决方案表现得更好。然而,由于转换器的注意机制的二次运行时间,通过基于转换器的模型生成特征增加了推理时间,特别是对于较长的用户输入。在模型精馏的基础上,我们引入了一种实用的多任务自适应方法,该方法配置了动态令牌剪枝,而不需要针对特定任务的训练来进行意图分类。实验结果表明,该方法在不影响模型性能的前提下,提高了常用语句转换模型的推理速度。

[NLP-3] LLM Pruning and Distillation in Practice: The Minitron Approach
[NLP-3] LLM修剪和蒸馏实践:Minitron方法

链接: https://arxiv.org/abs/2408.11796
作者: Sharath Turuvekere Sreenivas,Saurav Muralidharan,Raviraj Joshi,Marcin Chochowski,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Jan Kautz,Pavlo Molchanov
关键词-EN: present a comprehensive, comprehensive report, report on compressing, Evaluation Harness, Mistral NeMo
关键词-ZH: 提交一份全面、全面的报告,关于压缩的报告,评估收件箱,Mistral NeMo
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.
摘要:我们提供了一份关于使用修剪和蒸馏将Llama 3.1 8B和Mistral NeMo 12 B模型分别压缩为4 B和8B参数的全面报告。我们探索了两种不同的修剪策略:(1)深度修剪和(2)联合隐藏/注意/MLP(宽度)修剪,并在LM评估收件箱的常见基准上评估结果。然后,这些模型与NeMo Aligner对齐,并在预算调整版本中进行测试。这种方法从Llama 3.1 8B中生成了引人注目的4 B模型,并从Mistral NeMo 12 B中生成了最先进的Mistral-NeMo-Minitron-8B(为了简洁起见,MN-Minitron-8B)模型。我们发现,在无法访问原始数据的情况下,在蒸馏数据集上稍微微调教师模型是有益的。我们通过许可证在Hugging Face上开源了我们的基本模型权重。

[NLP-4] DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework
[NLP-4] DreamFactory:利用多代理框架开创多场景长视频生成

链接: https://arxiv.org/abs/2408.11788
作者: Zhifei Xie,Daniel Tang,Dingwei Tan,Jacques Klein,Tegawend F. Bissyand,Saad Ezzini
关键词-EN: Current video generation, generation models excel, realistic clips, Key Frames Iteration, Frames Iteration Design
关键词-ZH: 当前视频生成、生成模型优秀、真实剪辑、关键帧迭代、帧迭代设计
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Current video generation models excel at creating short, realistic clips, but struggle with longer, multi-scene videos. We introduce \textttDreamFactory, an LLM-based framework that tackles this challenge. \textttDreamFactory leverages multi-agent collaboration principles and a Key Frames Iteration Design Method to ensure consistency and style across long videos. It utilizes Chain of Thought (COT) to address uncertainties inherent in large language models. \textttDreamFactory generates long, stylistically coherent, and complex videos. Evaluating these long-form videos presents a challenge. We propose novel metrics such as Cross-Scene Face Distance Score and Cross-Scene Style Consistency Score. To further research in this area, we contribute the Multi-Scene Videos Dataset containing over 150 human-rated videos.
摘要:当前的视频生成模型擅长创建简短、真实的剪辑,但在处理更长、多场景视频方面却很困难。我们引入了\textttDreamFactory,这是一个基于LLM的框架,可以应对这一挑战。\textttDreamFactory利用多代理协作原则和关键帧迭代设计方法来确保长视频的一致性和风格。它利用思想链(COT)来解决大型语言模型中固有的不确定性。\textttDreamFactory生成冗长、风格连贯且复杂的视频。评估这些长篇视频是一个挑战。我们提出了新颖的指标,例如跨场景面部距离得分和跨场景风格一致性得分。为了进一步研究该领域,我们提供了包含150多个人类分级视频的多场景视频数据集。

[NLP-5] Personality Alignment of Large Language Models
[NLP-5] 大型语言模型的性格一致

链接: https://arxiv.org/abs/2408.11779
作者: Minjun Zhu,Linyi Yang,Yue Zhang
关键词-EN: large language models, aligning large language, reflect general human, language models, typically aim
关键词-ZH: 大型语言模型,调整大型语言,反映一般人类、语言模型,通常目标
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current methods for aligning large language models (LLMs) typically aim to reflect general human values and behaviors, but they often fail to capture the unique characteristics and preferences of individual users. To address this gap, we introduce the concept of Personality Alignment. This approach tailors LLMs’ responses and decisions to match the specific preferences of individual users or closely related groups. Inspired by psychometrics, we created the Personality Alignment with Personality Inventories (PAPI) dataset, which includes data from 300,000 real subjects, each providing behavioral preferences based on the Big Five Personality Factors. This dataset allows us to quantitatively evaluate the extent to which LLMs can align with each subject’s behavioral patterns. Recognizing the challenges of personality alignments: such as limited personal data, diverse preferences, and scalability requirements: we developed an activation intervention optimization method. This method enhances LLMs’ ability to efficiently align with individual behavioral preferences using minimal data and computational resources. Remarkably, our method, PAS, achieves superior performance while requiring only 1/5 of the optimization time compared to DPO, offering practical value for personality alignment. Our work paves the way for future AI systems to make decisions and reason in truly personality ways, enhancing the relevance and meaning of AI interactions for each user and advancing human-centered artificial intelligence.The code has released in \urlthis https URL.
摘要:目前用于对齐大型语言模型(LLM)的方法通常旨在反映一般的人类价值观和行为,但它们往往无法捕捉到个体用户的独特特征和偏好。为了解决这一差距,我们引入了人格一致性的概念。这种方法调整LLM的响应和决策,以匹配个人用户或密切相关群体的特定偏好。受心理测量学的启发,我们创建了人格与人格问卷(PAPI)数据集,其中包括30万名真实受试者的数据,每个受试者都提供了基于五大人格因素的行为偏好。这一数据集使我们能够定量地评估LLM与每个受试者的行为模式相一致的程度。认识到个性调整的挑战:例如有限的个人数据、多样化的偏好和可扩展性要求:我们开发了一种激活干预优化方法。这种方法增强了LLMS使用最少的数据和计算资源有效地与个人行为偏好保持一致的能力。值得注意的是,我们的PAS方法获得了优越的性能,而与DPO相比,只需要1/5的优化时间,为个性匹配提供了实用价值。我们的工作为未来的AI系统以真正个性化的方式做出决策和推理铺平了道路,增强了AI交互对每个用户的相关性和意义,并推进了以人为中心的人工智能。代码已在此HTTPS URL中发布。

[NLP-6] Leveraging Fine-Tuned Retrieval-Augmented Generation with Long-Context Support: For 3GPP Standards
[NLP-6] 利用具有长上下文支持的微调检索增强生成:适用于3GPP标准

链接: https://arxiv.org/abs/2408.11775
作者: Omar Erak,Nouf Alabbasi,Omar Alhussein,Ismail Lotfi,Amr Hussein,Sami Muhaidat,Merouane Debbah
关键词-EN: Recent studies show, studies show, show that large, large language models, language models
关键词-ZH: 最近的研究表明,研究表明,大型语言模型,语言模型
类目: Computation and Language (cs.CL); Networking and Internet Architecture (cs.NI)
备注: submitted to Proc. IEEE Globecom

点击查看摘要

Abstract:Recent studies show that large language models (LLMs) struggle with technical standards in telecommunications. We propose a fine-tuned retrieval-augmented generation (RAG) system based on the Phi-2 small language model (SLM) to serve as an oracle for communication networks. Our developed system leverages forward-looking semantic chunking to adaptively determine parsing breakpoints based on embedding similarity, enabling effective processing of diverse document formats. To handle the challenge of multiple similar contexts in technical standards, we employ a re-ranking algorithm to prioritize the most relevant retrieved chunks. Recognizing the limitations of Phi-2’s small context window, we implement a recent technique, namely SelfExtend, to expand the context window during inference, which not only boosts the performance but also can accommodate a wider range of user queries and design requirements from customers to specialized technicians. For fine-tuning, we utilize the low-rank adaptation (LoRA) technique to enhance computational efficiency during training and enable effective fine-tuning on small datasets. Our comprehensive experiments demonstrate substantial improvements over existing question-answering approaches in the telecom domain, achieving performance that exceeds larger language models such as GPT-4 (which is about 880 times larger in size). This work presents a novel approach to leveraging SLMs for communication networks, offering a balance of efficiency and performance. This work can serve as a foundation towards agentic language models for networks.
摘要:最近的研究表明,大型语言模型(LLM)难以适应电信领域的技术标准。我们提出了一个基于Phi-2小语言模型(SLM)的微调检索-增强生成(RAG)系统,作为通信网络的先知。我们开发的系统利用前瞻性的语义块来根据嵌入的相似度自适应地确定解析断点,从而能够有效地处理不同的文档格式。为了应对技术标准中多个相似上下文的挑战,我们使用了一种重新排序算法来对最相关的检索块进行优先排序。认识到Phi-2的S小上下文窗口的局限性,我们实现了一种新的技术–自扩展技术来扩展推理过程中的上下文窗口,不仅提高了性能,而且可以适应从客户到专业技术人员的更广泛的用户查询和设计要求。在微调方面,我们利用低阶自适应(LORA)技术来提高训练过程中的计算效率,并使之能够在小数据集上进行有效的微调。我们的综合实验表明,在电信领域,与现有的问答方法相比,问答方法有了实质性的改进,性能超过了GPT-4等更大的语言模型(GPT-4的规模约为GPT-4的880倍)。这项工作提出了一种在通信网络中利用SLM的新方法,提供了效率和性能之间的平衡。这项工作可以作为网络代理语言模型的基础。

[NLP-7] Against All Odds: Overcoming Typology Script and Language Confusion in Multilingual Embedding Inversion Attacks
[NLP-7] 克服一切困难:克服多语言嵌入倒置攻击中的类型学脚本和语言混乱

链接: https://arxiv.org/abs/2408.11749
作者: Yiyi Chen,Russa Biswas,Heather Lent,Johannes Bjerva
关键词-EN: Large Language Models, Large Language, susceptible to malicious, malicious influence, influence by cyber
关键词-ZH: 大型语言模型,大型语言,容易受到恶意、恶意影响,网络影响
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 11 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are susceptible to malicious influence by cyber attackers through intrusions such as adversarial, backdoor, and embedding inversion attacks. In response, the burgeoning field of LLM Security aims to study and defend against such threats. Thus far, the majority of works in this area have focused on monolingual English models, however, emerging research suggests that multilingual LLMs may be more vulnerable to various attacks than their monolingual counterparts. While previous work has investigated embedding inversion over a small subset of European languages, it is challenging to extrapolate these findings to languages from different linguistic families and with differing scripts. To this end, we explore the security of multilingual LLMs in the context of embedding inversion attacks and investigate cross-lingual and cross-script inversion across 20 languages, spanning over 8 language families and 12 scripts. Our findings indicate that languages written in Arabic script and Cyrillic script are particularly vulnerable to embedding inversion, as are languages within the Indo-Aryan language family. We further observe that inversion models tend to suffer from language confusion, sometimes greatly reducing the efficacy of an attack. Accordingly, we systematically explore this bottleneck for inversion models, uncovering predictable patterns which could be leveraged by attackers. Ultimately, this study aims to further the field’s understanding of the outstanding security vulnerabilities facing multilingual LLMs and raise awareness for the languages most at risk of negative impact from these attacks.
摘要:大型语言模型容易受到网络攻击者的恶意影响,如对抗性攻击、后门攻击、嵌入反转攻击等。作为回应,LLM Security这个新兴领域的目标是研究和防御此类威胁。到目前为止,这一领域的研究大多集中在单语英语模型上,然而,新的研究表明,多语种的LLM可能比单语的LLM更容易受到各种攻击。虽然以前的工作已经研究了在一小部分欧洲语言上嵌入倒置,但将这些发现外推到来自不同语系和不同脚本的语言是具有挑战性的。为此,我们在嵌入倒置攻击的情况下探索了多语言LLMS的安全性,并研究了跨语言和跨脚本的跨语言和跨脚本倒置,涉及8个语系和12个脚本。我们的发现表明,用阿拉伯文字和西里尔文字书写的语言特别容易嵌入倒置,印度-雅利安语系的语言也是如此。我们进一步观察到,倒置模型往往受到语言混乱的影响,有时会极大地降低攻击的有效性。因此,我们系统地探索了倒置模型的这一瓶颈,揭示了可被攻击者利用的可预测模式。最终,这项研究旨在加深外地对多语种土地管理系统面临的突出安全漏洞的了解,并提高对最有可能受到这些攻击的负面影响的语言的认识。

[NLP-8] FocusLLM: Scaling LLMs Context by Parallel Decoding
[NLP-8] FocusLLM:通过并行解码缩放LLM上下文

链接: https://arxiv.org/abs/2408.11745
作者: Zhenyu Li,Yike Zhang,Tengyu Pan,Yutao Sun,Zhichao Duan,Junjie Fang,Rong Han,Zixuan Wang,Jianyong Wang
关键词-EN: Empowering LLMs, context, long context lengths, Empowering, context length
关键词-ZH: 赋权LLM,上下文,长上下文长度,赋权,上下文长度
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Empowering LLMs with the ability to utilize useful information from a long context is crucial for many downstream applications. However, achieving long context lengths with the conventional transformer architecture requires substantial training and inference resources. In this paper, we present FocusLLM, a framework designed to extend the context length of any decoder-only LLM, enabling the model to focus on relevant information from very long sequences. FocusLLM processes long text inputs by dividing them into chunks based on the model’s original context length to alleviate the issue of attention distraction. Then, it appends the local context to each chunk as a prompt to extract essential information from each chunk based on a novel parallel decoding mechanism, and ultimately integrates the extracted information into the local context. FocusLLM stands out for great training efficiency and versatility: trained with an 8K input length with much less training cost than previous methods, FocusLLM exhibits superior performance across downstream long-context tasks and maintains strong language modeling ability when handling extensive long texts, even up to 400K tokens. Our code is available at this https URL.
摘要:使LLM能够利用长上下文中的有用信息,这对许多下游应用至关重要。然而,使用传统的转换器体系结构实现长上下文长度需要大量的训练和推理资源。在本文中,我们提出了FocusLLM,这是一个框架,旨在扩展任何仅用于解码器的LLM的上下文长度,使该模型能够关注来自非常长序列的相关信息。FocusLLM通过根据模型的原始上下文长度将长文本输入划分为块来处理长文本输入,以缓解注意力分散的问题。然后,基于一种新颖的并行解码机制,将局部上下文作为提示添加到每个块中,从每个块中提取基本信息,并最终将提取的信息整合到局部上下文中。FocusLLM以极高的训练效率和通用性而脱颖而出:与以前的方法相比,FocusLLM以8K的输入长度进行了训练,训练成本比以前的方法低得多,在下游长上下文任务中表现出卓越的性能,并在处理大量长文本时保持着强大的语言建模能力,甚至高达40万个令牌。我们的代码可以在这个HTTPS URL上找到。

[NLP-9] Efficient Detection of Toxic Prompts in Large Language Models
[NLP-9] 大型语言模型中有毒缺陷的有效检测

链接: https://arxiv.org/abs/2408.11727
作者: Yi Liu,Junzhe Yu,Huijia Sun,Ling Shi,Gelei Deng,Yuqi Chen,Yang Liu
关键词-EN: Large language models, advanced natural language, automated content generation, significantly advanced natural, Large language
关键词-ZH: 大型语言模型、高级自然语言、自动内容生成、显着高级自然、大型语言
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted by the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

点击查看摘要

Abstract:Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39% and a low false positive rate of 2.00%, outperforming state-of-the-art methods. Additionally, ToxicDetector’s processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.
摘要:ChatGPT和Gemini等大型语言模型(LLM)显著提升了自然语言处理能力,支持各种应用,如聊天机器人和自动内容生成。然而,这些模型可能会被恶意个人利用,他们精心编制有毒提示,以引发有害或不道德的反应。这些人经常使用越狱技术来绕过安全机制,这突显了强大的有毒物质快速检测方法的必要性。现有的检测技术,无论是黑盒还是白盒,都面临着与有毒提示的多样性、可扩展性和计算效率相关的挑战。作为回应,我们提出了ToxicDetector,这是一种轻量级灰盒方法,旨在有效地检测LLMS中的有毒提示。ToxicDetector利用LLMS创建有毒概念提示,使用嵌入向量形成特征向量,并使用多层感知器(MLP)分类器进行提示分类。我们在不同版本的骆驼模型GEMA-2和多个数据集上的评估表明,ToxicDetector获得了96.39%的高准确率和2.00%的低假阳性率,优于最先进的方法。此外,ToxicDetector每个提示符0.0780秒的处理时间使其非常适合实时应用程序。ToxicDetector具有很高的准确性、效率和可扩展性,是一种实用的LLMS中毒快速检测方法。

[NLP-10] Xinyu: An Efficient LLM-based System for Commentary Generation
[NLP-10] 新宇:一个基于LLM的高效评论生成系统

链接: https://arxiv.org/abs/2408.11609
作者: Yiquan Wu,Bo Tang,Chenyang Xi,Yu Yu,Pengyu Wang,Yifei Liu,Kun Kuang,Haiying Deng,Zhiyu Li,Feiyu Xiong,Jie Hu,Peng Cheng,Zhonghao Wang,Yi Wang,Yi Luo,Mingchuan Yang
关键词-EN: presenting diverse arguments, deep understanding, presenting diverse, Commentary, requirements
关键词-ZH: 提出多样化的论点,深刻的理解,提出多样化的,评论,要求
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Commentary provides readers with a deep understanding of events by presenting diverse arguments and evidence. However, creating commentary is a time-consuming task, even for skilled commentators. Large language models (LLMs) have simplified the process of natural language generation, but their direct application in commentary creation still faces challenges due to unique task requirements. These requirements can be categorized into two levels: 1) fundamental requirements, which include creating well-structured and logically consistent narratives, and 2) advanced requirements, which involve generating quality arguments and providing convincing evidence. In this paper, we introduce Xinyu, an efficient LLM-based system designed to assist commentators in generating Chinese commentaries. To meet the fundamental requirements, we deconstruct the generation process into sequential steps, proposing targeted strategies and supervised fine-tuning (SFT) for each step. To address the advanced requirements, we present an argument ranking model for arguments and establish a comprehensive evidence database that includes up-to-date events and classic books, thereby strengthening the substantiation of the evidence with retrieval augmented generation (RAG) technology. To evaluate the generated commentaries more fairly, corresponding to the two-level requirements, we introduce a comprehensive evaluation metric that considers five distinct perspectives in commentary generation. Our experiments confirm the effectiveness of our proposed system. We also observe a significant increase in the efficiency of commentators in real-world scenarios, with the average time spent on creating a commentary dropping from 4 hours to 20 minutes. Importantly, such an increase in efficiency does not compromise the quality of the commentaries.
摘要:《评论》通过提供不同的论据和证据,使读者对事件有了深刻的了解。然而,编写评论是一项耗时的任务,即使是对熟练的评论员来说也是如此。大语言模型简化了自然语言生成的过程,但由于其独特的任务要求,其在评论创作中的直接应用仍然面临挑战。这些要求可以分为两个层次:1)基本要求,包括创造结构良好和逻辑一致的叙述;2)高级要求,包括产生高质量的论点和提供令人信服的证据。在本文中,我们介绍了新语,这是一个高效的基于LLM的系统,旨在帮助评论员生成中文评论。为了满足基本要求,我们将生成过程解构为连续的步骤,为每个步骤提出有针对性的策略和监督微调(SFT)。为了满足这些高级要求,我们提出了一个论点排名模型,并建立了一个全面的证据数据库,其中包括最新的事件和经典书籍,从而利用检索增强生成(RAG)技术加强了证据的确证。为了更公平地评估生成的评论,对应于两个级别的要求,我们引入了一个综合评估度量,该度量在评论生成中考虑了五个不同的角度。我们的实验证实了我们所提出的系统的有效性。我们还观察到,在真实世界的场景中,解说员的效率显著提高,创建一篇评论的平均时间从4小时下降到20分钟。重要的是,这种效率的提高不会影响评论的质量。

[NLP-11] Cause-Aware Empathetic Response Generation via Chain-of-Thought Fine-Tuning
[NLP-11] 通过思想链微调产生原因意识同理心反应

链接: https://arxiv.org/abs/2408.11599
作者: Xinhao Chen,Chong Yang,Man Lan,Li Cai,Yang Chen,Tu Hu,Xinlin Zhuang,Aimin Zhou
关键词-EN: comprehend dialogue contexts, Empathetic response generation, generation endows agents, response generation endows, Large Language Models
关键词-ZH: 理解对话上下文、同理心的响应生成、生成赋予代理人、响应生成赋予、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Empathetic response generation endows agents with the capability to comprehend dialogue contexts and react to expressed emotions. Previous works predominantly focus on leveraging the speaker’s emotional labels, but ignore the importance of emotion cause reasoning in empathetic response generation, which hinders the model’s capacity for further affective understanding and cognitive inference. In this paper, we propose a cause-aware empathetic generation approach by integrating emotions and causes through a well-designed Chain-of-Thought (CoT) prompt on Large Language Models (LLMs). Our approach can greatly promote LLMs’ performance of empathy by instruction tuning and enhancing the role awareness of an empathetic listener in the prompt. Additionally, we propose to incorporate cause-oriented external knowledge from COMET into the prompt, which improves the diversity of generation and alleviates conflicts between internal and external knowledge at the same time. Experimental results on the benchmark dataset demonstrate that our approach on LLaMA-7b achieves state-of-the-art performance in both automatic and human evaluations.
摘要:移情反应的产生赋予了智能体理解对话语境和对表达的情感做出反应的能力。以往的研究主要集中在利用说话人的情感标签,而忽略了情感原因推理在移情反应生成中的重要性,这阻碍了模型进一步的情感理解和认知推理能力。在这篇文章中,我们提出了一种原因感知的移情生成方法,通过在大型语言模型(LLM)上设计一个良好的思想链(COT)提示来整合情感和原因。我们的方法可以通过教学调整和提高同理心听者在提示中的角色意识来极大地促进LLMS的同理心表现。此外,我们还提出在提示语中加入来自Comet的原因导向的外部知识,提高了生成的多样性,同时缓解了内部知识和外部知识之间的冲突。在基准数据集上的实验结果表明,我们在Llama-7b上的方法在自动和人工评估方面都获得了最先进的性能。

[NLP-12] Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks
[NLP-12] 大型语言模型是好的攻击者:高效且隐秘的文本后门攻击

链接: https://arxiv.org/abs/2408.11587
作者: Ziqiang Li,Yueqi Zeng,Pengfei Xia,Lei Liu,Zhangjie Fu,Bin Li
关键词-EN: natural language processing, increased significantly, burgeoning advancements, field of natural, backdoor attacks
关键词-ZH: 自然语言处理显着增加,蓬勃发展,自然后门攻击领域
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Under Review

点击查看摘要

Abstract:With the burgeoning advancements in the field of natural language processing (NLP), the demand for training data has increased significantly. To save costs, it has become common for users and businesses to outsource the labor-intensive task of data collection to third-party entities. Unfortunately, recent research has unveiled the inherent risk associated with this practice, particularly in exposing NLP systems to potential backdoor attacks. Specifically, these attacks enable malicious control over the behavior of a trained model by poisoning a small portion of the training data. Unlike backdoor attacks in computer vision, textual backdoor attacks impose stringent requirements for attack stealthiness. However, existing attack methods meet significant trade-off between effectiveness and stealthiness, largely due to the high information entropy inherent in textual data. In this paper, we introduce the Efficient and Stealthy Textual backdoor attack method, EST-Bad, leveraging Large Language Models (LLMs). Our EST-Bad encompasses three core strategies: optimizing the inherent flaw of models as the trigger, stealthily injecting triggers with LLMs, and meticulously selecting the most impactful samples for backdoor injection. Through the integration of these techniques, EST-Bad demonstrates an efficient achievement of competitive attack performance while maintaining superior stealthiness compared to prior methods across various text classifier datasets.
摘要:随着自然语言处理(NLP)领域的迅速发展,对训练数据的需求急剧增加。为了节省成本,用户和企业将劳动密集型的数据收集任务外包给第三方实体已经变得很常见。不幸的是,最近的研究揭示了与这种做法相关的固有风险,特别是在将NLP系统暴露于潜在的后门攻击方面。具体地说,这些攻击通过对一小部分训练数据下毒,实现了对训练模型行为的恶意控制。与计算机视觉中的后门攻击不同,文本后门攻击对攻击的隐蔽性提出了严格的要求。然而,现有的攻击方法在有效性和隐蔽性之间达到了显著的权衡,这在很大程度上是由于文本数据固有的高信息熵。本文利用大型语言模型,介绍了一种高效、隐蔽的文本后门攻击方法EST-Bad。我们的EST-Bad包含三个核心策略:优化模型的固有缺陷作为触发器,悄悄地向触发器注入LLM,以及精心选择最有影响力的样本进行后门注入。通过这些技术的集成,EST-Bad展示了在与各种文本分类器数据集的现有方法相比保持优越的隐蔽性的同时,有效地实现了竞争性攻击性能。

[NLP-13] Drama Engine: A Framework for Narrative Agents
[NLP-13] 戏剧引擎:叙事代理的框架

链接: https://arxiv.org/abs/2408.11574
作者: Martin Pichlmair,Riddhi Raj,Charlene Putney
关键词-EN: technical report presents, large language models, language models designed, Drama Engine, narrative purposes
关键词-ZH: 技术报告呈现、大型语言模型、设计的语言模型、戏剧引擎、叙事目的
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 2 tables

点击查看摘要

Abstract:This technical report presents the Drama Engine, a novel framework for agentic interaction with large language models designed for narrative purposes. The framework adapts multi-agent system principles to create dynamic, context-aware companions that can develop over time and interact with users and each other. Key features include multi-agent workflows with delegation, dynamic prompt assembly, and model-agnostic design. The Drama Engine introduces unique elements such as companion development, mood systems, and automatic context summarising. It is implemented in TypeScript. The framework’s applications include multi-agent chats and virtual co-workers for creative writing. The paper discusses the system’s architecture, prompt assembly process, delegation mechanisms, and moderation techniques, as well as potential ethical considerations and future extensions.
摘要:本技术报告介绍了Drama Engine,这是一个新颖的框架,用于与为叙事目的设计的大型语言模型进行代理交互。该框架采用多代理系统原则来创建动态的、上下文感知的同伴,这些同伴可以随着时间的推移而开发并与用户和彼此互动。主要功能包括具有委托、动态提示组装和模型不可知设计的多代理工作流程。戏剧引擎引入了独特的元素,例如同伴开发、情绪系统和自动上下文总结。它是在TypScript中实现的。该框架的应用程序包括用于创意写作的多代理聊天和虚拟同事。本文讨论了系统的架构、即时组装过程、委托机制和审核技术,以及潜在的道德考虑因素和未来的扩展。

[NLP-14] Differentiating Choices via Commonality for Multiple-Choice Question Answering ECAI2024
[NLP-14] 通过多项选择题回答的共性区分选择

链接: https://arxiv.org/abs/2408.11554
作者: Wenqing Deng,Zhe Wang,Kewen Wang,Shirui Pan,Xiaowang Zhang,Zhiyong Feng
关键词-EN: Multiple-choice question answering, Multiple-choice question, semantically similar, choices, MCQA
关键词-ZH: 多项选择题回答,多项选择题,语义相似,选择,MCQA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, accepted to ECAI 2024

点击查看摘要

Abstract:Multiple-choice question answering (MCQA) becomes particularly challenging when all choices are relevant to the question and are semantically similar. Yet this setting of MCQA can potentially provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. Specifically, they fail to leverage the semantic commonalities and nuances among the choices for reasoning. In this paper, we propose a novel MCQA model by differentiating choices through identifying and eliminating their commonality, called DCQA. Our model captures token-level attention of each choice to the question, and separates tokens of the question attended to by all the choices (i.e., commonalities) from those by individual choices (i.e., nuances). Using the nuances as refined contexts for the choices, our model can effectively differentiate choices with subtle differences and provide justifications for choosing the correct answer. We conduct comprehensive experiments across five commonly used MCQA benchmarks, demonstrating that DCQA consistently outperforms baseline models. Furthermore, our case study illustrates the effectiveness of the approach in directing the attention of the model to more differentiating features.
摘要:当所有选项都与问题相关并在语义上相似时,多项选择问答(MCQA)变得特别具有挑战性。然而,MCQA的这种设置可能会为选择正确的答案提供有价值的线索。现有的模型通常对每个选项进行单独排序,而忽略了其他选项提供的上下文。具体地说,它们未能利用推理选择之间的语义共性和细微差别。在本文中,我们提出了一种新的MCQA模型,该模型通过识别和消除选项的共性来区分选项,称为DCQA。我们的模型捕获了每个选择对问题的令牌级关注,并将所有选择(即共性)所关注的问题的标记与单个选择(即细微差别)所关注的问题的标记分开。使用细微差别作为选择的精化语境,我们的模型可以有效地区分具有细微差异的选择,并为选择正确的答案提供理由。我们在五个常用的MCQA基准上进行了全面的实验,证明DCQA的性能一直优于基线模型。此外,我们的案例研究表明,该方法在将模型的注意力引导到更具区分性的特征方面是有效的。

[NLP-15] Memorization In In-Context Learning
[NLP-15] 情境学习中的同步化

链接: https://arxiv.org/abs/2408.11546
作者: Shahriar Golchin,Mihai Surdeanu,Steven Bethard,Eduardo Blanco,Ellen Riloff
关键词-EN: large language models, In-context learning, ICL, language models, strategy for improving
关键词-ZH: 大型语言模型、上下文学习、ICL、语言模型、改进策略
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: v1

点击查看摘要

Abstract:In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind these performance improvements remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when it outperforms zero-shot learning. Overall, our study uncovers a hidden phenomenon – memorization – at the core of ICL, raising an important question: to what extent do LLMs truly generalize from demonstrations in ICL, and how much of their success is due to memorization?
摘要:情境学习(ICL)已被证明是一种不需要额外训练就能提高大型语言模型(LLM)性能的有效策略。然而,这些性能改进背后的确切机制仍不清楚。这项研究首次展示了ICL表面是如何记忆训练数据的,并探索了这种记忆与不同ICL制度之间的相关性:零射击、少射击和多射击。我们最显著的发现包括:(1)在大多数情况下,与零射击学习相比,ICL显著地进行表层记忆;(2)没有标签的演示是表层记忆的最有效因素;(3)当表层记忆在少射击模式下达到较高水平(约40%)时,ICL提高了成绩;(4)当ICL的成绩优于零射击学习时,ICL的成绩与记忆之间存在很强的相关性。总体而言,我们的研究揭示了一个隐藏在ICL核心的现象–记忆,提出了一个重要的问题:LLM在多大程度上真正从ICL的示范中概括出来,他们的成功在多大程度上归功于记忆?

[NLP-16] Imagining from Images with an AI Storytelling Tool
[NLP-16] 使用人工智能讲故事工具从图像中想象

链接: https://arxiv.org/abs/2408.11517
作者: Edirlei Soares de Lima,Marco A. Casanova,Antonio L. Furtado
关键词-EN: time immemorial tradition, analyzing single images, Narrative Art, sequences is presented, analyzing single
关键词-ZH: 自古以来的传统,分析单一图像,叙事艺术,序列呈现,分析单一
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A method for generating narratives by analyzing single images or image sequences is presented, inspired by the time immemorial tradition of Narrative Art. The proposed method explores the multimodal capabilities of GPT-4o to interpret visual content and create engaging stories, which are illustrated by a Stable Diffusion XL model. The method is supported by a fully implemented tool, called ImageTeller, which accepts images from diverse sources as input. Users can guide the narrative’s development according to the conventions of fundamental genres - such as Comedy, Romance, Tragedy, Satire or Mystery -, opt to generate data-driven stories, or to leave the prototype free to decide how to handle the narrative structure. User interaction is provided along the generation process, allowing the user to request alternative chapters or illustrations, and even reject and restart the story generation based on the same input. Additionally, users can attach captions to the input images, influencing the system’s interpretation of the visual content. Examples of generated stories are provided, along with details on how to access the prototype.
摘要:受叙事艺术源远流长的传统启发,提出了一种通过分析单个图像或图像序列来生成叙事的方法。所提出的方法探索了GPT-40解释视觉内容和创建引人入胜的故事的多通道能力,并通过稳定的扩散XL模型进行了说明。该方法得到了一个完全实现的工具ImageTeller的支持,该工具接受来自不同来源的图像作为输入。用户可以根据喜剧、浪漫、悲剧、讽刺或神秘等基本体裁的惯例来指导叙事的发展,选择生成数据驱动的故事,或者让原型自由决定如何处理叙事结构。在生成过程中提供了用户交互,允许用户请求替代章节或插图,甚至基于相同的输入拒绝和重新启动故事生成。此外,用户可以将字幕附加到输入图像,从而影响系统对可视内容的解释。文中提供了生成的故事的示例,以及有关如何访问原型的详细信息。

[NLP-17] IKUN for WMT24 General MT Task: LLMs Are here for Multilingual Machine Translation
[NLP-17] IKUN for WMT 24一般MT任务:LLM用于多语言机器翻译

链接: https://arxiv.org/abs/2408.11512
作者: Baohao Liao,Christian Herold,Shahram Khadivi,Christof Monz
关键词-EN: IKUN, paper introduces, systems, general machine translation, machine translation task
关键词-ZH: IKUN,论文介绍,系统,通用机器翻译,机器翻译任务
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure, 3 tables

点击查看摘要

Abstract:This paper introduces two multilingual systems, IKUN and IKUN-C, developed for the general machine translation task in WMT24. IKUN and IKUN-C represent an open system and a constrained system, respectively, built on Llama-3-8b and Mistral-7B-v0.3. Both systems are designed to handle all 11 language directions using a single model. According to automatic evaluation metrics, IKUN-C achieved 6 first-place and 3 second-place finishes among all constrained systems, while IKUN secured 1 first-place and 2 second-place finishes across both open and constrained systems. These encouraging results suggest that large language models (LLMs) are nearing the level of proficiency required for effective multilingual machine translation. The systems are based on a two-stage approach: first, continuous pre-training on monolingual data in 10 languages, followed by fine-tuning on high-quality parallel data for 11 language directions. The primary difference between IKUN and IKUN-C lies in their monolingual pre-training strategy. IKUN-C is pre-trained using constrained monolingual data, whereas IKUN leverages monolingual data from the OSCAR dataset. In the second phase, both systems are fine-tuned on parallel data sourced from NTREX, Flores, and WMT16-23 for all 11 language pairs.
摘要:本文介绍了WMT24中为通用机器翻译任务开发的两个多语言系统:IKUN和IKUN-C。IKUN和IKUN-C分别代表建立在Llama-3-8b和Mistral-7B-v0.3基础上的开放系统和受限系统。这两个系统都设计为使用单一模型处理所有11种语言方向。根据自动评价指标,IKUN-C在所有受限系统中获得了6个第一名和3个第二名,而IKUN在开放和受限系统中获得了1个第一名和2个第二名。这些令人鼓舞的结果表明,大型语言模型(LLM)正在接近有效的多语言机器翻译所需的熟练程度。这些系统以两阶段办法为基础:首先,对10种语文的单语数据进行持续的预培训,然后对11种语文方向的高质量平行数据进行微调。IKUN和IKUN-C的主要不同之处在于他们的单语预训策略。IKUN-C是使用受限的单语数据进行预训练的,而IKUN则利用OSCAR数据集中的单语数据。在第二阶段,两个系统都根据来自NTREX、Flores和WMT16-23的并行数据对所有11种语言进行了微调。

[NLP-18] DocTabQA: Answering Questions from Long Documents Using Tables
[NLP-18] DocTabQA:使用表格回答长文档中的问题

链接: https://arxiv.org/abs/2408.11490
作者: Haochen Wang,Kai Hu,Haoyu Dong,Liangcai Gao
关键词-EN: question answering, Large Language Models, structured tables, leverage Large Language, structured tables derived
关键词-ZH: 问答、大型语言模型、结构化表、利用大型语言、结构化表衍生
类目: Computation and Language (cs.CL)
备注: 18 pages,5 figures

点击查看摘要

Abstract:We study a new problem setting of question answering (QA), referred to as DocTabQA. Within this setting, given a long document, the goal is to respond to questions by organizing the answers into structured tables derived directly from the document’s content. Unlike traditional QA approaches which predominantly rely on unstructured text to formulate responses, DocTabQA aims to leverage structured tables as answers to convey information clearly and systematically, thereby enhancing user comprehension and highlighting relationships between data points. To the best of our knowledge, this problem has not been previously explored. In this paper, we introduce the QTabA dataset, encompassing 300 financial documents, accompanied by manually annotated 1.5k question-table pairs. Initially, we leverage Large Language Models (LLMs) such as GPT-4 to establish a baseline. However, it is widely acknowledged that LLMs encounter difficulties when tasked with generating intricate, structured outputs from long input sequences. To overcome these challenges, we present a two-stage framework, called DocTabTalk, which initially retrieves relevant sentences from extensive documents and subsequently generates hierarchical tables based on these identified sentences. DocTabTalk incorporates two key technological innovations: AlignLLaMA and TabTalk, which are specifically tailored to assist GPT-4 in tackling DocTabQA, enabling it to generate well-structured, hierarchical tables with improved organization and clarity. Comprehensive experimental evaluations conducted on both QTabA and RotoWire datasets demonstrate that our DocTabTalk significantly enhances the performances of the GPT-4 in our proposed DocTabQA task and the table generation task. The code and dataset are available at this https URL for further research.
摘要:我们研究了一种新的问答问题设置,称为DocTabQA。在这种情况下,给定一个很长的文档,目标是通过将答案组织到直接从文档内容派生的结构化表格中来回答问题。与主要依靠非结构化文本来制定答复的传统QA方法不同,DocTabQA的目标是利用结构化表格作为答复,清楚而系统地传达信息,从而增强用户理解并突出数据点之间的关系。据我们所知,这个问题以前从未被探索过。在本文中,我们介绍了QTabA数据集,包括300个金融文档,以及手动标注的1.5k问表对。最初,我们利用大型语言模型(LLM)(如GPT-4)来建立基准。然而,人们普遍认为,在从长输入序列生成复杂的、结构化的输出时,LLM遇到了困难。为了克服这些挑战,我们提出了一个名为DocTabTalk的两阶段框架,该框架最初从大量文档中检索相关句子,然后根据这些识别的句子生成层次表。DocTabTalk结合了两项关键技术创新:AlignLLaMA和TabTalk,它们是专门为协助GPT-4解决DocTabQA而量身定做的,使其能够以更好的组织和清晰度生成结构良好的分层表格。在QTabA和RotoWire数据集上进行的综合实验评估表明,我们的DocTabTalk在我们提出的DocTabQA任务和表格生成任务中显著提高了GPT-4的性能。代码和数据集可以在此HTTPS URL上找到,以供进一步研究。

[NLP-19] he Self-Contained Negation Test Set
[NLP-19] 独立否定测试集

链接: https://arxiv.org/abs/2408.11469
作者: David Kletz(Lattice, LLF - UMR7110, UPCité),Pascal Amsili(Lattice),Marie Candito(LLF UMR7110, UPCité)
关键词-EN: Pretrained Language Models, Pretrained Language, ability of Pretrained, Gubelmann and Handschuh, Language Models
关键词-ZH: 预训练语言模型,预训练语言,预训练能力,Gubelmann和Handschuh,语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Several methodologies have recently been proposed to evaluate the ability of Pretrained Language Models (PLMs) to interpret negation. In this article, we build on Gubelmann and Handschuh (2022), which studies the modification of PLMs’ predictions as a function of the polarity of inputs, in English. Crucially, this test uses ``self-contained’’ inputs ending with a masked position: depending on the polarity of a verb in the input, a particular token is either semantically ruled out or allowed at the masked position. By replicating Gubelmann and Handschuh (2022) experiments, we have uncovered flaws that weaken the conclusions that can be drawn from this test. We thus propose an improved version, the Self-Contained Neg Test, which is more controlled, more systematic, and entirely based on examples forming minimal pairs varying only in the presence or absence of verbal negation in English. When applying our test to the roberta and bert base and large models, we show that only roberta-large shows trends that match the expectations, while bert-base is mostly insensitive to negation. For all the tested models though, in a significant number of test instances the top-1 prediction remains the token that is semantically forbidden by the context, which shows how much room for improvement remains for a proper treatment of the negation phenomenon.
摘要:最近提出了几种方法来评估预训练语言模型(PLM)解释否定的能力。在这篇文章中,我们建立在Gubelmann和Handscheh(2022)的基础上,他们研究了PLM的预测作为输入极性的函数在英语中的修正。至关重要的是,这项测试使用了以屏蔽位置结尾的“自包含”输入:根据输入中动词的极性,特定的标记要么在语义上被排除,要么在被屏蔽的位置被允许。通过复制Gubelmann和Handscheh(2022)的实验,我们发现了一些缺陷,这些缺陷削弱了可以从这个测试中得出的结论。因此,我们提出了一个改进的版本,自含式否定测试,它更受控制,更系统,完全基于组成最小对的例子,只有在英语中有或没有动词否定时才会发生变化。当我们将我们的测试应用于Roberta和Bert基本模型和大型模型时,我们表明只有Roberta-Large显示出符合预期的趋势,而Bert-Base大多对否定不敏感。然而,对于所有测试的模型,在相当多的测试实例中,TOP-1预测仍然是上下文在语义上禁止的标记,这表明对于否定现象的适当处理还有多大的改进空间。

[NLP-20] Expanding FLORES Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation
[NLP-20] 扩展FLORES基准以获得更多低资源设置:葡萄牙语-埃马库瓦机器翻译评估

链接: https://arxiv.org/abs/2408.11457
作者: Felermino D. M. Antonio Ali,Henrique Lopes Cardoso,Rui Sousa-Silva
关键词-EN: Initiative shared tasks, Open Language Data, Language Data Initiative, low-resource language widely, language widely spoken
关键词-ZH: 倡议共享任务、开放语言数据、语言数据倡议、广泛使用的低资源语言、广泛使用的语言
类目: Computation and Language (cs.CL)
备注: Open Language Data Initiative 2024 shared tasks

点击查看摘要

Abstract:As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at this https URL.
摘要:作为开放语言数据计划共享任务的一部分,我们扩展了FLORES+评估集,以包括Emakhuwa,这是一种在莫桑比克广泛使用的低资源语言。我们将dev和DevTest集从葡萄牙语翻译成Emakhuwa,并详细介绍了翻译过程和使用的质量保证措施。我们的方法涉及各种质量检查,包括后期编辑和充分性评估。生成的数据集由每个来源的多个参考句组成。我们展示了训练神经机器翻译系统和微调现有多语言翻译模型的基线结果。我们的研究结果表明,拼写不一致仍然是Emakhuwa的一个挑战。此外,基线模型在此评估集中表现不佳,凸显了进一步研究以提高Emakhuwa的机器翻译质量的必要性。该数据可在此https URL上公开获取。

[NLP-21] Distributional Properties of Subword Regularization
[NLP-21] 子字规则化的分布性质

链接: https://arxiv.org/abs/2408.11443
作者: Marco Cognetta,Vilém Zouhar,Naoaki Okazaki
关键词-EN: widely in NLP, improves model performance, training corpus, model performance, performance by reducing
关键词-ZH: 广泛应用于NLP,通过降低模型性能、训练数据库、模型性能、性能
类目: Computation and Language (cs.CL)
备注: 4 pages + 4 page appendix. 3 figures

点击查看摘要

Abstract:Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.
摘要:在NLP中广泛使用的子词正规化通过减少对精确标记化的依赖、增强训练数据库以及在训练期间将模型暴露于更独特的上下文来提高模型性能。BPE和MaxMatch是两种流行的子字标记化方案,具有随机丢弃正规化变体。然而,尚未对它们形成的分布进行分析。我们表明,这些随机变体严重偏向于每个单词的一小组标记化。如果子词正规化的好处如所提到的那样,我们假设偏见人为地限制了这些方案的有效性。因此,我们提出了一种对标记化进行统一抽样的算法,我们将其用作现有标记化器随机方面的直接替代品,并发现它提高了机器翻译质量。

[NLP-22] LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems
[NLP-22] LAHAJA:用于评估印地语ASB系统的稳健多口音基准

链接: https://arxiv.org/abs/2408.11440
作者: Tahir Javed,Janki Nawale,Sakshi Joshi,Eldho George,Kaushal Bhogale,Deovrat Mehendale,Mitesh M. Khapra
关键词-EN: diverse linguistic origins, Hindi ASR systems, linguistic origins, Hindi ASR, spoken language
关键词-ZH: 不同的语言起源,印地语ASB系统,语言起源,印地语ASB,口语
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hindi, one of the most spoken language of India, exhibits a diverse array of accents due to its usage among individuals from diverse linguistic origins. To enable a robust evaluation of Hindi ASR systems on multiple accents, we create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases, with a total of 12.5 hours of Hindi audio, sourced from 132 speakers spanning 83 districts of India. We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor. We then train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin. We also present a fine-grained analysis which shows that the performance declines for speakers from North-East and South India, especially with content heavy in named entities and specialized terminology.
摘要:印地语是印度最常用的语言之一,由于其在来自不同语言来源的个人中的使用,因此表现出多种口音。为了能够对多种口音的印地语ASB系统进行稳健评估,我们创建了一个基准LAHAJA,其中包含关于不同主题和用例的朗读和即兴演讲,总共12.5小时的印地语音频,来自印度83个地区的132名扬声器。我们评估了LAHAJA上现有的开源和商业模型,发现它们的性能很差。然后,我们使用不同的数据集训练模型,发现我们的模型在具有良好说话者多样性的多语言数据上训练,其表现明显优于现有模型。我们还提出了一项细粒度分析,该分析显示来自印度东北部和南部的演讲者的表现有所下降,特别是在内容大量涉及指定实体和专业术语的情况下。

[NLP-23] Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning
[NLP-23] 通过无标签课程有意义学习诊断和补救法学硕士的知识缺陷

链接: https://arxiv.org/abs/2408.11431
作者: Kai Xiong,Xiao Ding,Li Du,Jiahao Ying,Ting Liu,Bing Qin,Yixin Cao
关键词-EN: Large Language Models, extensive unlabeled text, demonstrate impressive generalization, impressive generalization ability, Large Language
关键词-ZH: 大型语言模型,大量的未标签文本,表现出令人印象深刻的概括,令人印象深刻的概括能力,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) are versatile and demonstrate impressive generalization ability by mining and learning information from extensive unlabeled text. However, they still exhibit reasoning mistakes, often stemming from knowledge deficiencies, which can affect their trustworthiness and reliability. Although users can provide diverse and comprehensive queries, obtaining sufficient and effective feedback is demanding. Furthermore, evaluating LLMs comprehensively with limited labeled samples is difficult. This makes it a challenge to diagnose and remedy the deficiencies of LLMs through rich label-free user queries. To tackle this challenge, we propose a label-free curricular meaningful learning framework (LaMer). LaMer first employs relative entropy to automatically diagnose and quantify the knowledge deficiencies of LLMs in a label-free setting. Next, to remedy the diagnosed knowledge deficiencies, we apply curricular meaningful learning: first, we adopt meaningful learning to adaptively synthesize augmentation data according to the severity of the deficiencies, and then design a curricular deficiency remedy strategy to remedy the knowledge deficiencies of LLMs progressively. Experiments show that LaMer efficiently and effectively diagnoses and remedies knowledge deficiencies in LLMs, improving various LLMs across seven out-of-distribution (OOD) reasoning and language understanding benchmarks, achieving comparable results to baselines with just 40% training data. LaMer even surpasses methods that rely on labeled datasets for deficiency diagnosis. In application, our label-free method can offer an effective knowledge deficiency diagnostic tool for efficient LLM development.
摘要:大型语言模型通过从大量的未标记文本中挖掘和学习信息,具有很强的通用性和泛化能力。然而,他们仍然表现出推理错误,通常是由于知识不足,这可能会影响他们的可信度和可靠性。虽然用户可以提供多样化和综合性的查询,但要获得足够和有效的反馈是很困难的。此外,利用有限的标记样本对LLMS进行综合评价是困难的。这使得通过丰富的无标签用户查询来诊断和修复LLMS的缺陷成为一项挑战。为了应对这一挑战,我们提出了一个无标签的课程有意义学习框架(LAMER)。Lamer首先利用相对熵来自动诊断和量化无标签环境下LLMS的知识缺陷。其次,为了弥补诊断出的知识缺陷,我们应用了课程有意义学习:首先,我们采用有意义的学习,根据缺陷的严重程度自适应地综合扩充数据,然后设计课程缺陷修复策略,逐步修复LLMS的知识缺陷。实验表明,Lamer能够有效地诊断和修复LLMS中的知识缺陷,在7个OOD推理和语言理解基准上改进了各种LLMS,获得了与仅用40个训练数据的基线相当的结果。Lamer甚至超过了依赖标记数据集进行缺乏症诊断的方法。在应用中,我们的无标记方法可以为高效的LLM开发提供一个有效的知识缺失诊断工具。

[NLP-24] owards “Differential AI Psychology” and in-context Value-driven Statement Alignment with Moral Foundations Theory
[NLP-24] owards“差异人工智能心理学”和背景下价值驱动的陈述与道德基础理论的一致

链接: https://arxiv.org/abs/2408.11415
作者: Simon Münker
关键词-EN: Contemporary research, increasingly utilizing, sciences is increasingly, language models, language
关键词-ZH: 当代研究,越来越多地利用科学,越来越多地、语言模型、语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 tables

点击查看摘要

Abstract:Contemporary research in social sciences is increasingly utilizing state-of-the-art statistical language models to annotate or generate content. While these models perform benchmark-leading on common language tasks and show exemplary task-independent emergent abilities, transferring them to novel out-of-domain tasks is only insufficiently explored. The implications of the statistical black-box approach - stochastic parrots - are prominently criticized in the language model research community; however, the significance for novel generative tasks is not. This work investigates the alignment between personalized language models and survey participants on a Moral Foundation Theory questionnaire. We adapt text-to-text models to different political personas and survey the questionnaire repetitively to generate a synthetic population of persona and model combinations. Analyzing the intra-group variance and cross-alignment shows significant differences across models and personas. Our findings indicate that adapted models struggle to represent the survey-captured assessment of political ideologies. Thus, using language models to mimic social interactions requires measurable improvements in in-context optimization or parameter manipulation to align with psychological and sociological stereotypes. Without quantifiable alignment, generating politically nuanced content remains unfeasible. To enhance these representations, we propose a testable framework to generate agents based on moral value statements for future research. Comments: 8 pages, 6 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.11415 [cs.CL] (or arXiv:2408.11415v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.11415 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:当代社会科学研究越来越多地使用最先进的统计语言模型来注释或生成内容。虽然这些模型在常见语言任务上表现出基准领先,并显示出与任务无关的涌现能力,但将它们转移到新的域外任务上的研究还不够深入。统计黑盒方法–随机鹦鹉–的含义在语言模型研究界受到了强烈的批评;然而,对新颖的生成性任务的意义却不是这样。本研究通过道德基础理论问卷调查了个性化语言模型与调查参与者之间的一致性。我们使文本到文本模型适应不同的政治人物角色,并重复调查问卷,以生成人物角色和模型组合的综合总体。对组内差异和交叉对齐的分析表明,不同模型和角色之间存在显著差异。我们的发现表明,调整后的模型难以代表调查捕获的对政治意识形态的评估。因此,使用语言模型来模拟社会互动需要在语境优化或参数处理方面进行可衡量的改进,以与心理学和社会学的刻板印象保持一致。如果没有可量化的一致性,生成政治上细微差别的内容仍然是不可行的。为了增强这些表示,我们提出了一个可测试的框架,以生成基于道德价值陈述的代理,用于未来的研究。评论:8页,6个表格主题:计算和语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2408.11415cs.CLhttps://doi.org/10.48550/arXiv.2408.11415 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-25] MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing
[NLP-25] MoE-LPO:通过具有语言先验路由的专家混合对大型语言模型进行多语言扩展

链接: https://arxiv.org/abs/2408.11396
作者: Hao Zhou,Zhijun Wang,Shujian Huang,Xin Huang,Xue Han,Junlan Feng,Chao Deng,Weihua Luo,Jiajun Chen
关键词-EN: Large Language Models, English-centric due, Large Language, Language Priors Routing, Language
关键词-ZH: 大型语言模型,以英语为中心,大型语言,语言优先级路由,语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are often English-centric due to the disproportionate distribution of languages in their pre-training data. Enhancing non-English language capabilities through post-pretraining often results in catastrophic forgetting of the ability of original languages. Previous methods either achieve good expansion with severe forgetting or slight forgetting with poor expansion, indicating the challenge of balancing language expansion while preventing forgetting. In this paper, we propose a method called MoE-LPR (Mixture-of-Experts with Language Priors Routing) to alleviate this problem. MoE-LPR employs a two-stage training approach to enhance the multilingual capability. First, the model is post-pretrained into a Mixture-of-Experts (MoE) architecture by upcycling, where all the original parameters are frozen and new experts are added. In this stage, we focus improving the ability on expanded languages, without using any original language data. Then, the model reviews the knowledge of the original languages with replay data amounting to less than 1% of post-pretraining, where we incorporate language priors routing to better recover the abilities of the original languages. Evaluations on multiple benchmarks show that MoE-LPR outperforms other post-pretraining methods. Freezing original parameters preserves original language knowledge while adding new experts preserves the learning ability. Reviewing with LPR enables effective utilization of multilingual knowledge within the parameters. Additionally, the MoE architecture maintains the same inference overhead while increasing total model parameters. Extensive experiments demonstrate MoE-LPR’s effectiveness in improving expanded languages and preserving original language proficiency with superior scalability. Code and scripts are freely available at this https URL.
摘要:大型语言模型往往是以英语为中心的,因为它们的训练前数据中的语言分布不成比例。通过后备培训提高非英语语言能力往往会导致灾难性地忘记原始语言的能力。以前的方法要么是在严重遗忘的情况下实现良好的扩展,要么是在轻微忘记的情况下实现较差的扩展,这表明在平衡语言扩展和防止遗忘之间存在挑战。在本文中,我们提出了一种称为MOE-LPR(MIX-of-Experts With Language Priors Routing)的方法来缓解这一问题。MOE-LPR采用两阶段培训的方法来提高多语种能力。首先,通过升级循环将模型后预训练到专家混合(MOE)体系结构,其中所有原始参数都被冻结并添加新的专家。在这个阶段,我们专注于提高扩展语言的能力,而不使用任何原始语言数据。然后,该模型用不到训练前1%的回放数据来回顾原始语言的知识,其中我们加入了语言先验路径,以更好地恢复原始语言的能力。对多个基准的评估表明,MOE-LPR的性能优于其他训练前的方法。冻结原始参数保留了原有的语言知识,而添加新的专家保留了学习能力。使用LPR复习能够有效地利用参数范围内的多语言知识。此外,MOE体系结构在增加总模型参数的同时保持相同的推理开销。大量实验表明,MOE-LPR算法在提高扩展语言质量的同时保持原有语言的熟练程度,具有较好的可扩展性。代码和脚本可在此HTTPS URL免费获得。

[NLP-26] First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models
[NLP-26] 第一个激活很重要:大型语言模型中动态激活的免培训方法

链接: https://arxiv.org/abs/2408.11393
作者: Chi Ma,Mincong Huang,Ying Zhang,Chao Wang,Yujie Wang,Lei Yu,Chuan Liu,Wei Lin
关键词-EN: large language models, Threshold-based Dynamic Activation, DejaVu and MoEfication, demonstrated their potential, enhance the inference
关键词-ZH: 大型语言模型、基于阈值的动态激活、DejaVu和MoEbification展示了它们的潜力,增强了推理
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dynamic activation (DA) techniques, such as DejaVu and MoEfication, have demonstrated their potential to significantly enhance the inference efficiency of large language models (LLMs). However, these techniques often rely on ReLU activation functions or require additional parameters and training to maintain performance. This paper introduces a training-free Threshold-based Dynamic Activation(TDA) method that leverage sequence information to exploit the inherent sparsity of models across various architectures. This method is designed to accelerate generation speed by 18-25% without significantly compromising task performance, thereby addressing the limitations of existing DA techniques. Moreover, we delve into the root causes of LLM sparsity and theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia. Our comprehensive analyses not only provide a robust theoretical foundation for DA methods but also offer valuable insights to guide future research in optimizing LLMs for greater efficiency and effectiveness.
摘要:动态激活(DA)技术,如DejaVu和MoEfication,已经证明它们有可能显著提高大型语言模型(LLM)的推理效率。然而,这些技术通常依赖于REU激活功能,或者需要额外的参数和训练来维持性能。本文介绍了一种无需训练的基于阈值的动态激活(TDA)方法,该方法利用序列信息来利用模型在不同体系结构上固有的稀疏性。该方法在不显著影响任务性能的情况下,将生成速度提高18-25%,从而克服了现有DA技术的局限性。此外,我们深入探讨了LLM稀疏性的根本原因,并从理论上分析了LLM稀疏性的两个关键特征:与历史相关的激活不确定性和与语义无关的激活惯性。我们的全面分析不仅为DA方法提供了坚实的理论基础,而且为未来优化LLMS的研究提供了有价值的见解,以获得更高的效率和效果。

[NLP-27] On the Interchangeability of Positional Embeddings in Multilingual Neural Machine Translation Models
[NLP-27] 多语言神经机器翻译模型中位置嵌入的互换性

链接: https://arxiv.org/abs/2408.11382
作者: Varun Gumma,Pranjal A. Chitale,Kalika Bali
关键词-EN: Standard Neural Machine, Neural Machine Translation, Standard Neural, Neural Machine, Machine Translation
关键词-ZH: 标准神经机器,神经机器翻译,标准神经,神经机器,机器翻译
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Standard Neural Machine Translation (NMT) models have traditionally been trained with Sinusoidal Positional Embeddings (PEs), which are inadequate for capturing long-range dependencies and are inefficient for long-context or document-level translation. In contrast, state-of-the-art large language models (LLMs) employ relative PEs, demonstrating superior length generalization. This work explores the potential for efficiently switching the Positional Embeddings of pre-trained NMT models from absolute sinusoidal PEs to relative approaches such as RoPE and ALiBi. Our findings reveal that sinusoidal PEs can be effectively replaced with RoPE and ALiBi with negligible or no performance loss, achieved by fine-tuning on a small fraction of high-quality data. Additionally, models trained without Positional Embeddings (NoPE) are not a viable solution for Encoder-Decoder architectures, as they consistently under-perform compared to models utilizing any form of Positional Embedding. Furthermore, even a model trained from scratch with these relative PEs slightly under-performs a fine-tuned model, underscoring the efficiency and validity of our hypothesis.
摘要:标准的神经机器翻译(NMT)模型传统上使用正弦位置嵌入(PES)进行训练,不足以捕捉长依赖关系,并且对于长上下文或文档级翻译效率低下。相比之下,最先进的大型语言模型(LLM)使用相对PE,表现出优越的长度泛化能力。这项工作探索了将预先训练的NMT模型的位置嵌入从绝对正弦PES有效地切换到相对方法(如绳索和不在场证明)的可能性。我们的发现表明,通过对一小部分高质量数据进行微调,可以有效地将正弦PES替换为绳索和不在场证明,而性能损失可以忽略或没有。此外,没有位置嵌入(NOPE)训练的模型对于编解码器架构来说不是一个可行的解决方案,因为与使用任何形式的位置嵌入的模型相比,它们总是表现不佳。此外,即使是用这些相对PE从头开始训练的模型也略逊于微调模型,这突显了我们假设的效率和有效性。

[NLP-28] RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation
[NLP-28] RAGLAB:用于检索增强生成的模块化、面向研究的统一框架

链接: https://arxiv.org/abs/2408.11381
作者: Xuanwang Zhang,Yunze Song,Yidong Wang,Shuyun Tang,Xinfeng Li,Zhengran Zeng,Zhen Wu,Wei Ye,Wenyuan Xu,Yue Zhang,Xinyu Dai,Shikun Zhang,Qingsong Wen
关键词-EN: Large Language Models, Large Language, Language Models, demonstrate human-level capabilities, Retrieval Augmented Generation
关键词-ZH: 大型语言模型、大型语言、语言模型、展示人类层面的能力、检索增强生成
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention. However, even the most advanced LLMs face challenges such as hallucinations and real-time updating of their knowledge. Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG). However, two key issues constrained the development of RAG. First, there is a growing lack of comprehensive and fair comparisons between novel RAG algorithms. Second, open-source tools such as LlamaIndex and LangChain employ high-level abstractions, which results in a lack of transparency and limits the ability to develop novel algorithms and evaluation metrics. To close this gap, we introduce RAGLAB, a modular and research-oriented open-source library. RAGLAB reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms. Leveraging RAGLAB, we conduct a fair comparison of 6 RAG algorithms across 10 benchmarks. With RAGLAB, researchers can efficiently compare the performance of various algorithms and develop novel algorithms.
摘要:大型语言模型展示了人类水平的对话、推理和知识保持能力。然而,即使是最先进的LLM也面临着诸如幻觉和知识的实时更新等挑战。目前的研究通过向LLMS配备外部知识来解决这一瓶颈,这是一种称为检索增强生成(RAG)的技术。然而,有两个关键问题制约了RAG的发展。首先,新的RAG算法之间越来越缺乏全面和公平的比较。其次,LlamaIndex和LangChain等开源工具使用高级抽象,这导致缺乏透明度,限制了开发新算法和评估指标的能力。为了缩小这一差距,我们引入了RAGLAB,这是一个模块化的、面向研究的开源库。RAGLAB再现了现有的6种算法,并为研究RAG算法提供了一个全面的生态系统。利用RAGLAB,我们在10个基准测试中对6个RAG算法进行了公平的比较。通过RAGLAB,研究人员可以有效地比较各种算法的性能,并开发新的算法。

[NLP-29] GeoReasoner: Reasoning On Geospatially Grounded Context For Natural Language Understanding
[NLP-29] 地理推理者:基于地理空间接地上下文的推理以促进自然语言理解

链接: https://arxiv.org/abs/2408.11366
作者: Yibo Yan,Joey Lee
关键词-EN: involves recognizing geographic, recognizing geographic entities, making informed inferences, reading and communication, individuals tend
关键词-ZH: 涉及识别地理、识别地理实体、做出明智的推断、阅读和沟通,个人倾向于
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by International Conference on Information and Knowledge Management 2024

点击查看摘要

Abstract:In human reading and communication, individuals tend to engage in geospatial reasoning, which involves recognizing geographic entities and making informed inferences about their interrelationships. To mimic such cognitive process, current methods either utilize conventional natural language understanding toolkits, or directly apply models pretrained on geo-related natural language corpora. However, these methods face two significant challenges: i) they do not generalize well to unseen geospatial scenarios, and ii) they overlook the importance of integrating geospatial context from geographical databases with linguistic information from the Internet. To handle these challenges, we propose GeoReasoner, a language model capable of reasoning on geospatially grounded natural language. Specifically, it first leverages Large Language Models (LLMs) to generate a comprehensive location description based on linguistic and geospatial information. It also encodes direction and distance information into spatial embedding via treating them as pseudo-sentences. Consequently, the model is trained on both anchor-level and neighbor-level inputs to learn geo-entity representation. Extensive experimental results demonstrate GeoReasoner’s superiority in three tasks: toponym recognition, toponym linking, and geo-entity typing, compared to the state-of-the-art baselines.
摘要:在人类阅读和交流中,个体倾向于参与地理空间推理,这涉及识别地理实体并对它们之间的相互关系做出知情的推理。为了模拟这种认知过程,目前的方法要么利用传统的自然语言理解工具包,要么直接应用在与地理相关的自然语言语料库上预先训练的模型。然而,这些方法面临着两个重大挑战:i)它们不能很好地推广到看不见的地理空间场景;ii)它们忽略了将地理数据库中的地理空间上下文与来自互联网的语言信息相结合的重要性。为了应对这些挑战,我们提出了一种能够对基于地理空间的自然语言进行推理的语言模型–GeoReasoner。具体地说,它首先利用大型语言模型(LLM)基于语言和地理空间信息生成全面的位置描述。将方向信息和距离信息作为伪句进行空间嵌入。因此,模型在锚级和邻接级输入上进行训练,以学习地理实体表示。大量的实验结果表明,与最先进的基线相比,GeoReasoner在三个任务上具有优势:地名识别、地名链接和地理实体分类。

[NLP-30] Clinical Context-aware Radiology Report Generation from Medical Images using Transformers
[NLP-30] 使用Transformers从医学图像生成临床上下文感知放射学报告

链接: https://arxiv.org/abs/2408.11344
作者: Sonit Singh
关键词-EN: Natural Language Processing, radiology report generation, Recent developments, field of Natural, Language Processing
关键词-ZH: 自然语言处理、放射学报告生成、最新发展、自然领域、语言处理
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Recent developments in the field of Natural Language Processing, especially language models such as the transformer have brought state-of-the-art results in language understanding and language generation. In this work, we investigate the use of the transformer model for radiology report generation from chest X-rays. We also highlight limitations in evaluating radiology report generation using only the standard language generation metrics. We then applied a transformer based radiology report generation architecture, and also compare the performance of a transformer based decoder with the recurrence based decoder. Experiments were performed using the IU-CXR dataset, showing superior results to its LSTM counterpart and being significantly faster. Finally, we identify the need of evaluating radiology report generation system using both language generation metrics and classification metrics, which helps to provide robust measure of generated reports in terms of their coherence and diagnostic value.
摘要:自然语言处理领域的最新发展,特别是语言模型,如转换器,在语言理解和语言生成方面带来了最先进的结果。在这项工作中,我们研究了使用转换器模型从胸部X光片生成放射学报告。我们还强调了仅使用标准语言生成指标评估放射学报告生成的局限性。然后我们应用了一种基于变压器的放射报告生成架构,并对基于变压器的解码器和基于递归的解码器的性能进行了比较。实验是使用Iu-CXR数据集进行的,显示出比LSTM对应的结果更好的结果,并且速度明显更快。最后,我们确定了使用语言生成度量和分类度量来评估放射学报告生成系统的需求,这有助于提供所生成报告的一致性和诊断价值方面的稳健度量。

[NLP-31] BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports
[NLP-31] BURExtract-Llama:乳腺超声报告中临床概念提取的LLM

链接: https://arxiv.org/abs/2408.11334
作者: Yuxuan Chen,Haoyan Yang,Hengkai Pan,Fardeen Siddiqui,Antonio Verdone,Qingyang Zhang,Sumit Chopra,Chen Zhao,Yiqiu Shen
关键词-EN: Breast ultrasound, reports summarizing key, summarizing key findings, diagnosing abnormalities, malignancy assessments
关键词-ZH: 乳房超声、总结关键的报告、总结关键发现、诊断异常、恶性肿瘤评估
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted as the oral paper for the HCHM workshop, ACM Multimedia 2024

点击查看摘要

Abstract:Breast ultrasound is essential for detecting and diagnosing abnormalities, with radiology reports summarizing key findings like lesion characteristics and malignancy assessments. Extracting this critical information is challenging due to the unstructured nature of these reports, with varied linguistic styles and inconsistent formatting. While proprietary LLMs like GPT-4 are effective, they are costly and raise privacy concerns when handling protected health information. This study presents a pipeline for developing an in-house LLM to extract clinical information from radiology reports. We first use GPT-4 to create a small labeled dataset, then fine-tune a Llama3-8B model on it. Evaluated on clinician-annotated reports, our model achieves an average F1 score of 84.6%, which is on par with GPT-4. Our findings demonstrate the feasibility of developing an in-house LLM that not only matches GPT-4’s performance but also offers cost reductions and enhanced data privacy.
摘要:乳腺超声对于检测和诊断异常至关重要,放射学报告总结了病变特征和恶性肿瘤评估等关键发现。由于这些报告的非结构化性质、语言风格各异且格式不一致,提取这些关键信息具有挑战性。虽然GPT-4等专有LLM很有效,但它们成本高昂,并且在处理受保护的健康信息时会引发隐私问题。这项研究提供了一个开发内部LLM以从放射学报告中提取临床信息的管道。我们首先使用GPT-4创建一个小型标记数据集,然后在其上微调Llama 3 -8B模型。根据临床医生注释的报告进行评估,我们的模型的F1平均评分为84.6%,与GPT-4相当。我们的研究结果证明了开发内部LLM的可行性,该LLM不仅与GPT-4的性能相匹配,而且还可以降低成本并增强数据隐私。

[NLP-32] Design Principle Transfer in Neural Architecture Search via Large Language Models
[NLP-32] 通过大型语言模型进行神经架构搜索中的设计原则转移

链接: https://arxiv.org/abs/2408.11330
作者: Xun Zhou,Liang Feng,Xingyu Wu,Zhichao Lu,Kay Chen Tan
关键词-EN: Transferable neural architecture, Transferable neural, efficient neural architectures, design efficient neural, efficient neural
关键词-ZH: 可移植神经架构,可移植神经,高效神经架构,设计高效神经,高效神经
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transferable neural architecture search (TNAS) has been introduced to design efficient neural architectures for multiple tasks, to enhance the practical applicability of NAS in real-world scenarios. In TNAS, architectural knowledge accumulated in previous search processes is reused to warm up the architecture search for new tasks. However, existing TNAS methods still search in an extensive search space, necessitating the evaluation of numerous architectures. To overcome this challenge, this work proposes a novel transfer paradigm, i.e., design principle transfer. In this work, the linguistic description of various structural components’ effects on architectural performance is termed design principles. They are learned from established architectures and then can be reused to reduce the search space by discarding unpromising architectures. Searching in the refined search space can boost both the search performance and efficiency for new NAS tasks. To this end, a large language model (LLM)-assisted design principle transfer (LAPT) framework is devised. In LAPT, LLM is applied to automatically reason the design principles from a set of given architectures, and then a principle adaptation method is applied to refine these principles progressively based on the new search results. Experimental results show that LAPT can beat the state-of-the-art TNAS methods on most tasks and achieve comparable performance on others.
摘要:可转移神经体系结构搜索(TNAS)被引入以设计高效的多任务神经体系结构,以增强NAS在现实世界场景中的实用适用性。在TNAS中,以前搜索过程中积累的体系结构知识被重用,以热身为新任务的体系结构搜索。然而,现有的TNAS方法仍然在广泛的搜索空间中进行搜索,这就需要对众多的体系结构进行评估。为了克服这一挑战,本工作提出了一种新的迁移范式,即设计原则迁移。在这项工作中,各种结构构件对建筑性能的影响的语言描述被称为设计原则。它们是从已建立的体系结构中学习的,然后可以重复使用,通过丢弃没有前景的体系结构来减少搜索空间。在精细化的搜索空间中进行搜索可以提高新NAS任务的搜索性能和效率。为此,设计了一个基于大语言模型的设计原理迁移框架。在LAPT中,应用LLM从一组给定的体系结构中自动推理出设计原则,然后根据新的搜索结果采用原则自适应方法逐步求精这些原则。实验结果表明,LAPT在大多数任务上都能击败最先进的TNAS方法,在其他任务上也能获得与之相当的性能。

[NLP-33] Plug Play and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies
[NLP-33] 即插即用和收件箱:通过跨不同词汇的单词级重新排序进行零镜头联合解码

链接: https://arxiv.org/abs/2408.11327
作者: Sai Koneru,Matthias Huck,Miriam Exel,Jan Niehues
关键词-EN: Recent advancements, advancements in NLP, NLP have resulted, processing multimodal inputs, specific domains
关键词-ZH: 最近的进步,NLP的进步,NLP的进步,处理多模式输入,特定领域
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Recent advancements in NLP have resulted in models with specialized strengths, such as processing multimodal inputs or excelling in specific domains. However, real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing. While individual translation and vision models are powerful, they typically lack the ability to perform both tasks in a single system. Combining these models poses challenges, particularly due to differences in their vocabularies, which limit the effectiveness of traditional ensemble methods to post-generation techniques like N-best list re-ranking. In this work, we propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training. Our approach re-ranks beams during decoding by combining scores at the word level, using heuristics to predict when a word is completed. We demonstrate the effectiveness of this method in machine translation scenarios, showing that it enables the generation of translations that are both speech- and image-aware while also improving overall translation quality\footnoteWe will release the code upon paper acceptance…
摘要:自然语言处理领域的最新进展产生了具有专业优势的模型,例如处理多模式输入或在特定领域表现出色。然而,现实世界中的任务,如多模式翻译,通常需要这些优势的结合,例如处理翻译和图像处理。虽然单独的翻译和视觉模型功能强大,但它们通常缺乏在单个系统中同时执行这两项任务的能力。将这些模型结合起来会带来挑战,特别是由于它们的词汇差异,这限制了传统集成方法的有效性,使其只能用于像N-Best List重新排名这样的后生成技术。在这项工作中,我们提出了一种新的零镜头集成策略,允许在解码阶段整合不同的模型,而不需要额外的训练。我们的方法在解码过程中通过组合单词级别的分数来对波束进行重新排序,并使用启发式算法预测单词何时完成。我们在机器翻译场景中演示了这种方法的有效性,表明它能够生成既能识别语音又能识别图像的翻译,同时还能提高整体翻译质量\Footnote我们将在纸质验收后发布代码。

[NLP-34] owards Evaluating Large Language Models on Sarcasm Understanding
[NLP-34] 对讽刺理解的大型语言模型进行评估

链接: https://arxiv.org/abs/2408.11319
作者: Yazhou Zhang,Chunwang Zou,Zheng Lian,Prayag Tiwari,Jing Qin
关键词-EN: sentiment analysis, text classification, large language models, successfully solved, era of large
关键词-ZH: 情感分析、文本分类、大型语言模型、成功解决、大型时代
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of large language models (LLMs), the task of ``System I’‘~-~the fast, unconscious, and intuitive tasks, e.g., sentiment analysis, text classification, etc., have been argued to be successfully solved. However, sarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices like hyperbole and figuration to convey true sentiments and intentions, involving a higher level of abstraction than sentiment analysis. There is growing concern that the argument about LLMs’ success may not be fully tenable when considering sarcasm understanding. To address this question, we select eleven SOTA LLMs and eight SOTA pre-trained language models (PLMs) and present comprehensive evaluations on six widely used benchmark datasets through different prompting approaches, i.e., zero-shot input/output (IO) prompting, few-shot IO prompting, chain of thought (CoT) prompting. Our results highlight three key findings: (1) current LLMs underperform supervised PLMs based sarcasm detection baselines across six sarcasm benchmarks. This suggests that significant efforts are still required to improve LLMs’ understanding of human sarcasm. (2) GPT-4 consistently and significantly outperforms other LLMs across various prompting methods, with an average improvement of 14.0% \uparrow . Claude 3 and ChatGPT demonstrate the next best performance after GPT-4. (3) Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT. The reason is that sarcasm detection, being a holistic, intuitive, and non-rational cognitive process, is argued not to adhere to step-by-step logical reasoning, making CoT less effective in understanding sarcasm compared to its effectiveness in mathematical reasoning tasks.
摘要:在大型语言模型(LLMS)时代,情感分析、文本分类等快速、无意识、直观的任务已经被成功地解决了。然而,讽刺作为一种微妙的语言现象,往往使用夸张、比喻等修辞手段来传达真实的情感和意图,涉及的抽象程度高于情感分析。人们越来越担心,在考虑到对讽刺的理解时,关于LLMS成功的论点可能并不完全站得住脚。为了解决这一问题,我们选取了11个SOTA LLM和8个SOTA预训练语言模型(PLM),并通过不同的激励方法对六个广泛使用的基准数据集进行了综合评估,即零触发输入/输出(IO)提示、少触发IO提示、思维链(COT)提示。我们的结果突出了三个关键发现:(1)在六个讽刺基准中,当前的LLMS表现逊于基于监督PLM的讽刺检测基线。这表明,要提高低收入者对人类讽刺的理解,仍需要做出重大努力。(2)在不同的激励方式下,GPT-4的成绩均显著高于其他LLMS,平均提高了14.0%。克劳德3和ChatGPT的表现仅次于GPT-4。(3)少发IO提示法优于其他两种方式:零发IO和少发COT。这是因为讽刺检测是一个整体的、直观的、非理性的认知过程,不能坚持循序渐进的逻辑推理,这使得COT在理解讽刺方面的有效性低于它在数学推理任务中的有效性。

[NLP-35] EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
[NLP-35] EEG-Defender:通过早期退出生成大型语言模型来抵御越狱

链接: https://arxiv.org/abs/2408.11308
作者: Chongwen Zhao,Zhihao Dou,Kaizhu Huang
关键词-EN: Large Language Models, Large Language, increasingly attracting attention, Language Models, increasingly attracting
关键词-ZH: 大型语言模型,大型语言,越来越吸引关注,语言模型,越来越吸引
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation. In an effort to mitigate such risks, the concept of “Alignment” technology has been developed. However, recent studies indicate that this alignment can be undermined using sophisticated prompt engineering or adversarial suffixes, a technique known as “Jailbreak.” Our research takes cues from the human-like generate process of LLMs. We identify that while jailbreaking prompts may yield output logits similar to benign prompts, their initial embeddings within the model’s latent space tend to be more analogous to those of malicious prompts. Leveraging this finding, we propose utilizing the early transformer outputs of LLMs as a means to detect malicious inputs, and terminate the generation immediately. Built upon this idea, we introduce a simple yet significant defense approach called EEG-Defender for LLMs. We conduct comprehensive experiments on ten jailbreak methods across three models. Our results demonstrate that EEG-Defender is capable of reducing the Attack Success Rate (ASR) by a significant margin, roughly 85% in comparison with 50% for the present SOTAs, with minimal impact on the utility and effectiveness of LLMs.
摘要:大型语言模型在各种应用中日益引起人们的关注。尽管如此,随着一些用户试图利用这些模型达到恶意目的,包括合成受控物质和传播虚假信息,人们越来越担心。为了减轻这种风险,人们提出了“对准”技术的概念。然而,最近的研究表明,这种对齐可以使用复杂的即时工程或敌对后缀来破坏,这是一种被称为“越狱”的技术。我们的研究从LLMS类似人类的生成过程中获得了线索。我们发现,虽然越狱提示可能会产生类似于良性提示的输出日志,但它们在模型潜在空间中的初始嵌入往往更类似于恶意提示。利用这一发现,我们建议利用LLMS的早期变压器输出作为一种手段来检测恶意输入,并立即终止生成。基于这一想法,我们介绍了一种简单但重要的防御方法,称为用于LLMS的EEG-Defender。我们在三个模型上对十种越狱方法进行了全面的实验。我们的结果表明,EEG-Defender能够显著降低攻击成功率(ASR),与现有SOTAS的50%相比,约为85%,而对LLMS的实用性和有效性的影响最小。

[NLP-36] RePair: Automated Program Repair with Process-based Feedback
[NLP-36] RePair:利用基于流程的反馈进行自动程序修复

链接: https://arxiv.org/abs/2408.11296
作者: Yuze Zhao,Zhenya Huang,Yixiao Ma,Rui Li,Kai Zhang,Hao Jiang,Qi Liu,Linbo Zhu,Yu Su
关键词-EN: Automated Program Repair, Automated Program, indispensability of Automated, bolstering program reliability, program reliability
关键词-ZH: 自动程序修复,自动程序,自动化的不可或缺,增强程序可靠性,程序可靠性
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 15 pages, 13 figures

点击查看摘要

Abstract:The gap between the trepidation of program reliability and the expense of repairs underscores the indispensability of Automated Program Repair (APR). APR is instrumental in transforming vulnerable programs into more robust ones, bolstering program reliability while simultaneously diminishing the financial burden of manual repairs. Commercial-scale language models (LM) have taken APR to unprecedented levels. However, the emergence reveals that for models fewer than 100B parameters, making single-step modifications may be difficult to achieve the desired effect. Moreover, humans interact with the LM through explicit prompts, which hinders the LM from receiving feedback from compiler and test cases to automatically optimize its repair policies. In this literature, we explore how small-scale LM (less than 20B) achieve excellent performance through process supervision and feedback. We start by constructing a dataset named CodeNet4Repair, replete with multiple repair records, which supervises the fine-tuning of a foundational model. Building upon the encouraging outcomes of reinforcement learning, we develop a reward model that serves as a critic, providing feedback for the fine-tuned LM’s action, progressively optimizing its policy. During inference, we require the LM to generate solutions iteratively until the repair effect no longer improves or hits the maximum step limit. The results show that process-based not only outperforms larger outcome-based generation methods, but also nearly matches the performance of closed-source commercial large-scale LMs.
摘要:程序可靠性的恐慌与维修费用之间的差距凸显了程序自动修复(APR)的必要性。APR在将易受攻击的程序转变为更强大的程序方面发挥了重要作用,在增强程序可靠性的同时减少了手动维修的财政负担。商业规模的语言模型(LM)将APR带到了前所未有的水平。然而,这一现象表明,对于小于100B参数的模型,单步修改可能很难达到预期的效果。此外,人类通过显式提示与LM交互,这阻碍了LM从编译器和测试用例接收反馈以自动优化其修复策略。在这篇文献中,我们探索小规模LM(低于20B)如何通过过程监督和反馈实现出色的绩效。我们首先构建一个名为CodeNet4Repair的数据集,其中包含多个修复记录,该数据集监督基本模型的微调。在强化学习令人鼓舞的结果的基础上,我们开发了一个奖励模型,作为批评者,为微调的LM的行动提供反馈,逐步优化其政策。在推理过程中,我们要求LM迭代生成解,直到修复效果不再提高或达到最大步长限制。结果表明,基于过程的生成方法不仅性能优于更大的基于结果的生成方法,而且接近于闭源商业大规模LMS的性能。

[NLP-37] RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining
[NLP-37] RedWhale:通过高效的持续预培训适应的韩国LLM

链接: https://arxiv.org/abs/2408.11294
作者: Anh-Dung Vo,Minseong Jung,Wonbeen Lee,Daewoo Choi
关键词-EN: Natural Language Processing, Large Language Models, Korean language processing, field of Natural, development of Large
关键词-ZH: 自然语言处理、大型语言模型、韩语处理、自然领域、大型开发
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The field of Natural Language Processing (NLP) has seen significant advancements with the development of Large Language Models (LLMs). However, much of this research remains focused on English, often overlooking low-resource languages like Korean. This oversight presents challenges due to the unique non-alphabetic token structure of Korean and the substantial memory and computational demands required for LLM training, which frequently lead to memory constraints and out-of-memory errors. To address these issues, we present RedWhale, a model specifically tailored for Korean language processing. RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline, a specialized tokenizer, an optimized model initialization technique, and a multistage pretraining strategy. These innovations collectively reduce training time and computational costs while maintaining high levels of accuracy and comprehension. By leveraging cross-lingual transfer learning, which exploits shared linguistic similarities across languages, RedWhale builds on English models to enhance Korean language processing. Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks, including the Korean Balanced Evaluation of Significant Tasks (KoBEST), showing superior understanding and generation of Korean text. Furthermore, RedWhale showed no signs of convergence even after pretraining on 9.7 billion tokens, indicating the potential for further improvements with additional training. This work represents a significant advancement in bridging the linguistic divide, particularly in enhancing NLP capabilities for the Korean language.
摘要:随着大型语言模型的发展,自然语言处理领域取得了长足的进步。然而,这项研究的大部分仍然集中在英语上,往往忽略了韩语等资源较少的语言。由于韩语独特的非字母符号结构,以及LLM培训所需的大量内存和计算需求,这种疏忽带来了挑战,这往往会导致内存限制和内存不足错误。为了解决这些问题,我们提出了红鲸,一个专门为韩语语言处理量身定做的模型。RedWhale是使用一种有效的持续预训练方法开发的,该方法包括全面的韩语语料库预处理流水线、专门的标记器、优化的模型初始化技术和多阶段预训练策略。这些创新共同减少了培训时间和计算成本,同时保持了高水平的准确性和理解力。通过利用跨语言迁移学习,即利用不同语言之间共享的语言相似性,红鲸建立在英语模式的基础上,以增强韩语处理能力。实验结果表明,RedWhale在韩语自然语言处理基准上的性能优于其他领先的模型,包括韩语重要任务平衡评估(KoBEST),表现出对韩语文本的更好理解和生成。此外,即使在对97亿个代币进行预培训后,红鲸也没有表现出收敛的迹象,这表明通过额外的培训有进一步改进的潜力。这项工作在弥合语言鸿沟方面取得了重大进展,特别是在提高朝鲜语的自然语言能力方面。

[NLP-38] owards Analyzing and Mitigating Sycophancy in Large Vision-Language Models
[NLP-38] owards分析和缓解大型视觉语言模型中的谄媚行为

链接: https://arxiv.org/abs/2408.11261
作者: Yunpu Zhao,Rui Zhang,Junbin Xiao,Changxin Ke,Ruibo Hou,Yifan Hao,Qi Guo,Yunji Chen
关键词-EN: Large Vision-Language Models, shown significant capability, Large Vision-Language, vision-language understanding, shown significant
关键词-ZH: 大型视觉语言模型,表现出显着的能力,大型视觉语言,视觉语言理解,表现出显着的
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown significant capability in vision-language understanding. However, one critical issue that persists in these models is sycophancy, which means models are unduly influenced by leading or deceptive prompts, resulting in biased outputs and hallucinations. Despite the progress in LVLMs, evaluating and mitigating sycophancy is yet much under-explored. In this work, we fill this gap by systematically analyzing sycophancy on various VL benchmarks with curated leading queries and further proposing a text contrastive decoding method for mitigation. While the specific sycophantic behavior varies significantly among models, our analysis reveals the severe deficiency of all LVLMs in resilience of sycophancy across various tasks. For improvement, we propose Leading Query Contrastive Decoding (LQCD), a model-agnostic method focusing on calibrating the LVLMs’ over-reliance on leading cues by identifying and suppressing the probabilities of sycophancy tokens at the decoding stage. Extensive experiments show that LQCD effectively mitigate sycophancy, outperforming both prompt engineering methods and common methods for hallucination mitigation. We further demonstrate that LQCD does not hurt but even slightly improves LVLMs’ responses to neutral queries, suggesting it being a more effective strategy for general-purpose decoding but not limited to sycophancy.
摘要:大型视觉语言模型在视觉语言理解方面表现出了很强的能力。然而,在这些模型中持续存在的一个关键问题是奉承,这意味着模型受到引导或欺骗性提示的过度影响,导致有偏见的输出和幻觉。尽管在LVLMS方面取得了进展,但对奉承的评估和减轻仍未得到充分研究。在这项工作中,我们通过系统地分析各种VL基准上的马屁和精心策划的引导性查询来填补这一空白,并进一步提出了一种文本对比解码方法来缓解问题。虽然特定的奉承行为在不同的模型之间有很大的差异,但我们的分析揭示了所有LVLM在应对不同任务的奉承方面的严重不足。为了进行改进,我们提出了领先查询对比解码(LQCD),这是一种与模型无关的方法,专注于通过在解码阶段识别和抑制奉承标记的概率来校准LVLMS对引导线索的过度依赖。广泛的实验表明,LQCD有效地缓解了奉承,表现优于迅速的工程方法和常见的减轻幻觉的方法。我们进一步证明,LQCD不会伤害甚至略微提高LVLMS对中性问题的响应,这表明它是一种更有效的通用解码策略,但并不局限于奉承。

[NLP-39] Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers
[NLP-39] 改进现代和现成语音识别器的语音识别错误预测

链接: https://arxiv.org/abs/2408.11258
作者: Prashant Serai,Peidong Wang,Eric Fosler-Lussier
关键词-EN: discriminative language modeling, errorful recognized speech, robustness of NLP, simulate errorful recognized, language modeling
关键词-ZH: 区分语言建模、错误识别语音、NLP鲁棒性、模拟错误识别、语言建模
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modeling the errors of a speech recognizer can help simulate errorful recognized speech data from plain text, which has proven useful for tasks like discriminative language modeling, improving robustness of NLP systems, where limited or even no audio data is available at train time. Previous work typically considered replicating behavior of GMM-HMM based systems, but the behavior of more modern posterior-based neural network acoustic models is not the same and requires adjustments to the error prediction model. In this work, we extend a prior phonetic confusion based model for predicting speech recognition errors in two ways: first, we introduce a sampling-based paradigm that better simulates the behavior of a posterior-based acoustic model. Second, we investigate replacing the confusion matrix with a sequence-to-sequence model in order to introduce context dependency into the prediction. We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. Sampling greatly improves predictive accuracy within a 100-guess paradigm, while the sequence model performs similarly to the confusion matrix.
摘要:对语音识别器的错误进行建模可以帮助从纯文本中模拟错误识别的语音数据,这已被证明对于区分语言建模、提高NLP系统的鲁棒性等任务非常有用,在训练时间没有音频数据的情况下,NLP系统是可用的。以前的工作通常考虑基于GMM-HMM的系统的复制行为,但更现代的基于后验的神经网络声学模型的行为并不相同,需要调整误差预测模型。在这项工作中,我们从两个方面扩展了基于先验语音混淆的模型来预测语音识别错误:首先,我们引入了一个基于采样的范例,它更好地模拟了基于后验的声学模型的行为。其次,为了将上下文依赖引入到预测中,我们研究了用序列到序列模型来替换混淆矩阵。我们通过两种方法对误差预测器进行评估:首先通过预测Switchboard ASR系统对未知数据(Fisher)产生的错误,然后使用相同的预测器来估计不相关的基于云的ASR系统在新任务上的行为。抽样极大地提高了100个猜测范例内的预测精度,而序列模型的性能类似于混淆矩阵。

[NLP-40] Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models
[NLP-40] 反事实作为评估自回归语言模型中归因方法可靠性的手段

链接: https://arxiv.org/abs/2408.11252
作者: Sepehr Kamahi,Yadollah Yaghoobzadeh
关键词-EN: explainability evaluation research, masked language models, widespread adoption, research has predominantly, predominantly focused
关键词-ZH: 可解释性评估研究、掩蔽语言模型、广泛采用、研究主要集中在
类目: Computation and Language (cs.CL)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models (MLMs). Evaluating the faithfulness of an explanation method – how accurately the method explains the inner workings and decision-making of the model – is very challenging because it is very hard to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove some input tokens considered important according to a particular attribution (feature importance) method and observe the change in the model’s output. This approach creates out-of-distribution inputs for causal language models (CLMs) due to their training objective of next token prediction. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language modeling scenarios. Our technique creates fluent and in-distribution counterfactuals that makes evaluation protocol more reliable. Code is available at this https URL
摘要:尽管自回归语言模型被广泛采用,但可解释性评价研究主要集中在跨度填充和掩蔽语言模型(MLM)。评估一种解释方法的真实性–该方法解释模型的内部工作和决策的精确度–是非常具有挑战性的,因为很难将模型与其解释分开。大多数忠诚度评估技术根据特定的属性(特征重要性)方法破坏或删除一些被认为重要的输入标记,并观察模型输出中的变化。这种方法为因果语言模型(CLM)创建非分布输入,因为它们的训练目标是下一个令牌预测。在这项研究中,我们提出了一种利用反事实生成来评估自回归语言建模情景下归因方法的忠实性的技术。我们的技术创建了流畅和分布内的反事实,使评估协议更可靠。代码可在此HTTPS URL获得

[NLP-41] Unboxing Occupational Bias: Grounded Debiasing LLMs with U.S. Labor Data AAAI
[NLP-41] 解除职业偏见:利用美国劳动力数据彻底消除LLM的偏见

链接: https://arxiv.org/abs/2408.11247
作者: Atmika Gorti,Manas Gaur,Aman Chadha
关键词-EN: Large Language Models, Large Language, potentially reinforcing harmful, reinforcing harmful stereotypes, harmful stereotypes related
关键词-ZH: 大型语言模型,大型语言,潜在地强化有害的,强化有害的刻板印象,有害的刻板印象相关
类目: Computation and Language (cs.CL)
备注: Accepted in AAAI Spring Symposium 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are prone to inheriting and amplifying societal biases embedded within their training data, potentially reinforcing harmful stereotypes related to gender, occupation, and other sensitive categories. This issue becomes particularly problematic as biased LLMs can have far-reaching consequences, leading to unfair practices and exacerbating social inequalities across various domains, such as recruitment, online content moderation, or even the criminal justice system. Although prior research has focused on detecting bias in LLMs using specialized datasets designed to highlight intrinsic biases, there has been a notable lack of investigation into how these findings correlate with authoritative datasets, such as those from the U.S. National Bureau of Labor Statistics (NBLS). To address this gap, we conduct empirical research that evaluates LLMs in a ``bias-out-of-the-box" setting, analyzing how the generated outputs compare with the distributions found in NBLS data. Furthermore, we propose a straightforward yet effective debiasing mechanism that directly incorporates NBLS instances to mitigate bias within LLMs. Our study spans seven different LLMs, including instructable, base, and mixture-of-expert models, and reveals significant levels of bias that are often overlooked by existing bias detection techniques. Importantly, our debiasing method, which does not rely on external datasets, demonstrates a substantial reduction in bias scores, highlighting the efficacy of our approach in creating fairer and more reliable LLMs.
摘要:大型语言模型容易继承和放大其训练数据中嵌入的社会偏见,潜在地强化了与性别、职业和其他敏感类别相关的有害刻板印象。这个问题变得特别成问题,因为有偏见的低成本管理可能会产生深远的后果,导致不公平做法,并加剧各个领域的社会不平等,如招聘、在线内容审查,甚至刑事司法系统。尽管之前的研究侧重于使用专门的数据集来检测LLMS中的偏见,以突出内在偏见,但明显缺乏对这些发现与权威数据集(如美国国家劳工统计局(NBLS)的数据)之间的相关性的调查。为了弥补这一差距,我们进行了实证研究,在“开箱即用”的设置下评估最小二乘法,分析生成的产出与NBLS数据中的分布如何比较。此外,我们还提出了一种简单而有效的去偏机制,该机制直接结合NBLS实例来减轻LLM中的偏差。我们的研究跨越了七个不同的最小二乘模型,包括可指令模型、基础模型和专家混合模型,并揭示了被现有偏差检测技术经常忽略的显著偏差水平。重要的是,我们的去偏方法不依赖于外部数据集,显示了偏差分数的大幅降低,突显了我们方法在创建更公平和更可靠的LLM方面的有效性。

[NLP-42] A Little Confidence Goes a Long Way
[NLP-42] 一点信心大有帮助

链接: https://arxiv.org/abs/2408.11239
作者: John Scoville,Shang Gao,Devanshu Agrawal,Javed Qadrud-Din
关键词-EN: binary classification tasks, large language models, hidden state activations, introduce a group, group of related
关键词-ZH: 二进制分类任务、大型语言模型、隐藏状态激活、引入一组、一组相关的
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:We introduce a group of related methods for binary classification tasks using probes of the hidden state activations in large language models (LLMs). Performance is on par with the largest and most advanced LLMs currently available, but requiring orders of magnitude fewer computational resources and not requiring labeled data. This approach involves translating class labels into a semantically rich description, spontaneous symmetry breaking of multilayer perceptron probes for unsupervised learning and inference, training probes to generate confidence scores (prior probabilities) from hidden state activations subject to known constraints via entropy maximization, and selecting the most confident probe model from an ensemble for prediction. These techniques are evaluated on four datasets using five base LLMs.
摘要:我们使用大型语言模型(LLM)中隐藏状态激活的探测器来介绍一组用于二进制分类任务的相关方法。性能与目前可用的最大、最先进的LLM相当,但需要数量级的计算资源并且不需要标记数据。该方法涉及将类标签翻译成语义丰富的描述、多层感知器探测器的自发对称性破缺以进行无监督学习和推理、训练探测器以通过最大化从受已知约束的隐藏状态激活中生成置信分数(先验概率),以及从集合中选择最有信心的探测器模型进行预测。使用五个基本LLM在四个数据集上评估这些技术。

[NLP-43] Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification
[NLP-43] 多模式文档分类中注意力头掩蔽的分布外检测

链接: https://arxiv.org/abs/2408.11237
作者: Christos Constantinou,Georgios Ioannides,Aman Chadha,Aaron Elkins,Edwin Simpson
关键词-EN: machine learning applications, model overconfidence, crucial in machine, machine learning, learning applications
关键词-ZH: 机器学习应用、模型过度自信、对机器、机器学习、学习应用至关重要
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The majority of existing OOD detection methods predominantly address uni-modal inputs, such as images or texts. In the context of multi-modal documents, there is a notable lack of extensive research on the performance of these methods, which have primarily been developed with a focus on computer vision tasks. We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches and significantly decreases the false positive rate (FPR) compared to existing solutions up to 7.5%. This methodology generalizes well to multi-modal data, such as documents, where visual and textual information are modeled under the same Transformer architecture. To address the scarcity of high-quality publicly available document datasets and encourage further research on OOD detection for documents, we introduce FinanceDocs, a new document AI dataset. Our code and dataset are publicly available.
摘要:在机器学习应用中,检测失配(OOD)数据对于降低模型过度自信的风险,从而提高已部署系统的可靠性和安全性至关重要。现有的大多数OOD检测方法主要针对单一模式的输入,例如图像或文本。在多模式文档的背景下,显然缺乏对这些方法的性能的广泛研究,这些方法主要是以计算机视觉任务为重点开发的。针对文档分类系统中的多通道面向对象设计任务,我们提出了一种新的方法,称为注意力头部掩蔽。实验结果表明,本文提出的AHM方法比现有的方法具有更好的性能,并且与已有的方法相比,显著降低了误检率(FPR),最高可达7.5。这种方法很好地适用于多模式数据,如文档,其中视觉和文本信息在相同的Transformer体系结构下建模。为了解决高质量公开可用的文档数据集的稀缺问题,并鼓励进一步研究文档的OOD检测,我们引入了一种新的文档人工智能数据集FinanceDocs。我们的代码和数据集是公开可用的。

[NLP-44] CoDi: Conversational Distillation for Grounded Question Answering
[NLP-44] CoDi:有针对性问题回答的对话蒸馏

链接: https://arxiv.org/abs/2408.11219
作者: Patrick Huber,Arash Einolghozati,Rylan Conway,Kanika Narang,Matt Smith,Waqar Nayyar,Adithya Sagar,Ahmed Aly,Akshat Shrivastava
关键词-EN: Distilling conversational skills, Small Language Models, billion parameters presents, parameters presents significant, Small Language
关键词-ZH: 提炼对话技能、小语言模型、呈现十亿个参数、呈现重要参数、小语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Distilling conversational skills into Small Language Models (SLMs) with approximately 1 billion parameters presents significant challenges. Firstly, SLMs have limited capacity in their model parameters to learn extensive knowledge compared to larger models. Secondly, high-quality conversational datasets are often scarce, small, and domain-specific. Addressing these challenges, we introduce a novel data distillation framework named CoDi (short for Conversational Distillation, pronounced “Cody”), allowing us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner. Specifically, while our framework is task agnostic at its core, we explore and evaluate the potential of CoDi on the task of conversational grounded reasoning for question answering. This is a typical on-device scenario for specialist SLMs, allowing for open-domain model responses, without requiring the model to “memorize” world knowledge in its limited weights. Our evaluations show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics. Additionally, when using our framework to generate larger datasets from web data, our models surpass larger, instruction-tuned models in zero-shot conversational grounded reasoning tasks.
摘要:将会话技能提取到具有大约10亿个参数的小语言模型(SLM)中是一个巨大的挑战。首先,与较大的模型相比,SLM在其模型参数中学习广泛知识的能力有限。其次,高质量的对话数据集通常是稀缺的、小的和特定于领域的。为了应对这些挑战,我们引入了一种名为CODI(对话式蒸馏的缩写,发音为“Cody”)的新型数据蒸馏框架,使我们能够以可控和多样化的方式合成大规模的助理样式的数据集。具体地说,虽然我们的框架在核心上是任务不可知的,但我们探索和评估了CoDI在基于会话推理的问题回答任务中的潜力。这是专家SLM的典型设备上场景,允许开放领域的模型响应,而不需要模型以其有限的权重“记忆”世界知识。我们的评估表明,用Codi合成数据训练的SLM取得了与标准度量中基于人类注释数据训练的模型相当的性能。此外,当使用我们的框架从网络数据生成更大的数据集时,我们的模型在零命中率对话式基础推理任务中超过了更大的指令调整模型。

[NLP-45] Reading with Intent
[NLP-45] 意图阅读

链接: https://arxiv.org/abs/2408.11189
作者: Benjamin Reichman,Kartik Talamadupula,Toshish Jawale,Larry Heck
关键词-EN: integrating external information, external information sources, Retrieval augmented generation, RAG systems, open internet
关键词-ZH: 集成外部信息、外部信息源、检索增强生成、RAG系统、开放互联网
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) systems augment how knowledge language models are by integrating external information sources such as Wikipedia, internal documents, scientific papers, or the open internet. RAG systems that rely on the open internet as their knowledge source have to contend with the complexities of human-generated content. Human communication extends much deeper than just the words rendered as text. Intent, tonality, and connotation can all change the meaning of what is being conveyed. Recent real-world deployments of RAG systems have shown some difficulty in understanding these nuances of human communication. One significant challenge for these systems lies in processing sarcasm. Though the Large Language Models (LLMs) that make up the backbone of these RAG systems are able to detect sarcasm, they currently do not always use these detections for the subsequent processing of text. To address these issues, in this paper, we synthetically generate sarcastic passages from Natural Question’s Wikipedia retrieval corpus. We then test the impact of these passages on the performance of both the retriever and reader portion of the RAG pipeline. We introduce a prompting system designed to enhance the model’s ability to interpret and generate responses in the presence of sarcasm, thus improving overall system performance. Finally, we conduct ablation studies to validate the effectiveness of our approach, demonstrating improvements in handling sarcastic content within RAG systems.
摘要:检索增强生成(RAG)系统通过集成外部信息源(如维基百科、内部文档、科学论文或开放互联网)来增强知识语言模型的能力。依赖开放互联网作为其知识来源的RAG系统必须与人类生成的复杂内容作斗争。人类的交流不仅仅是文字,而是文本。意图、音调和内涵都可以改变所传达的意思。最近在现实世界中部署的RAG系统在理解人类交流的这些细微差别方面表现出了一些困难。这些系统面临的一个重大挑战是如何处理讽刺。尽管构成这些RAG系统主干的大型语言模型(LLM)能够检测到讽刺,但它们目前并不总是将这些检测用于后续的文本处理。为了解决这些问题,在本文中,我们从自然问题的维基百科检索语料库中综合生成讽刺段落。然后,我们测试这些通道对RAG管道的取回器和读取器部分的性能的影响。我们介绍了一个提示系统,旨在增强模型在存在讽刺的情况下解释和生成响应的能力,从而提高系统的整体性能。最后,我们进行了消融研究,以验证我们的方法的有效性,展示了在处理RAG系统中讽刺内容方面的改进。

[NLP-46] Combining Objective and Subjective Perspectives for Political News Understanding
[NLP-46] 客观与主观的结合理解政治新闻

链接: https://arxiv.org/abs/2408.11174
作者: Evan Dufraisse,Adrian Popescu,Julien Tourille,Armelle Brun,Olivier Hamon
关键词-EN: computational politics rely, Researchers and practitioners, automatic content analysis, content analysis tools, practitioners interested
关键词-ZH: 计算政治依赖、研究人员和从业者、自动内容分析、内容分析工具、感兴趣的从业者
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Researchers and practitioners interested in computational politics rely on automatic content analysis tools to make sense of the large amount of political texts available on the Web. Such tools should provide objective and subjective aspects at different granularity levels to make the analyses useful in practice. Existing methods produce interesting insights for objective aspects, but are limited for subjective ones, are often limited to national contexts, and have limited explainability. We introduce a text analysis framework which integrates both perspectives and provides a fine-grained processing of subjective aspects. Information retrieval techniques and knowledge bases complement powerful natural language processing components to allow a flexible aggregation of results at different granularity levels. Importantly, the proposed bottom-up approach facilitates the explainability of the obtained results. We illustrate its functioning with insights on news outlets, political orientations, topics, individual entities, and demographic segments. The approach is instantiated on a large corpus of French news, but is designed to work seamlessly for other languages and countries.
摘要:对计算政治感兴趣的研究人员和从业者依赖于自动内容分析工具来理解Web上可用的大量政治文本。这些工具应在不同的粒度级别提供客观和主观方面,以使分析在实践中有用。现有的方法对客观方面产生了有趣的见解,但限于主观方面,往往限于国家背景,并且可解释性有限。我们介绍了一个文本分析框架,它集成了这两个角度,并提供了对主观方面的细粒度处理。信息检索技术和知识库补充了强大的自然语言处理组件,以允许在不同的粒度级别灵活地汇总结果。重要的是,提出的自下而上的方法促进了所获得结果的可解释性。我们通过对新闻媒体、政治取向、话题、个人实体和人口部分的洞察来说明它的运作。这种方法在一个大型法国新闻语料库上进行了实例化,但旨在无缝地适用于其他语言和国家。

[NLP-47] SubgoalXL: Subgoal-based Expert Learning for Theorem Proving
[NLP-47] SubgoalXL:基于子目标的专家学习,用于定理证明

链接: https://arxiv.org/abs/2408.11172
作者: Xueliang Zhao,Lin Zheng,Haige Bo,Changran Hu,Urmish Thakker,Lingpeng Kong
关键词-EN: Formal theorem proving, large language models, theorem proving, Formal theorem, computer science
关键词-ZH: 形式定理证明,大型语言模型,定理证明,形式定理,计算机科学
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Formal theorem proving, a field at the intersection of mathematics and computer science, has seen renewed interest with advancements in large language models (LLMs). This paper introduces SubgoalXL, a novel approach that synergizes subgoal-based proofs with expert learning to enhance LLMs’ capabilities in formal theorem proving within the Isabelle environment. SubgoalXL addresses two critical challenges: the scarcity of specialized mathematics and theorem-proving data, and the need for improved multi-step reasoning abilities in LLMs. By optimizing data efficiency and employing subgoal-level supervision, SubgoalXL extracts richer information from limited human-generated proofs. The framework integrates subgoal-oriented proof strategies with an expert learning system, iteratively refining formal statement, proof, and subgoal generators. Leveraging the Isabelle environment’s advantages in subgoal-based proofs, SubgoalXL achieves a new state-of-the-art performance of 56.1% in Isabelle on the standard miniF2F dataset, marking an absolute improvement of 4.9%. Notably, SubgoalXL successfully solves 41 AMC12, 9 AIME, and 3 IMO problems from miniF2F. These results underscore the effectiveness of maximizing limited data utility and employing targeted guidance for complex reasoning in formal theorem proving, contributing to the ongoing advancement of AI reasoning capabilities. The implementation is available at \urlthis https URL.
摘要:形式定理证明是数学和计算机科学的交叉领域,随着大型语言模型的发展,人们对形式定理证明产生了新的兴趣。本文介绍了一种新的方法SubgoalXL,它将基于子目标的证明与专家学习相结合,以增强LLMS在Isabelle环境下的形式定理证明能力。SubgoalXL解决了两个关键挑战:专业数学和定理证明数据的稀缺,以及需要改进LLMS的多步推理能力。通过优化数据效率和使用子目标级别的监督,SubgoalXL从有限的人类生成的证据中提取更丰富的信息。该框架将面向子目标的证明策略与专家学习系统相结合,迭代地提炼形式语句、证明和子目标生成器。利用Isabelle环境在基于子目标的验证方面的优势,SubgoalXL在标准mini F2F数据集上实现了Isabelle 56.1%的最新最新性能,绝对提高了4.9%。值得注意的是,SubgoalXL成功地解决了来自mini F2F的41个AMC12、9个AIME和3个IMO问题。这些结果强调了在形式定理证明中最大化有限数据效用和使用有针对性的复杂推理指导的有效性,有助于人工智能推理能力的不断提高。该实施可在此HTTPS URL\url获得。

[NLP-48] Public Health in Disaster: Emotional Health and Life Incidents Extraction during Hurricane Harvey
[NLP-48] 灾难中的公共卫生:哈维飓风期间的情绪健康和生活事件提取

链接: https://arxiv.org/abs/2408.11133
作者: Thomas Hoang,Quynh Anh Nguyen,Long Nguyen
关键词-EN: causing severe damage, Countless disasters, climate change, causing severe, resulted from climate
关键词-ZH: 造成严重破坏,无数的灾难,气候变化,造成严重的,由气候造成
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Countless disasters have resulted from climate change, causing severe damage to infrastructure and the economy. These disasters have significant societal impacts, necessitating mental health services for the millions affected. To prepare for and respond effectively to such events, it is important to understand people’s emotions and the life incidents they experience before and after a disaster strikes. In this case study, we collected a dataset of approximately 400,000 public tweets related to the storm. Using a BERT-based model, we predicted the emotions associated with each tweet. To efficiently identify these topics, we utilized the Latent Dirichlet Allocation (LDA) technique for topic modeling, which allowed us to bypass manual content analysis and extract meaningful patterns from the data. However, rather than stopping at topic identification like previous methods \citemath11244910, we further refined our analysis by integrating Graph Neural Networks (GNN) and Large Language Models (LLM). The GNN was employed to generate embeddings and construct a similarity graph of the tweets, which was then used to optimize clustering. Subsequently, we used an LLM to automatically generate descriptive names for each event cluster, offering critical insights for disaster preparedness and response strategies.
摘要:气候变化造成的灾害不计其数,对基础设施和经济造成严重破坏。这些灾难具有重大的社会影响,需要为数百万受影响的人提供精神卫生服务。为了有效地准备和应对这类事件,重要的是要了解人们在灾难发生前后的情绪和生活事件。在这个案例研究中,我们收集了大约400,000条与风暴有关的公共推文的数据集。使用基于伯特的模型,我们预测了与每条推文相关的情绪。为了有效地识别这些主题,我们利用潜在Dirichlet分配(LDA)技术进行主题建模,该技术允许我们绕过手动内容分析,从数据中提取有意义的模式。然而,我们没有像以前的方法\ciemath11244910那样停留在主题识别上,而是通过集成图形神经网络(GNN)和大型语言模型(LLM)进一步细化了我们的分析。该算法利用GNN生成推文的嵌入,并构造推文的相似度图,然后利用相似度图优化聚类。随后,我们使用LLM为每个事件群自动生成描述性名称,从而为灾难准备和响应策略提供重要的见解。

[NLP-49] DOMBA: Double Model Balancing for Access-Controlled Language Models via Minimum-Bounded Aggregation
[NLP-49] DOMBA:通过最小界聚集实现对象控制语言模型的双模型平衡

链接: https://arxiv.org/abs/2408.11121
作者: Tom Segal,Asaf Shabtai,Yuval Elovici
关键词-EN: depends heavily, quality and quantity, large language models, large language, LLMs
关键词-ZH: 严重依赖质量和数量、大型语言模型、大型语言、LLM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:The utility of large language models (LLMs) depends heavily on the quality and quantity of their training data. Many organizations possess large data corpora that could be leveraged to train or fine-tune LLMs tailored to their specific needs. However, these datasets often come with access restrictions that are based on user privileges and enforced by access control mechanisms. Training LLMs on such datasets could result in exposure of sensitive information to unauthorized users. A straightforward approach for preventing such exposure is to train a separate model for each access level. This, however, may result in low utility models due to the limited amount of training data per model compared to the amount in the entire organizational corpus. Another approach is to train a single LLM on all the data while limiting the exposure of unauthorized information. However, current exposure-limiting methods for LLMs are ineffective for access-controlled data, where sensitive information appears frequently across many training examples. We propose DOMBA - double model balancing - a simple approach for training and deploying LLMs that provides high utility and access-control functionality with security guarantees. DOMBA aggregates the probability distributions of two models, each trained on documents with (potentially many) different access levels, using a “min-bounded” average function (a function that is bounded by the smaller value, e.g., harmonic mean). A detailed mathematical analysis and extensive evaluation show that DOMBA safeguards restricted information while offering utility comparable to non-secure models.
摘要:大型语言模型的实用性在很大程度上取决于其训练数据的质量和数量。许多组织拥有大型数据语料库,可以利用这些语料库来培训或微调符合其特定需求的LLM。但是,这些数据集通常带有基于用户权限的访问限制,并由访问控制机制强制执行。在这类数据集上培训LLM可能会导致敏感信息暴露给未经授权的用户。防止这种暴露的一种直接方法是为每个访问级别培训一个单独的模型。然而,这可能会导致低实用模型,因为与整个组织语料库中的数量相比,每个模型的训练数据量有限。另一种方法是对所有数据进行单个LLM培训,同时限制未经授权信息的暴露。然而,当前用于LLM的暴露限制方法对于访问控制数据无效,其中敏感信息经常出现在许多训练示例中。我们提出了DOMBA-双模型平衡-一种简单的训练和部署LLMS的方法,它提供了高效用和访问控制功能,并提供了安全保证。DOMBA使用“最小有界”平均函数(由较小的值限定的函数,例如调和平均值)聚集两个模型的概率分布,每个模型在具有(可能多个)不同访问级别的文档上训练。详细的数学分析和广泛的评估表明,DOMBA在保护受限信息的同时,提供了与非安全模型相当的效用。

[NLP-50] Mistral-SPLADE: LLMs for for better Learned Sparse Retrieval
[NLP-50] Mistral-SPLADE:LLM,用于更好的学习稀疏检索

链接: https://arxiv.org/abs/2408.11119
作者: Meet Doshi,Vishwajeet Kumar,Rudra Murthy,Vignesh P,Jaydeep Sen
关键词-EN: embedding-based dense retrievers, Learned Sparse Retrievers, traditional keyword-based sparse, keyword-based sparse retrievers, Sparse Retrievers
关键词-ZH: 基于嵌入的密集检索器、习得稀疏检索器、传统的基于关键字的稀疏检索器、基于关键字的稀疏检索器、稀疏检索器
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Learned Sparse Retrievers (LSR) have evolved into an effective retrieval strategy that can bridge the gap between traditional keyword-based sparse retrievers and embedding-based dense retrievers. At its core, learned sparse retrievers try to learn the most important semantic keyword expansions from a query and/or document which can facilitate better retrieval with overlapping keyword expansions. LSR like SPLADE has typically been using encoder only models with MLM (masked language modeling) style objective in conjunction with known ways of retrieval performance improvement such as hard negative mining, distillation, etc. In this work, we propose to use decoder-only model for learning semantic keyword expansion. We posit, decoder only models that have seen much higher magnitudes of data are better equipped to learn keyword expansions needed for improved retrieval. We use Mistral as the backbone to develop our Learned Sparse Retriever similar to SPLADE and train it on a subset of sentence-transformer data which is often used for training text embedding models. Our experiments support the hypothesis that a sparse retrieval model based on decoder only large language model (LLM) surpasses the performance of existing LSR systems, including SPLADE and all its variants. The LLM based model (Echo-Mistral-SPLADE) now stands as a state-of-the-art learned sparse retrieval model on the BEIR text retrieval benchmark.
摘要:学习稀疏检索器(LSR)已经发展成为一种有效的检索策略,可以弥补传统的基于关键字的稀疏检索器和基于嵌入的密集检索器之间的差距。在其核心,学习稀疏检索者试图从查询和/或文档中学习最重要的语义关键字扩展,这可以通过重叠的关键字扩展来促进更好的检索。LSR和SPLADE一样,通常使用带有MLM(掩蔽语言建模)风格目标的仅编码器模型,并结合已知的提高检索性能的方法,如硬否定挖掘、蒸馏等。在本工作中,我们提出使用仅解码器模型来学习语义关键字扩展。我们假设,只有数据量大得多的模型才能更好地学习改进检索所需的关键字扩展。我们使用Mistral作为主干来开发类似于SPLADE的学习稀疏检索器,并在经常用于训练文本嵌入模型的语句转换器数据子集上训练它。我们的实验支持这样一个假设,即基于仅解码器的大语言模型(LLM)的稀疏检索模型的性能优于现有的LSR系统,包括SPLADE及其所有变体。基于LLM的模型(Echo-Mistral-SPLADE)现在是Beir文本检索基准上最先进的学习型稀疏检索模型。

[NLP-51] What can Large Language Models Capture about Code Functional Equivalence?
[NLP-51] 大型语言模型可以捕捉到代码功能等效性的哪些信息?

链接: https://arxiv.org/abs/2408.11081
作者: Nickil Maveli,Antonio Vergari,Shay B. Cohen
关键词-EN: shown great progress, learning rich representations, large code corpora, classify code fragments, pre-trained on large
关键词-ZH: 显示出巨大的进步,学习丰富的表示、大型代码库、分类代码片段、在大型上预训练
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 37 pages

点击查看摘要

Abstract:Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing SeqCoBench, a benchmark for systematically assessing how Code-LLMs can capture code functional equivalence. SeqCoBench contains over 20 code transformations that either preserve or alter the semantics of Python programs. We conduct extensive evaluations in different settings, including zero-shot and parameter-efficient finetuning methods on state-of-the-art (Code-)LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench. We find that the performance gap between these LLMs and classical match-based retrieval scores is minimal, with both approaches showing a concerning lack of depth in understanding code semantics.
摘要:代码-LLMS是在大型代码语料库上预训练的LLM,在学习代码结构和语法的丰富表示法方面取得了很大进展,并成功地用于生成或分类代码片段。与此同时,理解它们是否能够做到这一点,因为它们捕获了代码语义,以及做得如何,仍然是一个悬而未决的问题。在本文中,我们通过引入SeqCoBch来解决这个问题,SeqCoBch是一个系统地评估Code-LLMS如何捕获代码功能等价性的基准。SeqCoBtch包含20多个代码转换,这些代码转换要么保留要么改变了Python程序的语义。我们在不同的环境下进行了广泛的评估,包括零射和参数高效的精调方法在最新的(代码)LLM上,看看它们是否能够区分出SeqCoBch中语义等价或不同的程序对。我们发现,这些LLM与经典的基于匹配的检索分数之间的性能差距很小,这两种方法都显示出在理解代码语义方面的深度不足。

[NLP-52] abular Transfer Learning via Prompting LLMs
[NLP-52] 通过预算LLM进行迁移学习

链接: https://arxiv.org/abs/2408.11063
作者: Jaehyun Nam,Woomin Song,Seong Hyeon Park,Jihoon Tack,Sukmin Yun,Jaehyung Kim,Kyu Hwan Oh,Jinwoo Shin
关键词-EN: transfer learning, tabular transfer learning, Learning, transfer, obtain annotations
关键词-ZH: 迁移学习,表格迁移学习,学习,迁移,获取注释
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: COLM 2024

点击查看摘要

Abstract:Learning with a limited number of labeled data is a central problem in real-world applications of machine learning, as it is often expensive to obtain annotations. To deal with the scarcity of labeled data, transfer learning is a conventional approach; it suggests to learn a transferable knowledge by training a neural network from multiple other sources. In this paper, we investigate transfer learning of tabular tasks, which has been less studied and successful in the literature, compared to other domains, e.g., vision and language. This is because tables are inherently heterogeneous, i.e., they contain different columns and feature spaces, making transfer learning difficult. On the other hand, recent advances in natural language processing suggest that the label scarcity issue can be mitigated by utilizing in-context learning capability of large language models (LLMs). Inspired by this and the fact that LLMs can also process tables within a unified language space, we ask whether LLMs can be effective for tabular transfer learning, in particular, under the scenarios where the source and target datasets are of different format. As a positive answer, we propose a novel tabular transfer learning framework, coined Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with LLMs. Specifically, P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts. Experimental results demonstrate that P2T outperforms previous methods on various tabular learning benchmarks, showing good promise for the important, yet underexplored tabular transfer learning problem. Code is available at this https URL.
摘要:在机器学习的实际应用中,使用有限数量的标记数据进行学习是一个核心问题,因为获取标注通常很昂贵。为了解决标签数据稀缺的问题,迁移学习是一种传统的方法,它建议通过从多个其他来源训练神经网络来学习可迁移的知识。在本文中,我们研究了表格任务的迁移学习,与视觉和语言等其他领域相比,文献中对表格任务的迁移学习研究较少,也很成功。这是因为表格本质上是异质的,即它们包含不同的列和特征空间,使得迁移学习变得困难。另一方面,自然语言处理的最新进展表明,可以通过利用大型语言模型(LLM)的上下文中学习能力来缓解标签稀缺问题。受此启发,以及LLMS也可以在统一的语言空间内处理表格的事实,我们询问LLMS是否可以有效地进行表格迁移学习,特别是在源数据集和目标数据集具有不同格式的情况下。作为一个肯定的回答,我们提出了一个新的表格迁移学习框架,称为提示迁移(P2T),它利用具有LLMS的未标记(或异质)源数据。具体地说,P2T在源数据集中标识与目标任务特征密切相关的列特征,以创建与目标任务相关的示例,从而为提示创建伪演示。实验结果表明,P2T在各种表格学习基准上的表现优于以往的方法,对于重要但未被充分探索的表格迁移学习问题显示出良好的前景。代码可在此HTTPS URL上找到。

[NLP-53] Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models
[NLP-53] 交互式T2 S:文本到SQL与大型语言模型的多轮交互

链接: https://arxiv.org/abs/2408.11062
作者: Guanming Xiong,Junwei Bao,Hongfei Jiang,Yang Song,Wen Zhao
关键词-EN: large language models, powerful reasoning capabilities, study explores, parsing by leveraging, language models
关键词-ZH: 大型语言模型、强大的推理能力、研究探索、利用解析、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:This study explores text-to-SQL parsing by leveraging the powerful reasoning capabilities of large language models (LLMs). Despite recent advancements, existing LLM-based methods have not adequately addressed scalability, leading to inefficiencies when processing wide tables. Furthermore, current interaction-based approaches either lack a step-by-step, interpretable SQL generation process or fail to provide an efficient and universally applicable interaction design. To address these challenges, we introduce Interactive-T2S, a framework that generates SQL queries through direct interactions with databases. This framework includes four general tools that facilitate proactive and efficient information retrieval by the LLM. Additionally, we have developed detailed exemplars to demonstrate the step-wise reasoning processes within our framework. Our experiments on the BIRD-Dev dataset, employing a setting without oracle knowledge, reveal that our method achieves state-of-the-art results with only two exemplars, underscoring the effectiveness and robustness of our framework.
摘要:本研究利用大型语言模型(LLM)的强大推理能力,探索了文本到SQL的解析。尽管最近取得了一些进展,但现有的基于LLM的方法还没有充分解决可伸缩性问题,导致在处理宽表时效率低下。此外,当前基于交互的方法要么缺乏一个循序渐进的、可解释的SQL生成过程,要么无法提供高效且普遍适用的交互设计。为了应对这些挑战,我们引入了Interactive-T2S,这是一个通过与数据库直接交互来生成SQL查询的框架。这一框架包括四个通用工具,这些工具有助于土地管理机构主动有效地检索信息。此外,我们还开发了详细的范例来演示我们框架内的逐步推理过程。我们在Bird-Dev数据集上的实验表明,在没有Oracle知识的情况下,我们的方法只用了两个样本就达到了最先进的结果,突显了我们框架的有效性和健壮性。

[NLP-54] StructuredRAG: JSON Response Formatting with Large Language Models
[NLP-54] StructuredRAG:具有大型语言模型的SON响应收件箱

链接: https://arxiv.org/abs/2408.11061
作者: Connor Shorten,Charles Pierse,Thomas Benjamin Smith,Erika Cardenas,Akanksha Sharma,John Trengrove,Bob van Luijt
关键词-EN: Large Language Models, Compound AI Systems, Large Language, ability of Large, Language Models
关键词-ZH: 大型语言模型,复合人工智能系统,大型语言,大型能力,语言模型
类目: Computation and Language (cs.CL)
备注: Preprint. 10 pages, 6 figures

点击查看摘要

Abstract:The ability of Large Language Models (LLMs) to generate structured outputs, such as JSON, is crucial for their use in Compound AI Systems. However, evaluating and improving this capability remains challenging. In this work, we introduce StructuredRAG, a benchmark of six tasks designed to assess LLMs’ proficiency in following response format instructions. We evaluate two state-of-the-art LLMs, Gemini 1.5 Pro and Llama 3 8B-instruct with 4-bit quantization using two distinct prompting strategies. We introduce these prompting strategies as f-String and Follow the Format (FF) prompting. Across 24 experiments, we find an average success rate of 82.55%. We further find a high variance in performance across tasks, models, and prompting strategies with success rates ranging from 0 to 100%. We find that Llama 3 8B-instruct often performs competitively with Gemini 1.5 Pro. We observe that task complexity significantly influences performance, with tasks involving lists or composite object outputs proving more challenging. Our findings highlight the need for further research into improving the reliability and consistency of structured output generation in LLMs. We have open-sourced our experimental code and results at this http URL.
摘要:大型语言模型(LLM)生成结构化输出的能力,如JSON,对于它们在复合人工智能系统中的使用是至关重要的。然而,评估和改进这一能力仍然具有挑战性。在这项工作中,我们介绍了结构RAG,一个由六个任务组成的基准,旨在评估LLMS在遵循回答格式说明方面的熟练程度。我们使用两种不同的提示策略,使用4位量化来评估两种最先进的LLM,Gemini 1.5 Pro和Llama 38B-Indict。我们将这些提示策略作为f-字符串来介绍,并遵循格式(FF)提示。在24个实验中,我们发现平均成功率为82.55%。我们进一步发现,不同任务、模型和激励策略的绩效差异很大,成功率从0%到100%不等。我们发现,骆驼38B-指令经常表现出与双子座1.5Pro的竞争力。我们观察到,任务复杂性显著影响性能,涉及列表或复合对象输出的任务被证明更具挑战性。我们的研究结果突出表明,有必要进一步研究如何提高低成本管理系统中结构化产出生成的可靠性和一致性。我们已经在这个http URL上开放了我们的实验代码和结果。

[NLP-55] LLM Agents Improve Semantic Code Search
[NLP-55] LLM代理改进语义代码搜索

链接: https://arxiv.org/abs/2408.11058
作者: Sarthak Jain(University of Illinois Urbana Champaign and Cisco),Aditya Dora(University of Illinois Urbana Champaign),Ka Seng Sam(University of Illinois Urbana Champaign),Prabhat Singh(Cisco)
关键词-EN: solutions to problems, key task, developing solutions, Retrieval Augmented Generation, Augmented Generation
关键词-ZH: 问题解决方案,关键任务,开发解决方案,检索增强生成,增强生成
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 1 Figure

点击查看摘要

Abstract:Code Search is a key task that many programmers often have to perform while developing solutions to problems. Current methodologies suffer from an inability to perform accurately on prompts that contain some ambiguity or ones that require additional context relative to a code-base. We introduce the approach of using Retrieval Augmented Generation (RAG) powered agents to inject information into user prompts allowing for better inputs into embedding models. By utilizing RAG, agents enhance user queries with relevant details from GitHub repositories, making them more informative and contextually aligned. Additionally, we introduce a multi-stream ensemble approach which when paired with agentic workflow can obtain improved retrieval accuracy, which we deploy on application called this http URL. Experimental results on the CodeSearchNet dataset demonstrate that RepoRift significantly outperforms existing methods, achieving an 78.2% success rate at Success@10 and a 34.6% success rate at Success@1. This research presents a substantial advancement in semantic code search, highlighting the potential of agentic LLMs and RAG to enhance code retrieval systems.
摘要:代码搜索是许多程序员在开发问题解决方案时经常必须执行的一项关键任务。当前的方法无法对包含一些模棱两可的提示或需要与代码库相关的附加上下文的提示准确执行。我们介绍了使用检索增强生成(RAG)支持的代理将信息注入到用户提示中的方法,从而允许更好地输入嵌入模型。通过利用RAG,代理使用来自GitHub存储库的相关详细信息来增强用户查询,使它们更有信息量,并与上下文保持一致。此外,我们引入了一种多流集成方法,当与代理工作流配合使用时,可以获得更高的检索精度,我们将其部署在名为此http URL的应用程序上。在CodeSearchNet数据集上的实验结果表明,RepoRift的性能明显优于现有的方法,在Success@10和Success@1上的成功率分别为78.2%和34.6%。本研究在语义代码搜索方面取得了实质性的进展,突出了代理LLM和RAG在增强代码检索系统方面的潜力。

[NLP-56] DSP-MLIR: A MLIR Dialect for Digital Signal Processing
[NLP-56] DSP-MLIR:数字信号处理的MLIR方言

链接: https://arxiv.org/abs/2408.11205
作者: Abhinav Kumar,Atharva Khedkar,Aviral Shrivastava
关键词-EN: Traditional Digital Signal, Digital Signal Processing, Traditional Digital, Signal Processing, Digital Signal
关键词-ZH: 传统数字信号,数字信号处理,传统数字,信号处理,数字信号
类目: ignal Processing (eess.SP); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional Digital Signal Processing ( DSP ) compilers work at low level ( C-level / assembly level ) and hence lose much of the optimization opportunities present at high-level ( domain-level ). The emerging multi-level compiler infrastructure MLIR ( Multi-level Intermediate Representation ) allows to specify optimizations at higher level. In this paper, we utilize MLIR framework to introduce a DSP Dialect and perform domain-specific optimizations at dialect -level ( high-level ) and show the usefulness of these optimizations on sample DSP apps. In particular, we develop a compiler for DSP and a DSL (Domain Specific Language) to ease the development of apps. We show the performance improvement in execution time for these sample apps by upto 10x which would have been difficult if the IR were at C/ affine level.
摘要:传统的数字信号处理(DSP)编译器在低级别(C级/汇编级)工作,因此失去了高级别(域级)存在的大部分优化机会。新兴的多层编译器基础设施MLIR(多层中间表示)允许指定更高级别的优化。本文中,我们利用MLIR框架引入DSP Dialect并在方言级别(高级)执行特定于领域的优化,并展示了这些优化对示例DSP应用程序的有用性。特别是,我们开发了一个适用于DSP和DSA(领域特定语言)的编译器,以简化应用程序的开发。我们展示了这些示例应用程序在执行时间方面的性能改进高达10倍,如果IR处于C/仿射级别,这将是困难的。

[NLP-57] Statistical Patterns in the Equations of Physics and the Emergence of a Meta-Law of Nature
[NLP-57] 物理方程中的统计模式和元自然定律的出现

链接: https://arxiv.org/abs/2408.11065
作者: Andrei Constantin,Deaglan Bartlett,Harry Desmond,Pedro G. Ferreira
关键词-EN: fundamental science, aims to understand, mathematical equations, Nature, equations
关键词-ZH: 基础科学,旨在理解,数学方程,自然,方程
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL); High Energy Physics - Theory (hep-th); Data Analysis, Statistics and Probability (physics.data-an); History and Philosophy of Physics (physics.hist-ph)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Physics, as a fundamental science, aims to understand the laws of Nature and describe them in mathematical equations. While the physical reality manifests itself in a wide range of phenomena with varying levels of complexity, the equations that describe them display certain statistical regularities and patterns, which we begin to explore here. By drawing inspiration from linguistics, where Zipf’s law states that the frequency of any word in a large corpus of text is roughly inversely proportional to its rank in the frequency table, we investigate whether similar patterns for the distribution of operators emerge in the equations of physics. We analyse three corpora of formulae and find, using sophisticated implicit-likelihood methods, that the frequency of operators as a function of their rank in the frequency table is best described by an exponential law with a stable exponent, in contrast with Zipf’s inverse power-law. Understanding the underlying reasons behind this statistical pattern may shed light on Nature’s modus operandi or reveal recurrent patterns in physicists’ attempts to formalise the laws of Nature. It may also provide crucial input for symbolic regression, potentially augmenting language models to generate symbolic models for physical phenomena. By pioneering the study of statistical regularities in the equations of physics, our results open the door for a meta-law of Nature, a (probabilistic) law that all physical laws obey.
摘要:物理学作为一门基础科学,其目的是理解自然规律,并用数学方程来描述它们。虽然物理现实表现在具有不同复杂程度的广泛的现象中,但描述它们的方程显示了某些统计规律和模式,我们在这里开始探索。通过从语言学中获得灵感,齐普夫定律指出,任何单词在大型文本语料库中的频率与其在频率表中的排名大致成反比,我们调查了在物理方程中是否出现了类似的运算符分布模式。我们分析了三个公式语料库,并使用复杂的隐式似然方法发现,算子的频率作为其在频率表中的等级的函数最好地用具有稳定指数的指数律来描述,而不是Zipf的逆幂定律。理解这种统计模式背后的潜在原因可能会揭示自然的运作方式,或者揭示物理学家试图将自然法则正规化的反复出现的模式。它还可能为符号回归提供关键的输入,潜在地增强语言模型以生成物理现象的符号模型。通过率先研究物理方程中的统计规律,我们的结果为自然界的元定律打开了大门,这是所有物理定律都遵循的(概率)定律。

人工智能

[AI-0] Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction

链接: https://arxiv.org/abs/2408.11816
作者: Anthony GX-Chen,Kenneth Marino,Rob Fergus
关键词-EN: difficult exploration problems, describing a set, face of difficult, difficult exploration, study whether giving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:In the face of difficult exploration problems in reinforcement learning, we study whether giving an agent an object-centric mapping (describing a set of items and their attributes) allow for more efficient learning. We found this problem is best solved hierarchically by modelling items at a higher level of state abstraction to pixels, and attribute change at a higher level of temporal abstraction to primitive actions. This abstraction simplifies the transition dynamic by making specific future states easier to predict. We make use of this to propose a fully model-based algorithm that learns a discriminative world model, plans to explore efficiently with only a count-based intrinsic reward, and can subsequently plan to reach any discovered (abstract) states. We demonstrate the model’s ability to (i) efficiently solve single tasks, (ii) transfer zero-shot and few-shot across item types and environments, and (iii) plan across long horizons. Across a suite of 2D crafting and MiniHack environments, we empirically show our model significantly out-performs state-of-the-art low-level methods (without abstraction), as well as performant model-free and model-based methods using the same abstraction. Finally, we show how to reinforce learn low level object-perturbing policies, as well as supervise learn the object mapping itself. Comments: Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.11816 [cs.LG] (or arXiv:2408.11816v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.11816 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] Great Memory Shallow Reasoning: Limits of kNN-LMs

链接: https://arxiv.org/abs/2408.11815
作者: Shangyi Geng,Wenting Zhao,Alexander M Rush
关键词-EN: downstream NLP benchmarks, nearest neighbor language, neighbor language models, NLP benchmarks, demonstrated strong performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract: K -nearest neighbor language models ( k NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to argue that models trained on poor quality or outdated data could perform well by employing a k NN extension that has access to a higher-quality datastore. In this work, we ask whether this improved ability to recall information really translates into downstream abilities. We extensively evaluate k NN-LMs on a diverse set of tasks, ranging from sentiment classification and commonsense reasoning to multi-hop reasoning. Results show that k NN-LMs excel at memory-intensive tasks, where utilizing the patterns in the input is sufficient for determining the output, but struggle with reasoning tasks that require integrating multiple pieces of information to derive new knowledge. We further demonstrate through oracle experiments and qualitative analysis that even with perfect retrieval, k NN-LMs still fail to determine the correct answers, placing an upper bound on their reasoning performance. Code and datastores are released at this https URL.

[AI-2] Approaching Deep Learning through the Spectral Dynamics of Weights

链接: https://arxiv.org/abs/2408.11804
作者: David Yunis,Kumar Kshitij Patel,Samuel Wheeler,Pedro Savarese,Gal Vardi,Karen Livescu,Michael Maire,Matthew R. Walter
关键词-EN: empirical approach centered, deep learning, propose an empirical, empirical approach, approach centered
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose an empirical approach centered on the spectral dynamics of weights – the behavior of singular values and vectors during optimization – to unify and clarify several phenomena in deep learning. We identify a consistent bias in optimization across various experiments, from small-scale ``grokking’’ to large-scale tasks like image classification with ConvNets, image generation with UNets, speech recognition with LSTMs, and language modeling with Transformers. We also demonstrate that weight decay enhances this bias beyond its role as a norm regularizer, even in practical systems. Moreover, we show that these spectral dynamics distinguish memorizing networks from generalizing ones, offering a novel perspective on this longstanding conundrum. Additionally, we leverage spectral dynamics to explore the emergence of well-performing sparse subnetworks (lottery tickets) and the structure of the loss surface through linear mode connectivity. Our findings suggest that spectral dynamics provide a coherent framework to better understand the behavior of neural networks across diverse settings.

[AI-3] LLM Pruning and Distillation in Practice: The Minitron Approach

链接: https://arxiv.org/abs/2408.11796
作者: Sharath Turuvekere Sreenivas,Saurav Muralidharan,Raviraj Joshi,Marcin Chochowski,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Jan Kautz,Pavlo Molchanov
关键词-EN: present a comprehensive, comprehensive report, report on compressing, Evaluation Harness, Mistral NeMo
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

[AI-4] Leveraging Chemistry Foundation Models to Facilitate Structure Focused Retrieval Augmented Generation in Multi-Agent Workflows for Catalyst and Materials Design

链接: https://arxiv.org/abs/2408.11793
作者: Nathaniel H. Park,Tiffany J. Callahan,James L. Hedrick,Tim Erdmann,Sara Capponi
关键词-EN: Molecular property prediction, Molecular property, complex research tasks, subject of intense, potential to accelerate
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Molecular property prediction and generative design via deep learning models has been the subject of intense research given its potential to accelerate development of new, high-performance materials. More recently, these workflows have been significantly augmented with the advent of large language models (LLMs) and systems of LLM-driven agents capable of utilizing pre-trained models to make predictions in the context of more complex research tasks. While effective, there is still room for substantial improvement within the agentic systems on the retrieval of salient information for material design tasks. Moreover, alternative uses of predictive deep learning models, such as leveraging their latent representations to facilitate cross-modal retrieval augmented generation within agentic systems to enable task-specific materials design, has remained unexplored. Herein, we demonstrate that large, pre-trained chemistry foundation models can serve as a basis for enabling semantic chemistry information retrieval for both small-molecules, complex polymeric materials, and reactions. Additionally, we show the use of chemistry foundation models in conjunction with image models such as OpenCLIP facilitate unprecedented queries and information retrieval across multiple characterization data domains. Finally, we demonstrate the integration of these systems within multi-agent systems to facilitate structure and topological-based natural language queries and information retrieval for complex research tasks.

[AI-5] DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework

链接: https://arxiv.org/abs/2408.11788
作者: Zhifei Xie,Daniel Tang,Dingwei Tan,Jacques Klein,Tegawend F. Bissyand,Saad Ezzini
关键词-EN: Current video generation, generation models excel, realistic clips, Key Frames Iteration, Frames Iteration Design
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Current video generation models excel at creating short, realistic clips, but struggle with longer, multi-scene videos. We introduce \textttDreamFactory, an LLM-based framework that tackles this challenge. \textttDreamFactory leverages multi-agent collaboration principles and a Key Frames Iteration Design Method to ensure consistency and style across long videos. It utilizes Chain of Thought (COT) to address uncertainties inherent in large language models. \textttDreamFactory generates long, stylistically coherent, and complex videos. Evaluating these long-form videos presents a challenge. We propose novel metrics such as Cross-Scene Face Distance Score and Cross-Scene Style Consistency Score. To further research in this area, we contribute the Multi-Scene Videos Dataset containing over 150 human-rated videos.

[AI-6] meline and Boundary Guided Diffusion Network for Video Shadow Detection ACM-MM2024

链接: https://arxiv.org/abs/2408.11785
作者: Haipeng Zhou,Honqiu Wang,Tian Ye,Zhaohu Xing,Jun Ma,Ping Li,Qiong Wang,Lei Zhu
关键词-EN: Boundary Guided Diffusion, Video Shadow Detection, aims to detect, Shadow Boundary Aware, Boundary Aware Attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ACM MM2024

点击查看摘要

Abstract:Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes, weights, and results at \urlthis https URL.

[AI-7] Sum of Squares Circuits

链接: https://arxiv.org/abs/2408.11778
作者: Lorenzo Loconte,Stefan Mengel,Antonio Vergari
关键词-EN: Designing expressive generative, Designing expressive, support exact, exact and efficient, efficient inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:Designing expressive generative models that support exact and efficient inference is a core question in probabilistic ML. Probabilistic circuits (PCs) offer a framework where this tractability-vs-expressiveness trade-off can be analyzed theoretically. Recently, squared PCs encoding subtractive mixtures via negative parameters have emerged as tractable models that can be exponentially more expressive than monotonic PCs, i.e., PCs with positive parameters only. In this paper, we provide a more precise theoretical characterization of the expressiveness relationships among these models. First, we prove that squared PCs can be less expressive than monotonic ones. Second, we formalize a novel class of PCs – sum of squares PCs – that can be exponentially more expressive than both squared and monotonic PCs. Around sum of squares PCs, we build an expressiveness hierarchy that allows us to precisely unify and separate different tractable model classes such as Born Machines and PSD models, and other recently introduced tractable probabilistic models by using complex parameters. Finally, we empirically show the effectiveness of sum of squares circuits in performing distribution estimation.

[AI-8] D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models

链接: https://arxiv.org/abs/2408.11761
作者: M. Forlini,M. Babcinschi,G. Palmieri,P. Neto
关键词-EN: Collaborative robots, increasingly popular, popular for assisting, work and daily, Large Multimodal Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Collaborative robots are increasingly popular for assisting humans at work and daily tasks. However, designing and setting up interfaces for human-robot collaboration is challenging, requiring the integration of multiple components, from perception and robot task control to the hardware itself. Frequently, this leads to highly customized solutions that rely on large amounts of costly training data, diverging from the ideal of flexible and general interfaces that empower robots to perceive and adapt to unstructured environments where they can naturally collaborate with humans. To overcome these challenges, this paper presents the Detection-Robot Management GPT (D-RMGPT), a robot-assisted assembly planner based on Large Multimodal Models (LMM). This system can assist inexperienced operators in assembly tasks without requiring any markers or previous training. D-RMGPT is composed of DetGPT-V and R-ManGPT. DetGPT-V, based on GPT-4V(vision), perceives the surrounding environment through one-shot analysis of prompted images of the current assembly stage and the list of components to be assembled. It identifies which components have already been assembled by analysing their features and assembly requirements. R-ManGPT, based on GPT-4, plans the next component to be assembled and generates the robot’s discrete actions to deliver it to the human co-worker. Experimental tests on assembling a toy aircraft demonstrated that D-RMGPT is flexible and intuitive to use, achieving an assembly success rate of 83% while reducing the assembly time for inexperienced operators by 33% compared to the manual process. this http URL

[AI-9] SBDet: A Symmetry-Breaking Object Detector via Relaxed Rotation-Equivariance

链接: https://arxiv.org/abs/2408.11760
作者: Zhiqiang Wu,Yingjie Liu,Hanlin Dong,Xuan Tang,Jian Yang,Bo Jin,Mingsong Chen,Xian Wei
关键词-EN: Introducing Group Equivariant, Group Equivariant Convolution, explore symmetries hidden, Equivariant Convolution, Introducing Group
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Introducing Group Equivariant Convolution (GConv) empowers models to explore symmetries hidden in visual data, improving their performance. However, in real-world scenarios, objects or scenes often exhibit perturbations of a symmetric system, specifically a deviation from a symmetric architecture, which can be characterized by a non-trivial action of a symmetry group, known as Symmetry-Breaking. Traditional GConv methods are limited by the strict operation rules in the group space, only ensuring features remain strictly equivariant under limited group transformations, making it difficult to adapt to Symmetry-Breaking or non-rigid transformations. Motivated by this, we introduce a novel Relaxed Rotation GConv (R2GConv) with our defined Relaxed Rotation-Equivariant group \mathbfR_4 . Furthermore, we propose a Relaxed Rotation-Equivariant Network (R2Net) as the backbone and further develop the Symmetry-Breaking Object Detector (SBDet) for 2D object detection built upon it. Experiments demonstrate the effectiveness of our proposed R2GConv in natural image classification tasks, and SBDet achieves excellent performance in object detection tasks with improved generalization capabilities and robustness.

[AI-10] Open-Ended 3D Point Cloud Instance Segmentation

链接: https://arxiv.org/abs/2408.11747
作者: Phuc D.A. Nguyen,Minh Luu,Anh Tran,Cuong Pham,Khoi Nguyen
关键词-EN: Instance Segmentation methods, Instance Segmentation, recently demonstrated, demonstrated their ability, ability to generalize
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-Vocab 3D Instance Segmentation methods (OV-3DIS) have recently demonstrated their ability to generalize to unseen objects. However, these methods still depend on predefined class names during testing, restricting the autonomy of agents. To mitigate this constraint, we propose a novel problem termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the necessity for predefined class names during testing. Moreover, we contribute a comprehensive set of strong baselines, derived from OV-3DIS approaches and leveraging 2D Multimodal Large Language Models. To assess the performance of our OE-3DIS system, we introduce a novel Open-Ended score, evaluating both the semantic and geometric quality of predicted masks and their associated class names, alongside the standard AP score. Our approach demonstrates significant performance improvements over the baselines on the ScanNet200 and ScanNet++ datasets. Remarkably, our method surpasses the performance of Open3DIS, the current state-of-the-art method in OV-3DIS, even in the absence of ground-truth object class names.

[AI-11] FocusLLM: Scaling LLMs Context by Parallel Decoding

链接: https://arxiv.org/abs/2408.11745
作者: Zhenyu Li,Yike Zhang,Tengyu Pan,Yutao Sun,Zhichao Duan,Junjie Fang,Rong Han,Zixuan Wang,Jianyong Wang
关键词-EN: Empowering LLMs, context, long context lengths, Empowering, context length
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Empowering LLMs with the ability to utilize useful information from a long context is crucial for many downstream applications. However, achieving long context lengths with the conventional transformer architecture requires substantial training and inference resources. In this paper, we present FocusLLM, a framework designed to extend the context length of any decoder-only LLM, enabling the model to focus on relevant information from very long sequences. FocusLLM processes long text inputs by dividing them into chunks based on the model’s original context length to alleviate the issue of attention distraction. Then, it appends the local context to each chunk as a prompt to extract essential information from each chunk based on a novel parallel decoding mechanism, and ultimately integrates the extracted information into the local context. FocusLLM stands out for great training efficiency and versatility: trained with an 8K input length with much less training cost than previous methods, FocusLLM exhibits superior performance across downstream long-context tasks and maintains strong language modeling ability when handling extensive long texts, even up to 400K tokens. Our code is available at this https URL.

[AI-12] JieHua Paintings Style Feature Extracting Model using Stable Diffusion with ControlNet CCS

链接: https://arxiv.org/abs/2408.11744
作者: Yujia Gu,Haofeng Li,Xinyu Fang,Zihan Peng,Yinan Peng
关键词-EN: Fine-tuned Stable Diffusion, refine depiction techniques, extract stylistic features, Stable Diffusion Model, Canny Edge Features
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ICCSMT 2024

点击查看摘要

Abstract:This study proposes a novel approach to extract stylistic features of Jiehua: the utilization of the Fine-tuned Stable Diffusion Model with ControlNet (FSDMC) to refine depiction techniques from artists’ Jiehua. The training data for FSDMC is based on the opensource Jiehua artist’s work collected from the Internet, which were subsequently manually constructed in the format of (Original Image, Canny Edge Features, Text Prompt). By employing the optimal hyperparameters identified in this paper, it was observed FSDMC outperforms CycleGAN, another mainstream style transfer model. FSDMC achieves FID of 3.27 on the dataset and also surpasses CycleGAN in terms of expert evaluation. This not only demonstrates the model’s high effectiveness in extracting Jiehua’s style features, but also preserves the original pre-trained semantic information. The findings of this study suggest that the application of FSDMC with appropriate hyperparameters can enhance the efficacy of the Stable Diffusion Model in the field of traditional art style migration tasks, particularly within the context of Jiehua.

[AI-13] CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering

链接: https://arxiv.org/abs/2408.11742
作者: Yuliang Cai,Mohammad Rostami
关键词-EN: Large vision-language models, Large vision-language, shown significant performance, significant performance boost, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) have shown significant performance boost in various application domains. However, adopting them to deal with several sequentially encountered tasks has been challenging because finetuning a VLM on a task normally leads to reducing its generalization power and the capacity of learning new tasks as well as causing catastrophic forgetting on previously learned tasks. Enabling using VLMs in multimodal continual learning (CL) settings can help to address such scenarios. To improve generalization capacity and prevent catastrophic forgetting, we propose a novel prompt-based CL method for VLMs, namely \textbfClu ster-based \textbfMo dality Fusion Prompt (\textbfCluMo). We design a novel \textbfKey-Key-Prompt pair, where each prompt is associated with a visual prompt key and a textual prompt key. We adopt a two-stage training strategy. During the first stage, the single-modal keys are trained via K -means clustering algorithm to help select the best semantically matched prompt. During the second stage, the prompt keys are frozen, the selected prompt is attached to the input for training the VLM in the CL scenario. Experiments on two benchmarks demonstrate that our method achieves SOTA performance.

[AI-14] Clinical Insights: A Comprehensive Review of Language Models in Medicine ALT

链接: https://arxiv.org/abs/2408.11735
作者: Nikita Neveditsin,Pawan Lingras,Vijay Mago
关键词-EN: large language models, detailed examination, large language, clinical applications, advancements and applications
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to PLOS Digital Health

点击查看摘要

Abstract:This paper provides a detailed examination of the advancements and applications of large language models in the healthcare sector, with a particular emphasis on clinical applications. The study traces the evolution of LLMs from their foundational technologies to the latest developments in domain-specific models and multimodal integration. It explores the technical progression from encoder-based models requiring fine-tuning to sophisticated approaches that integrate textual, visual, and auditory data, thereby facilitating comprehensive AI solutions in healthcare. The paper discusses both the opportunities these technologies present for enhancing clinical efficiency and the challenges they pose in terms of ethics, data privacy, and implementation. Additionally, it critically evaluates the deployment strategies of LLMs, emphasizing the necessity of open-source models to ensure data privacy and adaptability within healthcare environments. Future research directions are proposed, focusing on empirical studies to evaluate the real-world efficacy of LLMs in healthcare and the development of open datasets for further research. This review aims to provide a comprehensive resource for both newcomers and multidisciplinary researchers interested in the intersection of AI and healthcare.

[AI-15] Efficient Detection of Toxic Prompts in Large Language Models

链接: https://arxiv.org/abs/2408.11727
作者: Yi Liu,Junzhe Yu,Huijia Sun,Ling Shi,Gelei Deng,Yuqi Chen,Yang Liu
关键词-EN: Large language models, advanced natural language, automated content generation, significantly advanced natural, Large language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注: Accepted by the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024)

点击查看摘要

Abstract:Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39% and a low false positive rate of 2.00%, outperforming state-of-the-art methods. Additionally, ToxicDetector’s processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.

[AI-16] Iterative Object Count Optimization for Text-to-image Diffusion Models

链接: https://arxiv.org/abs/2408.11721
作者: Oz Zafar,Lior Wolf,Idan Schwartz
关键词-EN: accurately generating, counting, counting model, models, objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Pre-print

点击查看摘要

Abstract:We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an objectś potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at this https URL.

[AI-17] Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests ICSE2025

链接: https://arxiv.org/abs/2408.11710
作者: Amirhossein Deljouyi,Roham Koohestani,Maliheh Izadi,Andy Zaidman
关键词-EN: Automated unit test, Automated unit, search-based software testing, software testing tools, tools like EvoSuite
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Note: This paper has been accepted for presentation at the 47th International Conference on Software Engineering (ICSE 2025) - Research Track

点击查看摘要

Abstract:Automated unit test generators, particularly search-based software testing tools like EvoSuite, are capable of generating tests with high coverage. Although these generators alleviate the burden of writing unit tests, they often pose challenges for software engineers in terms of understanding the generated tests. To address this, we introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We achieve this enhancement through contextualizing test data, improving identifier naming, and adding descriptive comments. Through a controlled experiment with 32 participants from both academia and industry, we investigate how the understandability of unit tests affects a software engineer’s ability to perform bug-fixing tasks. We selected bug-fixing to simulate a real-world scenario that emphasizes the importance of understandable test cases. We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases. From the post-test questionnaire, we gathered that participants found that enhanced test names, test data, and variable names improved their bug-fixing process.

[AI-18] Physics-informed Discovery of State Variables in Second-Order and Hamiltonian Systems

链接: https://arxiv.org/abs/2408.11691
作者: Félix Chavelli,Zi-Yu Khoo,Dawen Wu,Jonathan Sze Choong Low,Stéphane Bressan
关键词-EN: controlling natural phenomena, state variables, pervasive concern, predicting and controlling, controlling natural
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The modeling of dynamical systems is a pervasive concern for not only describing but also predicting and controlling natural phenomena and engineered systems. Current data-driven approaches often assume prior knowledge of the relevant state variables or result in overparameterized state spaces. Boyuan Chen and his co-authors proposed a neural network model that estimates the degrees of freedom and attempts to discover the state variables of a dynamical system. Despite its innovative approach, this baseline model lacks a connection to the physical principles governing the systems it analyzes, leading to unreliable state variables. This research proposes a method that leverages the physical characteristics of second-order Hamiltonian systems to constrain the baseline model. The proposed model outperforms the baseline model in identifying a minimal set of non-redundant and interpretable state variables. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2408.11691 [cs.AI] (or arXiv:2408.11691v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.11691 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-19] CIPHER: Cybersecurity Intelligent Penetration-testing Helper for Ethical Researcher

链接: https://arxiv.org/abs/2408.11650
作者: Derry Pratama,Naufal Suryanto,Andro Aprila Adiputra,Thi-Thu-Huong Le,Ahmada Yusril Kadiptya,Muhammad Iqbal,Howon Kim
关键词-EN: typically requires extensive, requires extensive time, Penetration testing, Intelligent Penetration-testing Helper, typically requires
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 28 pages, github available

点击查看摘要

Abstract:Penetration testing, a critical component of cybersecurity, typically requires extensive time and effort to find vulnerabilities. Beginners in this field often benefit from collaborative approaches with the community or experts. To address this, we develop CIPHER (Cybersecurity Intelligent Penetration-testing Helper for Ethical Researchers), a large language model specifically trained to assist in penetration testing tasks. We trained CIPHER using over 300 high-quality write-ups of vulnerable machines, hacking techniques, and documentation of open-source penetration testing tools. Additionally, we introduced the Findings, Action, Reasoning, and Results (FARR) Flow augmentation, a novel method to augment penetration testing write-ups to establish a fully automated pentesting simulation benchmark tailored for large language models. This approach fills a significant gap in traditional cybersecurity Q\A benchmarks and provides a realistic and rigorous standard for evaluating AI’s technical knowledge, reasoning capabilities, and practical utility in dynamic penetration testing scenarios. In our assessments, CIPHER achieved the best overall performance in providing accurate suggestion responses compared to other open-source penetration testing models of similar size and even larger state-of-the-art models like Llama 3 70B and Qwen1.5 72B Chat, particularly on insane difficulty machine setups. This demonstrates that the current capabilities of general LLMs are insufficient for effectively guiding users through the penetration testing process. We also discuss the potential for improvement through scaling and the development of better benchmarks using FARR Flow augmentation results. Our benchmark will be released publicly at this https URL.

[AI-20] Video-to-Text Pedestrian Monitoring (VTPM): Leveraging Computer Vision and Large Language Models for Privacy-Preserve Pedestrian Activity Monitoring at Intersections

链接: https://arxiv.org/abs/2408.11649
作者: Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Dongdong Wang
关键词-EN: advanced research methodologies, enhancing system services, research methodologies, advanced research, textual reports
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computer vision has advanced research methodologies, enhancing system services across various fields. It is a core component in traffic monitoring systems for improving road safety; however, these monitoring systems don’t preserve the privacy of pedestrians who appear in the videos, potentially revealing their identities. Addressing this issue, our paper introduces Video-to-Text Pedestrian Monitoring (VTPM), which monitors pedestrian movements at intersections and generates real-time textual reports, including traffic signal and weather information. VTPM uses computer vision models for pedestrian detection and tracking, achieving a latency of 0.05 seconds per video frame. Additionally, it detects crossing violations with 90.2% accuracy by incorporating traffic signal data. The proposed framework is equipped with Phi-3 mini-4k to generate real-time textual reports of pedestrian activity while stating safety concerns like crossing violations, conflicts, and the impact of weather on their behavior with latency of 0.33 seconds. To enhance comprehensive analysis of the generated textual reports, Phi-3 medium is fine-tuned for historical analysis of these generated textual reports. This fine-tuning enables more reliable analysis about the pedestrian safety at intersections, effectively detecting patterns and safety critical events. The proposed VTPM offers a more efficient alternative to video footage by using textual reports reducing memory usage, saving up to 253 million percent, eliminating privacy issues, and enabling comprehensive interactive historical analysis.

[AI-21] Data-driven Modeling of Combined Sewer Systems for Urban Sustainability: An Empirical Evaluation

链接: https://arxiv.org/abs/2408.11619
作者: Vipin Singh,Tianheng Ling,Teodor Chiaburu,Felix Biessmann
关键词-EN: Climate change poses, Climate change, poses complex challenges, change poses complex, Combined Sewer Systems
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, accepted at 47th German Conference on Artificial Intelligence, Wuerzburg 2024

点击查看摘要

Abstract:Climate change poses complex challenges, with extreme weather events becoming increasingly frequent and difficult to model. Examples include the dynamics of Combined Sewer Systems (CSS). Overburdened CSS during heavy rainfall will overflow untreated wastewater into surface water bodies. Classical approaches to modeling the impact of extreme rainfall events rely on physical simulations, which are particularly challenging to create for large urban infrastructures. Deep Learning (DL) models offer a cost-effective alternative for modeling the complex dynamics of sewer systems. In this study, we present a comprehensive empirical evaluation of several state-of-the-art DL time series models for predicting sewer system dynamics in a large urban infrastructure, utilizing three years of measurement data. We especially investigate the potential of DL models to maintain predictive precision during network outages by comparing global models, which have access to all variables within the sewer system, and local models, which are limited to data from a restricted set of local sensors. Our findings demonstrate that DL models can accurately predict the dynamics of sewer system load, even under network outage conditions. These results suggest that DL models can effectively aid in balancing the load redistribution in CSS, thereby enhancing the sustainability and resilience of urban infrastructures.

[AI-22] Xinyu: An Efficient LLM-based System for Commentary Generation

链接: https://arxiv.org/abs/2408.11609
作者: Yiquan Wu,Bo Tang,Chenyang Xi,Yu Yu,Pengyu Wang,Yifei Liu,Kun Kuang,Haiying Deng,Zhiyu Li,Feiyu Xiong,Jie Hu,Peng Cheng,Zhonghao Wang,Yi Wang,Yi Luo,Mingchuan Yang
关键词-EN: presenting diverse arguments, deep understanding, presenting diverse, Commentary, requirements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Commentary provides readers with a deep understanding of events by presenting diverse arguments and evidence. However, creating commentary is a time-consuming task, even for skilled commentators. Large language models (LLMs) have simplified the process of natural language generation, but their direct application in commentary creation still faces challenges due to unique task requirements. These requirements can be categorized into two levels: 1) fundamental requirements, which include creating well-structured and logically consistent narratives, and 2) advanced requirements, which involve generating quality arguments and providing convincing evidence. In this paper, we introduce Xinyu, an efficient LLM-based system designed to assist commentators in generating Chinese commentaries. To meet the fundamental requirements, we deconstruct the generation process into sequential steps, proposing targeted strategies and supervised fine-tuning (SFT) for each step. To address the advanced requirements, we present an argument ranking model for arguments and establish a comprehensive evidence database that includes up-to-date events and classic books, thereby strengthening the substantiation of the evidence with retrieval augmented generation (RAG) technology. To evaluate the generated commentaries more fairly, corresponding to the two-level requirements, we introduce a comprehensive evaluation metric that considers five distinct perspectives in commentary generation. Our experiments confirm the effectiveness of our proposed system. We also observe a significant increase in the efficiency of commentators in real-world scenarios, with the average time spent on creating a commentary dropping from 4 hours to 20 minutes. Importantly, such an increase in efficiency does not compromise the quality of the commentaries.

[AI-23] Dont Kill the Baby: The Case for AI in Arbitration

链接: https://arxiv.org/abs/2408.11608
作者: Michael Broyde,Yiyang Mei
关键词-EN: introduction of Generative, simulate human intelligence, Federal Arbitration Act, ability to simulate, generate content
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Since the introduction of Generative AI (GenAI) in 2022, its ability to simulate human intelligence and generate content has sparked both enthusiasm and concern. While much criticism focuses on AI’s potential to perpetuate bias, create emotional dissonance, displace jobs, and raise ethical questions, these concerns often overlook the practical benefits of AI, particularly in legal contexts. This article examines the integration of AI into arbitration, arguing that the Federal Arbitration Act (FAA) allows parties to contractually choose AI-driven arbitration, despite traditional reservations. The article makes three key contributions: (1) It shifts the focus from debates over AI’s personhood to the practical aspects of incorporating AI into arbitration, asserting that AI can effectively serve as an arbitrator if both parties agree; (2) It positions arbitration as an ideal starting point for broader AI adoption in the legal field, given its flexibility and the autonomy it grants parties to define their standards of fairness; and (3) It outlines future research directions, emphasizing the importance of empirically comparing AI and human arbitration, which could lead to the development of distinct systems. By advocating for the use of AI in arbitration, this article underscores the importance of respecting contractual autonomy and creating an environment that allows AI’s potential to be fully realized. Drawing on the insights of Judge Richard Posner, the article argues that the ethical obligations of AI in arbitration should be understood within the context of its technological strengths and the voluntary nature of arbitration agreements. Ultimately, it calls for a balanced, open-minded approach to AI in arbitration, recognizing its potential to enhance the efficiency, fairness, and flexibility of dispute resolution Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2408.11608 [cs.AI] (or arXiv:2408.11608v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.11608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-24] Networked Communication for Mean-Field Games with Function Approximation and Empirical Mean-Field Estimation

链接: https://arxiv.org/abs/2408.11607
作者: Patrick Benjamin,Alessandro Abate
关键词-EN: Recent works, Munchausen Online Mirror, Online Mirror Descent, non-episodic run, Mean-Field Games
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recent works have provided algorithms by which decentralised agents, which may be connected via a communication network, can learn equilibria in Mean-Field Games from a single, non-episodic run of the empirical system. However, these algorithms are given for tabular settings: this computationally limits the size of players’ observation space, meaning that the algorithms are not able to handle anything but small state spaces, nor to generalise beyond policies depending on the ego player’s state to so-called ‘population-dependent’ policies. We address this limitation by introducing function approximation to the existing setting, drawing on the Munchausen Online Mirror Descent method that has previously been employed only in finite-horizon, episodic, centralised settings. While this permits us to include the population’s mean-field distribution in the observation for each player’s policy, it is arguably unrealistic to assume that decentralised agents would have access to this global information: we therefore additionally provide new algorithms that allow agents to estimate the global empirical distribution based on a local neighbourhood, and to improve this estimate via communication over a given network. Our experiments showcase how the communication network allows decentralised agents to estimate the mean-field distribution for population-dependent policies, and that exchanging policy information helps networked agents to outperform both independent and even centralised agents in function-approximation settings, by an even greater margin than in tabular settings.

[AI-25] Cause-Aware Empathetic Response Generation via Chain-of-Thought Fine-Tuning

链接: https://arxiv.org/abs/2408.11599
作者: Xinhao Chen,Chong Yang,Man Lan,Li Cai,Yang Chen,Tu Hu,Xinlin Zhuang,Aimin Zhou
关键词-EN: comprehend dialogue contexts, Empathetic response generation, generation endows agents, response generation endows, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Empathetic response generation endows agents with the capability to comprehend dialogue contexts and react to expressed emotions. Previous works predominantly focus on leveraging the speaker’s emotional labels, but ignore the importance of emotion cause reasoning in empathetic response generation, which hinders the model’s capacity for further affective understanding and cognitive inference. In this paper, we propose a cause-aware empathetic generation approach by integrating emotions and causes through a well-designed Chain-of-Thought (CoT) prompt on Large Language Models (LLMs). Our approach can greatly promote LLMs’ performance of empathy by instruction tuning and enhancing the role awareness of an empathetic listener in the prompt. Additionally, we propose to incorporate cause-oriented external knowledge from COMET into the prompt, which improves the diversity of generation and alleviates conflicts between internal and external knowledge at the same time. Experimental results on the benchmark dataset demonstrate that our approach on LLaMA-7b achieves state-of-the-art performance in both automatic and human evaluations.

[AI-26] Active learning for efficient data selection in radio-signal based positioning via deep learning

链接: https://arxiv.org/abs/2408.11592
作者: Vincent Corlay,Milan Courcoux-Caro
关键词-EN: user equipment, based on radio, radio signals, signals via deep, deep learning
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Submitted to Electronics Letters

点击查看摘要

Abstract:We consider the problem of user equipment (UE) positioning based on radio signals via deep learning. As in most supervised-learning tasks, a critical aspect is the availability of a relevant dataset to train a model. However, in a cellular network, the data-collection step may induce a high communication overhead. As a result, to reduce the required size of the dataset, it may be interesting to carefully choose the positions to be labelled and to be used in the training. We therefore propose an active learning approach for efficient data collection. We first show that significant gains (both in terms of positioning accuracy and size of the required dataset) can be obtained for the considered positioning problem using a genie. This validates the interest of active learning for positioning. We then propose a \textcolorbluepractical method to approximate this genie.

[AI-27] Drama Engine: A Framework for Narrative Agents

链接: https://arxiv.org/abs/2408.11574
作者: Martin Pichlmair,Riddhi Raj,Charlene Putney
关键词-EN: technical report presents, large language models, language models designed, Drama Engine, narrative purposes
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages, 2 figures, 2 tables

点击查看摘要

Abstract:This technical report presents the Drama Engine, a novel framework for agentic interaction with large language models designed for narrative purposes. The framework adapts multi-agent system principles to create dynamic, context-aware companions that can develop over time and interact with users and each other. Key features include multi-agent workflows with delegation, dynamic prompt assembly, and model-agnostic design. The Drama Engine introduces unique elements such as companion development, mood systems, and automatic context summarising. It is implemented in TypeScript. The framework’s applications include multi-agent chats and virtual co-workers for creative writing. The paper discusses the system’s architecture, prompt assembly process, delegation mechanisms, and moderation techniques, as well as potential ethical considerations and future extensions.

[AI-28] Differentiating Choices via Commonality for Multiple-Choice Question Answering ECAI2024

链接: https://arxiv.org/abs/2408.11554
作者: Wenqing Deng,Zhe Wang,Kewen Wang,Shirui Pan,Xiaowang Zhang,Zhiyong Feng
关键词-EN: Multiple-choice question answering, Multiple-choice question, semantically similar, choices, MCQA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 9 pages, accepted to ECAI 2024

点击查看摘要

Abstract:Multiple-choice question answering (MCQA) becomes particularly challenging when all choices are relevant to the question and are semantically similar. Yet this setting of MCQA can potentially provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. Specifically, they fail to leverage the semantic commonalities and nuances among the choices for reasoning. In this paper, we propose a novel MCQA model by differentiating choices through identifying and eliminating their commonality, called DCQA. Our model captures token-level attention of each choice to the question, and separates tokens of the question attended to by all the choices (i.e., commonalities) from those by individual choices (i.e., nuances). Using the nuances as refined contexts for the choices, our model can effectively differentiate choices with subtle differences and provide justifications for choosing the correct answer. We conduct comprehensive experiments across five commonly used MCQA benchmarks, demonstrating that DCQA consistently outperforms baseline models. Furthermore, our case study illustrates the effectiveness of the approach in directing the attention of the model to more differentiating features.

[AI-29] Explainable Deep Learning Framework for Human Activity Recognition

链接: https://arxiv.org/abs/2408.11552
作者: Yiran Huang,Yexu Zhou,Haibin Zhao,Till Riedel,Michael Beigl
关键词-EN: explainable Artificial Intelligence, Artificial Intelligence, human activity recognition, Class Activation Mapping, explainable Artificial
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of human activity recognition (HAR), the integration of explainable Artificial Intelligence (XAI) emerges as a critical necessity to elucidate the decision-making processes of complex models, fostering transparency and trust. Traditional explanatory methods like Class Activation Mapping (CAM) and attention mechanisms, although effective in highlighting regions vital for decisions in various contexts, prove inadequate for HAR. This inadequacy stems from the inherently abstract nature of HAR data, rendering these explanations obscure. In contrast, state-of-th-art post-hoc interpretation techniques for time series can explain the model from other perspectives. However, this requires extra effort. It usually takes 10 to 20 seconds to generate an explanation. To overcome these challenges, we proposes a novel, model-agnostic framework that enhances both the interpretability and efficacy of HAR models through the strategic use of competitive data augmentation. This innovative approach does not rely on any particular model architecture, thereby broadening its applicability across various HAR models. By implementing competitive data augmentation, our framework provides intuitive and accessible explanations of model decisions, thereby significantly advancing the interpretability of HAR systems without compromising on performance.

[AI-30] Memorization In In-Context Learning

链接: https://arxiv.org/abs/2408.11546
作者: Shahriar Golchin,Mihai Surdeanu,Steven Bethard,Eduardo Blanco,Ellen Riloff
关键词-EN: large language models, In-context learning, ICL, language models, strategy for improving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: v1

点击查看摘要

Abstract:In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind these performance improvements remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when it outperforms zero-shot learning. Overall, our study uncovers a hidden phenomenon – memorization – at the core of ICL, raising an important question: to what extent do LLMs truly generalize from demonstrations in ICL, and how much of their success is due to memorization?

[AI-31] A Survey of Embodied Learning for Object-Centric Robotic Manipulation

链接: https://arxiv.org/abs/2408.11537
作者: Ying Zheng,Lei Yao,Yuejiao Su,Yi Zhang,Yi Wang,Sicheng Zhao,Yiyi Zhang,Lap-Pui Chau
关键词-EN: rapidly developing, developing and challenging, challenging area, object-centric robotic manipulation, Embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embodied learning for object-centric robotic manipulation is a rapidly developing and challenging area in embodied AI. It is crucial for advancing next-generation intelligent robots and has garnered significant interest recently. Unlike data-driven machine learning methods, embodied learning focuses on robot learning through physical interaction with the environment and perceptual feedback, making it especially suitable for robotic manipulation. In this paper, we provide a comprehensive survey of the latest advancements in this field and categorize the existing work into three main branches: 1) Embodied perceptual learning, which aims to predict object pose and affordance through various data representations; 2) Embodied policy learning, which focuses on generating optimal robotic decisions using methods such as reinforcement learning and imitation learning; 3) Embodied task-oriented learning, designed to optimize the robot’s performance based on the characteristics of different tasks in object grasping and manipulation. In addition, we offer an overview and discussion of public datasets, evaluation metrics, representative applications, current challenges, and potential future research directions. A project associated with this survey has been established at this https URL.

[AI-32] Scalable Knowledge Refactoring using Constrained Optimisation

链接: https://arxiv.org/abs/2408.11530
作者: Minghao Liu,David M. Cerna,Filipe Gouveia,Andrew Cropper
关键词-EN: Knowledge refactoring compresses, Knowledge refactoring, compresses a logic, Knowledge, logic program
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge refactoring compresses a logic program by introducing new rules. Current approaches struggle to scale to large programs. To overcome this limitation, we introduce a constrained optimisation refactoring approach. Our first key idea is to encode the problem with decision variables based on literals rather than rules. Our second key idea is to focus on linear invented rules. Our empirical results on multiple domains show that our approach can refactor programs quicker and with more compression than the previous state-of-the-art approach, sometimes by 60%.

[AI-33] he Vizier Gaussian Process Bandit Algorithm

链接: https://arxiv.org/abs/2408.11527
作者: Xingyou Song,Qiuyi Zhang,Chansoo Lee,Emily Fertig,Tzu-Kuo Huang,Lior Belenki,Greg Kochanski,Setareh Ariafar,Srinivas Vasudevan,Sagi Perel,Daniel Golovin
关键词-EN: accelerated numerous research, Bayesian optimization, success of Bayesian, Google Vizier, Open Source Vizier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: Google DeepMind Technical Report. Code can be found in this https URL

点击查看摘要

Abstract:Google Vizier has performed millions of optimizations and accelerated numerous research and production systems at Google, demonstrating the success of Bayesian optimization as a large-scale service. Over multiple years, its algorithm has been improved considerably, through the collective experiences of numerous research efforts and user feedback. In this technical report, we discuss the implementation details and design choices of the current default algorithm provided by Open Source Vizier. Our experiments on standardized benchmarks reveal its robustness and versatility against well-established industry baselines on multiple practical modes.

[AI-34] RConE: Rough Cone Embedding for Multi-Hop Logical Query Answering on Multi-Modal Knowledge Graphs

链接: https://arxiv.org/abs/2408.11526
作者: Mayank Kharbanda,Rajiv Ratn Shah,Raghava Mutharaju
关键词-EN: Multi-hop query answering, Multi-Modal Knowledge Graphs, multi-hop question answering, start node, Multi-hop
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-hop query answering over a Knowledge Graph (KG) involves traversing one or more hops from the start node to answer a query. Path-based and logic-based methods are state-of-the-art for multi-hop question answering. The former is used in link prediction tasks. The latter is for answering complex logical queries. The logical multi-hop querying technique embeds the KG and queries in the same embedding space. The existing work incorporates First Order Logic (FOL) operators, such as conjunction ( \wedge ), disjunction ( \vee ), and negation ( \neg ), in queries. Though current models have most of the building blocks to execute the FOL queries, they cannot use the dense information of multi-modal entities in the case of Multi-Modal Knowledge Graphs (MMKGs). We propose RConE, an embedding method to capture the multi-modal information needed to answer a query. The model first shortlists candidate (multi-modal) entities containing the answer. It then finds the solution (sub-entities) within those entities. Several existing works tackle path-based question-answering in MMKGs. However, to our knowledge, we are the first to introduce logical constructs in querying MMKGs and to answer queries that involve sub-entities of multi-modal entities as the answer. Extensive evaluation of four publicly available MMKGs indicates that RConE outperforms the current state-of-the-art.

[AI-35] LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding

链接: https://arxiv.org/abs/2408.11523
作者: Zhizhong Wan,Bin Yin,Junjie Xie,Fei Jiang,Xiang Li,Wei Lin
关键词-EN: Click-Through Rate, provide personalized recommendation, personalized recommendation services, Recommendation System, Large Language Models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Click-Through Rate (CTR) prediction is crucial for Recommendation System(RS), aiming to provide personalized recommendation services for users in many aspects such as food delivery, e-commerce and so on. However, traditional RS relies on collaborative signals, which lacks semantic understanding to real-time scenes. We also noticed that a major challenge in utilizing Large Language Models (LLMs) for practical recommendation purposes is their efficiency in dealing with long text input. To break through the problems above, we propose Large Language Model Aided Real-time Scene Recommendation(LARR), adopt LLMs for semantic understanding, utilizing real-time scene information in RS without requiring LLM to process the entire real-time scene text directly, thereby enhancing the efficiency of LLM-based CTR modeling. Specifically, recommendation domain-specific knowledge is injected into LLM and then RS employs an aggregation encoder to build real-time scene information from separate LLM’s outputs. Firstly, a LLM is continual pretrained on corpus built from recommendation data with the aid of special tokens. Subsequently, the LLM is fine-tuned via contrastive learning on three kinds of sample construction strategies. Through this step, LLM is transformed into a text embedding model. Finally, LLM’s separate outputs for different scene features are aggregated by an encoder, aligning to collaborative signals in RS, enhancing the performance of recommendation model.

[AI-36] Quantifying Behavioural Distance Between Mathematical Expressions

链接: https://arxiv.org/abs/2408.11515
作者: Sebastian Mežnar,Sašo Džeroski,Ljupčo Todorovski
关键词-EN: expressions primarily based, Existing symbolic regression, candidate mathematical expressions, mathematical expressions primarily, structural similarity
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 10 figures, 1 table, 2 appendices

点击查看摘要

Abstract:Existing symbolic regression methods organize the space of candidate mathematical expressions primarily based on their syntactic, structural similarity. However, this approach overlooks crucial equivalences between expressions that arise from mathematical symmetries, such as commutativity, associativity, and distribution laws for arithmetic operations. Consequently, expressions with similar errors on a given data set are apart from each other in the search space. This leads to a rough error landscape in the search space that efficient local, gradient-based methods cannot explore. This paper proposes and implements a measure of a behavioral distance, BED, that clusters together expressions with similar errors. The experimental results show that the stochastic method for calculating BED achieves consistency with a modest number of sampled values for evaluating the expressions. This leads to computational efficiency comparable to the tree-based syntactic distance. Our findings also reveal that BED significantly improves the smoothness of the error landscape in the search space for symbolic regression.

[AI-37] Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

链接: https://arxiv.org/abs/2408.11513
作者: Washim Uddin Mondal,Vaneet Aggarwal
关键词-EN: Markov Decision Process, Constrained Markov Decision, Decision Process, Constrained Markov, Markov Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, \epsilon_\mathrmbias , PDR-ANPG achieves a last-iterate \epsilon optimality gap and \epsilon constraint violation (up to some additive factor of \epsilon_\mathrmbias ) with a sample complexity of \tilde\mathcalO(\epsilon^-2\min\epsilon^-2,\epsilon_\mathrmbias^-\frac13) . If the class is incomplete ( \epsilon_\mathrmbias0 ), then the sample complexity reduces to \tilde\mathcalO(\epsilon^-2) for \epsilon(\epsilon_\mathrmbias)^\frac16 . Moreover, for complete policies with \epsilon_\mathrmbias=0 , our algorithm achieves a last-iterate \epsilon optimality gap and \epsilon constraint violation with \tilde\mathcalO(\epsilon^-4) sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

[AI-38] Mutagenesis screen to map the functionals of parameters of Large Language Models

链接: https://arxiv.org/abs/2408.11494
作者: Yue Hu,Kai Hu,Patrick X. Zhao,Javed Khan,Chengming Xu
关键词-EN: advanced artificial intelligence, significantly advanced artificial, Large Language Models, artificial intelligence, excelling in numerous
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, supplementary material available online

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced artificial intelligence, excelling in numerous tasks. Although the functionality of a model is inherently tied to its parameters, a systematic method for exploring the connections between the parameters and the functionality are lacking. Models sharing similar structure and parameter counts exhibit significant performance disparities across various tasks, prompting investigations into the varying patterns that govern their performance. We adopted a mutagenesis screen approach inspired by the methods used in biological studies, to investigate Llama2-7b and Zephyr. This technique involved mutating elements within the models’ matrices to their maximum or minimum values to examine the relationship between model parameters and their functionalities. Our research uncovered multiple levels of fine structures within both models. Many matrices showed a mixture of maximum and minimum mutations following mutagenesis, but others were predominantly sensitive to one type. Notably, mutations that produced phenotypes, especially those with severe outcomes, tended to cluster along axes. Additionally, the location of maximum and minimum mutations often displayed a complementary pattern on matrix in both models, with the Gate matrix showing a unique two-dimensional asymmetry after rearrangement. In Zephyr, certain mutations consistently resulted in poetic or conversational rather than descriptive outputs. These “writer” mutations grouped according to the high-frequency initial word of the output, with a marked tendency to share the row coordinate even when they are in different matrices. Our findings affirm that the mutagenesis screen is an effective tool for deciphering the complexities of large language models and identifying unexpected ways to expand their potential, providing deeper insights into the foundational aspects of AI systems.

[AI-39] Estimating Peer Direct and Indirect Effects in Observational Network Data AAAI

链接: https://arxiv.org/abs/2408.11492
作者: Xiaojing Du,Jiuyong Li,Debo Cheng,Lin Liu,Wentao Gao,Xiongren Chen
关键词-EN: Estimating causal effects, Estimating causal, network data due, observational network data, causal effects
类目: Artificial Intelligence (cs.AI)
*备注: AAAI

点击查看摘要

Abstract:Estimating causal effects is crucial for decision-makers in many applications, but it is particularly challenging with observational network data due to peer interactions. Many algorithms have been proposed to estimate causal effects involving network data, particularly peer effects, but they often overlook the variety of peer effects. To address this issue, we propose a general setting which considers both peer direct effects and peer indirect effects, and the effect of an individual’s own treatment, and provide identification conditions of these causal effects and proofs. To estimate these causal effects, we utilize attention mechanisms to distinguish the influences of different neighbors and explore high-order neighbor effects through multi-layer graph neural networks (GNNs). Additionally, to control the dependency between node features and representations, we incorporate the Hilbert-Schmidt Independence Criterion (HSIC) into the GNN, fully utilizing the structural information of the graph, to enhance the robustness and accuracy of the model. Extensive experiments on two semi-synthetic datasets confirm the effectiveness of our approach. Our theoretical findings have the potential to improve intervention strategies in networked systems, with applications in areas such as social networks and epidemiology.

[AI-40] Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

链接: https://arxiv.org/abs/2408.11491
作者: Zouying Cao,Yifei Yang,Hai Zhao
关键词-EN: Large language models, indispensable for Large, Large language, malicious instructions, exaggerated safety
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. However, recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue, limiting their helpfulness. In this paper, we propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns in aligned LLMs. First, SCANS extracts the refusal steering vectors within the activation space and utilizes vocabulary projection to anchor some specific safety-critical layers which influence model refusal behavior. Second, by tracking the hidden state transition, SCANS identifies the steering direction and steers the model behavior accordingly, achieving a balance between exaggerated safety and adequate safety. Experiments show that SCANS achieves new state-of-the-art performance on XSTest and OKTest benchmarks, without impairing their defense capability against harmful queries and maintaining almost unchanged model capability.

[AI-41] Using Part-based Representations for Explainable Deep Reinforcement Learning

链接: https://arxiv.org/abs/2408.11455
作者: Manos Kirtas,Konstantinos Tsampazis,Loukia Avramelou,Nikolaos Passalis,Nikolaos Passalis
关键词-EN: Utilizing deep learning, holds significant potential, Utilizing deep, models incorporate latent, representations holds significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Utilizing deep learning models to learn part-based representations holds significant potential for interpretable-by-design approaches, as these models incorporate latent causes obtained from feature representations through simple addition. However, training a part-based learning model presents challenges, particularly in enforcing non-negative constraints on the model’s parameters, which can result in training difficulties such as instability and convergence issues. Moreover, applying such approaches in Deep Reinforcement Learning (RL) is even more demanding due to the inherent instabilities that impact many optimization methods. In this paper, we propose a non-negative training approach for actor models in RL, enabling the extraction of part-based representations that enhance interpretability while adhering to non-negative constraints. To this end, we employ a non-negative initialization technique, as well as a modified sign-preserving training method, which can ensure better gradient flow compared to existing approaches. We demonstrate the effectiveness of the proposed approach using the well-known Cartpole benchmark.

[AI-42] Bidirectional Gated Mamba for Sequential Recommendation

链接: https://arxiv.org/abs/2408.11451
作者: Ziwei Liu,Qidong Liu,Yejing Wang,Wanyu Wang,Pengyue Jia,Maolin Wang,Zitao Liu,Yi Chang,Xiangyu Zhao
关键词-EN: Sequential Recommender Systems, Recommender Systems, intricate user preferences, discern intricate user, Sequential Recommender
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In various domains, Sequential Recommender Systems (SRS) have become essential due to their superior capability to discern intricate user preferences. Typically, SRS utilize transformer-based architectures to forecast the subsequent item within a sequence. Nevertheless, the quadratic computational complexity inherent in these models often leads to inefficiencies, hindering the achievement of real-time recommendations. Mamba, a recent advancement, has exhibited exceptional performance in time series prediction, significantly enhancing both efficiency and accuracy. However, integrating Mamba directly into SRS poses several challenges. Its inherently unidirectional nature may constrain the model’s capacity to capture the full context of user-item interactions, while its instability in state estimation can compromise its ability to detect short-term patterns within interaction sequences. To overcome these issues, we introduce a new framework named \textbf\underlineSelect\textbf\underlineIve \textbf\underlineGated \textbf\underlineMAmba (SIGMA). This framework leverages a Partially Flipped Mamba (PF-Mamba) to construct a bidirectional architecture specifically tailored to improve contextual modeling. Additionally, an input-sensitive Dense Selective Gate (DS Gate) is employed to optimize directional weights and enhance the processing of sequential information in PF-Mamba. For short sequence modeling, we have also developed a Feature Extract GRU (FE-GRU) to efficiently capture short-term dependencies. Empirical results indicate that SIGMA outperforms current models on five real-world datasets. Our implementation code is available at \urlthis https URL to ease reproducibility. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2408.11451 [cs.AI] (or arXiv:2408.11451v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.11451 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-43] Enabling Small Models for Zero-Shot Classification through Model Label Learning

链接: https://arxiv.org/abs/2408.11449
作者: Jia Zhang,Zhi Zhou,Lan-Zhe Guo,Yu-Feng Li
关键词-EN: CLIP have demonstrated, demonstrated impressive zero-shot, suffer inferior performance, Vision-language models, image classification tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot ability in image classification tasks by aligning text and images but suffer inferior performance compared with task-specific expert models. On the contrary, expert models excel in their specialized domains but lack zero-shot ability for new tasks. How to obtain both the high performance of expert models and zero-shot ability is an important research direction. In this paper, we attempt to demonstrate that by constructing a model hub and aligning models with their functionalities using model labels, new tasks can be solved in a zero-shot manner by effectively selecting and reusing models in the hub. We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities through a Semantic Directed Acyclic Graph (SDAG) and leverages an algorithm, Classification Head Combination Optimization (CHCO), to select capable models for new tasks. Compared with the foundation model paradigm, it is less costly and more scalable, i.e., the zero-shot ability grows with the sizes of the model hub. Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL, demonstrating that expert models can be effectively reused for zero-shot tasks. Our code will be released publicly.

[AI-44] Lookism: The overlooked bias in computer vision ECCV-2024 ECCV2024

链接: https://arxiv.org/abs/2408.11448
作者: Aditya Gulati,Bruno Lepri,Nuria Oliver
关键词-EN: socially relevant applications, computer vision, recent years, relevant applications, security screening
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Paper accepted at the ECCV 2024 workshop named “Fairness and ethics towards transparent AI: facing the chalLEnge through model Debiasing (FAILED)”, this https URL

点击查看摘要

Abstract:In recent years, there have been significant advancements in computer vision which have led to the widespread deployment of image recognition and generation systems in socially relevant applications, from hiring to security screening. However, the prevalence of biases within these systems has raised significant ethical and social concerns. The most extensively studied biases in this context are related to gender, race and age. Yet, other biases are equally pervasive and harmful, such as lookism, i.e., the preferential treatment of individuals based on their physical appearance. Lookism remains under-explored in computer vision but can have profound implications not only by perpetuating harmful societal stereotypes but also by undermining the fairness and inclusivity of AI technologies. Thus, this paper advocates for the systematic study of lookism as a critical bias in computer vision models. Through a comprehensive review of existing literature, we identify three areas of intersection between lookism and computer vision. We illustrate them by means of examples and a user study. We call for an interdisciplinary approach to address lookism, urging researchers, developers, and policymakers to prioritize the development of equitable computer vision systems that respect and reflect the diversity of human appearances.

[AI-45] Epistemic Injustice in Generative AI

链接: https://arxiv.org/abs/2408.11441
作者: Jackie Kay,Atoosa Kasirzadeh,Shakir Mohamed
关键词-EN: posing a significant, paper investigates, potentially undermine, processes we rely, significant threat
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates how generative AI can potentially undermine the integrity of collective knowledge and the processes we rely on to acquire, assess, and trust information, posing a significant threat to our knowledge ecosystem and democratic discourse. Grounded in social and political philosophy, we introduce the concept of \emphgenerative algorithmic epistemic injustice. We identify four key dimensions of this phenomenon: amplified and manipulative testimonial injustice, along with hermeneutical ignorance and access injustice. We illustrate each dimension with real-world examples that reveal how generative AI can produce or amplify misinformation, perpetuate representational harm, and create epistemic inequities, particularly in multilingual contexts. By highlighting these injustices, we aim to inform the development of epistemically just generative AI systems, proposing strategies for resistance, system design principles, and two approaches that leverage generative AI to foster a more equitable information ecosystem, thereby safeguarding democratic values and the integrity of knowledge production.

[AI-46] owards Aligned Data Removal via Twin Machine Unlearning

链接: https://arxiv.org/abs/2408.11433
作者: Yuyao Sun,Zhenxing Niu,Gang hua,Rong jin
关键词-EN: Modern privacy regulations, Modern privacy, machine unlearning, Twin Machine Unlearning, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modern privacy regulations have spurred the evolution of machine unlearning, a technique that enables the removal of data from an already trained ML model without requiring retraining from scratch. Previous unlearning methods tend to induce the model to achieve lowest classification accuracy on the removal data. Nonetheless, the authentic objective of machine unlearning is to align the unlearned model with the gold model, i.e., achieving the same classification accuracy as the gold model. For this purpose, we present a Twin Machine Unlearning (TMU) approach, where a twin unlearning problem is defined corresponding to the original unlearning problem. As a results, the generalization-label predictor trained on the twin problem can be transferred to the original problem, facilitating aligned data removal. Comprehensive empirical experiments illustrate that our approach significantly enhances the alignment between the unlearned model and the gold model. Meanwhile, our method allows data removal without compromising the model accuracy.

[AI-47] Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning

链接: https://arxiv.org/abs/2408.11431
作者: Kai Xiong,Xiao Ding,Li Du,Jiahao Ying,Ting Liu,Bing Qin,Yixin Cao
关键词-EN: Large Language Models, extensive unlabeled text, demonstrate impressive generalization, impressive generalization ability, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) are versatile and demonstrate impressive generalization ability by mining and learning information from extensive unlabeled text. However, they still exhibit reasoning mistakes, often stemming from knowledge deficiencies, which can affect their trustworthiness and reliability. Although users can provide diverse and comprehensive queries, obtaining sufficient and effective feedback is demanding. Furthermore, evaluating LLMs comprehensively with limited labeled samples is difficult. This makes it a challenge to diagnose and remedy the deficiencies of LLMs through rich label-free user queries. To tackle this challenge, we propose a label-free curricular meaningful learning framework (LaMer). LaMer first employs relative entropy to automatically diagnose and quantify the knowledge deficiencies of LLMs in a label-free setting. Next, to remedy the diagnosed knowledge deficiencies, we apply curricular meaningful learning: first, we adopt meaningful learning to adaptively synthesize augmentation data according to the severity of the deficiencies, and then design a curricular deficiency remedy strategy to remedy the knowledge deficiencies of LLMs progressively. Experiments show that LaMer efficiently and effectively diagnoses and remedies knowledge deficiencies in LLMs, improving various LLMs across seven out-of-distribution (OOD) reasoning and language understanding benchmarks, achieving comparable results to baselines with just 40% training data. LaMer even surpasses methods that rely on labeled datasets for deficiency diagnosis. In application, our label-free method can offer an effective knowledge deficiency diagnostic tool for efficient LLM development.

[AI-48] Long-Range Vision-Based UAV-assisted Localization for Unmanned Surface Vehicles

链接: https://arxiv.org/abs/2408.11429
作者: Waseem Akram,Siyuan Yang,Hailiang Kuang,Xiaoyu He,Muhayy Ud Din,Yihao Dong,Defu Lin,Lakmal Seneviratne,Shaoming He,Irfan Hussain
关键词-EN: unmanned surface vehicles, global positioning system, Unmanned Aerial Vehicle, UAV, global positioning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The global positioning system (GPS) has become an indispensable navigation method for field operations with unmanned surface vehicles (USVs) in marine environments. However, GPS may not always be available outdoors because it is vulnerable to natural interference and malicious jamming attacks. Thus, an alternative navigation system is required when the use of GPS is restricted or prohibited. To this end, we present a novel method that utilizes an Unmanned Aerial Vehicle (UAV) to assist in localizing USVs in GNSS-restricted marine environments. In our approach, the UAV flies along the shoreline at a consistent altitude, continuously tracking and detecting the USV using a deep learning-based approach on camera images. Subsequently, triangulation techniques are applied to estimate the USV’s position relative to the UAV, utilizing geometric information and datalink range from the UAV. We propose adjusting the UAV’s camera angle based on the pixel error between the USV and the image center throughout the localization process to enhance accuracy. Additionally, visual measurements are integrated into an Extended Kalman Filter (EKF) for robust state estimation. To validate our proposed method, we utilize a USV equipped with onboard sensors and a UAV equipped with a camera. A heterogeneous robotic interface is established to facilitate communication between the USV and UAV. We demonstrate the efficacy of our approach through a series of experiments conducted during the ``Muhammad Bin Zayed International Robotic Challenge (MBZIRC-2024)‘’ in real marine environments, incorporating noisy measurements and ocean disturbances. The successful outcomes indicate the potential of our method to complement GPS for USV navigation.

[AI-49] owards “Differential AI Psychology” and in-context Value-driven Statement Alignment with Moral Foundations Theory

链接: https://arxiv.org/abs/2408.11415
作者: Simon Münker
关键词-EN: Contemporary research, increasingly utilizing, sciences is increasingly, language models, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 tables

点击查看摘要

Abstract:Contemporary research in social sciences is increasingly utilizing state-of-the-art statistical language models to annotate or generate content. While these models perform benchmark-leading on common language tasks and show exemplary task-independent emergent abilities, transferring them to novel out-of-domain tasks is only insufficiently explored. The implications of the statistical black-box approach - stochastic parrots - are prominently criticized in the language model research community; however, the significance for novel generative tasks is not. This work investigates the alignment between personalized language models and survey participants on a Moral Foundation Theory questionnaire. We adapt text-to-text models to different political personas and survey the questionnaire repetitively to generate a synthetic population of persona and model combinations. Analyzing the intra-group variance and cross-alignment shows significant differences across models and personas. Our findings indicate that adapted models struggle to represent the survey-captured assessment of political ideologies. Thus, using language models to mimic social interactions requires measurable improvements in in-context optimization or parameter manipulation to align with psychological and sociological stereotypes. Without quantifiable alignment, generating politically nuanced content remains unfeasible. To enhance these representations, we propose a testable framework to generate agents based on moral value statements for future research. Comments: 8 pages, 6 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.11415 [cs.CL] (or arXiv:2408.11415v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.11415 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-50] Revisiting FunnyBirds evaluation framework for prototypical parts networks

链接: https://arxiv.org/abs/2408.11401
作者: Szymon Opłatek,Dawid Rymarczyk,Bartosz Zieliński
关键词-EN: post-hoc methods, popular due, produce more genuine, Prototypical parts networks, metric scores
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at 2nd XAI World Conference

点击查看摘要

Abstract:Prototypical parts networks, such as ProtoPNet, became popular due to their potential to produce more genuine explanations than post-hoc methods. However, for a long time, this potential has been strictly theoretical, and no systematic studies have existed to support it. That changed recently with the introduction of the FunnyBirds benchmark, which includes metrics for evaluating different aspects of explanations. However, this benchmark employs attribution maps visualization for all explanation techniques except for the ProtoPNet, for which the bounding boxes are used. This choice significantly influences the metric scores and questions the conclusions stated in FunnyBirds publication. In this study, we comprehensively compare metric scores obtained for two types of ProtoPNet visualizations: bounding boxes and similarity maps. Our analysis indicates that employing similarity maps aligns better with the essence of ProtoPNet, as evidenced by different metric scores obtained from FunnyBirds. Therefore, we advocate using similarity maps as a visualization technique for prototypical parts networks in explainability evaluation benchmarks. Comments: Published at 2nd XAI World Conference Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2408.11401 [cs.CV] (or arXiv:2408.11401v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.11401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-51] Data-Centric Machine Learning for Earth Observation: Necessary and Sufficient Features ECML ACL KDD2024

链接: https://arxiv.org/abs/2408.11384
作者: Hiba Najjar,Marlon Nuske,Andreas Dengel
关键词-EN: machine learning models, extensively leveraged, leveraged to enhance, machine learning, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at MACLEAN workshop, ECML/PKDD 2024

点击查看摘要

Abstract:The availability of temporal geospatial data in multiple modalities has been extensively leveraged to enhance the performance of machine learning models. While efforts on the design of adequate model architectures are approaching a level of saturation, focusing on a data-centric perspective can complement these efforts to achieve further enhancements in data usage efficiency and model generalization capacities. This work contributes to this direction. We leverage model explanation methods to identify the features crucial for the model to reach optimal performance and the smallest set of features sufficient to achieve this performance. We evaluate our approach on three temporal multimodal geospatial datasets and compare multiple model explanation techniques. Our results reveal that some datasets can reach their optimal accuracy with less than 20% of the temporal instances, while in other datasets, the time series of a single band from a single modality is sufficient.

[AI-52] Reflex-Based Open-Vocabulary Navigation without Prior Knowledge Using Omnidirectional Camera and Multiple Vision-Language Models

链接: https://arxiv.org/abs/2408.11380
作者: Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Naoto Tsukamoto,Kei Okada,Masayuki Inaba
关键词-EN: Localization and Mapping, Simultaneous Localization, prior map construction, reinforcement learning, map construction
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Accepted at Advanced Robotics, website - this https URL

点击查看摘要

Abstract:Various robot navigation methods have been developed, but they are mainly based on Simultaneous Localization and Mapping (SLAM), reinforcement learning, etc., which require prior map construction or learning. In this study, we consider the simplest method that does not require any map construction or learning, and execute open-vocabulary navigation of robots without any prior knowledge to do this. We applied an omnidirectional camera and pre-trained vision-language models to the robot. The omnidirectional camera provides a uniform view of the surroundings, thus eliminating the need for complicated exploratory behaviors including trajectory generation. By applying multiple pre-trained vision-language models to this omnidirectional image and incorporating reflective behaviors, we show that navigation becomes simple and does not require any prior setup. Interesting properties and limitations of our method are discussed based on experiments with the mobile robot Fetch.

[AI-53] Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation

链接: https://arxiv.org/abs/2408.11372
作者: Hao Wang,Yongqiang Han,Kefan Wang,Kai Cheng,Zhen Wang,Wei Guo,Yong Liu,Defu Lian,Enhong Chen
关键词-EN: interacting with items, recommendation systems, Efficient Behavior Miner, Multi-Behavior Sequential Recommendation, enhance recommendation performance
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of recommendation systems, users exhibit a diverse array of behaviors when interacting with items. This phenomenon has spurred research into learning the implicit semantic relationships between these behaviors to enhance recommendation performance. However, these methods often entail high computational complexity. To address concerns regarding efficiency, pre-training presents a viable solution. Its objective is to extract knowledge from extensive pre-training data and fine-tune the model for downstream tasks. Nevertheless, previous pre-training methods have primarily focused on single-behavior data, while multi-behavior data contains significant noise. Additionally, the fully fine-tuning strategy adopted by these methods still imposes a considerable computational burden. In response to this challenge, we propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation. Specifically, in the pre-training stage, we commence by proposing a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales, thereby facilitating the comprehension of the contextual semantics of multi-behavior sequences. Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module, which generates personalized, progressive, and diverse prompts to fully exploit the potential of the pre-trained model effectively. Extensive experiments on three real-world datasets have unequivocally demonstrated that DPCPL not only exhibits high efficiency and effectiveness, requiring minimal parameter adjustments but also surpasses the state-of-the-art performance across a diverse range of downstream tasks.

[AI-54] Solving Decision Theory Problems with Probabilistic Answer Set Programming

链接: https://arxiv.org/abs/2408.11371
作者: Damiano Azzolini,Elena Bellodi,Rafael Kiesel,Fabrizio Riguzzi
关键词-EN: decision theory problem, Probabilistic Answer Set, Answer Set Programming, Algebraic Model Counting, finding the actions
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Under consideration in Theory and Practice of Logic Programming (TPLP)

点击查看摘要

Abstract:Solving a decision theory problem usually involves finding the actions, among a set of possible ones, which optimize the expected reward, possibly accounting for the uncertainty of the environment. In this paper, we introduce the possibility to encode decision theory problems with Probabilistic Answer Set Programming under the credal semantics via decision atoms and utility attributes. To solve the task we propose an algorithm based on three layers of Algebraic Model Counting, that we test on several synthetic datasets against an algorithm that adopts answer set enumeration. Empirical results show that our algorithm can manage non trivial instances of programs in a reasonable amount of time. Under consideration in Theory and Practice of Logic Programming (TPLP).

[AI-55] Graph Classification via Reference Distribution Learning: Theory and Practice

链接: https://arxiv.org/abs/2408.11370
作者: Zixiao Wang,Jicong Fan
关键词-EN: challenging problem owing, Reference Distribution Learning, challenging problem, problem owing, difficulty in quantifying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph classification is a challenging problem owing to the difficulty in quantifying the similarity between graphs or representing graphs as vectors, though there have been a few methods using graph kernels or graph neural networks (GNNs). Graph kernels often suffer from computational costs and manual feature engineering, while GNNs commonly utilize global pooling operations, risking the loss of structural or semantic information. This work introduces Graph Reference Distribution Learning (GRDL), an efficient and accurate graph classification method. GRDL treats each graph’s latent node embeddings given by GNN layers as a discrete distribution, enabling direct classification without global pooling, based on maximum mean discrepancy to adaptively learned reference distributions. To fully understand this new model (the existing theories do not apply) and guide its configuration (e.g., network architecture, references’ sizes, number, and regularization) for practical use, we derive generalization error bounds for GRDL and verify them numerically. More importantly, our theoretical and numerical results both show that GRDL has a stronger generalization ability than GNNs with global pooling operations. Experiments on moderate-scale and large-scale graph datasets show the superiority of GRDL over the state-of-the-art, emphasizing its remarkable efficiency, being at least 10 times faster than leading competitors in both training and inference stages.

[AI-56] owards Probabilistic Inductive Logic Programming with Neurosymbolic Inference and Relaxation

链接: https://arxiv.org/abs/2408.11367
作者: Fieke Hillerstrom,Gertjan Burghouts
关键词-EN: inductive logic programming, probabilistic background knowledge, logic programming, methods are incapable, coming from sensory
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Many inductive logic programming (ILP) methods are incapable of learning programs from probabilistic background knowledge, e.g. coming from sensory data or neural networks with probabilities. We propose Propper, which handles flawed and probabilistic background knowledge by extending ILP with a combination of neurosymbolic inference, a continuous criterion for hypothesis selection (BCE) and a relaxation of the hypothesis constrainer (NoisyCombo). For relational patterns in noisy images, Propper can learn programs from as few as 8 examples. It outperforms binary ILP and statistical models such as a Graph Neural Network.

[AI-57] ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

链接: https://arxiv.org/abs/2408.11363
作者: Yijia Xiao,Edward Sun,Yiqiao Jin,Qifan Wang,Wei Wang
关键词-EN: Understanding biological processes, biotechnological advancements requires, advancements requires detailed, requires detailed analysis, Understanding biological
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 19 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Understanding biological processes, drug development, and biotechnological advancements requires detailed analysis of protein structures and sequences, a task in protein research that is inherently complex and time-consuming when performed manually. To streamline this process, we introduce ProteinGPT, a state-of-the-art multi-modal protein chat system, that allows users to upload protein sequences and/or structures for comprehensive protein analysis and responsive inquiries. ProteinGPT seamlessly integrates protein sequence and structure encoders with linear projection layers for precise representation adaptation, coupled with a large language model (LLM) to generate accurate and contextually relevant responses. To train ProteinGPT, we construct a large-scale dataset of 132,092 proteins with annotations, and optimize the instruction-tuning process using GPT-4o. This innovative system ensures accurate alignment between the user-uploaded data and prompts, simplifying protein analysis. Experiments show that ProteinGPT can produce promising responses to proteins and their corresponding questions.

[AI-58] Hypergraph Learning based Recommender System for Anomaly Detection Control and Optimization

链接: https://arxiv.org/abs/2408.11359
作者: Sakhinana Sagar Srinivas,Rajat Kumar Sarkar,Venkataramana Runkana
关键词-EN: anomaly detection framework, Anomaly detection, self-adapting anomaly detection, challenging problem, applications in industry
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 10 figure, Accepted at IEEE International Conference on Big Data 2022, Osaka, Japan

点击查看摘要

Abstract:Anomaly detection is fundamental yet, challenging problem with practical applications in industry. The current approaches neglect the higher-order dependencies within the networks of interconnected sensors in the high-dimensional time series(multisensor data) for anomaly detection. To this end, we present a self-adapting anomaly detection framework for joint learning of (a) discrete hypergraph structure and (b) modeling the temporal trends and spatial relations among the interdependent sensors using the hierarchical encoder-decoder architecture to overcome the challenges. The hypergraph representation learning-based framework exploits the relational inductive biases in the hypergraph-structured data to learn the pointwise single-step-ahead forecasts through the self-supervised autoregressive task and predicts the anomalies based on the forecast error. Furthermore, our framework incentivizes learning the anomaly-diagnosis ontology through a differentiable approach. It derives the anomaly information propagation-based computational hypergraphs for root cause analysis and provides recommendations through an offline, optimal predictive control policy to remedy an anomaly. We conduct extensive experiments to evaluate the proposed method on the benchmark datasets for fair and rigorous comparison with the popular baselines. The proposed method outperforms the baseline models and achieves SOTA performance. We report the ablation studies to support the efficacy of the framework.

[AI-59] One-step Structure Prediction and Screening for Protein-Ligand Complexes using Multi-Task Geometric Deep Learning

链接: https://arxiv.org/abs/2408.11356
作者: Kelei He,Tiejun Dong,Jinhui Wu,Junfeng Zhang
关键词-EN: Understanding the structure, Existing virtual structure, Understanding, virtual structure measurement, protein-ligand complex
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Understanding the structure of the protein-ligand complex is crucial to drug development. Existing virtual structure measurement and screening methods are dominated by docking and its derived methods combined with deep learning. However, the sampling and scoring methodology have largely restricted the accuracy and efficiency. Here, we show that these two fundamental tasks can be accurately tackled with a single model, namely LigPose, based on multi-task geometric deep learning. By representing the ligand and the protein pair as a graph, LigPose directly optimizes the three-dimensional structure of the complex, with the learning of binding strength and atomic interactions as auxiliary tasks, enabling its one-step prediction ability without docking tools. Extensive experiments show LigPose achieved state-of-the-art performance on major tasks in drug research. Its considerable improvements indicate a promising paradigm of AI-based pipeline for drug development.

[AI-60] Vision HgNN: An Electron-Micrograph is Worth Hypergraph of Hypernodes ICLR

链接: https://arxiv.org/abs/2408.11351
作者: Sakhinana Sagar Srinivas,Rajat Kumar Sarkar,Sreeja Gangasani,Venkataramana Runkana
关键词-EN: electron micrographs, crucial but challenging, challenging task, task with applications, quantum materials
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, Accepted in PML4DC Workshop at International Conference on Learning Representations (ICLR) 2023

点击查看摘要

Abstract:Material characterization using electron micrographs is a crucial but challenging task with applications in various fields, such as semiconductors, quantum materials, batteries, etc. The challenges in categorizing electron micrographs include but are not limited to the complexity of patterns, high level of detail, and imbalanced data distribution(long-tail distribution). Existing methods have difficulty in modeling the complex relational structure in electron micrographs, hindering their ability to effectively capture the complex relationships between different spatial regions of micrographs. We propose a hypergraph neural network(HgNN) backbone architecture, a conceptually alternative approach, to better model the complex relationships in electron micrographs and improve material characterization accuracy. By utilizing cost-effective GPU hardware, our proposed framework outperforms popular baselines. The results of the ablation studies demonstrate that the proposed framework is effective in achieving state-of-the-art performance on benchmark datasets and efficient in terms of computational and memory requirements for handling large-scale electron micrograph-based datasets.

[AI-61] Multimodal Datasets and Benchmarks for Reasoning about Dynamic Spatio-Temporality in Everyday Environments

链接: https://arxiv.org/abs/2408.11347
作者: Takanori Ugai,Kensho Hara,Shusaku Egami,Ken Fukuda
关键词-EN: create artificial video, artificial video data, development of Embodied, simulator to create, standardized annotations
类目: Artificial Intelligence (cs.AI)
*备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:We used a 3D simulator to create artificial video data with standardized annotations, aiming to aid in the development of Embodied AI. Our question answering (QA) dataset measures the extent to which a robot can understand human behavior and the environment in a home setting. Preliminary experiments suggest our dataset is useful in measuring AI’s comprehension of daily life. \endabstract

[AI-62] EHL*: Memory-Budgeted Indexing for Ultrafast Optimal Euclidean Pathfinding

链接: https://arxiv.org/abs/2408.11341
作者: Jinchun Du,Bojie Shen,Muhammad Aamir Cheema
关键词-EN: Shortest Path Problem, Euclidean Shortest Path, Shortest Path, Euclidean Hub Labeling, Path Problem
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Euclidean Shortest Path Problem (ESPP), which involves finding the shortest path in a Euclidean plane with polygonal obstacles, is a classic problem with numerous real-world applications. The current state-of-the-art solution, Euclidean Hub Labeling (EHL), offers ultra-fast query performance, outperforming existing techniques by 1-2 orders of magnitude in runtime efficiency. However, this performance comes at the cost of significant memory overhead, requiring up to tens of gigabytes of storage on large maps, which can limit its applicability in memory-constrained environments like mobile phones or smaller devices. Additionally, EHL’s memory usage can only be determined after index construction, and while it provides a memory-runtime tradeoff, it does not fully optimize memory utilization. In this work, we introduce an improved version of EHL, called EHL*, which overcomes these limitations. A key contribution of EHL* is its ability to create an index that adheres to a specified memory budget while optimizing query runtime performance. Moreover, EHL* can leverage preknown query distributions, a common scenario in many real-world applications to further enhance runtime efficiency. Our results show that EHL* can reduce memory usage by up to 10-20 times without much impact on query runtime performance compared to EHL, making it a highly effective solution for optimal pathfinding in memory-constrained environments.

[AI-63] Automatic Dataset Construction (ADC): Sample Collection Data Curation and Beyond

链接: https://arxiv.org/abs/2408.11338
作者: Minghao Liu,Zonglin Di,Jiaheng Wei,Zhongruo Wang,Hengxiang Zhang,Ruixuan Xiao,Haoyu Wang,Jinlong Pang,Hao Chen,Ankit Shah,Hongxin Wei,Xinlei He,Zhaowei Zhao,Haobo Wang,Lei Feng,Jindong Wang,James Davis,Yang Liu
关键词-EN: Large-scale data collection, developing personalized training, fine-tuning specialized models, Large-scale data, mitigating the shortage
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale data collection is essential for developing personalized training data, mitigating the shortage of training data, and fine-tuning specialized models. However, creating high-quality datasets quickly and accurately remains a challenge due to annotation errors, the substantial time and costs associated with human labor. To address these issues, we propose Automatic Dataset Construction (ADC), an innovative methodology that automates dataset creation with negligible cost and high efficiency. Taking the image classification task as a starting point, ADC leverages LLMs for the detailed class design and code generation to collect relevant samples via search engines, significantly reducing the need for manual annotation and speeding up the data generation process. Despite these advantages, ADC also encounters real-world challenges such as label errors (label noise) and imbalanced data distributions (label bias). We provide open-source software that incorporates existing methods for label error detection, robust learning under noisy and biased data, ensuring a higher-quality training data and more robust model training procedure. Furthermore, we design three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning. These datasets are vital because there are few existing datasets specifically for label noise detection, despite its importance. Finally, we evaluate the performance of existing popular methods on these datasets, thereby facilitating further research in the field.

[AI-64] BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports

链接: https://arxiv.org/abs/2408.11334
作者: Yuxuan Chen,Haoyan Yang,Hengkai Pan,Fardeen Siddiqui,Antonio Verdone,Qingyang Zhang,Sumit Chopra,Chen Zhao,Yiqiu Shen
关键词-EN: Breast ultrasound, reports summarizing key, summarizing key findings, diagnosing abnormalities, malignancy assessments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted as the oral paper for the HCHM workshop, ACM Multimedia 2024

点击查看摘要

Abstract:Breast ultrasound is essential for detecting and diagnosing abnormalities, with radiology reports summarizing key findings like lesion characteristics and malignancy assessments. Extracting this critical information is challenging due to the unstructured nature of these reports, with varied linguistic styles and inconsistent formatting. While proprietary LLMs like GPT-4 are effective, they are costly and raise privacy concerns when handling protected health information. This study presents a pipeline for developing an in-house LLM to extract clinical information from radiology reports. We first use GPT-4 to create a small labeled dataset, then fine-tune a Llama3-8B model on it. Evaluated on clinician-annotated reports, our model achieves an average F1 score of 84.6%, which is on par with GPT-4. Our findings demonstrate the feasibility of developing an in-house LLM that not only matches GPT-4’s performance but also offers cost reductions and enhanced data privacy.

[AI-65] Plug Play and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

链接: https://arxiv.org/abs/2408.11327
作者: Sai Koneru,Matthias Huck,Miriam Exel,Jan Niehues
关键词-EN: Recent advancements, advancements in NLP, NLP have resulted, processing multimodal inputs, specific domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Recent advancements in NLP have resulted in models with specialized strengths, such as processing multimodal inputs or excelling in specific domains. However, real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing. While individual translation and vision models are powerful, they typically lack the ability to perform both tasks in a single system. Combining these models poses challenges, particularly due to differences in their vocabularies, which limit the effectiveness of traditional ensemble methods to post-generation techniques like N-best list re-ranking. In this work, we propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training. Our approach re-ranks beams during decoding by combining scores at the word level, using heuristics to predict when a word is completed. We demonstrate the effectiveness of this method in machine translation scenarios, showing that it enables the generation of translations that are both speech- and image-aware while also improving overall translation quality\footnoteWe will release the code upon paper acceptance…

[AI-66] Automating Thought of Search: A Journey Towards Soundness and Completeness

链接: https://arxiv.org/abs/2408.11326
作者: Daniel Cao,Michael Katz,Harsha Kokel,Kavitha Srinivas,Shirin Sohrabi
关键词-EN: large language models, language models, standing bastions, bastions for large, turn their attention
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Planning remains one of the last standing bastions for large language models (LLMs), which now turn their attention to search. Most of the literature uses the language models as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having the language models produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. At the same time LLMs have demonstrated significant progress in code generation and refinement for complex reasoning tasks. In this work, we automate ToS (AutoToS), completely taking the human out of the loop of solving planning problems. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.

[AI-67] owards Evaluating Large Language Models on Sarcasm Understanding

链接: https://arxiv.org/abs/2408.11319
作者: Yazhou Zhang,Chunwang Zou,Zheng Lian,Prayag Tiwari,Jing Qin
关键词-EN: sentiment analysis, text classification, large language models, successfully solved, era of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the era of large language models (LLMs), the task of ``System I’‘~-~the fast, unconscious, and intuitive tasks, e.g., sentiment analysis, text classification, etc., have been argued to be successfully solved. However, sarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices like hyperbole and figuration to convey true sentiments and intentions, involving a higher level of abstraction than sentiment analysis. There is growing concern that the argument about LLMs’ success may not be fully tenable when considering sarcasm understanding. To address this question, we select eleven SOTA LLMs and eight SOTA pre-trained language models (PLMs) and present comprehensive evaluations on six widely used benchmark datasets through different prompting approaches, i.e., zero-shot input/output (IO) prompting, few-shot IO prompting, chain of thought (CoT) prompting. Our results highlight three key findings: (1) current LLMs underperform supervised PLMs based sarcasm detection baselines across six sarcasm benchmarks. This suggests that significant efforts are still required to improve LLMs’ understanding of human sarcasm. (2) GPT-4 consistently and significantly outperforms other LLMs across various prompting methods, with an average improvement of 14.0% \uparrow . Claude 3 and ChatGPT demonstrate the next best performance after GPT-4. (3) Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT. The reason is that sarcasm detection, being a holistic, intuitive, and non-rational cognitive process, is argued not to adhere to step-by-step logical reasoning, making CoT less effective in understanding sarcasm compared to its effectiveness in mathematical reasoning tasks.

[AI-68] Probabilistic Medical Predictions of Large Language Models

链接: https://arxiv.org/abs/2408.11316
作者: Bowen Gu,Rishi J. Desai,Kueiyu Joshua Lin,Jie Yang
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated significant potential, demonstrated significant
类目: Artificial Intelligence (cs.AI)
*备注: 58 pages, 3 figures, 3 tables, Submitted to Nature Communication

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential in clinical applications through prompt engineering, which enables the generation of flexible and diverse clinical predictions. However, they pose challenges in producing prediction probabilities, which are essential for transparency and allowing clinicians to apply flexible probability thresholds in decision-making. While explicit prompt instructions can lead LLMs to provide prediction probability numbers through text generation, LLMs’ limitations in numerical reasoning raise concerns about the reliability of these text-generated probabilities. To assess this reliability, we compared explicit probabilities derived from text generation to implicit probabilities calculated based on the likelihood of predicting the correct label token. Experimenting with six advanced open-source LLMs across five medical datasets, we found that the performance of explicit probabilities was consistently lower than implicit probabilities with respect to discrimination, precision, and recall. Moreover, these differences were enlarged on small LLMs and imbalanced datasets, emphasizing the need for cautious interpretation and applications, as well as further research into robust probability estimation methods for LLMs in clinical contexts.

[AI-69] Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer

链接: https://arxiv.org/abs/2408.11313
作者: Weipeng Jiang,Zhenting Wang,Juan Zhai,Shiqing Ma,Zhengyu Zhao,Chao Shen
关键词-EN: prior safety alignment, safety alignment efforts, Greedy Coordinate Gradient, prior safety, safety alignment
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite prior safety alignment efforts, mainstream LLMs can still generate harmful and unethical content when subjected to jailbreaking attacks. Existing jailbreaking methods fall into two main categories: template-based and optimization-based methods. The former requires significant manual effort and domain knowledge, while the latter, exemplified by Greedy Coordinate Gradient (GCG), which seeks to maximize the likelihood of harmful LLM outputs through token-level optimization, also encounters several limitations: requiring white-box access, necessitating pre-constructed affirmative phrase, and suffering from low efficiency. In this paper, we present ECLIPSE, a novel and efficient black-box jailbreaking method utilizing optimizable suffixes. Drawing inspiration from LLMs’ powerful generation and optimization capabilities, we employ task prompts to translate jailbreaking goals into natural language instructions. This guides the LLM to generate adversarial suffixes for malicious queries. In particular, a harmfulness scorer provides continuous feedback, enabling LLM self-reflection and iterative optimization to autonomously and efficiently produce effective suffixes. Experimental results demonstrate that ECLIPSE achieves an average attack success rate (ASR) of 0.92 across three open-source LLMs and GPT-3.5-Turbo, significantly surpassing GCG in 2.4 times. Moreover, ECLIPSE is on par with template-based methods in ASR while offering superior attack efficiency, reducing the average attack overhead by 83%.

[AI-70] Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework

链接: https://arxiv.org/abs/2408.11312
作者: Xiao Han,Chen Zhu,Xiangyu Zhao,Hengshu Zhu
关键词-EN: geographic locations precisely, real-world geographic locations, geo-localization demands in-depth, advanced reasoning skills, demands in-depth knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual geo-localization demands in-depth knowledge and advanced reasoning skills to associate images with real-world geographic locations precisely. In general, traditional methods based on data-matching are hindered by the impracticality of storing adequate visual records of global landmarks. Recently, Large Vision-Language Models (LVLMs) have demonstrated the capability of geo-localization through Visual Question Answering (VQA), enabling a solution that does not require external geo-tagged image records. However, the performance of a single LVLM is still limited by its intrinsic knowledge and reasoning capabilities. Along this line, in this paper, we introduce a novel visual geo-localization framework called \name\ that integrates the inherent knowledge of multiple LVLM agents via inter-agent communication to achieve effective geo-localization of images. Furthermore, our framework employs a dynamic learning strategy to optimize the communication patterns among agents, reducing unnecessary discussions among agents and improving the efficiency of the framework. To validate the effectiveness of the proposed framework, we construct GeoGlobe, a novel dataset for visual geo-localization tasks. Extensive testing on the dataset demonstrates that our approach significantly outperforms state-of-the-art methods.

[AI-71] EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models

链接: https://arxiv.org/abs/2408.11308
作者: Chongwen Zhao,Zhihao Dou,Kaizhu Huang
关键词-EN: Large Language Models, Large Language, increasingly attracting attention, Language Models, increasingly attracting
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation. In an effort to mitigate such risks, the concept of “Alignment” technology has been developed. However, recent studies indicate that this alignment can be undermined using sophisticated prompt engineering or adversarial suffixes, a technique known as “Jailbreak.” Our research takes cues from the human-like generate process of LLMs. We identify that while jailbreaking prompts may yield output logits similar to benign prompts, their initial embeddings within the model’s latent space tend to be more analogous to those of malicious prompts. Leveraging this finding, we propose utilizing the early transformer outputs of LLMs as a means to detect malicious inputs, and terminate the generation immediately. Built upon this idea, we introduce a simple yet significant defense approach called EEG-Defender for LLMs. We conduct comprehensive experiments on ten jailbreak methods across three models. Our results demonstrate that EEG-Defender is capable of reducing the Attack Success Rate (ASR) by a significant margin, roughly 85% in comparison with 50% for the present SOTAs, with minimal impact on the utility and effectiveness of LLMs.

[AI-72] KAN4TSF: Are KAN and KAN-based models Effective for Time Series Forecasting?

链接: https://arxiv.org/abs/2408.11306
作者: Xiao Han,Xinfeng Zhang,Yiling Wu,Zhenduo Zhang,Zhe Wu
关键词-EN: Time series forecasting, Time series, series forecasting, crucial task, task that predicts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is a crucial task that predicts the future values of variables based on historical data. Time series forecasting techniques have been developing in parallel with the machine learning community, from early statistical learning methods to current deep learning methods. Although existing methods have made significant progress, they still suffer from two challenges. The mathematical theory of mainstream deep learning-based methods does not establish a clear relation between network sizes and fitting capabilities, and these methods often lack interpretability. To this end, we introduce the Kolmogorov-Arnold Network (KAN) into time series forecasting research, which has better mathematical properties and interpretability. First, we propose the Reversible Mixture of KAN experts (RMoK) model, which is a KAN-based model for time series forecasting. RMoK uses a mixture-of-experts structure to assign variables to KAN experts. Then, we compare performance, integration, and speed between RMoK and various baselines on real-world datasets, and the experimental results show that RMoK achieves the best performance in most cases. And we find the relationship between temporal feature weights and data periodicity through visualization, which roughly explains RMoK’s mechanism. Thus, we conclude that KAN and KAN-based models (RMoK) are effective in time series forecasting. Code is available at KAN4TSF: this https URL.

[AI-73] UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

链接: https://arxiv.org/abs/2408.11305
作者: Xiangyu Zhao,Yuehan Zhang,Wenlong Zhang,Xiao-Ming Wu
关键词-EN: fashion domain, fashion domain encompasses, generation, fashion, encompasses a variety
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain. The source code is available at this https URL.

[AI-74] Offline Policy Learning via Skill-step Abstraction for Long-horizon Goal-Conditioned Tasks

链接: https://arxiv.org/abs/2408.11300
作者: Donghoon Kim,Minjong Yoo,Honguk Woo
关键词-EN: policy learning, confronting long-horizon goals, policy, sparsity of rewards, long-horizon goals
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, International Joint Conference on Artificial Intelligence 2024, Published version

点击查看摘要

Abstract:Goal-conditioned (GC) policy learning often faces a challenge arising from the sparsity of rewards, when confronting long-horizon goals. To address the challenge, we explore skill-based GC policy learning in offline settings, where skills are acquired from existing data and long-horizon goals are decomposed into sequences of near-term goals that align with these skills. Specifically, we present an `offline GC policy learning via skill-step abstraction’ framework (GLvSA) tailored for tackling long-horizon GC tasks affected by goal distribution shifts. In the framework, a GC policy is progressively learned offline in conjunction with the incremental modeling of skill-step abstractions on the data. We also devise a GC policy hierarchy that not only accelerates GC policy learning within the framework but also allows for parameter-efficient fine-tuning of the policy. Through experiments with the maze and Franka kitchen environments, we demonstrate the superiority and efficiency of our GLvSA framework in adapting GC policies to a wide range of long-horizon goals. The framework achieves competitive zero-shot and few-shot adaptation performance, outperforming existing GC policy learning and skill-based methods.

[AI-75] Applying and Evaluating Large Language Models in Mental Health Care: A Scoping Review of Human-Assessed Generative Tasks

链接: https://arxiv.org/abs/2408.11288
作者: Yining Hua,Hongbin Na,Zehan Li,Fenglin Liu,Xiao Fang,David Clifton,John Torous
关键词-EN: Large language models, generate human-like responses, Large language, mental health care, offering scalable support
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are emerging as promising tools for mental health care, offering scalable support through their ability to generate human-like responses. However, the effectiveness of these models in clinical settings remains unclear. This scoping review aimed to assess the current generative applications of LLMs in mental health care, focusing on studies where these models were tested with human participants in real-world scenarios. A systematic search across APA PsycNet, Scopus, PubMed, and Web of Science identified 726 unique articles, of which 17 met the inclusion criteria. These studies encompassed applications such as clinical assistance, counseling, therapy, and emotional support. However, the evaluation methods were often non-standardized, with most studies relying on ad hoc scales that limit comparability and robustness. Privacy, safety, and fairness were also frequently underexplored. Moreover, reliance on proprietary models, such as OpenAI’s GPT series, raises concerns about transparency and reproducibility. While LLMs show potential in expanding mental health care access, especially in underserved areas, the current evidence does not fully support their use as standalone interventions. More rigorous, standardized evaluations and ethical oversight are needed to ensure these tools can be safely and effectively integrated into clinical practice.

[AI-76] Inference Plans for Hybrid Particle Filtering

链接: https://arxiv.org/abs/2408.11283
作者: Ellie Y. Cheng,Eric Atkinson,Guillaume Baudart,Louis Mandel,Michael Carbin
关键词-EN: Monte Carlo methods, Monte Carlo, Advanced probabilistic programming, combine symbolic exact, Advanced probabilistic
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advanced probabilistic programming languages (PPLs) use hybrid inference systems to combine symbolic exact inference and Monte Carlo methods to improve inference performance. These systems use heuristics to partition random variables within the program into variables that are encoded symbolically and variables that are encoded with sampled values, and the heuristics are not necessarily aligned with the performance evaluation metrics used by the developer. In this work, we present inference plans, a programming interface that enables developers to control the partitioning of random variables during hybrid particle filtering. We further present Siren, a new PPL that enables developers to use annotations to specify inference plans the inference system must implement. To assist developers with statically reasoning about whether an inference plan can be implemented, we present an abstract-interpretation-based static analysis for Siren for determining inference plan satisfiability. We prove the analysis is sound with respect to Siren’s semantics. Our evaluation applies inference plans to three different hybrid particle filtering algorithms on a suite of benchmarks and shows that the control provided by inference plans enables speed ups of 1.76x on average and up to 206x to reach target accuracy, compared to the inference plans implemented by default heuristics; the results also show that inference plans improve accuracy by 1.83x on average and up to 595x with less or equal runtime, compared to the default inference plans. We further show that the static analysis is precise in practice, identifying all satisfiable inference plans in 27 out of the 33 benchmark-algorithm combinations.

[AI-77] BearLLM: A Prior Knowledge-Enhanced Bearing Health Management Framework with Unified Vibration Signal Representation

链接: https://arxiv.org/abs/2408.11281
作者: Haotian Peng,Jiawei Liu,Jinsong Du,Jie Gao,Wei Wang
关键词-EN: framework leveraging large, leveraging large language, processing user prompts, unifies multiple bearing-related, multiple bearing-related tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a bearing health management framework leveraging large language models (BearLLM), a novel multimodal model that unifies multiple bearing-related tasks by processing user prompts and vibration signals. Specifically, we introduce a prior knowledge-enhanced unified vibration signal representation to handle various working conditions across multiple datasets. This involves adaptively sampling the vibration signals based on the sampling rate of the sensor, incorporating the frequency domain to unify input dimensions, and using a fault-free reference signal as an auxiliary input. To extract features from vibration signals, we first train a fault classification network, then convert and align the extracted features into word embedding, and finally concatenate these with text embedding as input to an LLM. To evaluate the performance of the proposed method, we constructed the first large-scale multimodal bearing health management (MBHM) dataset, including paired vibration signals and textual descriptions. With our unified vibration signal representation, BearLLM using one set of pre-trained weights achieves state-of-the-art performance on nine publicly available fault diagnosis benchmarks, outperforming specific methods designed for individual datasets. We provide a dataset, our model, and code to inspire future research on building more capable industrial multimodal models (this https URL).

[AI-78] owards Analyzing and Mitigating Sycophancy in Large Vision-Language Models

链接: https://arxiv.org/abs/2408.11261
作者: Yunpu Zhao,Rui Zhang,Junbin Xiao,Changxin Ke,Ruibo Hou,Yifan Hao,Qi Guo,Yunji Chen
关键词-EN: Large Vision-Language Models, shown significant capability, Large Vision-Language, vision-language understanding, shown significant
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown significant capability in vision-language understanding. However, one critical issue that persists in these models is sycophancy, which means models are unduly influenced by leading or deceptive prompts, resulting in biased outputs and hallucinations. Despite the progress in LVLMs, evaluating and mitigating sycophancy is yet much under-explored. In this work, we fill this gap by systematically analyzing sycophancy on various VL benchmarks with curated leading queries and further proposing a text contrastive decoding method for mitigation. While the specific sycophantic behavior varies significantly among models, our analysis reveals the severe deficiency of all LVLMs in resilience of sycophancy across various tasks. For improvement, we propose Leading Query Contrastive Decoding (LQCD), a model-agnostic method focusing on calibrating the LVLMs’ over-reliance on leading cues by identifying and suppressing the probabilities of sycophancy tokens at the decoding stage. Extensive experiments show that LQCD effectively mitigate sycophancy, outperforming both prompt engineering methods and common methods for hallucination mitigation. We further demonstrate that LQCD does not hurt but even slightly improves LVLMs’ responses to neutral queries, suggesting it being a more effective strategy for general-purpose decoding but not limited to sycophancy.

[AI-79] Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

链接: https://arxiv.org/abs/2408.11258
作者: Prashant Serai,Peidong Wang,Eric Fosler-Lussier
关键词-EN: discriminative language modeling, errorful recognized speech, robustness of NLP, simulate errorful recognized, language modeling
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Modeling the errors of a speech recognizer can help simulate errorful recognized speech data from plain text, which has proven useful for tasks like discriminative language modeling, improving robustness of NLP systems, where limited or even no audio data is available at train time. Previous work typically considered replicating behavior of GMM-HMM based systems, but the behavior of more modern posterior-based neural network acoustic models is not the same and requires adjustments to the error prediction model. In this work, we extend a prior phonetic confusion based model for predicting speech recognition errors in two ways: first, we introduce a sampling-based paradigm that better simulates the behavior of a posterior-based acoustic model. Second, we investigate replacing the confusion matrix with a sequence-to-sequence model in order to introduce context dependency into the prediction. We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. Sampling greatly improves predictive accuracy within a 100-guess paradigm, while the sequence model performs similarly to the confusion matrix.

[AI-80] Automatic Image Annotation (AIA) of AlmondNet-20 Method for Almond Detection by Improved CNN-based Model

链接: https://arxiv.org/abs/2408.11253
作者: Mohsen Asghari Ilani,Saba Moftakhar Tehran,Ashkan Kavei,Arian Radmehr
关键词-EN: competitive nut market, innovative methodology aimed, Convolutional Neural Networks, burgeoning global demand, Deep Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In response to the burgeoning global demand for premium agricultural products, particularly within the competitive nut market, this paper introduces an innovative methodology aimed at enhancing the grading process for almonds and their shells. Leveraging state-of-the-art Deep Convolutional Neural Networks (CNNs), specifically the AlmondNet-20 architecture, our study achieves exceptional accuracy exceeding 99%, facilitated by the utilization of a 20-layer CNN model. To bolster robustness in differentiating between almonds and shells, data augmentation techniques are employed, ensuring the reliability and accuracy of our classification system. Our model, meticulously trained over 1000 epochs, demonstrates remarkable performance, boasting an accuracy rate of 99% alongside a minimal loss function of 0.0567. Rigorous evaluation through test datasets further validates the efficacy of our approach, revealing impeccable precision, recall, and F1-score metrics for almond detection. Beyond its technical prowess, this advanced classification system offers tangible benefits to both industry experts and non-specialists alike, ensuring globally reliable almond classification. The application of deep learning algorithms, as showcased in our study, not only enhances grading accuracy but also presents opportunities for product patents, thereby contributing to the economic value of our nation. Through the adoption of cutting-edge technologies such as the AlmondNet-20 model, we pave the way for future advancements in agricultural product classification, ultimately enriching global trade and economic prosperity.

[AI-81] he Dilemma of Uncertainty Estimation for General Purpose AI in the EU AI Act ICML2024

链接: https://arxiv.org/abs/2408.11249
作者: Matias Valdenegro-Toro,Radina Stoykova
关键词-EN: European Union-wide regulation, European Union-wide, Union-wide regulation, uncertainty estimation, European
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 7 pages, 2nd GenLaw Workshop @ ICML 2024

点击查看摘要

Abstract:The AI act is the European Union-wide regulation of AI systems. It includes specific provisions for general-purpose AI models which however need to be further interpreted in terms of technical standards and state-of-art studies to ensure practical compliance solutions. This paper examines the AI act requirements for providers and deployers of general-purpose AI and further proposes uncertainty estimation as a suitable measure for legal compliance and quality assurance in training of such models. We argue that uncertainty estimation should be a required component for deploying models in the real world, and under the EU AI Act, it could fulfill several requirements for transparency, accuracy, and trustworthiness. However, generally using uncertainty estimation methods increases the amount of computation, producing a dilemma, as computation might go over the threshold ( 10^25 FLOPS) to classify the model as a systemic risk system which bears more regulatory burden.

[AI-82] Do Neural Scaling Laws Exist on Graph Self-Supervised Learning?

链接: https://arxiv.org/abs/2408.11243
作者: Qian Ma,Haitao Mao,Jingzhe Liu,Zhehua Zhang,Chunlin Feng,Yu Song,Yihan Shao,Tianfan Fu,Yao Ma
关键词-EN: graph SSL techniques, existing graph SSL, graph SSL, effectively leveraging knowledge, SSL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervised learning~(SSL) is essential to obtain foundation models in NLP and CV domains via effectively leveraging knowledge in large-scale unlabeled data. The reason for its success is that a suitable SSL design can help the model to follow the neural scaling law, i.e., the performance consistently improves with increasing model and dataset sizes. However, it remains a mystery whether existing SSL in the graph domain can follow the scaling behavior toward building Graph Foundation Models~(GFMs) with large-scale pre-training. In this study, we examine whether existing graph SSL techniques can follow the neural scaling behavior with the potential to serve as the essential component for GFMs. Our benchmark includes comprehensive SSL technique implementations with analysis conducted on both the conventional SSL setting and many new settings adopted in other domains. Surprisingly, despite the SSL loss continuously decreasing, no existing graph SSL techniques follow the neural scaling behavior on the downstream performance. The model performance only merely fluctuates on different data scales and model scales. Instead of the scales, the key factors influencing the performance are the choices of model architecture and pretext task design. This paper examines existing SSL techniques for the feasibility of Graph SSL techniques in developing GFMs and opens a new direction for graph SSL design with the new evaluation prototype. Our code implementation is available online to ease reproducibility on this https URL.

[AI-83] A Little Confidence Goes a Long Way

链接: https://arxiv.org/abs/2408.11239
作者: John Scoville,Shang Gao,Devanshu Agrawal,Javed Qadrud-Din
关键词-EN: binary classification tasks, large language models, hidden state activations, introduce a group, group of related
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:We introduce a group of related methods for binary classification tasks using probes of the hidden state activations in large language models (LLMs). Performance is on par with the largest and most advanced LLMs currently available, but requiring orders of magnitude fewer computational resources and not requiring labeled data. This approach involves translating class labels into a semantically rich description, spontaneous symmetry breaking of multilayer perceptron probes for unsupervised learning and inference, training probes to generate confidence scores (prior probabilities) from hidden state activations subject to known constraints via entropy maximization, and selecting the most confident probe model from an ensemble for prediction. These techniques are evaluated on four datasets using five base LLMs.

[AI-84] Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

链接: https://arxiv.org/abs/2408.11237
作者: Christos Constantinou,Georgios Ioannides,Aman Chadha,Aaron Elkins,Edwin Simpson
关键词-EN: machine learning applications, model overconfidence, crucial in machine, machine learning, learning applications
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The majority of existing OOD detection methods predominantly address uni-modal inputs, such as images or texts. In the context of multi-modal documents, there is a notable lack of extensive research on the performance of these methods, which have primarily been developed with a focus on computer vision tasks. We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches and significantly decreases the false positive rate (FPR) compared to existing solutions up to 7.5%. This methodology generalizes well to multi-modal data, such as documents, where visual and textual information are modeled under the same Transformer architecture. To address the scarcity of high-quality publicly available document datasets and encourage further research on OOD detection for documents, we introduce FinanceDocs, a new document AI dataset. Our code and dataset are publicly available.

[AI-85] Unified Deep Learning Model for Global Prediction of Aboveground Biomass Canopy Height and Cover from High-Resolution Multi-Sensor Satellite Imagery

链接: https://arxiv.org/abs/2408.11234
作者: Manuel Weber,Carly Beneke,Clyde Wheeler
关键词-EN: international climate initiatives, ground based assessments, carbon stock, carbon accounting, climate initiatives
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regular measurement of carbon stock in the world’s forests is critical for carbon accounting and reporting under national and international climate initiatives, and for scientific research, but has been largely limited in scalability and temporal resolution due to a lack of ground based assessments. Increasing efforts have been made to address these challenges by incorporating remotely sensed data. We present a new methodology which uses multi-sensor, multi-spectral imagery at a resolution of 10 meters and a deep learning based model which unifies the prediction of above ground biomass density (AGBD), canopy height (CH), canopy cover (CC) as well as uncertainty estimations for all three quantities. The model is trained on millions of globally sampled GEDI-L2/L4 measurements. We validate the capability of our model by deploying it over the entire globe for the year 2023 as well as annually from 2016 to 2023 over selected areas. The model achieves a mean absolute error for AGBD (CH, CC) of 26.1 Mg/ha (3.7 m, 9.9 %) and a root mean squared error of 50.6 Mg/ha (5.4 m, 15.8 %) on a globally sampled test dataset, demonstrating a significant improvement over previously published results. We also report the model performance against independently collected ground measurements published in the literature, which show a high degree of correlation across varying conditions. We further show that our pre-trained model facilitates seamless transferability to other GEDI variables due to its multi-head architecture.

[AI-86] CoDi: Conversational Distillation for Grounded Question Answering

链接: https://arxiv.org/abs/2408.11219
作者: Patrick Huber,Arash Einolghozati,Rylan Conway,Kanika Narang,Matt Smith,Waqar Nayyar,Adithya Sagar,Ahmed Aly,Akshat Shrivastava
关键词-EN: Distilling conversational skills, Small Language Models, billion parameters presents, parameters presents significant, Small Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Distilling conversational skills into Small Language Models (SLMs) with approximately 1 billion parameters presents significant challenges. Firstly, SLMs have limited capacity in their model parameters to learn extensive knowledge compared to larger models. Secondly, high-quality conversational datasets are often scarce, small, and domain-specific. Addressing these challenges, we introduce a novel data distillation framework named CoDi (short for Conversational Distillation, pronounced “Cody”), allowing us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner. Specifically, while our framework is task agnostic at its core, we explore and evaluate the potential of CoDi on the task of conversational grounded reasoning for question answering. This is a typical on-device scenario for specialist SLMs, allowing for open-domain model responses, without requiring the model to “memorize” world knowledge in its limited weights. Our evaluations show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics. Additionally, when using our framework to generate larger datasets from web data, our models surpass larger, instruction-tuned models in zero-shot conversational grounded reasoning tasks.

[AI-87] Quantum Inverse Contextual Vision Transformers (Q-ICVT): A New Frontier in 3D Object Detection for AVs CIKM’24

链接: https://arxiv.org/abs/2408.11207
作者: Sanjay Bhargav Dharavath,Tanmoy Dam,Supriyo Chakraborty,Prithwiraj Roy,Aniruddha Maiti
关键词-EN: predominantly leverages multi-modal, Contextual Vision Transformers, Inverse Contextual Vision, Quantum Inverse Contextual, autonomous vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: The paper has been accepted as a short paper at CIKM '24

点击查看摘要

Abstract:The field of autonomous vehicles (AVs) predominantly leverages multi-modal integration of LiDAR and camera data to achieve better performance compared to using a single modality. However, the fusion process encounters challenges in detecting distant objects due to the disparity between the high resolution of cameras and the sparse data from LiDAR. Insufficient integration of global perspectives with local-level details results in sub-optimal fusion this http URL address this issue, we have developed an innovative two-stage fusion process called Quantum Inverse Contextual Vision Transformers (Q-ICVT). This approach leverages adiabatic computing in quantum concepts to create a novel reversible vision transformer known as the Global Adiabatic Transformer (GAT). GAT aggregates sparse LiDAR features with semantic features in dense images for cross-modal integration in a global form. Additionally, the Sparse Expert of Local Fusion (SELF) module maps the sparse LiDAR 3D proposals and encodes position information of the raw point cloud onto the dense camera feature space using a gating point fusion approach. Our experiments show that Q-ICVT achieves an mAPH of 82.54 for L2 difficulties on the Waymo dataset, improving by 1.88% over current state-of-the-art fusion methods. We also analyze GAT and SELF in ablation studies to highlight the impact of Q-ICVT. Our code is available at this https URL Q-ICVT

[AI-88] EPiC: Cost-effective Search-based Prompt Engineering of LLMs for Code Generation

链接: https://arxiv.org/abs/2408.11198
作者: Hamed Taherkhani,Melika Sepindband,Hung Viet Pham,Song Wang,Hadi Hemmati
关键词-EN: Large Language Models, Large Language, software development tasks, Language Models, development tasks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Submitted to TSE

点击查看摘要

Abstract:Large Language Models (LLMs) have seen increasing use in various software development tasks, especially in code generation. The most advanced recent methods attempt to incorporate feedback from code execution into prompts to help guide LLMs in generating correct code, in an iterative process. While effective, these methods could be costly and time-consuming due to numerous interactions with the LLM and the extensive token usage. To address this issue, we propose an alternative approach named Evolutionary Prompt Engineering for Code (EPiC), which leverages a lightweight evolutionary algorithm to evolve the original prompts toward better ones that produce high-quality code, with minimal interactions with LLM. Our evaluation against state-of-the-art (SOTA) LLM-based code generation models shows that EPiC outperforms all the baselines in terms of cost-effectiveness.

[AI-89] Reading with Intent

链接: https://arxiv.org/abs/2408.11189
作者: Benjamin Reichman,Kartik Talamadupula,Toshish Jawale,Larry Heck
关键词-EN: integrating external information, external information sources, Retrieval augmented generation, RAG systems, open internet
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) systems augment how knowledge language models are by integrating external information sources such as Wikipedia, internal documents, scientific papers, or the open internet. RAG systems that rely on the open internet as their knowledge source have to contend with the complexities of human-generated content. Human communication extends much deeper than just the words rendered as text. Intent, tonality, and connotation can all change the meaning of what is being conveyed. Recent real-world deployments of RAG systems have shown some difficulty in understanding these nuances of human communication. One significant challenge for these systems lies in processing sarcasm. Though the Large Language Models (LLMs) that make up the backbone of these RAG systems are able to detect sarcasm, they currently do not always use these detections for the subsequent processing of text. To address these issues, in this paper, we synthetically generate sarcastic passages from Natural Question’s Wikipedia retrieval corpus. We then test the impact of these passages on the performance of both the retriever and reader portion of the RAG pipeline. We introduce a prompting system designed to enhance the model’s ability to interpret and generate responses in the presence of sarcasm, thus improving overall system performance. Finally, we conduct ablation studies to validate the effectiveness of our approach, demonstrating improvements in handling sarcastic content within RAG systems.

[AI-90] Optimization of Multi-Agent Flying Sidekick Traveling Salesman Problem over Road Networks

链接: https://arxiv.org/abs/2408.11187
作者: Ruixiao Yang,Chuchu Fan
关键词-EN: fully connected graph, attracted increasing attention, truck-drone delivery systems, multi-agent systems operating, mixed truck-drone delivery
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The mixed truck-drone delivery systems have attracted increasing attention for last-mile logistics, but real-world complexities demand a shift from single-agent, fully connected graph models to multi-agent systems operating on actual road networks. We introduce the multi-agent flying sidekick traveling salesman problem (MA-FSTSP) on road networks, extending the single truck-drone model to multiple trucks, each carrying multiple drones while considering full road networks for truck restrictions and flexible drone routes. We propose a mixed-integer linear programming model and an efficient three-phase heuristic algorithm for this NP-hard problem. Our approach decomposes MA-FSTSP into manageable subproblems of one truck with multiple drones. Then, it computes the routes for trucks without drones in subproblems, which are used in the final phase as heuristics to help optimize drone and truck routes simultaneously. Extensive numerical experiments on Manhattan and Boston road networks demonstrate our algorithm’s superior effectiveness and efficiency, significantly outperforming both column generation and variable neighborhood search baselines in solution quality and computation time. Notably, our approach scales to more than 300 customers within a 5-minute time limit, showcasing its potential for large-scale, real-world logistics applications.

[AI-91] Autonomous Negotiation Using Comparison-Based Gradient Estimation

链接: https://arxiv.org/abs/2408.11186
作者: Surya Murthy,Mustafa O. Karabag,Ufuk Topcu
关键词-EN: responding agent, multi-agent systems, resolving conflicts, conflicts in multi-agent, agent
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Negotiation is useful for resolving conflicts in multi-agent systems. We explore autonomous negotiation in a setting where two self-interested rational agents sequentially trade items from a finite set of categories. Each agent has a utility function that depends on the amount of items it possesses in each category. The offering agent makes trade offers to improve its utility without knowing the responding agent’s utility function, and the responding agent accepts offers that improve its utility. We present a comparison-based algorithm for the offering agent that generates offers through previous acceptance or rejection responses without extensive information sharing. The algorithm estimates the responding agent’s gradient by leveraging the rationality assumption and rejected offers to prune the space of potential gradients. After the algorithm makes a finite number of consecutively rejected offers, the responding agent is at a near-optimal state, or the agents’ preferences are closely aligned. Additionally, we facilitate negotiations with humans by representing natural language feedback as comparisons that can be integrated into the proposed algorithm. We compare the proposed algorithm against random search baselines in integer and fractional trading scenarios and show that it improves the societal benefit with fewer offers.

[AI-92] Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles

链接: https://arxiv.org/abs/2408.11182
作者: Zhilong Wang,Haizhou Wang,Nanqing Luo,Lan Zhang,Xiaoyan Sun,Yebo Cao,Peng Liu
关键词-EN: Language Model Models, Language Model, Model Models, entail crafting prompts, crafting prompts aimed
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Jailbreak attacks on Language Model Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. This paper proposes a new type of jailbreak attacks which shift the attention of the LLM by inserting a prohibited query into a carrier article. The proposed attack leverage the knowledge graph and a composer LLM to automatically generating a carrier article that is similar to the topic of the prohibited query but does not violate LLM’s safeguards. By inserting the malicious query to the carrier article, the assembled attack payload can successfully jailbreak LLM. To evaluate the effectiveness of our method, we leverage 4 popular categories of ``harmful behaviors’’ adopted by related researches to attack 6 popular LLMs. Our experiment results show that the proposed attacking method can successfully jailbreak all the target LLMs which high success rate, except for Claude-3.

[AI-93] A Full DAG Score-Based Algorithm for Learning Causal Bayesian Networks with Latent Confounders ECAI’24

链接: https://arxiv.org/abs/2408.11181
作者: Christophe Gonzales,Amir-Hosein Valizadeh
关键词-EN: Causal Bayesian networks, encode causal relations, Causal Bayesian, Bayesian networks, popular graphical probabilistic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, extended version with supplementary material of paper accepted at the 27th European Conference on Artificial Intelligence (ECAI’24)

点击查看摘要

Abstract:Causal Bayesian networks (CBN) are popular graphical probabilistic models that encode causal relations among variables. Learning their graphical structure from observational data has received a lot of attention in the literature. When there exists no latent (unobserved) confounder, i.e., no unobserved direct common cause of some observed variables, learning algorithms can be divided essentially into two classes: constraint-based and score-based approaches. The latter are often thought to be more robust than the former and to produce better results. However, to the best of our knowledge, when variables are discrete, no score-based algorithm is capable of dealing with latent confounders. This paper introduces the first fully score-based structure learning algorithm searching the space of DAGs (directed acyclic graphs) that is capable of identifying the presence of some latent confounders. It is justified mathematically and experiments highlight its effectiveness.

[AI-94] SubgoalXL: Subgoal-based Expert Learning for Theorem Proving

链接: https://arxiv.org/abs/2408.11172
作者: Xueliang Zhao,Lin Zheng,Haige Bo,Changran Hu,Urmish Thakker,Lingpeng Kong
关键词-EN: Formal theorem proving, large language models, theorem proving, Formal theorem, computer science
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Formal theorem proving, a field at the intersection of mathematics and computer science, has seen renewed interest with advancements in large language models (LLMs). This paper introduces SubgoalXL, a novel approach that synergizes subgoal-based proofs with expert learning to enhance LLMs’ capabilities in formal theorem proving within the Isabelle environment. SubgoalXL addresses two critical challenges: the scarcity of specialized mathematics and theorem-proving data, and the need for improved multi-step reasoning abilities in LLMs. By optimizing data efficiency and employing subgoal-level supervision, SubgoalXL extracts richer information from limited human-generated proofs. The framework integrates subgoal-oriented proof strategies with an expert learning system, iteratively refining formal statement, proof, and subgoal generators. Leveraging the Isabelle environment’s advantages in subgoal-based proofs, SubgoalXL achieves a new state-of-the-art performance of 56.1% in Isabelle on the standard miniF2F dataset, marking an absolute improvement of 4.9%. Notably, SubgoalXL successfully solves 41 AMC12, 9 AIME, and 3 IMO problems from miniF2F. These results underscore the effectiveness of maximizing limited data utility and employing targeted guidance for complex reasoning in formal theorem proving, contributing to the ongoing advancement of AI reasoning capabilities. The implementation is available at \urlthis https URL.

[AI-95] MS3D: A RG Flow-Based Regularization for GAN Training with Limited Data

链接: https://arxiv.org/abs/2408.11135
作者: Jian Wang,Xin Lan,Yuxin Tian,Jiancheng Lv
关键词-EN: Generative adversarial networks, made impressive advances, avoid degradation caused, require large-scale training, Generative adversarial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative adversarial networks (GANs) have made impressive advances in image generation, but they often require large-scale training data to avoid degradation caused by discriminator overfitting. To tackle this issue, we investigate the challenge of training GANs with limited data, and propose a novel regularization method based on the idea of renormalization group (RG) in physics.We observe that in the limited data setting, the gradient pattern that the generator obtains from the discriminator becomes more aggregated over time. In RG context, this aggregated pattern exhibits a high discrepancy from its coarse-grained versions, which implies a high-capacity and sensitive system, prone to overfitting and collapse. To address this problem, we introduce a \textbfmulti-\textbfscale \textbfstructural \textbfself-\textbfdissimilarity (MS ^3 D) regularization, which constrains the gradient field to have a consistent pattern across different scales, thereby fostering a more redundant and robust system. We show that our method can effectively enhance the performance and stability of GANs under limited data scenarios, and even allow them to generate high-quality images with very few data.

[AI-96] DOMBA: Double Model Balancing for Access-Controlled Language Models via Minimum-Bounded Aggregation

链接: https://arxiv.org/abs/2408.11121
作者: Tom Segal,Asaf Shabtai,Yuval Elovici
关键词-EN: depends heavily, quality and quantity, large language models, large language, LLMs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:The utility of large language models (LLMs) depends heavily on the quality and quantity of their training data. Many organizations possess large data corpora that could be leveraged to train or fine-tune LLMs tailored to their specific needs. However, these datasets often come with access restrictions that are based on user privileges and enforced by access control mechanisms. Training LLMs on such datasets could result in exposure of sensitive information to unauthorized users. A straightforward approach for preventing such exposure is to train a separate model for each access level. This, however, may result in low utility models due to the limited amount of training data per model compared to the amount in the entire organizational corpus. Another approach is to train a single LLM on all the data while limiting the exposure of unauthorized information. However, current exposure-limiting methods for LLMs are ineffective for access-controlled data, where sensitive information appears frequently across many training examples. We propose DOMBA - double model balancing - a simple approach for training and deploying LLMs that provides high utility and access-control functionality with security guarantees. DOMBA aggregates the probability distributions of two models, each trained on documents with (potentially many) different access levels, using a “min-bounded” average function (a function that is bounded by the smaller value, e.g., harmonic mean). A detailed mathematical analysis and extensive evaluation show that DOMBA safeguards restricted information while offering utility comparable to non-secure models.

[AI-97] What can Large Language Models Capture about Code Functional Equivalence?

链接: https://arxiv.org/abs/2408.11081
作者: Nickil Maveli,Antonio Vergari,Shay B. Cohen
关键词-EN: shown great progress, learning rich representations, large code corpora, classify code fragments, pre-trained on large
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 37 pages

点击查看摘要

Abstract:Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing SeqCoBench, a benchmark for systematically assessing how Code-LLMs can capture code functional equivalence. SeqCoBench contains over 20 code transformations that either preserve or alter the semantics of Python programs. We conduct extensive evaluations in different settings, including zero-shot and parameter-efficient finetuning methods on state-of-the-art (Code-)LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench. We find that the performance gap between these LLMs and classical match-based retrieval scores is minimal, with both approaches showing a concerning lack of depth in understanding code semantics.

[AI-98] DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

链接: https://arxiv.org/abs/2408.11071
作者: Pucheng Dang,Xing Hu,Dong Li,Rui Zhang,Qi Guo,Kaidi Xu
关键词-EN: raise misuse concerns, models raise misuse, misuse concerns, raise misuse, creating prohibited
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model’s capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of \textitpurely black-box attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.

[AI-99] oward End-to-End Bearing Fault Diagnosis for Industrial Scenarios with Spiking Neural Networks

链接: https://arxiv.org/abs/2408.11067
作者: Yongqi Ding,Lin Zuo,Mengmeng Jing,Kunshan Yang,Biao Chen,Yunqian Yu
关键词-EN: Spiking neural networks, received widespread attention, low-power binary spikes, neural networks, transmit information
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:Spiking neural networks (SNNs) transmit information via low-power binary spikes and have received widespread attention in areas such as computer vision and reinforcement learning. However, there have been very few explorations of SNNs in more practical industrial scenarios. In this paper, we focus on the application of SNNs in bearing fault diagnosis to facilitate the integration of high-performance AI algorithms and real-world industries. In particular, we identify two key limitations of existing SNN fault diagnosis methods: inadequate encoding capacity that necessitates cumbersome data preprocessing, and non-spike-oriented architectures that constrain the performance of SNNs. To alleviate these problems, we propose a Multi-scale Residual Attention SNN (MRA-SNN) to simultaneously improve the efficiency, performance, and robustness of SNN methods. By incorporating a lightweight attention mechanism, we have designed a multi-scale attention encoding module to extract multiscale fault features from vibration signals and encode them as spatio-temporal spikes, eliminating the need for complicated preprocessing. Then, the spike residual attention block extracts high-dimensional fault features and enhances the expressiveness of sparse spikes with the attention mechanism for end-to-end diagnosis. In addition, the performance and robustness of MRA-SNN is further enhanced by introducing the lightweight attention mechanism within the spiking neurons to simulate the biological dendritic filtering effect. Extensive experiments on MFPT and JNU benchmark datasets demonstrate that MRA-SNN significantly outperforms existing methods in terms of accuracy, energy consumption and noise robustness, and is more feasible for deployment in real-world industrial scenarios.

[AI-100] abular Transfer Learning via Prompting LLMs

链接: https://arxiv.org/abs/2408.11063
作者: Jaehyun Nam,Woomin Song,Seong Hyeon Park,Jihoon Tack,Sukmin Yun,Jaehyung Kim,Kyu Hwan Oh,Jinwoo Shin
关键词-EN: transfer learning, tabular transfer learning, Learning, transfer, obtain annotations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: COLM 2024

点击查看摘要

Abstract:Learning with a limited number of labeled data is a central problem in real-world applications of machine learning, as it is often expensive to obtain annotations. To deal with the scarcity of labeled data, transfer learning is a conventional approach; it suggests to learn a transferable knowledge by training a neural network from multiple other sources. In this paper, we investigate transfer learning of tabular tasks, which has been less studied and successful in the literature, compared to other domains, e.g., vision and language. This is because tables are inherently heterogeneous, i.e., they contain different columns and feature spaces, making transfer learning difficult. On the other hand, recent advances in natural language processing suggest that the label scarcity issue can be mitigated by utilizing in-context learning capability of large language models (LLMs). Inspired by this and the fact that LLMs can also process tables within a unified language space, we ask whether LLMs can be effective for tabular transfer learning, in particular, under the scenarios where the source and target datasets are of different format. As a positive answer, we propose a novel tabular transfer learning framework, coined Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with LLMs. Specifically, P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts. Experimental results demonstrate that P2T outperforms previous methods on various tabular learning benchmarks, showing good promise for the important, yet underexplored tabular transfer learning problem. Code is available at this https URL.

[AI-101] Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

链接: https://arxiv.org/abs/2408.11062
作者: Guanming Xiong,Junwei Bao,Hongfei Jiang,Yang Song,Wen Zhao
关键词-EN: large language models, powerful reasoning capabilities, study explores, parsing by leveraging, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:This study explores text-to-SQL parsing by leveraging the powerful reasoning capabilities of large language models (LLMs). Despite recent advancements, existing LLM-based methods have not adequately addressed scalability, leading to inefficiencies when processing wide tables. Furthermore, current interaction-based approaches either lack a step-by-step, interpretable SQL generation process or fail to provide an efficient and universally applicable interaction design. To address these challenges, we introduce Interactive-T2S, a framework that generates SQL queries through direct interactions with databases. This framework includes four general tools that facilitate proactive and efficient information retrieval by the LLM. Additionally, we have developed detailed exemplars to demonstrate the step-wise reasoning processes within our framework. Our experiments on the BIRD-Dev dataset, employing a setting without oracle knowledge, reveal that our method achieves state-of-the-art results with only two exemplars, underscoring the effectiveness and robustness of our framework.

[AI-102] Dynamic Code Orchestration: Harnessing the Power of Large Language Models for Adaptive Script Execution

链接: https://arxiv.org/abs/2408.11060
作者: Justin Del Vecchio,Andrew Perreault,Eliana Furmanek
关键词-EN: written language directives, written language, initially required humans, language, language directives
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:Computer programming initially required humans to directly translate their goals into machine code. These goals could have easily been expressed as a written (or human) language directive. Computers, however, had no capacity to satisfactorily interpret written language. Large language model’s provide exactly this capability; automatic generation of computer programs or even assembly code from written language directives. This research examines dynamic code execution of written language directives within the context of a running application. It implements a text editor whose business logic is purely backed by large language model prompts. That is, the program’s execution uses prompts and written language directives to dynamically generate application logic at the point in time it is needed. The research clearly shows how written language directives, backed by a large language model, offer radically new programming and operating system paradigms. For example, empowerment of users to directly implement requirements via written language directives, thus supplanting the need for a team ofprogrammers, a release schedule and the like. Or, new security mechanisms where static executables, always a target for reverse engineering or fuzzing, no longer exist. They are replaced by ephemeral executables that may continually change, be completely removed, and are easily updated.

[AI-103] LLM Agents Improve Semantic Code Search

链接: https://arxiv.org/abs/2408.11058
作者: Sarthak Jain(University of Illinois Urbana Champaign and Cisco),Aditya Dora(University of Illinois Urbana Champaign),Ka Seng Sam(University of Illinois Urbana Champaign),Prabhat Singh(Cisco)
关键词-EN: solutions to problems, key task, developing solutions, Retrieval Augmented Generation, Augmented Generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 12 pages, 1 Figure

点击查看摘要

Abstract:Code Search is a key task that many programmers often have to perform while developing solutions to problems. Current methodologies suffer from an inability to perform accurately on prompts that contain some ambiguity or ones that require additional context relative to a code-base. We introduce the approach of using Retrieval Augmented Generation (RAG) powered agents to inject information into user prompts allowing for better inputs into embedding models. By utilizing RAG, agents enhance user queries with relevant details from GitHub repositories, making them more informative and contextually aligned. Additionally, we introduce a multi-stream ensemble approach which when paired with agentic workflow can obtain improved retrieval accuracy, which we deploy on application called this http URL. Experimental results on the CodeSearchNet dataset demonstrate that RepoRift significantly outperforms existing methods, achieving an 78.2% success rate at Success@10 and a 34.6% success rate at Success@1. This research presents a substantial advancement in semantic code search, highlighting the potential of agentic LLMs and RAG to enhance code retrieval systems.

[AI-104] Improving the Scan-rescan Precision of AI-based CMR Biomarker Estimation MICCAI

链接: https://arxiv.org/abs/2408.11754
作者: Dewmini Hasara Wickremasinghe,Yiyang Xu,Esther Puyol-Antón,Paul Aljabar,Reza Razavi,Andrew P. King
关键词-EN: cardiovascular magnetic resonance, cine cardiovascular magnetic, magnetic resonance, deep learning, offers many advantages
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 11 pages, 3 figures, MICCAI STACOM 2024

点击查看摘要

Abstract:Quantification of cardiac biomarkers from cine cardiovascular magnetic resonance (CMR) data using deep learning (DL) methods offers many advantages, such as increased accuracy and faster analysis. However, only a few studies have focused on the scan-rescan precision of the biomarker estimates, which is important for reproducibility and longitudinal analysis. Here, we propose a cardiac biomarker estimation pipeline that not only focuses on achieving high segmentation accuracy but also on improving the scan-rescan precision of the computed biomarkers, namely left and right ventricular ejection fraction, and left ventricular myocardial mass. We evaluate two approaches to improve the apical-basal resolution of the segmentations used for estimating the biomarkers: one based on image interpolation and one based on segmentation interpolation. Using a database comprising scan-rescan cine CMR data acquired from 92 subjects, we compare the performance of these two methods against ground truth (GT) segmentations and DL segmentations obtained before interpolation (baseline). The results demonstrate that both the image-based and segmentation-based interpolation methods were able to narrow Bland-Altman scan-rescan confidence intervals for all biomarkers compared to the GT and baseline performances. Our findings highlight the importance of focusing not only on segmentation accuracy but also on the consistency of biomarkers across repeated scans, which is crucial for longitudinal analysis of cardiac function.

[AI-105] 5G NR PRACH Detection with Convolutional Neural Networks (CNN): Overcoming Cell Interference Challenges

链接: https://arxiv.org/abs/2408.11659
作者: Desire Guel,Arsene Kabore,Didier Bassole
关键词-EN: Convolutional Neural Networks, Convolutional Neural, Neural Networks, Random Access Channel, Physical Random Access
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel approach to interference detection in 5G New Radio (5G-NR) networks using Convolutional Neural Networks (CNN). Interference in 5G networks challenges high-quality service due to dense user equipment deployment and increased wireless environment complexity. Our CNN-based model is designed to detect Physical Random Access Channel (PRACH) sequences amidst various interference scenarios, leveraging the spatial and temporal characteristics of PRACH signals to enhance detection accuracy and robustness. Comprehensive datasets of simulated PRACH signals under controlled interference conditions were generated to train and validate the model. Experimental results show that our CNN-based approach outperforms traditional PRACH detection methods in accuracy, precision, recall and F1-score. This study demonstrates the potential of AI/ML techniques in advancing interference management in 5G networks, providing a foundation for future research and practical applications in optimizing network performance and reliability.

[AI-106] OCTCube: A 3D foundation model for optical coherence tomography that improves cross-dataset cross-disease cross-device and cross-modality analysis

链接: https://arxiv.org/abs/2408.11227
作者: Zixuan Liu,Hanwen Xu,Addie Woicik,Linda G. Shapiro,Marian Blazes,Yue Wu,Cecilia S. Lee,Aaron Y. Lee,Sheng Wang
关键词-EN: Optical coherence tomography, Optical coherence, OCT, OCT images, coherence tomography
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Optical coherence tomography (OCT) has become critical for diagnosing retinal diseases as it enables 3D images of the retina and optic nerve. OCT acquisition is fast, non-invasive, affordable, and scalable. Due to its broad applicability, massive numbers of OCT images have been accumulated in routine exams, making it possible to train large-scale foundation models that can generalize to various diagnostic tasks using OCT images. Nevertheless, existing foundation models for OCT only consider 2D image slices, overlooking the rich 3D structure. Here, we present OCTCube, a 3D foundation model pre-trained on 26,605 3D OCT volumes encompassing 1.62 million 2D OCT images. OCTCube is developed based on 3D masked autoencoders and exploits FlashAttention to reduce the larger GPU memory usage caused by modeling 3D volumes. OCTCube outperforms 2D models when predicting 8 retinal diseases in both inductive and cross-dataset settings, indicating that utilizing the 3D structure in the model instead of 2D data results in significant improvement. OCTCube further shows superior performance on cross-device prediction and when predicting systemic diseases, such as diabetes and hypertension, further demonstrating its strong generalizability. Finally, we propose a contrastive-self-supervised-learning-based OCT-IR pre-training framework (COIP) for cross-modality analysis on OCT and infrared retinal (IR) images, where the OCT volumes are embedded using OCTCube. We demonstrate that COIP enables accurate alignment between OCT and IR en face images. Collectively, OCTCube, a 3D OCT foundation model, demonstrates significantly better performance against 2D models on 27 out of 29 tasks and comparable performance on the other two tasks, paving the way for AI-based retinal disease diagnosis.

[AI-107] Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits RECSYS2024

链接: https://arxiv.org/abs/2408.11202
作者: Tatsuhiro Shimizu,Koichi Tanaka,Ren Kishimoto,Haruka Kiyohara,Masahiro Nomura,Yuta Saito
关键词-EN: contextual combinatorial bandits, explore off-policy evaluation, evaluation and learning, combinatorial bandits, explore off-policy
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: accepted at RecSys2024

点击查看摘要

Abstract:We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB remains unexplored in the relevant literature. Typical OPE/L methods such as regression and importance sampling can be applied to the CCB problem, however, they face significant challenges due to high bias or variance, exacerbated by the exponential growth in the number of available subsets. To address these challenges, we introduce a concept of factored action space, which allows us to decompose each subset into binary indicators. This formulation allows us to distinguish between the ‘‘main effect’’ derived from the main actions, and the ‘‘residual effect’’, originating from the supplemental actions, facilitating more effective OPE. Specifically, our estimator, called OPCB, leverages an importance sampling-based approach to unbiasedly estimate the main effect, while employing regression-based approach to deal with the residual effect with low variance. OPCB achieves substantial variance reduction compared to conventional importance sampling methods and bias reduction relative to regression methods under certain conditions, as illustrated in our theoretical analysis. Experiments demonstrate OPCB’s superior performance over typical methods in both OPE and OPL.

计算机视觉

[CV-0] GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

链接: https://arxiv.org/abs/2408.11817
作者: Jonathan Roberts,Kai Han,Samuel Albanie
关键词-EN: Large multimodal models, Large multimodal, exhibited proficiencies, Large, GRAB
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.

[CV-1] SynPlay: Importing Real-world Diversity for a Synthetic Human Dataset

链接: https://arxiv.org/abs/2408.11814
作者: Jinsub Yim,Hyungtae Lee,Sungmin Eum,Yi-Ting Shen,Yan Zhang,Heesung Kwon,Shuvra S. Bhattacharyya
关键词-EN: introduce Synthetic Playground, Synthetic Playground, aims to bring, human, Playground
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce Synthetic Playground (SynPlay), a new synthetic human dataset that aims to bring out the diversity of human appearance in the real world. We focus on two factors to achieve a level of diversity that has not yet been seen in previous works: i) realistic human motions and poses and ii) multiple camera viewpoints towards human instances. We first use a game engine and its library-provided elementary motions to create games where virtual players can take less-constrained and natural movements while following the game rules (i.e., rule-guided motion design as opposed to detail-guided design). We then augment the elementary motions with real human motions captured with a motion capture device. To render various human appearances in the games from multiple viewpoints, we use seven virtual cameras encompassing the ground and aerial views, capturing abundant aerial-vs-ground and dynamic-vs-static attributes of the scene. Through extensive and carefully-designed experiments, we show that using SynPlay in model training leads to enhanced accuracy over existing synthetic datasets for human detection and segmentation. The benefit of SynPlay becomes even greater for tasks in the data-scarce regime, such as few-shot and cross-domain learning tasks. These results clearly demonstrate that SynPlay can be used as an essential dataset with rich attributes of complex human appearances and poses suitable for model pretraining. SynPlay dataset comprising over 73k images and 6.5M human instances, is available for download at this https URL.

[CV-2] SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

链接: https://arxiv.org/abs/2408.11813
作者: Yuanyang Yin,Yaqi Zhao,Yajie Zhang,Ke Lin,Jiahao Wang,Xin Tao,Pengfei Wan,Di Zhang,Baoqun Yin,Wentao Zhang
关键词-EN: Large Language Models, Multimodal Large Language, Vision Encoder, Large Language, recently demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities, typically comprising a Vision Encoder, an Adapter, and a Large Language Model (LLM). The adapter serves as the critical bridge between the visual and language components. However, training adapters with image-level supervision often results in significant misalignment, undermining the LLMs’ capabilities and limiting the potential of Multimodal LLMs. To address this, we introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models, such as CLIP, to align visual tokens with the LLM’s embedding space through contrastive learning. This approach ensures a more coherent integration of visual and language representations, enhancing the performance and interpretability of multimodal LLMs while preserving their inherent capabilities. Extensive experiments show that SEA effectively improves MLLMs, particularly for smaller models, without adding extra data or inference computation. SEA also lays the groundwork for developing more general and adaptable solutions to enhance multimodal systems.

[CV-3] EmbodiedSAM: Online Segment Any 3D Thing in Real Time

链接: https://arxiv.org/abs/2408.11811
作者: Xiuwei Xu,Huangxing Chen,Linqing Zhao,Ziwei Wang,Jie Zhou,Jiwen Lu
关键词-EN: Embodied tasks require, fully understand, scenes simultaneously, desperately needed, require the agent
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Project page: this https URL

点击查看摘要

Abstract:Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at this https URL, with only one RTX 3090 GPU required for training and evaluation.

[CV-4] Pixel Is Not A Barrier: An Effective Evasion Attack for Pixel-Domain Diffusion Models

链接: https://arxiv.org/abs/2408.11810
作者: Chun-Yen Shih,Li-Xuan Peng,Jia-Wei Liao,Ernie Chu,Cheng-Fu Chou,Jun-Cheng Chen
关键词-EN: powerful generative models, Diffusion Models, high-quality image synthesis, Pixel-domain Diffusion Models, editing techniques based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Models have emerged as powerful generative models for high-quality image synthesis, with many subsequent image editing techniques based on them. However, the ease of text-based image editing introduces significant risks, such as malicious editing for scams or intellectual property infringement. Previous works have attempted to safeguard images from diffusion-based editing by adding imperceptible perturbations. These methods are costly and specifically target prevalent Latent Diffusion Models (LDMs), while Pixel-domain Diffusion Models (PDMs) remain largely unexplored and robust against such attacks. Our work addresses this gap by proposing a novel attacking framework with a feature representation attack loss that exploits vulnerabilities in denoising UNets and a latent optimization strategy to enhance the naturalness of protected images. Extensive experiments demonstrate the effectiveness of our approach in attacking dominant PDM-based editing methods (e.g., SDEdit) while maintaining reasonable protection fidelity and robustness against common defense methods. Additionally, our framework is extensible to LDMs, achieving comparable performance to existing approaches.

[CV-5] ACE: A Cross-Platform Visual-Exoskeletons System for Low-Cost Dexterous Teleoperation

链接: https://arxiv.org/abs/2408.11805
作者: Shiqi Yang,Minghuan Liu,Yuzhe Qin,Runyu Ding,Jialong Li,Xuxin Cheng,Ruihan Yang,Sha Yi,Xiaolong Wang
关键词-EN: recently collected large-scale, large-scale robot data, collected large-scale robot, demonstrations has shown, effective approach
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Webpage: this https URL

点击查看摘要

Abstract:Learning from demonstrations has shown to be an effective approach to robotic manipulation, especially with the recently collected large-scale robot data with teleoperation systems. Building an efficient teleoperation system across diverse robot platforms has become more crucial than ever. However, there is a notable lack of cost-effective and user-friendly teleoperation systems for different end-effectors, e.g., anthropomorphic robot hands and grippers, that can operate across multiple platforms. To address this issue, we develop ACE, a cross-platform visual-exoskeleton system for low-cost dexterous teleoperation. Our system utilizes a hand-facing camera to capture 3D hand poses and an exoskeleton mounted on a portable base, enabling accurate real-time capture of both finger and wrist poses. Compared to previous systems, which often require hardware customization according to different robots, our single system can generalize to humanoid hands, arm-hands, arm-gripper, and quadruped-gripper systems with high-precision teleoperation. This enables imitation learning for complex manipulation tasks on diverse platforms.

[CV-6] Story3D-Agent : Exploring 3D Storytelling Visualization with Large Language Models

链接: https://arxiv.org/abs/2408.11801
作者: Yuzhou Huang,Yiran Qin,Shunlin Lu,Xintao Wang,Rui Huang,Ying Shan,Ruimao Zhang
关键词-EN: Traditional visual storytelling, requiring specialized knowledge, Large Language Models, Traditional visual, requiring specialized
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Traditional visual storytelling is complex, requiring specialized knowledge and substantial resources, yet often constrained by human creativity and creation precision. While Large Language Models (LLMs) enhance visual storytelling, current approaches often limit themselves to 2D visuals or oversimplify stories through motion synthesis and behavioral simulation, failing to create comprehensive, multi-dimensional narratives. To this end, we present Story3D-Agent, a pioneering approach that leverages the capabilities of LLMs to transform provided narratives into 3D-rendered visualizations. By integrating procedural modeling, our approach enables precise control over multi-character actions and motions, as well as diverse decorative elements, ensuring the long-range and dynamic 3D representation. Furthermore, our method supports narrative extension through logical reasoning, ensuring that generated content remains consistent with existing conditions. We have thoroughly evaluated our Story3D-Agent to validate its effectiveness, offering a basic framework to advance 3D story representation.

[CV-7] EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

链接: https://arxiv.org/abs/2408.11795
作者: Feipeng Ma,Yizhou Zhou,Hebei Li,Zilong He,Siying Wu,Fengyun Rao,Yueyi Zhang,Xiaoyan Sun
关键词-EN: numerous studies leverage, studies leverage substantial, leverage substantial image-text, substantial image-text pairs, transforming Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the realm of multimodal research, numerous studies leverage substantial image-text pairs to conduct modal alignment learning, transforming Large Language Models (LLMs) into Multimodal LLMs and excelling in a variety of visual-language tasks. The prevailing methodologies primarily fall into two categories: self-attention-based and cross-attention-based methods. While self-attention-based methods offer superior data efficiency due to their simple MLP architecture, they often suffer from lower computational efficiency due to concatenating visual and textual tokens as input for LLM. Conversely, cross-attention-based methods, although less data-efficient due to additional learnable parameters, exhibit higher computational efficiency by avoiding long sequence input for LLM. To address these trade-offs, we introduce the Data-Efficient and Compute-Efficient Multimodal Large Language Model (EE-MLLM). Without introducing additional modules or learnable parameters, EE-MLLM achieves both data and compute efficiency. Specifically, we modify the original self-attention mechanism in MLLM to a composite attention mechanism. This mechanism has two key characteristics: 1) Eliminating the computational overhead of self-attention within visual tokens to achieve compute efficiency, and 2) Reusing the weights on each layer of LLM to facilitate effective modality alignment between vision and language for data efficiency. Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks, including general-purpose datasets like MMBench and SeedBench, as well as fine-grained tasks such as TextVQA and DocVQA.

[CV-8] DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework

链接: https://arxiv.org/abs/2408.11788
作者: Zhifei Xie,Daniel Tang,Dingwei Tan,Jacques Klein,Tegawend F. Bissyand,Saad Ezzini
关键词-EN: Current video generation, generation models excel, realistic clips, Key Frames Iteration, Frames Iteration Design
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Current video generation models excel at creating short, realistic clips, but struggle with longer, multi-scene videos. We introduce \textttDreamFactory, an LLM-based framework that tackles this challenge. \textttDreamFactory leverages multi-agent collaboration principles and a Key Frames Iteration Design Method to ensure consistency and style across long videos. It utilizes Chain of Thought (COT) to address uncertainties inherent in large language models. \textttDreamFactory generates long, stylistically coherent, and complex videos. Evaluating these long-form videos presents a challenge. We propose novel metrics such as Cross-Scene Face Distance Score and Cross-Scene Style Consistency Score. To further research in this area, we contribute the Multi-Scene Videos Dataset containing over 150 human-rated videos.

[CV-9] meline and Boundary Guided Diffusion Network for Video Shadow Detection ACM-MM2024

链接: https://arxiv.org/abs/2408.11785
作者: Haipeng Zhou,Honqiu Wang,Tian Ye,Zhaohu Xing,Jun Ma,Ping Li,Qiong Wang,Lei Zhu
关键词-EN: Boundary Guided Diffusion, Video Shadow Detection, aims to detect, Shadow Boundary Aware, Boundary Aware Attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ACM MM2024

点击查看摘要

Abstract:Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes, weights, and results at \urlthis https URL.

[CV-10] Embedding Ordinality to Binary Loss Function for Improving Solar Flare Forecasting

链接: https://arxiv.org/abs/2408.11768
作者: Chetraj Pandey,Anli Ji,Jinsu Hong,Rafal A. Angryk,Berkay Aydin
关键词-EN: intrinsic ordinal flare, skill score, True Skill Score, Heidke Skill Score, ordinal flare characteristics
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 10 Pages, 8 Figures. This manuscript is accepted to be published at DSAA 2024 conference. arXiv admin note: substantial text overlap with arXiv:2406.11054

点击查看摘要

Abstract:In this paper, we propose a novel loss function aimed at optimizing the binary flare prediction problem by embedding the intrinsic ordinal flare characteristics into the binary cross-entropy (BCE) loss function. This modification is intended to provide the model with better guidance based on the ordinal characteristics of the data and improve the overall performance of the models. For our experiments, we employ a ResNet34-based model with transfer learning to predict \geq M-class flares by utilizing the shape-based features of magnetograms of active region (AR) patches spanning from - 90 ^\circ to + 90 ^\circ of solar longitude as our input data. We use a composite skill score (CSS) as our evaluation metric, which is calculated as the geometric mean of the True Skill Score (TSS) and the Heidke Skill Score (HSS) to rank and compare our models’ performance. The primary contributions of this work are as follows: (i) We introduce a novel approach to encode ordinality into a binary loss function showing an application to solar flare prediction, (ii) We enhance solar flare forecasting by enabling flare predictions for each AR across the entire solar disk, without any longitudinal restrictions, and evaluate and compare performance. (iii) Our candidate model, optimized with the proposed loss function, shows an improvement of \sim 7%, \sim 4%, and \sim 3% for AR patches within \pm 30 ^\circ , \pm 60 ^\circ , and \pm 90 ^\circ of solar longitude, respectively in terms of CSS, when compared with standard BCE. Additionally, we demonstrate the ability to issue flare forecasts for ARs in near-limb regions (regions between \pm 60 ^\circ to \pm 90 ^\circ ) with a CSS=0.34 (TSS=0.50 and HSS=0.23), expanding the scope of AR-based models for solar flare prediction. This advances the reliability of solar flare forecasts, leading to more effective prediction capabilities.

[CV-11] SBDet: A Symmetry-Breaking Object Detector via Relaxed Rotation-Equivariance

链接: https://arxiv.org/abs/2408.11760
作者: Zhiqiang Wu,Yingjie Liu,Hanlin Dong,Xuan Tang,Jian Yang,Bo Jin,Mingsong Chen,Xian Wei
关键词-EN: Introducing Group Equivariant, Group Equivariant Convolution, explore symmetries hidden, Equivariant Convolution, Introducing Group
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Introducing Group Equivariant Convolution (GConv) empowers models to explore symmetries hidden in visual data, improving their performance. However, in real-world scenarios, objects or scenes often exhibit perturbations of a symmetric system, specifically a deviation from a symmetric architecture, which can be characterized by a non-trivial action of a symmetry group, known as Symmetry-Breaking. Traditional GConv methods are limited by the strict operation rules in the group space, only ensuring features remain strictly equivariant under limited group transformations, making it difficult to adapt to Symmetry-Breaking or non-rigid transformations. Motivated by this, we introduce a novel Relaxed Rotation GConv (R2GConv) with our defined Relaxed Rotation-Equivariant group \mathbfR_4 . Furthermore, we propose a Relaxed Rotation-Equivariant Network (R2Net) as the backbone and further develop the Symmetry-Breaking Object Detector (SBDet) for 2D object detection built upon it. Experiments demonstrate the effectiveness of our proposed R2GConv in natural image classification tasks, and SBDet achieves excellent performance in object detection tasks with improved generalization capabilities and robustness.

[CV-12] MambaCSR: Dual-Interleaved Scanning for Compressed Image Super-Resolution With SSMs

链接: https://arxiv.org/abs/2408.11758
作者: Yulin Ren,Xin Li,Mengxi Guo,Bingchen Li,Shijie Zhao,Zhibo Chen
关键词-EN: effective framework based, challenging compressed image, framework based, compressed image super-resolution, scanning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present MambaCSR, a simple but effective framework based on Mamba for the challenging compressed image super-resolution (CSR) task. Particularly, the scanning strategies of Mamba are crucial for effective contextual knowledge modeling in the restoration process despite it relying on selective state space modeling for all tokens. In this work, we propose an efficient dual-interleaved scanning paradigm (DIS) for CSR, which is composed of two scanning strategies: (i) hierarchical interleaved scanning is designed to comprehensively capture and utilize the most potential contextual information within an image by simultaneously taking advantage of the local window-based and sequential scanning methods; (ii) horizontal-to-vertical interleaved scanning is proposed to reduce the computational cost by leaving the redundancy between the scanning of different directions. To overcome the non-uniform compression artifacts, we also propose position-aligned cross-scale scanning to model multi-scale contextual information. Experimental results on multiple benchmarks have shown the great performance of our MambaCSR in the compressed image super-resolution task. The code will be soon available in~\textcolormagenta\urlthis https URL.

[CV-13] DH-Bench: Probing Depth and Height Perception of Large Visual-Language Models

链接: https://arxiv.org/abs/2408.11748
作者: Shehreen Azad,Yash Jain,Rishit Garg,Yogesh S Rawat,Vibhav Vineet
关键词-EN: Vision Language Models, large Vision Language, Geometric understanding, navigating and interacting, Vision Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Geometric understanding is crucial for navigating and interacting with our environment. While large Vision Language Models (VLMs) demonstrate impressive capabilities, deploying them in real-world scenarios necessitates a comparable geometric understanding in visual perception. In this work, we focus on the geometric comprehension of these models; specifically targeting the depths and heights of objects within a scene. Our observations reveal that, although VLMs excel in basic geometric properties perception such as shape and size, they encounter significant challenges in reasoning about the depth and height of objects. To address this, we introduce a suite of benchmark datasets encompassing Synthetic 2D, Synthetic 3D, and Real-World scenarios to rigorously evaluate these aspects. We benchmark 17 state-of-the-art VLMs using these datasets and find that they consistently struggle with both depth and height perception. Our key insights include detailed analyses of the shortcomings in depth and height reasoning capabilities of VLMs and the inherent bias present in these models. This study aims to pave the way for the development of VLMs with enhanced geometric understanding, crucial for real-world applications. The code and datasets for our benchmarks will be available at \urlthis https URL.

[CV-14] Open-Ended 3D Point Cloud Instance Segmentation

链接: https://arxiv.org/abs/2408.11747
作者: Phuc D.A. Nguyen,Minh Luu,Anh Tran,Cuong Pham,Khoi Nguyen
关键词-EN: Instance Segmentation methods, Instance Segmentation, recently demonstrated, demonstrated their ability, ability to generalize
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-Vocab 3D Instance Segmentation methods (OV-3DIS) have recently demonstrated their ability to generalize to unseen objects. However, these methods still depend on predefined class names during testing, restricting the autonomy of agents. To mitigate this constraint, we propose a novel problem termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the necessity for predefined class names during testing. Moreover, we contribute a comprehensive set of strong baselines, derived from OV-3DIS approaches and leveraging 2D Multimodal Large Language Models. To assess the performance of our OE-3DIS system, we introduce a novel Open-Ended score, evaluating both the semantic and geometric quality of predicted masks and their associated class names, alongside the standard AP score. Our approach demonstrates significant performance improvements over the baselines on the ScanNet200 and ScanNet++ datasets. Remarkably, our method surpasses the performance of Open3DIS, the current state-of-the-art method in OV-3DIS, even in the absence of ground-truth object class names.

[CV-15] JieHua Paintings Style Feature Extracting Model using Stable Diffusion with ControlNet CCS

链接: https://arxiv.org/abs/2408.11744
作者: Yujia Gu,Haofeng Li,Xinyu Fang,Zihan Peng,Yinan Peng
关键词-EN: Fine-tuned Stable Diffusion, refine depiction techniques, extract stylistic features, Stable Diffusion Model, Canny Edge Features
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by ICCSMT 2024

点击查看摘要

Abstract:This study proposes a novel approach to extract stylistic features of Jiehua: the utilization of the Fine-tuned Stable Diffusion Model with ControlNet (FSDMC) to refine depiction techniques from artists’ Jiehua. The training data for FSDMC is based on the opensource Jiehua artist’s work collected from the Internet, which were subsequently manually constructed in the format of (Original Image, Canny Edge Features, Text Prompt). By employing the optimal hyperparameters identified in this paper, it was observed FSDMC outperforms CycleGAN, another mainstream style transfer model. FSDMC achieves FID of 3.27 on the dataset and also surpasses CycleGAN in terms of expert evaluation. This not only demonstrates the model’s high effectiveness in extracting Jiehua’s style features, but also preserves the original pre-trained semantic information. The findings of this study suggest that the application of FSDMC with appropriate hyperparameters can enhance the efficacy of the Stable Diffusion Model in the field of traditional art style migration tasks, particularly within the context of Jiehua.

[CV-16] CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering

链接: https://arxiv.org/abs/2408.11742
作者: Yuliang Cai,Mohammad Rostami
关键词-EN: Large vision-language models, Large vision-language, shown significant performance, significant performance boost, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) have shown significant performance boost in various application domains. However, adopting them to deal with several sequentially encountered tasks has been challenging because finetuning a VLM on a task normally leads to reducing its generalization power and the capacity of learning new tasks as well as causing catastrophic forgetting on previously learned tasks. Enabling using VLMs in multimodal continual learning (CL) settings can help to address such scenarios. To improve generalization capacity and prevent catastrophic forgetting, we propose a novel prompt-based CL method for VLMs, namely \textbfClu ster-based \textbfMo dality Fusion Prompt (\textbfCluMo). We design a novel \textbfKey-Key-Prompt pair, where each prompt is associated with a visual prompt key and a textual prompt key. We adopt a two-stage training strategy. During the first stage, the single-modal keys are trained via K -means clustering algorithm to help select the best semantically matched prompt. During the second stage, the prompt keys are frozen, the selected prompt is attached to the input for training the VLM in the CL scenario. Experiments on two benchmarks demonstrate that our method achieves SOTA performance.

[CV-17] Enhancing Cross-Modal Medical Image Segmentation through Compositionality MICCAI2024 MICCAI

链接: https://arxiv.org/abs/2408.11733
作者: Aniek Eijpe,Valentina Corbetta,Kalina Chupetlovska,Regina Beets-Tan,Wilson Silva
关键词-EN: modalities produce images, imaging modalities produce, image segmentation presents, Cross-modal medical image, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures, 2 tables. Accepted at Deep Generative Models workshop @ MICCAI 2024 (DGM4MICCAI). This is the submitted manuscript with added link to github repo, funding acknowledgements and authors’ names and affiliations. No further post submission improvements or corrections were integrated. Final version not published yet

点击查看摘要

Abstract:Cross-modal medical image segmentation presents a significant challenge, as different imaging modalities produce images with varying resolutions, contrasts, and appearances of anatomical structures. We introduce compositionality as an inductive bias in a cross-modal segmentation network to improve segmentation performance and interpretability while reducing complexity. The proposed network is an end-to-end cross-modal segmentation framework that enforces compositionality on the learned representations using learnable von Mises-Fisher kernels. These kernels facilitate content-style disentanglement in the learned representations, resulting in compositional content representations that are inherently interpretable and effectively disentangle different anatomical structures. The experimental results demonstrate enhanced segmentation performance and reduced computational costs on multiple medical datasets. Additionally, we demonstrate the interpretability of the learned compositional features. Code and checkpoints will be publicly available at: this https URL.

[CV-18] Iterative Object Count Optimization for Text-to-image Diffusion Models

链接: https://arxiv.org/abs/2408.11721
作者: Oz Zafar,Lior Wolf,Idan Schwartz
关键词-EN: accurately generating, counting, counting model, models, objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Pre-print

点击查看摘要

Abstract:We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an objectś potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at this https URL.

[CV-19] On Learnable Parameters of Optimal and Suboptimal Deep Learning Models

链接: https://arxiv.org/abs/2408.11720
作者: Ziwei Zheng,Huizhi Liang,Vaclav Snasel,Vito Latora,Panos Pardalos,Giuseppe Nicosia,Varun Ojha
关键词-EN: deep learning models, learning models, deep learning, node interaction, scrutinize the structural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We scrutinize the structural and operational aspects of deep learning models, particularly focusing on the nuances of learnable parameters (weight) statistics, distribution, node interaction, and visualization. By establishing correlations between variance in weight patterns and overall network performance, we investigate the varying (optimal and suboptimal) performances of various deep-learning models. Our empirical analysis extends across widely recognized datasets such as MNIST, Fashion-MNIST, and CIFAR-10, and various deep learning models such as deep neural networks (DNNs), convolutional neural networks (CNNs), and vision transformer (ViT), enabling us to pinpoint characteristics of learnable parameters that correlate with successful networks. Through extensive experiments on the diverse architectures of deep learning models, we shed light on the critical factors that influence the functionality and efficiency of DNNs. Our findings reveal that successful networks, irrespective of datasets or models, are invariably similar to other successful networks in their converged weights statistics and distribution, while poor-performing networks vary in their weights. In addition, our research shows that the learnable parameters of widely varied deep learning models such as DNN, CNN, and ViT exhibit similar learning characteristics.

[CV-20] ControlCol: Controllability in Automatic Speaker Video Colorization

链接: https://arxiv.org/abs/2408.11711
作者: Rory Ward,John G. Breslin,Peter Corcoran
关键词-EN: Adding color, highly desirable technique, highly desirable, speaker videos automatically, Adding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Adding color to black-and-white speaker videos automatically is a highly desirable technique. It is an artistic process that requires interactivity with humans for the best results. Many existing automatic video colorization systems provide little opportunity for the user to guide the colorization process. In this work, we introduce a novel automatic speaker video colorization system which provides controllability to the user while also maintaining high colorization quality relative to state-of-the-art techniques. We name this system ControlCol. ControlCol performs 3.5% better than the previous state-of-the-art DeOldify on the Grid and Lombard Grid datasets when PSNR, SSIM, FID and FVD are used as metrics. This result is also supported by our human evaluation, where in a head-to-head comparison, ControlCol is preferred 90% of the time to DeOldify. Example videos can be seen in the supplementary material.

[CV-21] FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

链接: https://arxiv.org/abs/2408.11706
作者: Liyao Jiang,Negar Hassanpour,Mohammad Salameh,Mohan Sai Singamsetti,Fengyu Sun,Wei Lu,Di Niu
关键词-EN: demonstrated impressive capabilities, prompt-image alignment, generating high-quality images, diffusion models, models have demonstrated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have demonstrated impressive capabilities in generating high-quality images given a text prompt. However, ensuring the prompt-image alignment remains a considerable challenge, i.e., generating images that faithfully align with the prompt’s semantics. Recent works attempt to improve the faithfulness by optimizing the latent code, which potentially could cause the latent code to go out-of-distribution and thus produce unrealistic images. In this paper, we propose FRAP, a simple, yet effective approach based on adaptively adjusting the per-token prompt weights to improve prompt-image alignment and authenticity of the generated images. We design an online algorithm to adaptively update each token’s weight coefficient, which is achieved by minimizing a unified objective function that encourages object presence and the binding of object-modifier pairs. Through extensive evaluations, we show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets, while having a lower average latency compared to recent latent code optimization methods, e.g., 4 seconds faster than DB on the COCO-Subject dataset. Furthermore, through visual comparisons and evaluation on the CLIP-IQA-Real metric, we show that FRAP not only improves prompt-image alignment but also generates more authentic images with realistic appearances. We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment, where we observe improvements in both prompt-image alignment and image quality.

[CV-22] Supervised Representation Learning towards Generalizable Assembly State Recognition

链接: https://arxiv.org/abs/2408.11700
作者: Tim J. Schoonbeek,Goutham Balachandran,Hans Onvlee,Tim Houben,Shao-Hsuan Hung,Jacek Kustra,Peter H.N. de With,Fons van der Sommen
关键词-EN: state recognition facilitates, Assembly state recognition, offering feedback, assembly procedures, execution errors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Assembly state recognition facilitates the execution of assembly procedures, offering feedback to enhance efficiency and minimize errors. However, recognizing assembly states poses challenges in scalability, since parts are frequently updated, and the robustness to execution errors remains underexplored. To address these challenges, this paper proposes an approach based on representation learning and the novel intermediate-state informed loss function modification (ISIL). ISIL leverages unlabeled transitions between states and demonstrates significant improvements in clustering and classification performance for all tested architectures and losses. Despite being trained exclusively on images without execution errors, thorough analysis on error states demonstrates that our approach accurately distinguishes between correct states and states with various types of execution errors. The integration of the proposed algorithm can offer meaningful assistance to workers and mitigate unexpected losses due to procedural mishaps in industrial settings. The code is available at: this https URL

[CV-23] Robust 3D Gaussian Splatting for Novel View Synthesis in Presence of Distractors ATC WWW

链接: https://arxiv.org/abs/2408.11697
作者: Paul Ungermann,Armin Ettenhofer,Matthias Nießner,Barbara Roessle
关键词-EN: view synthesis results, dynamic objects polluting, Gaussian Splatting, shown impressive, impressive novel view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: GCPR 2024, Project Page: this https URL , Video: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting has shown impressive novel view synthesis results; nonetheless, it is vulnerable to dynamic objects polluting the input data of an otherwise static scene, so called distractors. Distractors have severe impact on the rendering quality as they get represented as view-dependent effects or result in floating artifacts. Our goal is to identify and ignore such distractors during the 3D Gaussian optimization to obtain a clean reconstruction. To this end, we take a self-supervised approach that looks at the image residuals during the optimization to determine areas that have likely been falsified by a distractor. In addition, we leverage a pretrained segmentation network to provide object awareness, enabling more accurate exclusion of distractors. This way, we obtain segmentation masks of distractors to effectively ignore them in the loss formulation. We demonstrate that our approach is robust to various distractors and strongly improves rendering quality on distractor-polluted scenes, improving PSNR by 1.86dB compared to 3D Gaussian Splatting.

[CV-24] Interpretable Long-term Action Quality Assessment BMVC

链接: https://arxiv.org/abs/2408.11687
作者: Xu Dong,Xinran Liu,Wanqing Li,Anthony Adeyemi-Ejeye,Andrew Gilbert
关键词-EN: Action Quality Assessment, Quality Assessment, Long-term Action Quality, Action Quality, evaluates the execution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to British Machine Vision Conference (BMVC) 2024

点击查看摘要

Abstract:Long-term Action Quality Assessment (AQA) evaluates the execution of activities in videos. However, the length presents challenges in fine-grained interpretability, with current AQA methods typically producing a single score by averaging clip features, lacking detailed semantic meanings of individual clips. Long-term videos pose additional difficulty due to the complexity and diversity of actions, exacerbating interpretability challenges. While query-based transformer networks offer promising long-term modeling capabilities, their interpretability in AQA remains unsatisfactory due to a phenomenon we term Temporal Skipping, where the model skips self-attention layers to prevent output degradation. To address this, we propose an attention loss function and a query initialization method to enhance performance and interpretability. Additionally, we introduce a weight-score regression module designed to approximate the scoring patterns observed in human judgments and replace conventional single-score regression, improving the rationality of interpretability. Our approach achieves state-of-the-art results on three real-world, long-term AQA benchmarks. Our code is available at: this https URL

[CV-25] Exploring Robustness of Visual State Space model against Backdoor Attacks

链接: https://arxiv.org/abs/2408.11679
作者: Cheng-Yi Lee,Cheng-Chang Tsai,Chia-Mu Yu,Chun-Shien Lu
关键词-EN: Visual State Space, computer vision tasks, demonstrated remarkable performance, State Space Model, Visual State
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 9 figures, under review

点击查看摘要

Abstract:Visual State Space Model (VSS) has demonstrated remarkable performance in various computer vision tasks. However, in the process of development, backdoor attacks have brought severe challenges to security. Such attacks cause an infected model to predict target labels when a specific trigger is activated, while the model behaves normally on benign samples. In this paper, we conduct systematic experiments to comprehend on robustness of VSS through the lens of backdoor attacks, specifically how the state space model (SSM) mechanism affects robustness. We first investigate the vulnerability of VSS to different backdoor triggers and reveal that the SSM mechanism, which captures contextual information within patches, makes the VSS model more susceptible to backdoor triggers compared to models without SSM. Furthermore, we analyze the sensitivity of the VSS model to patch processing techniques and discover that these triggers are effectively disrupted. Based on these observations, we consider an effective backdoor for the VSS model that recurs in each patch to resist patch perturbations. Extensive experiments across three datasets and various backdoor attacks reveal that the VSS model performs comparably to Transformers (ViTs) but is less robust than the Gated CNNs, which comprise only stacked Gated CNN blocks without SSM.

[CV-26] Video-to-Text Pedestrian Monitoring (VTPM): Leveraging Computer Vision and Large Language Models for Privacy-Preserve Pedestrian Activity Monitoring at Intersections

链接: https://arxiv.org/abs/2408.11649
作者: Ahmed S. Abdelrahman,Mohamed Abdel-Aty,Dongdong Wang
关键词-EN: advanced research methodologies, enhancing system services, research methodologies, advanced research, textual reports
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computer vision has advanced research methodologies, enhancing system services across various fields. It is a core component in traffic monitoring systems for improving road safety; however, these monitoring systems don’t preserve the privacy of pedestrians who appear in the videos, potentially revealing their identities. Addressing this issue, our paper introduces Video-to-Text Pedestrian Monitoring (VTPM), which monitors pedestrian movements at intersections and generates real-time textual reports, including traffic signal and weather information. VTPM uses computer vision models for pedestrian detection and tracking, achieving a latency of 0.05 seconds per video frame. Additionally, it detects crossing violations with 90.2% accuracy by incorporating traffic signal data. The proposed framework is equipped with Phi-3 mini-4k to generate real-time textual reports of pedestrian activity while stating safety concerns like crossing violations, conflicts, and the impact of weather on their behavior with latency of 0.33 seconds. To enhance comprehensive analysis of the generated textual reports, Phi-3 medium is fine-tuned for historical analysis of these generated textual reports. This fine-tuning enables more reliable analysis about the pedestrian safety at intersections, effectively detecting patterns and safety critical events. The proposed VTPM offers a more efficient alternative to video footage by using textual reports reducing memory usage, saving up to 253 million percent, eliminating privacy issues, and enabling comprehensive interactive historical analysis.

[CV-27] MCDubber: Multimodal Context-Aware Expressive Video Dubbing

链接: https://arxiv.org/abs/2408.11593
作者: Yuan Zhao,Zhenqi Jia,Rui Liu,De Hu,Feilong Bao,Guanglai Gao
关键词-EN: Automatic Video Dubbing, Automatic Video, context, Current AVD models, Context-aware video Dubbing
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Automatic Video Dubbing (AVD) aims to take the given script and generate speech that aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual information of the current sentence to enhance the prosody of synthesized speech. However, it is crucial to consider whether the prosody of the generated dubbing aligns with the multimodal context, as the dubbing will be combined with the original context in the final video. This aspect has been overlooked in previous studies. To address this issue, we propose a Multimodal Context-aware video Dubbing model, termed \textbfMCDubber, to convert the modeling object from a single sentence to a longer sequence with context information to ensure the consistency of the global context prosody. MCDubber comprises three main components: (1) A context duration aligner aims to learn the context-aware alignment between the text and lip frames; (2) A context prosody predictor seeks to read the global context visual sequence and predict the context-aware global energy and pitch; (3) A context acoustic decoder ultimately predicts the global context mel-spectrogram with the assistance of adjacent ground-truth mel-spectrograms of the target sentence. Through this process, MCDubber fully considers the influence of multimodal context on the prosody expressiveness of the current sentence when dubbing. The extracted mel-spectrogram belonging to the target sentence from the output context mel-spectrograms is the final required dubbing audio. Extensive experiments on the Chem benchmark dataset demonstrate that our MCDubber significantly improves dubbing expressiveness compared to all advanced baselines. The code and demos are available at this https URL.

[CV-28] oward Enhancing Vehicle Color Recognition in Adverse Conditions: A Dataset and Benchmark

链接: https://arxiv.org/abs/2408.11589
作者: Gabriel E. Lima,Rayson Laroca,Eduardo Santos,Eduil Nascimento Jr.,David Menotti
关键词-EN: practical domains, criminal investigations, Vehicle Color Recognition, Vehicle information recognition, VCR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2024

点击查看摘要

Abstract:Vehicle information recognition is crucial in various practical domains, particularly in criminal investigations. Vehicle Color Recognition (VCR) has garnered significant research interest because color is a visually distinguishable attribute of vehicles and is less affected by partial occlusion and changes in viewpoint. Despite the success of existing methods for this task, the relatively low complexity of the datasets used in the literature has been largely overlooked. This research addresses this gap by compiling a new dataset representing a more challenging VCR scenario. The images - sourced from six license plate recognition datasets - are categorized into eleven colors, and their annotations were validated using official vehicle registration information. We evaluate the performance of four deep learning models on a widely adopted dataset and our proposed dataset to establish a benchmark. The results demonstrate that our dataset poses greater difficulty for the tested models and highlights scenarios that require further exploration in VCR. Remarkably, nighttime scenes account for a significant portion of the errors made by the best-performing model. This research provides a foundation for future studies on VCR, while also offering valuable insights for the field of fine-grained vehicle classification.

[CV-29] RaNDT SLAM: Radar SLAM Based on Intensity-Augmented Normal Distributions Transform

链接: https://arxiv.org/abs/2408.11576
作者: Maximilian Hilger,Nils Mandischer,Burkhard Corves
关键词-EN: Rescue robotics sets, robotics sets high, sets high requirements, perception algorithms due, Rescue robotics
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: This work was accepted by the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024

点击查看摘要

Abstract:Rescue robotics sets high requirements to perception algorithms due to the unstructured and potentially vision-denied environments. Pivoting Frequency-Modulated Continuous Wave radars are an emerging sensing modality for SLAM in this kind of environment. However, the complex noise characteristics of radar SLAM makes, particularly indoor, applications computationally demanding and slow. In this work, we introduce a novel radar SLAM framework, RaNDT SLAM, that operates fast and generates accurate robot trajectories. The method is based on the Normal Distributions Transform augmented by radar intensity measures. Motion estimation is based on fusion of motion model, IMU data, and registration of the intensity-augmented Normal Distributions Transform. We evaluate RaNDT SLAM in a new benchmark dataset and the Oxford Radar RobotCar dataset. The new dataset contains indoor and outdoor environments besides multiple sensing modalities (LiDAR, radar, and IMU).

[CV-30] Finite element-based space-time total variation-type regularization of the inverse problem in electrocardiographic imaging

链接: https://arxiv.org/abs/2408.11573
作者: Manuel Haas,Thomas Grandits,Thomas Pinetz,Thomas Beiert,Simone Pezzuto,Alexander Effland
关键词-EN: cardiac electrical activity, severely ill-posed inverse, Reconstructing cardiac electrical, ill-posed inverse problem, electric potential measurements
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing cardiac electrical activity from body surface electric potential measurements results in the severely ill-posed inverse problem in electrocardiography. Many different regularization approaches have been proposed to improve numerical results and provide unique results. This work presents a novel approach for reconstructing the epicardial potential from body surface potential maps based on a space-time total variation-type regularization using finite elements, where a first-order primal-dual algorithm solves the underlying convex optimization problem. In several numerical experiments, the superior performance of this method and the benefit of space-time regularization for the reconstruction of epicardial potential on two-dimensional torso data and a three-dimensional rabbit heart compared to state-of-the-art methods are demonstrated.

[CV-31] CHOTA: A Higher Order Accuracy Metric for Cell Tracking

链接: https://arxiv.org/abs/2408.11571
作者: Timo Kaiser,Vladimir Ulman,Bodo Rosenhahn
关键词-EN: significantly impacting biomedical, impacting biomedical research, tracking, cell tracking, Cell-specific Higher Order
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BIC Workshop at European Conference on Computer Vision 2024, 14 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The evaluation of cell tracking results steers the development of tracking methods, significantly impacting biomedical research. This is quantitatively achieved by means of evaluation metrics. Unfortunately, current metrics favor local correctness and weakly reward global coherence, impeding high-level biological analysis. To also foster global coherence, we propose the CHOTA metric (Cell-specific Higher Order Tracking Accuracy) which unifies the evaluation of all relevant aspects of cell tracking: cell detections and local associations, global coherence, and lineage tracking. We achieve this by introducing a new definition of the term ‘trajectory’ that includes the entire cell lineage and by including this into the well-established HOTA metric from general multiple object tracking. Furthermore, we provide a detailed survey of contemporary cell tracking metrics to compare our novel CHOTA metric and to show its advantages. All metrics are extensively evaluated on state-of-the-art real-data cell tracking results and synthetic results that simulate specific tracking errors. We show that CHOTA is sensitive to all tracking errors and gives a good indication of the biologically relevant capability of a method to reconstruct the full lineage of cells. It introduces a robust and comprehensive alternative to the currently used metrics in cell tracking. Python code is available at this https URL .

[CV-32] Positional Prompt Tuning for Efficient 3D Representation Learning

链接: https://arxiv.org/abs/2408.11567
作者: Shaochen Zhang,Zekun Qi,Runpei Dong,Xiuxiu Bai,Xing Wei
关键词-EN: Point cloud analysis, point cloud classification, achieved significant development, Point cloud, multiple downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: tech report

点击查看摘要

Abstract:Point cloud analysis has achieved significant development and is well-performed in multiple downstream tasks like point cloud classification and segmentation, etc. Being conscious of the simplicity of the position encoding structure in Transformer-based architectures, we attach importance to the position encoding as a high-dimensional part and the patch encoder to offer multi-scale information. Together with the sequential Transformer, the whole module with position encoding comprehensively constructs a multi-scale feature abstraction module that considers both the local parts from the patch and the global parts from center points as position encoding. With only a few parameters, the position embedding module fits the setting of PEFT (Parameter-Efficient Fine-Tuning) tasks pretty well. Thus we unfreeze these parameters as a fine-tuning part. At the same time, we review the existing prompt and adapter tuning methods, proposing a fresh way of prompts and synthesizing them with adapters as dynamic adjustments. Our Proposed method of PEFT tasks, namely PPT, with only 1.05% of parameters for training, gets state-of-the-art results in several mainstream datasets, such as 95.01% accuracy in the ScanObjectNN OBJ_BG dataset. Codes will be released at this https URL.

[CV-33] AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

链接: https://arxiv.org/abs/2408.11564
作者: Minheng Ni,Chenfei Wu,Huaying Yuan,Zhengyuan Yang,Ming Gong,Lijuan Wang,Zicheng Liu,Wangmeng Zuo,Nan Duan
关键词-EN: achieved significant realism, generative models, significant realism, advancement of generative, speech has achieved
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the advancement of generative models, the synthesis of different sensory elements such as music, visuals, and speech has achieved significant realism. However, the approach to generate multi-sensory outputs has not been fully explored, limiting the application on high-value scenarios such as of directing a film. Developing a movie director agent faces two major challenges: (1) Lack of parallelism and online scheduling with production steps: In the production of multi-sensory films, there are complex dependencies between different sensory elements, and the production time for each element varies. (2) Diverse needs and clear communication demands with users: Users often cannot clearly express their needs until they see a draft, which requires human-computer interaction and iteration to continually adjust and optimize the film content based on user feedback. To address these issues, we introduce AutoDirector, an interactive multi-sensory composition framework that supports long shots, special effects, music scoring, dubbing, and lip-syncing. This framework improves the efficiency of multi-sensory film production through automatic scheduling and supports the modification and improvement of interactive tasks to meet user needs. AutoDirector not only expands the application scope of human-machine collaboration but also demonstrates the potential of AI in collaborating with humans in the role of a film director to complete multi-sensory films.

[CV-34] Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control

链接: https://arxiv.org/abs/2408.11561
作者: Muhammad Aqeel,Shakiba Sharifi,Marco Cristani,Francesco Setti
关键词-EN: Iterative Refinement Process, Refinement Process, introduces the Iterative, Iterative Refinement, industrial quality control
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces the Iterative Refinement Process (IRP), a robust anomaly detection methodology designed for high-stakes industrial quality control. The IRP enhances defect detection accuracy through a cyclic data refinement strategy, iteratively removing misleading data points to improve model performance and robustness. We validate the IRP’s effectiveness using two benchmark datasets, Kolektor SDD2 (KSDD2) and MVTec AD, covering a wide range of industrial products and defect types. Our experimental results demonstrate that the IRP consistently outperforms traditional anomaly detection models, particularly in environments with high noise levels. This study highlights the IRP’s potential to significantly enhance anomaly detection processes in industrial settings, effectively managing the challenges of sparse and noisy data.

[CV-35] Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

链接: https://arxiv.org/abs/2408.11559
作者: Duc-Hai Pham,Duc Dung Nguyen,Hoang-Anh Pham,Ho Lai Tuan,Phong Ha Nguyen,Khoi Nguyen,Rang Nguyen
关键词-EN: enabling autonomous agents, visual images, planning and navigation, images is vital, vital in enabling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.

[CV-36] GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation ICPR2024

链接: https://arxiv.org/abs/2408.11558
作者: Abiao Li,Chenlei Lv,Guofeng Mei,Yifan Zuo,Jian Zhang,Yuming Fang
关键词-EN: Learning meaningful local, Learning meaningful, global information remains, point cloud segmentation, cloud segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICPR 2024

点击查看摘要

Abstract:Learning meaningful local and global information remains a challenge in point cloud segmentation tasks. When utilizing local information, prior studies indiscriminately aggregates neighbor information from different classes to update query points, potentially compromising the distinctive feature of query points. In parallel, inaccurate modeling of long-distance contextual dependencies when utilizing global information can also impact model performance. To address these issues, we propose GSTran, a novel transformer network tailored for the segmentation task. The proposed network mainly consists of two principal components: a local geometric transformer and a global semantic transformer. In the local geometric transformer module, we explicitly calculate the geometric disparity within the local region. This enables amplifying the affinity with geometrically similar neighbor points while suppressing the association with other neighbors. In the global semantic transformer module, we design a multi-head voting strategy. This strategy evaluates semantic similarity across the entire spatial range, facilitating the precise capture of contextual dependencies. Experiments on ShapeNetPart and S3DIS benchmarks demonstrate the effectiveness of the proposed method, showing its superiority over other algorithms. The code is available at this https URL.

[CV-37] AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

链接: https://arxiv.org/abs/2408.11553
作者: Yunfang Niu,Lingxiang Wu,Dong Yi,Jie Peng,Ning Jiang,Haiying Wu,Jinqiao Wang
关键词-EN: person appearance based, aims to modify, modify a person, person appearance, appearance based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fashion image editing aims to modify a person’s appearance based on a given instruction. Existing methods require auxiliary tools like segmenters and keypoint extractors, lacking a flexible and unified framework. Moreover, these methods are limited in the variety of clothing types they can handle, as most datasets focus on people in clean backgrounds and only include generic garments such as tops, pants, and dresses. These limitations restrict their applicability in real-world scenarios. In this paper, we first extend an existing dataset for human generation to include a wider range of apparel and more complex backgrounds. This extended dataset features people wearing diverse items such as tops, pants, dresses, skirts, headwear, scarves, shoes, socks, and bags. Additionally, we propose AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas. Users can simply input a human image along with a corresponding prompt in either text or image format. Our approach incorporates Fashion DiT, equipped with a Fashion-Guidance Attention (FGA) module designed to fuse explicit apparel types and CLIP-encoded apparel features. Both Qualitative and quantitative experiments demonstrate that our method delivers high-quality fashion editing and outperforms contemporary text-guided fashion editing methods.

[CV-38] UNetMamba: Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images

链接: https://arxiv.org/abs/2408.11545
作者: Enze Zhu,Zhan Chen,Dingkai Wang,Hanru Shi,Xiaoxuan Liu,Lei Wang
关键词-EN: high-resolution remote sensing, remote sensing images, sensing images plays, disaster assessment, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The semantic segmentation of high-resolution remote sensing images plays a crucial role in downstream applications such as urban planning and disaster assessment. However, existing Transformer-based methods suffer from the constraint between accuracy and efficiency. To overcome this dilemma, we propose UNetMamba, a novel Mamba-based semantic segmentation model. It incorporates a Mamba Segmentation Decoder (MSD) that can efficiently decode the complex information within high-resolution images, and a Local Supervision Module (LSM), which is train-only but can significantly enhance the perception of local contents. Extensive experiments demonstrate that UNet-Mamba outperforms the state-of-the-art methods with the mIoU increased by 0.87% on LoveDA and 0.36% on ISPRS Vaihingen, while achieving high efficiency through light weight, low memory footprint and low computational cost. The source code will soon be publicly available at this https URL.

[CV-39] Evolution of Detection Performance throughout the Online Lifespan of Synthetic Images

链接: https://arxiv.org/abs/2408.11541
作者: Dimitrios Karageorgiou,Quentin Bammey,Valentin Porcellini,Bertrand Goupil,Denis Teyssou,Symeon Papadopoulos
关键词-EN: disseminated online significantly, online significantly differ, Synthetic images disseminated, images disseminated online, significantly differ
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Synthetic images disseminated online significantly differ from those used during the training and evaluation of the state-of-the-art detectors. In this work, we analyze the performance of synthetic image detectors as deceptive synthetic images evolve throughout their online lifespan. Our study reveals that, despite advancements in the field, current state-of-the-art detectors struggle to distinguish between synthetic and real images in the wild. Moreover, we show that the time elapsed since the initial online appearance of a synthetic image negatively affects the performance of most detectors. Ultimately, by employing a retrieval-assisted detection approach, we demonstrate the feasibility to maintain initial detection performance throughout the whole online lifespan of an image and enhance the average detection efficacy across several state-of-the-art detectors by 6.7% and 7.8% for balanced accuracy and AUC metrics, respectively.

[CV-40] DeRainGS: Gaussian Splatting for Enhanced Scene Reconstruction in Rainy

链接: https://arxiv.org/abs/2408.11540
作者: Shuhong Liu,Xiang Chen,Hongming Chen,Quanfeng Xu,Mingrui Li
关键词-EN: conditions poses significant, poses significant challenges, significant challenges due, rainy conditions poses, visual perception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstruction under adverse rainy conditions poses significant challenges due to reduced visibility and the distortion of visual perception. These conditions can severely impair the quality of geometric maps, which is essential for applications ranging from autonomous planning to environmental monitoring. In response to these challenges, this study introduces the novel task of 3D Reconstruction in Rainy Environments (3DRRE), specifically designed to address the complexities of reconstructing 3D scenes under rainy conditions. To benchmark this task, we construct the HydroViews dataset that comprises a diverse collection of both synthesized and real-world scene images characterized by various intensities of rain streaks and raindrops. Furthermore, we propose DeRainGS, the first 3DGS method tailored for reconstruction in adverse rainy environments. Extensive experiments across a wide range of rain scenarios demonstrate that our method delivers state-of-the-art performance, remarkably outperforming existing occlusion-free methods by a large margin.

[CV-41] A Survey of Embodied Learning for Object-Centric Robotic Manipulation

链接: https://arxiv.org/abs/2408.11537
作者: Ying Zheng,Lei Yao,Yuejiao Su,Yi Zhang,Yi Wang,Sicheng Zhao,Yiyi Zhang,Lap-Pui Chau
关键词-EN: rapidly developing, developing and challenging, challenging area, object-centric robotic manipulation, Embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embodied learning for object-centric robotic manipulation is a rapidly developing and challenging area in embodied AI. It is crucial for advancing next-generation intelligent robots and has garnered significant interest recently. Unlike data-driven machine learning methods, embodied learning focuses on robot learning through physical interaction with the environment and perceptual feedback, making it especially suitable for robotic manipulation. In this paper, we provide a comprehensive survey of the latest advancements in this field and categorize the existing work into three main branches: 1) Embodied perceptual learning, which aims to predict object pose and affordance through various data representations; 2) Embodied policy learning, which focuses on generating optimal robotic decisions using methods such as reinforcement learning and imitation learning; 3) Embodied task-oriented learning, designed to optimize the robot’s performance based on the characteristics of different tasks in object grasping and manipulation. In addition, we offer an overview and discussion of public datasets, evaluation metrics, representative applications, current challenges, and potential future research directions. A project associated with this survey has been established at this https URL.

[CV-42] SAM-REF: Rethinking Image-Prompt Synergy for Refinement in Segment Anything

链接: https://arxiv.org/abs/2408.11535
作者: Chongkai Yu,Anqi Li,Xiaochao Qu,Luoqi Liu,Ting Liu
关键词-EN: SAM extracts image, marks a significant, significant milestone, milestone for interactive, SAM extracts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advent of the Segment Anything Model (SAM) marks a significant milestone for interactive segmentation using generalist models. As a late fusion model, SAM extracts image embeddings once and merges them with prompts in later interactions. This strategy limits the models ability to extract detailed information from the prompted target zone. Current specialist models utilize the early fusion strategy that encodes the combination of images and prompts to target the prompted objects, yet repetitive complex computations on the images result in high latency. The key to these issues is efficiently synergizing the images and prompts. We propose SAM-REF, a two-stage refinement framework that fully integrates images and prompts globally and locally while maintaining the accuracy of early fusion and the efficiency of late fusion. The first-stage GlobalDiff Refiner is a lightweight early fusion network that combines the whole image and prompts, focusing on capturing detailed information for the entire object. The second-stage PatchDiff Refiner locates the object detail window according to the mask and prompts, then refines the local details of the object. Experimentally, we demonstrated the high effectiveness and efficiency of our method in tackling complex cases with multiple interactions. Our SAM-REF model outperforms the current state-of-the-art method in most metrics on segmentation quality without compromising efficiency.

[CV-43] Just Project! Multi-Channel Despeckling the Easy Way

链接: https://arxiv.org/abs/2408.11531
作者: Loïc Denis,Emanuele Dalsasso(EPFL),Florence Tupin(IMAGES, IDS)
关键词-EN: Reducing speckle fluctuations, multi-channel SAR images, Reducing speckle, interferometric height estimation, SAR images
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Reducing speckle fluctuations in multi-channel SAR images is essential in many applications of SAR imaging such as polarimetric classification or interferometric height estimation. While single-channel despeckling has widely benefited from the application of deep learning techniques, extensions to multi-channel SAR images are much more challenging.This paper introduces MuChaPro, a generic framework that exploits existing single-channel despeckling methods. The key idea is to generate numerous single-channel projections, restore these projections, and recombine them into the final multi-channel estimate. This simple approach is shown to be effective in polarimetric and/or interferometric modalities. A special appeal of MuChaPro is the possibility to apply a self-supervised training strategy to learn sensor-specific networks for single-channel despeckling.

[CV-44] EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

链接: https://arxiv.org/abs/2408.11518
作者: Yihong Lin,Liang Peng,Jianqiao Hu,Xiandong Li,Wenxiong Kang,Songju Lei,Xianjia Wu,Huang Xu
关键词-EN: virtual digital humans, increasingly vivid, virtual digital, recent years, creation of increasingly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The creation of increasingly vivid 3D virtual digital humans has become a hot topic in recent years. Currently, most speech-driven work focuses on training models to learn the relationship between phonemes and visemes to achieve more realistic lips. However, they fail to capture the correlations between emotions and facial expressions effectively. To solve this problem, we propose a new model, termed EmoFace. EmoFace employs a novel Mesh Attention mechanism, which helps to learn potential feature dependencies between mesh vertices in time and space. We also adopt, for the first time to our knowledge, an effective self-growing training scheme that combines teacher-forcing and scheduled sampling in a 3D face animation task. Additionally, since EmoFace is an autoregressive model, there is no requirement that the first frame of the training data must be a silent frame, which greatly reduces the data limitations and contributes to solve the current dilemma of insufficient datasets. Comprehensive quantitative and qualitative evaluations on our proposed high-quality reconstructed 3D emotional facial animation dataset, 3D-RAVDESS ( 5.0343\times 10^-5 mm for LVE and 1.0196\times 10^-5 mm for EVE), and publicly available dataset VOCASET ( 2.8669\times 10^-5 mm for LVE and 0.4664\times 10^-5 mm for EVE), demonstrate that our algorithm achieves state-of-the-art performance.

[CV-45] MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning

链接: https://arxiv.org/abs/2408.11505
作者: Minghao Han,Linhao Qu,Dingkang Yang,Xukun Zhang,Xiaoying Wang,Lihua Zhang
关键词-EN: Multiple instance learning, Prompt tuning, Few-shot Weakly Supervised, Weakly Supervised WSI, Multiple instance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures, 5tables

点击查看摘要

Abstract:Multiple instance learning (MIL) has become a standard paradigm for weakly supervised classification of whole slide images (WSI). However, this paradigm relies on the use of a large number of labelled WSIs for training. The lack of training data and the presence of rare diseases present significant challenges for these methods. Prompt tuning combined with the pre-trained Vision-Language models (VLMs) is an effective solution to the Few-shot Weakly Supervised WSI classification (FSWC) tasks. Nevertheless, applying prompt tuning methods designed for natural images to WSIs presents three significant challenges: 1) These methods fail to fully leverage the prior knowledge from the VLM’s text modality; 2) They overlook the essential multi-scale and contextual information in WSIs, leading to suboptimal results; and 3) They lack exploration of instance aggregation methods. To address these problems, we propose a Multi-Scale and Context-focused Prompt Tuning (MSCPT) method for FSWC tasks. Specifically, MSCPT employs the frozen large language model to generate pathological visual language prior knowledge at multi-scale, guiding hierarchical prompt tuning. Additionally, we design a graph prompt tuning module to learn essential contextual information within WSI, and finally, a non-parametric cross-guided instance aggregation module has been introduced to get the WSI-level features. Based on two VLMs, extensive experiments and visualizations on three datasets demonstrated the powerful performance of our MSCPT.

[CV-46] XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary Classification of Chest X-Rays ALT

链接: https://arxiv.org/abs/2408.11493
作者: Umaima Rahman,Abhishek Basu,Muhammad Uzair Khattak,Aniq Ur Rahman
关键词-EN: binary classifiers trained, cross-disease transferability, study explores, explores the concept, concept of cross-disease
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in Machine Learning for Healthcare Conference MLHC 2024

点击查看摘要

Abstract:This study explores the concept of cross-disease transferability (XDT) in medical imaging, focusing on the potential of binary classifiers trained on one disease to perform zero-shot classification on another disease affecting the same organ. Utilizing chest X-rays (CXR) as the primary modality, we investigate whether a model trained on one pulmonary disease can make predictions about another novel pulmonary disease, a scenario with significant implications for medical settings with limited data on emerging diseases. The XDT framework leverages the embedding space of a vision encoder, which, through kernel transformation, aids in distinguishing between diseased and non-diseased classes in the latent space. This capability is especially beneficial in resource-limited environments or in regions with low prevalence of certain diseases, where conventional diagnostic practices may fail. However, the XDT framework is currently limited to binary classification, determining only the presence or absence of a disease rather than differentiating among multiple diseases. This limitation underscores the supplementary role of XDT to traditional diagnostic tests in clinical settings. Furthermore, results show that XDT-CXR as a framework is able to make better predictions compared to other zero-shot learning (ZSL) baselines.

[CV-47] E-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

链接: https://arxiv.org/abs/2408.11481
作者: Shangkun Sun,Xiaoyu Liang,Songlin Fan,Wenxu Gao,Wei Gao
关键词-EN: experienced rapid development, recently experienced rapid, video editing, Text-driven video editing, video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-driven video editing has recently experienced rapid development. Despite this, evaluating edited videos remains a considerable challenge. Current metrics tend to fail to align with human perceptions, and effective quantitative metrics for video editing are still notably absent. To address this, we introduce E-Bench, a benchmark suite tailored to the assessment of text-driven video editing. This suite includes E-Bench DB, a video quality assessment (VQA) database for video editing. E-Bench DB encompasses a diverse set of source videos featuring various motions and subjects, along with multiple distinct editing prompts, editing results from 8 different models, and the corresponding Mean Opinion Scores (MOS) from 24 human annotators. Based on E-Bench DB, we further propose E-Bench QA, a quantitative human-aligned measurement for the text-driven video editing task. In addition to the aesthetic, distortion, and other visual quality indicators that traditional VQA methods emphasize, E-Bench QA focuses on the text-video alignment and the relevance modeling between source and edited videos. It proposes a new assessment network for video editing that attains superior performance in alignment with human preferences. To the best of our knowledge, E-Bench introduces the first quality assessment dataset for video editing and an effective subjective-aligned quantitative metric for this domain. All data and code will be publicly available at this https URL.

[CV-48] LAKD-Activation Mapping Distillation Based on Local Learning

链接: https://arxiv.org/abs/2408.11478
作者: Yaoze Zhang,Yuming Zhang,Yu Zhao,Yue Zhang,Feiyu Zhu
关键词-EN: fundamental vision models, Knowledge distillation, Attention Knowledge Distillation, knowledge distillation methods, Knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages,7 figures

点击查看摘要

Abstract:Knowledge distillation is widely applied in various fundamental vision models to enhance the performance of compact models. Existing knowledge distillation methods focus on designing different distillation targets to acquire knowledge from teacher models. However, these methods often overlook the efficient utilization of distilled information, crudely coupling different types of information, making it difficult to explain how the knowledge from the teacher network aids the student network in learning. This paper proposes a novel knowledge distillation framework, Local Attention Knowledge Distillation (LAKD), which more efficiently utilizes the distilled information from teacher networks, achieving higher interpretability and competitive performance. The framework establishes an independent interactive training mechanism through a separation-decoupling mechanism and non-directional activation mapping. LAKD decouples the teacher’s features and facilitates progressive interaction training from simple to complex. Specifically, the student network is divided into local modules with independent gradients to decouple the knowledge transferred from the teacher. The non-directional activation mapping helps the student network integrate knowledge from different local modules by learning coarse-grained feature knowledge. We conducted experiments on the CIFAR-10, CIFAR-100, and ImageNet datasets, and the results show that our LAKD method significantly outperforms existing methods, consistently achieving state-of-the-art performance across different datasets.

[CV-49] rackGo: A Flexible and Efficient Method for Controllable Video Generation

链接: https://arxiv.org/abs/2408.11475
作者: Haitao Zhou,Chuang Wang,Rui Nie,Jinxiao Lin,Dongdong Yu,Qian Yu,Changhu Wang
关键词-EN: Recent years, diffusion-based controllable video, substantial progress, progress in diffusion-based, diffusion-based controllable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores. The project page of TrackGo can be found at: this https URL

[CV-50] MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation BMVC2024

链接: https://arxiv.org/abs/2408.11465
作者: Kim Yu-Ji,Hyunwoo Ha,Kim Youwang,Jaeheung Surh,Hyowon Ha,Tae-Hyun Oh
关键词-EN: single view image, long-standing challenge, Reconstructing, single view, view image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at BMVC 2024. [Project page] this https URL

点击查看摘要

Abstract:Reconstructing 3D from a single view image is a long-standing challenge. One of the popular approaches to tackle this problem is learning-based methods, but dealing with the test cases unfamiliar with training data (Out-of-distribution; OoD) introduces an additional challenge. To adapt for unseen samples in test time, we propose MeTTA, a test-time adaptation (TTA) exploiting generative prior. We design joint optimization of 3D geometry, appearance, and pose to handle OoD cases with only a single view image. However, the alignment between the reference image and the 3D shape via the estimated viewpoint could be erroneous, which leads to ambiguity. To address this ambiguity, we carefully design learnable virtual cameras and their self-calibration. In our experiments, we demonstrate that MeTTA effectively deals with OoD scenarios at failure cases of existing learning-based 3D reconstruction models and enables obtaining a realistic appearance with physically based rendering (PBR) textures.

[CV-51] MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering

链接: https://arxiv.org/abs/2408.11464
作者: Yonglin Tian,Songlin Bai,Zhiyao Luo,Yutong Wang,Yisheng Lv,Fei-Yue Wang
关键词-EN: autonomous driving systems, shown great superiority, attracted intensive attention, Occupancy prediction, Mamba-based occupancy prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Occupancy prediction has attracted intensive attention and shown great superiority in the development of autonomous driving systems. The fine-grained environmental representation brought by occupancy prediction in terms of both geometry and semantic information has facilitated the general perception and safe planning under open scenarios. However, it also brings high computation costs and heavy parameters in existing works that utilize voxel-based 3d dense representation and Transformer-based quadratic attention. To address these challenges, in this paper, we propose a Mamba-based occupancy prediction method (MambaOcc) adopting BEV features to ease the burden of 3D scenario representation, and linear Mamba-style attention to achieve efficient long-range perception. Besides, to address the sensitivity of Mamba to sequence order, we propose a local adaptive reordering (LAR) mechanism with deformable convolution and design a hybrid BEV encoder comprised of convolution layers and Mamba. Extensive experiments on the Occ3D-nuScenes dataset demonstrate that MambaOcc achieves state-of-the-art performance in terms of both accuracy and computational efficiency. For example, compared to FlashOcc, MambaOcc delivers superior results while reducing the number of parameters by 42% and computational costs by 39%. Code will be available at this https URL.

[CV-52] Low-Light Object Tracking: A Benchmark

链接: https://arxiv.org/abs/2408.11463
作者: Pengzhi Zhong,Xiaoyu Guo,Defeng Huang,Xiaojun Peng,Yian Li,Qijun Zhao,Shuiwang Li
关键词-EN: large-scale training datasets, made significant progress, recent years, application of large-scale, large-scale training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, the field of visual tracking has made significant progress with the application of large-scale training datasets. These datasets have supported the development of sophisticated algorithms, enhancing the accuracy and stability of visual object tracking. However, most research has primarily focused on favorable illumination circumstances, neglecting the challenges of tracking in low-ligh environments. In low-light scenes, lighting may change dramatically, targets may lack distinct texture features, and in some scenarios, targets may not be directly observable. These factors can lead to a severe decline in tracking performance. To address this issue, we introduce LLOT, a benchmark specifically designed for Low-Light Object Tracking. LLOT comprises 269 challenging sequences with a total of over 132K frames, each carefully annotated with bounding boxes. This specially designed dataset aims to promote innovation and advancement in object tracking techniques for low-light conditions, addressing challenges not adequately covered by existing benchmarks. To assess the performance of existing methods on LLOT, we conducted extensive tests on 39 state-of-the-art tracking algorithms. The results highlight a considerable gap in low-light tracking performance. In response, we propose H-DCPT, a novel tracker that incorporates historical and darkness clue prompts to set a stronger baseline. H-DCPT outperformed all 39 evaluated methods in our experiments, demonstrating significant improvements. We hope that our benchmark and H-DCPT will stimulate the development of novel and accurate methods for tracking objects in low-light conditions. The LLOT and code are available at this https URL.

[CV-53] Lookism: The overlooked bias in computer vision ECCV-2024 ECCV2024

链接: https://arxiv.org/abs/2408.11448
作者: Aditya Gulati,Bruno Lepri,Nuria Oliver
关键词-EN: socially relevant applications, computer vision, recent years, relevant applications, security screening
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Paper accepted at the ECCV 2024 workshop named “Fairness and ethics towards transparent AI: facing the chalLEnge through model Debiasing (FAILED)”, this https URL

点击查看摘要

Abstract:In recent years, there have been significant advancements in computer vision which have led to the widespread deployment of image recognition and generation systems in socially relevant applications, from hiring to security screening. However, the prevalence of biases within these systems has raised significant ethical and social concerns. The most extensively studied biases in this context are related to gender, race and age. Yet, other biases are equally pervasive and harmful, such as lookism, i.e., the preferential treatment of individuals based on their physical appearance. Lookism remains under-explored in computer vision but can have profound implications not only by perpetuating harmful societal stereotypes but also by undermining the fairness and inclusivity of AI technologies. Thus, this paper advocates for the systematic study of lookism as a critical bias in computer vision models. Through a comprehensive review of existing literature, we identify three areas of intersection between lookism and computer vision. We illustrate them by means of examples and a user study. We call for an interdisciplinary approach to address lookism, urging researchers, developers, and policymakers to prioritize the development of equitable computer vision systems that respect and reflect the diversity of human appearances.

[CV-54] GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

链接: https://arxiv.org/abs/2408.11447
作者: Wanshui Gan,Fang Liu,Hongbin Xu,Ningkai Mo,Naoto Yokoya
关键词-EN: Gaussian splatting, propose Gaussian Splatting, occupancy estimation, Gaussian, propose Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We introduce GaussianOcc, a systematic method that investigates the two usages of Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D poses from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP) module to provide accurate scale information for fully self-supervised training from adjacent view projection. Additionally, existing methods rely on volume rendering for final 3D voxel representation learning using 2D signals (depth maps, semantic maps), which is both time-consuming and less effective. We propose Gaussian Splatting from Voxel space (GSV) to leverage the fast rendering properties of Gaussian splatting. As a result, the proposed GaussianOcc method enables fully self-supervised (no ground truth pose) 3D occupancy estimation in competitive performance with low computational cost (2.7 times faster in training and 5 times faster in rendering).

[CV-55] BAdd: Bias Mitigation through Bias Addition

链接: https://arxiv.org/abs/2408.11439
作者: Ioannis Sarridis,Christos Koutlis,Symeon Papadopoulos,Christos Diou
关键词-EN: Computer vision, deep learning models, perpetuated by deep, Computer, learning models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computer vision (CV) datasets often exhibit biases that are perpetuated by deep learning models. While recent efforts aim to mitigate these biases and foster fair representations, they fail in complex real-world scenarios. In particular, existing methods excel in controlled experiments involving benchmarks with single-attribute injected biases, but struggle with multi-attribute biases being present in well-established CV datasets. Here, we introduce BAdd, a simple yet effective method that allows for learning fair representations invariant to the attributes introducing bias by incorporating features representing these attributes into the backbone. BAdd is evaluated on seven benchmarks and exhibits competitive performance, surpassing state-of-the-art methods on both single- and multi-attribute benchmarks. Notably, BAdd achieves +27.5% and +5.5% absolute accuracy improvements on the challenging multi-attribute benchmarks, FB-Biased-MNIST and CelebA, respectively.

[CV-56] DABench: A Benchmark Dataset for Data-Driven Weather Data Assimilation

链接: https://arxiv.org/abs/2408.11438
作者: Wuxin Wang,Weicheng Ni,Tao Han,Lei Bai,Boheng Duan,Kaijun Ren
关键词-EN: Large Weather Models, data-driven weather prediction, numerical weather prediction, weather prediction, weather prediction systems
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 37pages, 12 figures, 6 tables

点击查看摘要

Abstract:Recent advancements in deep learning (DL) have led to the development of several Large Weather Models (LWMs) that rival state-of-the-art (SOTA) numerical weather prediction (NWP) systems. Up to now, these models still rely on traditional NWP-generated analysis fields as input and are far from being an autonomous system. While researchers are exploring data-driven data assimilation (DA) models to generate accurate initial fields for LWMs, the lack of a standard benchmark impedes the fair evaluation among different data-driven DA algorithms. Here, we introduce DABench, a benchmark dataset utilizing ERA5 data as ground truth to guide the development of end-to-end data-driven weather prediction systems. DABench contributes four standard features: (1) sparse and noisy simulated observations under the guidance of the observing system simulation experiment method; (2) a skillful pre-trained weather prediction model to generate background fields while fairly evaluating the impact of assimilation outcomes on predictions; (3) standardized evaluation metrics for model comparison; (4) a strong baseline called the DA Transformer (DaT). DaT integrates the four-dimensional variational DA prior knowledge into the Transformer model and outperforms the SOTA in physical state reconstruction, named 4DVarNet. Furthermore, we exemplify the development of an end-to-end data-driven weather prediction system by integrating DaT with the prediction model. Researchers can leverage DABench to develop their models and compare performance against established baselines, which will benefit the future advancements of data-driven weather prediction systems. The code is available on this Github repository and the dataset is available at the Baidu Drive.

[CV-57] 2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

链接: https://arxiv.org/abs/2408.11432
作者: Yili Li,Jing Yu,Keke Gai,Bang Liu,Gang Xiong,Qi Wu
关键词-EN: obtain retrieval results, similarity scores, rely on cross-modal, calculate their similarity, sorted to obtain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30%-50% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at this https URL.

[CV-58] EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

链接: https://arxiv.org/abs/2408.11424
作者: Bohao Xing,Zitong Yu,Xin Liu,Kaishen Yuan,Qilang Ye,Weicheng Xie,Huanjing Yue,Jingyu Yang,Heikki Kälviäinen
关键词-EN: important research topic, emotional artificial intelligence, FER, artificial intelligence, important research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs’ capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at this https URL.

[CV-59] Pano2Room: Novel View Synthesis from a Single Indoor Panorama SIGGRAPH

链接: https://arxiv.org/abs/2408.11413
作者: Guo Pu,Yiming Zhao,Zhouhui Lian
关键词-EN: made significant advancements, leveraging knowledge distilled, Recent single-view, object datasets, panoramic RGBD inpainter
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: SIGGRAPH Asia 2024 Conference Papers (SA Conference Papers '24), December 3–6, 2024, Tokyo, Japan

点击查看摘要

Abstract:Recent single-view 3D generative methods have made significant advancements by leveraging knowledge distilled from extensive 3D object datasets. However, challenges persist in the synthesis of 3D scenes from a single view, primarily due to the complexity of real-world environments and the limited availability of high-quality prior resources. In this paper, we introduce a novel approach called Pano2Room, designed to automatically reconstruct high-quality 3D indoor scenes from a single panoramic image. These panoramic images can be easily generated using a panoramic RGBD inpainter from captures at a single location with any camera. The key idea is to initially construct a preliminary mesh from the input panorama, and iteratively refine this mesh using a panoramic RGBD inpainter while collecting photo-realistic 3D-consistent pseudo novel views. Finally, the refined mesh is converted into a 3D Gaussian Splatting field and trained with the collected pseudo novel views. This pipeline enables the reconstruction of real-world 3D scenes, even in the presence of large occlusions, and facilitates the synthesis of photo-realistic novel views with detailed geometry. Extensive qualitative and quantitative experiments have been conducted to validate the superiority of our method in single-panorama indoor novel synthesis compared to the state-of-the-art. Our code and data are available at \urlthis https URL.

[CV-60] SelfDRSC: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction

链接: https://arxiv.org/abs/2408.11411
作者: Wei Shang,Dongwei Ren,Wanying Zhang,Qilong Wang,Pengfei Zhu,Wangmeng Zuo
关键词-EN: Modern consumer cameras, consumer cameras commonly, cameras commonly employ, Modern consumer, dynamic scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 9 figures, and the code is available at \url{ this https URL }

点击查看摘要

Abstract:Modern consumer cameras commonly employ the rolling shutter (RS) imaging mechanism, via which images are captured by scanning scenes row-by-row, resulting in RS distortion for dynamic scenes. To correct RS distortion, existing methods adopt a fully supervised learning manner that requires high framerate global shutter (GS) images as ground-truth for supervision. In this paper, we propose an enhanced Self-supervised learning framework for Dual reversed RS distortion Correction (SelfDRSC++). Firstly, we introduce a lightweight DRSC network that incorporates a bidirectional correlation matching block to refine the joint optimization of optical flows and corrected RS features, thereby improving correction performance while reducing network parameters. Subsequently, to effectively train the DRSC network, we propose a self-supervised learning strategy that ensures cycle consistency between input and reconstructed dual reversed RS images. The RS reconstruction in SelfDRSC++ can be interestingly formulated as a specialized instance of video frame interpolation, where each row in reconstructed RS images is interpolated from predicted GS images by utilizing RS distortion time maps. By achieving superior performance while simplifying the training process, SelfDRSC++ enables feasible one-stage self-supervised training. Additionally, besides start and end RS scanning time, SelfDRSC++ allows supervision of GS images at arbitrary intermediate scanning times, thus enabling the learned DRSC network to generate high framerate GS videos. The code and trained models are available at \urlthis https URL.

[CV-61] Latent Feature and Attention Dual Erasure Attack against Multi-View Diffusion Models for 3D Assets Protection

链接: https://arxiv.org/abs/2408.11408
作者: Jingwei Sun,Xuchong Zhang,Changfeng Sun,Qicheng Bai,Hongbin Sun
关键词-EN: Multi-View Diffusion Models, Diffusion Models, enable remarkable improvements, increasing attention due, received increasing attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-View Diffusion Models (MVDMs) enable remarkable improvements in the field of 3D geometric reconstruction, but the issue regarding intellectual property has received increasing attention due to unauthorized imitation. Recently, some works have utilized adversarial attacks to protect copyright. However, all these works focus on single-image generation tasks which only need to consider the inner feature of images. Previous methods are inefficient in attacking MVDMs because they lack the consideration of disrupting the geometric and visual consistency among the generated multi-view images. This paper is the first to address the intellectual property infringement issue arising from MVDMs. Accordingly, we propose a novel latent feature and attention dual erasure attack to disrupt the distribution of latent feature and the consistency across the generated images from multi-view and multi-domain simultaneously. The experiments conducted on SOTA MVDMs indicate that our approach achieves superior performances in terms of attack effectiveness, transferability, and robustness against defense methods. Therefore, this paper provides an efficient solution to protect 3D assets from MVDMs-based 3D geometry reconstruction.

[CV-62] Domain-invariant Progressive Knowledge Distillation for UAV-based Object Detection

链接: https://arxiv.org/abs/2408.11407
作者: Liang Yao,Fan Liu,Chuanyi Zhang,Zhiquan Ou,Ting Wu
关键词-EN: object detection tasks, object detection, UAV-based object detection, detection tasks, compressing models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is an effective method for compressing models in object detection tasks. Due to limited computational capability, UAV-based object detection (UAV-OD) widely adopt the KD technique to obtain lightweight detectors. Existing methods often overlook the significant differences in feature space caused by the large gap in scale between the teacher and student models. This limitation hampers the efficiency of knowledge transfer during the distillation process. Furthermore, the complex backgrounds in UAV images make it challenging for the student model to efficiently learn the object features. In this paper, we propose a novel knowledge distillation framework for UAV-OD. Specifically, a progressive distillation approach is designed to alleviate the feature gap between teacher and student models. Then a new feature alignment method is provided to extract object-related features for enhancing student model’s knowledge reception efficiency. Finally, extensive experiments are conducted to validate the effectiveness of our proposed approach. The results demonstrate that our proposed method achieves state-of-the-art (SoTA) performance in two UAV-OD datasets.

[CV-63] Video Diffusion Models are Strong Video Inpainter

链接: https://arxiv.org/abs/2408.11402
作者: Minhyeok Lee,Suhwan Cho,Chajin Shin,Jungho Lee,Sunghun Yang,Sangyoun Lee
关键词-EN: Propagation-based video inpainting, garnered significant attention, recently garnered significant, Propagation-based video, optical flow
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame’s noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.

[CV-64] Revisiting FunnyBirds evaluation framework for prototypical parts networks

链接: https://arxiv.org/abs/2408.11401
作者: Szymon Opłatek,Dawid Rymarczyk,Bartosz Zieliński
关键词-EN: post-hoc methods, popular due, produce more genuine, Prototypical parts networks, metric scores
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at 2nd XAI World Conference

点击查看摘要

Abstract:Prototypical parts networks, such as ProtoPNet, became popular due to their potential to produce more genuine explanations than post-hoc methods. However, for a long time, this potential has been strictly theoretical, and no systematic studies have existed to support it. That changed recently with the introduction of the FunnyBirds benchmark, which includes metrics for evaluating different aspects of explanations. However, this benchmark employs attribution maps visualization for all explanation techniques except for the ProtoPNet, for which the bounding boxes are used. This choice significantly influences the metric scores and questions the conclusions stated in FunnyBirds publication. In this study, we comprehensively compare metric scores obtained for two types of ProtoPNet visualizations: bounding boxes and similarity maps. Our analysis indicates that employing similarity maps aligns better with the essence of ProtoPNet, as evidenced by different metric scores obtained from FunnyBirds. Therefore, we advocate using similarity maps as a visualization technique for prototypical parts networks in explainability evaluation benchmarks. Comments: Published at 2nd XAI World Conference Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2408.11401 [cs.CV] (or arXiv:2408.11401v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.11401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-65] EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning

链接: https://arxiv.org/abs/2408.11397
作者: Zhihao Li,Yao Du,Yang Liu,Yan Zhang,Yufang Liu,Mengdi Zhang,Xunliang Cai
关键词-EN: Multi-modal Large Language, Large Language Models, Large Language, Multi-modal Large, recently experienced rapid
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models have recently experienced rapid developments and excel in various multi-modal tasks. However, they still struggle with mathematical geometric problem solving, which requires exceptional visual perception proficiency. Existing MLLMs mostly optimize the LLM backbone to acquire geometric reasoning capabilities, while rarely emphasizing improvements in visual comprehension. In this paper, we first investigate the visual perception performance of MLLMs when facing geometric diagrams. Our findings reveal that current MLLMs severely suffer from inaccurate geometric perception and hallucinations. To address these limitations, we propose EAGLE, a novel two-stage end-to-end visual enhancement MLLM framework designed to ElevAte Geometric reasoning through LLM-Empowered visual instruction tuning. Specifically, in the preliminary stage, we feed geometric image-caption pairs into our MLLM that contains a fully fine-tuning CLIP ViT and a frozen LLM, aiming to endow our model with basic geometric knowledge. In the subsequent advanced stage, we incorporate LoRA modules into the vision encoder and unfreeze the LLM backbone. This enables the model to leverage the inherent CoT rationales within question-answer pairs, guiding the MLLM to focus on nuanced visual cues and enhancing its overall perceptual capacity. Moreover, we optimize the cross-modal projector in both stages to foster adaptive visual-linguistic alignments. After the two-stage visual enhancement, we develop the geometry expert model EAGLE-7B. Extensive experiments on popular benchmarks demonstrate the effectiveness of our model. For example, on the GeoQA benchmark, EAGLE-7B not only surpasses the exemplary G-LLaVA 7B model by 2.9%, but also marginally outperforms the larger G-LLaVA 13B model. On the MathVista benchmark, EAGLE-7B achieves remarkable 3.8% improvements compared with the proprietary model GPT-4V.

[CV-66] Fairness measures for biometric quality assessment

链接: https://arxiv.org/abs/2408.11392
作者: André Dörsch,Torsten Schlett,Peter Munch,Christian Rathgeb,Christoph Busch
关键词-EN: Quality assessment algorithms, captured biometric sample, Quality, Quality assessment, assessment algorithms
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Quality assessment algorithms measure the quality of a captured biometric sample. Since the sample quality strongly affects the recognition performance of a biometric system, it is essential to only process samples of sufficient quality and discard samples of low-quality. Even though quality assessment algorithms are not intended to yield very different quality scores across demographic groups, quality score discrepancies are possible, resulting in different discard ratios. To ensure that quality assessment algorithms do not take demographic characteristics into account when assessing sample quality and consequently to ensure that the quality algorithms perform equally for all individuals, it is crucial to develop a fairness measure. In this work we propose and compare multiple fairness measures for evaluating quality components across demographic groups. Proposed measures, could be used as potential candidates for an upcoming standard in this important field.

[CV-67] Current Status and Trends in Image Anti-Forensics Research: A Bibliometric Analysis

链接: https://arxiv.org/abs/2408.11365
作者: Yihong Lu,Jianyi Liu,Ru Zhang
关键词-EN: Science Core Collection, Image anti-forensics, critical topic, privacy and security, Image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image anti-forensics is a critical topic in the field of image privacy and security research. With the increasing ease of manipulating or generating human faces in images, the potential misuse of such forged images is a growing concern. This study aims to comprehensively review the knowledge structure and research hotspots related to image anti-forensics by analyzing publications in the Web of Science Core Collection (WoSCC) database. The bibliometric analysis conducted using VOSViewer software has revealed the research trends, major research institutions, most influential publications, top publishing venues, and most active contributors in this field. This is the first comprehensive bibliometric study summarizing research trends and developments in image anti-forensics. The information highlights recent and primary research directions, serving as a reference for future research in image anti-forensics.

[CV-68] HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model

链接: https://arxiv.org/abs/2408.11357
作者: Yi Wang,Jian Ma,Ruizhi Shao,Qiao Feng,Yu-kun Lai,Kun Li
关键词-EN: text prompts, paper aims, clothing, generation, human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper aims to generate physically-layered 3D humans from text prompts. Existing methods either generate 3D clothed humans as a whole or support only tight and simple clothing generation, which limits their applications to virtual try-on and part-level editing. To achieve physically-layered 3D human generation with reusable and complex clothing, we propose a novel layer-wise dressed human representation based on a physically-decoupled diffusion model. Specifically, to achieve layer-wise clothing generation, we propose a dual-representation decoupling framework for generating clothing decoupled from the human body, in conjunction with an innovative multi-layer fusion volume rendering method. To match the clothing with different body shapes, we propose an SMPL-driven implicit field deformation network that enables the free transfer and reuse of clothing. Extensive experiments demonstrate that our approach not only achieves state-of-the-art layered 3D human generation with complex clothing but also supports virtual try-on and layered human animation.

[CV-69] Image Score: Learning and Evaluating Human Preferences for Mercari Search

链接: https://arxiv.org/abs/2408.11349
作者: Chingis Oinar,Miao Cao,Shanshan Fu
关键词-EN: million active monthly, marketplace in Japan, active monthly users, million active, active monthly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mercari is the largest C2C e-commerce marketplace in Japan, having more than 20 million active monthly users. Search being the fundamental way to discover desired items, we have always had a substantial amount of data with implicit feedback. Although we actively take advantage of that to provide the best service for our users, the correlation of implicit feedback for such tasks as image quality assessment is not trivial. Many traditional lines of research in Machine Learning (ML) are similarly motivated by the insatiable appetite of Deep Learning (DL) models for well-labelled training data. Weak supervision is about leveraging higher-level and/or noisier supervision over unlabeled data. Large Language Models (LLMs) are being actively studied and used for data labelling tasks. We present how we leverage a Chain-of-Thought (CoT) to enable LLM to produce image aesthetics labels that correlate well with human behavior in e-commerce settings. Leveraging LLMs is more cost-effective compared to explicit human judgment, while significantly improving the explainability of deep image quality evaluation which is highly important for customer journey optimization at Mercari. We propose a cost-efficient LLM-driven approach for assessing and predicting image quality in e-commerce settings, which is very convenient for proof-of-concept testing. We show that our LLM-produced labels correlate with user behavior on Mercari. Finally, we show our results from an online experimentation, where we achieved a significant growth in sales on the web platform.

[CV-70] FATE: Focal-modulated Attention Encoder for Temperature Prediction

链接: https://arxiv.org/abs/2408.11336
作者: Tajamul Ashraf,Janibul Bashir
关键词-EN: rising sea levels, increased storm frequency, melting glaciers, evidenced by rising, sea levels
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:One of the major challenges of the twenty-first century is climate change, evidenced by rising sea levels, melting glaciers, and increased storm frequency. Accurate temperature forecasting is vital for understanding and mitigating these impacts. Traditional data-driven models often use recurrent neural networks (RNNs) but face limitations in parallelization, especially with longer sequences. To address this, we introduce a novel approach based on the FocalNet Transformer architecture. Our Focal modulation Attention Encoder (FATE) framework operates in a multi-tensor format, utilizing tensorized modulation to capture spatial and temporal nuances in meteorological data. Comparative evaluations against existing transformer encoders, 3D CNNs, LSTM, and ConvLSTM models show that FATE excels at identifying complex patterns in temperature data. Additionally, we present a new labeled dataset, the Climate Change Parameter dataset (CCPD), containing 40 years of data from Jammu and Kashmir on seven climate-related parameters. Experiments with real-world temperature datasets from the USA, Canada, and Europe show accuracy improvements of 12%, 23%, and 28%, respectively, over current state-of-the-art models. Our CCPD dataset also achieved a 24% improvement in accuracy. To support reproducible research, we have released the source code and pre-trained FATE model at \hrefthis https URLthis https URL.

[CV-71] Optimizing Transmit Field Inhomogeneity of Parallel RF Transmit Design in 7T MRI using Deep Learning

链接: https://arxiv.org/abs/2408.11323
作者: Zhengyi Lu,Hao Liang,Xiao Wang,Xinqiang Yan,Yuankai Huo
关键词-EN: Magnetic Resonance Imaging, Magnetic Resonance, higher spatial resolution, UHF MRI introduces, UHF MRI
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ultrahigh field (UHF) Magnetic Resonance Imaging (MRI) provides a higher signal-to-noise ratio and, thereby, higher spatial resolution. However, UHF MRI introduces challenges such as transmit radiofrequency (RF) field (B1+) inhomogeneities, leading to uneven flip angles and image intensity anomalies. These issues can significantly degrade imaging quality and its medical applications. This study addresses B1+ field homogeneity through a novel deep learning-based strategy. Traditional methods like Magnitude Least Squares (MLS) optimization have been effective but are time-consuming and dependent on the patient’s presence. Recent machine learning approaches, such as RF Shim Prediction by Iteratively Projected Ridge Regression and deep learning frameworks, have shown promise but face limitations like extensive training times and oversimplified architectures. We propose a two-step deep learning strategy. First, we obtain the desired reference RF shimming weights from multi-channel B1+ fields using random-initialized Adaptive Moment Estimation. Then, we employ Residual Networks (ResNets) to train a model that maps B1+ fields to target RF shimming outputs. Our approach does not rely on pre-calculated reference optimizations for the testing process and efficiently learns residual functions. Comparative studies with traditional MLS optimization demonstrate our method’s advantages in terms of speed and accuracy. The proposed strategy achieves a faster and more efficient RF shimming design, significantly improving imaging quality at UHF. This advancement holds potential for broader applications in medical imaging and diagnostics.

[CV-72] WLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

链接: https://arxiv.org/abs/2408.11318
作者: Hyeongmin Lee,Jin-Young Kim,Kyungjune Baek,Jihwan Kim,Hyojun Go,Seongsu Ha,Seokjin Han,Jiho Jang,Raehyuk Jung,Daewoo Kim,GeunOh Kim,JongMok Kim,Jongseok Kim,Junwan Kim,Soonwoo Kwon,Jangwon Lee,Seungjoon Park,Minjoon Seo,Jay Suh,Jaehyuk Yi,Aiden Lee
关键词-EN: improvement compared, discuss evaluating video, video foundation models, video foundation, foundation models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages; Twelve Labs Technical Report

点击查看摘要

Abstract:In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA~(ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available on "this https URL.

[CV-73] Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework

链接: https://arxiv.org/abs/2408.11312
作者: Xiao Han,Chen Zhu,Xiangyu Zhao,Hengshu Zhu
关键词-EN: geographic locations precisely, real-world geographic locations, geo-localization demands in-depth, advanced reasoning skills, demands in-depth knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual geo-localization demands in-depth knowledge and advanced reasoning skills to associate images with real-world geographic locations precisely. In general, traditional methods based on data-matching are hindered by the impracticality of storing adequate visual records of global landmarks. Recently, Large Vision-Language Models (LVLMs) have demonstrated the capability of geo-localization through Visual Question Answering (VQA), enabling a solution that does not require external geo-tagged image records. However, the performance of a single LVLM is still limited by its intrinsic knowledge and reasoning capabilities. Along this line, in this paper, we introduce a novel visual geo-localization framework called \name\ that integrates the inherent knowledge of multiple LVLM agents via inter-agent communication to achieve effective geo-localization of images. Furthermore, our framework employs a dynamic learning strategy to optimize the communication patterns among agents, reducing unnecessary discussions among agents and improving the efficiency of the framework. To validate the effectiveness of the proposed framework, we construct GeoGlobe, a novel dataset for visual geo-localization tasks. Extensive testing on the dataset demonstrates that our approach significantly outperforms state-of-the-art methods.

[CV-74] Improving Out-of-Distribution Data Handling and Corruption Resistance via Modern Hopfield Networks

链接: https://arxiv.org/abs/2408.11309
作者: Saleh Sargolzaei,Luis Rueda
关键词-EN: Modern Hopfield Networks, Hopfield Networks, Modern Hopfield, computer vision models, potential of Modern
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the potential of Modern Hopfield Networks (MHN) in improving the ability of computer vision models to handle out-of-distribution data. While current computer vision models can generalize to unseen samples from the same distribution, they are susceptible to minor perturbations such as blurring, which limits their effectiveness in real-world applications. We suggest integrating MHN into the baseline models to enhance their robustness. This integration can be implemented during the test time for any model and combined with any adversarial defense method. Our research shows that the proposed integration consistently improves model performance on the MNIST-C dataset, achieving a state-of-the-art increase of 13.84% in average corruption accuracy, a 57.49% decrease in mean Corruption Error (mCE), and a 60.61% decrease in relative mCE compared to the baseline model. Additionally, we investigate the capability of MHN to converge to the original non-corrupted data. Notably, our method does not require test-time adaptation or augmentation with corruptions, underscoring its practical viability for real-world deployment. (Source code publicly available at: this https URL)

[CV-75] UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

链接: https://arxiv.org/abs/2408.11305
作者: Xiangyu Zhao,Yuehan Zhang,Wenlong Zhang,Xiao-Ming Wu
关键词-EN: fashion domain, fashion domain encompasses, generation, fashion, encompasses a variety
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks involving embeddings, such as image-to-text or text-to-image retrieval, have been largely overlooked from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain. The source code is available at this https URL.

[CV-76] Making Large Vision Language Models to be Good Few-shot Learners

链接: https://arxiv.org/abs/2408.11297
作者: Fan Liu,Wenwen Cai,Jian Huo,Chuanyi Zhang,Delong Chen,Jun Zhou
关键词-EN: Large Vision Language, Vision Language Models, Large Vision, computer vision, fundamental yet challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few-shot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk learning specific response formats rather than effectively extracting useful information from support data in FSC tasks. In this paper, we investigate LVLMs’ performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases. To tackle the above challenges, we adopt the meta-learning strategy to teach models “learn to learn”. By constructing a rich set of meta-tasks for instruction fine-tuning, LVLMs enhance the ability to extract information from few-shot support data for classification. Additionally, we further boost LVLM’s few-shot learning capabilities through label augmentation and candidate selection in the fine-tuning and inference stage, respectively. Label augmentation is implemented via a character perturbation strategy to ensure the model focuses on support information. Candidate selection leverages attribute descriptions to filter out unreliable candidates and simplify the task. Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proven beneficial for training-free LVLMs.

[CV-77] aming Generative Diffusion for Universal Blind Image Restoration

链接: https://arxiv.org/abs/2408.11287
作者: Siwei Tu,Weidong Yang,Ben Fei
关键词-EN: blind image restoration, image restoration, blind image, widely utilized, image restoration methods
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures, 8 tables

点击查看摘要

Abstract:Diffusion models have been widely utilized for image restoration. However, previous blind image restoration methods still need to assume the type of degradation model while leaving the parameters to be optimized, limiting their real-world applications. Therefore, we aim to tame generative diffusion prior for universal blind image restoration dubbed BIR-D, which utilizes an optimizable convolutional kernel to simulate the degradation model and dynamically update the parameters of the kernel in the diffusion steps, enabling it to achieve blind image restoration results even in various complex situations. Besides, based on mathematical reasoning, we have provided an empirical formula for the chosen of adaptive guidance scale, eliminating the need for a grid search for the optimal parameter. Experimentally, Our BIR-D has demonstrated superior practicality and versatility than off-the-shelf unsupervised methods across various tasks both on real-world and synthetic datasets, qualitatively and quantitatively. BIR-D is able to fulfill multi-guidance blind image restoration. Moreover, BIR-D can also restore images that undergo multiple and complicated degradations, demonstrating the practical applications.

[CV-78] Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

链接: https://arxiv.org/abs/2408.11286
作者: Mengying Ge,Dongkai Tang,Mingyang Li
关键词-EN: Multimodal emotion recognition, Multimodal emotion, great concern, task of great, Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal emotion recognition is a task of great concern. However, traditional data sets are based on fixed labels, resulting in models that often focus on main emotions and ignore detailed emotional changes in complex scenes. This report introduces the solution of using MLLMs technology to generate open-vocabulary emotion labels from a video. The solution includes the use of framework, data generation and processing, training methods, results generation and multi-model co-judgment. In the MER-OV (Open-Word Emotion Recognition) of the MER2024 challenge, our method achieved significant advantages, leading to its superior capabilities in complex emotion computation.

[CV-79] Exploring Scene Coherence for Semi-Supervised 3D Semantic Segmentation

链接: https://arxiv.org/abs/2408.11280
作者: Chuandong Liu,Shuguo Jiang,Xingxing Weng,Lei Yu,Pengcheng Li,Gui-Song Xia
关键词-EN: acquiring dense annotations, efficiently addresses, addresses the limitation, limitation of acquiring, acquiring dense
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-supervised semantic segmentation, which efficiently addresses the limitation of acquiring dense annotations, is essential for 3D scene understanding. Most methods leverage the teacher model to generate pseudo labels, and then guide the learning of the student model on unlabeled scenes. However, they focus only on points with pseudo labels while directly overlooking points without pseudo labels, namely intra-scene inconsistency, leading to semantic ambiguity. Moreover, inter-scene correlation between labeled and unlabeled scenes contribute to transferring rich annotation information, yet this has not been explored for the semi-supervised tasks. To address these two problems, we propose to explore scene coherence for semi-supervised 3D semantic segmentation, dubbed CoScene. Inspired by the unstructured and unordered nature of the point clouds, our CoScene adopts the straightforward point erasure strategy to ensure the intra-scene consistency. Moreover, patch-based data augmentation is proposed to enhance the inter-scene information transfer between labeled and unlabeled scenes at both scene and instance levels. Extensive experimental results on SemanticKITTI and nuScenes show that our approach outperforms existing methods.

[CV-80] he Key of Parameter Skew in Federated Learning

链接: https://arxiv.org/abs/2408.11278
作者: Sifan Wang,Junfeng Liao,Ye Yuan,Riquan Zhang
关键词-EN: performing deep learning, exchanging raw data, Federated Learning, deep learning, data owners
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as an excellent solution for performing deep learning on different data owners without exchanging raw data. However, statistical heterogeneity in FL presents a key challenge, leading to a phenomenon of skewness in local model parameter distributions that researchers have largely overlooked. In this work, we propose the concept of parameter skew to describe the phenomenon that can substantially affect the accuracy of global model parameter estimation. Additionally, we introduce FedSA, an aggregation strategy to obtain a high-quality global model, to address the implication from parameter skew. Specifically, we categorize parameters into high-dispersion and low-dispersion groups based on the coefficient of variation. For high-dispersion parameters, Micro-Classes (MIC) and Macro-Classes (MAC) represent the dispersion at the micro and macro levels, respectively, forming the foundation of FedSA. To evaluate the effectiveness of FedSA, we conduct extensive experiments with different FL algorithms on three computer vision datasets. FedSA outperforms eight state-of-the-art baselines by about 4.7% in test accuracy.

[CV-81] On Missing Scores in Evolving Multibiometric Systems ICPR

链接: https://arxiv.org/abs/2408.11271
作者: Melissa R Dale,Anil Jain,Arun Ross
关键词-EN: operational biometric system, biometric system, multiple algorithms, face comparators, multiple modalities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2022 26th International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:The use of multiple modalities (e.g., face and fingerprint) or multiple algorithms (e.g., three face comparators) has shown to improve the recognition accuracy of an operational biometric system. Over time a biometric system may evolve to add new modalities, retire old modalities, or be merged with other biometric systems. This can lead to scenarios where there are missing scores corresponding to the input probe set. Previous work on this topic has focused on either the verification or identification tasks, but not both. Further, the proportion of missing data considered has been less than 50%. In this work, we study the impact of missing score data for both the verification and identification tasks. We show that the application of various score imputation methods along with simple sum fusion can improve recognition accuracy, even when the proportion of missing scores increases to 90%. Experiments show that fusion after score imputation outperforms fusion with no imputation. Specifically, iterative imputation with K nearest neighbors consistently surpasses other imputation methods in both the verification and identification tasks, regardless of the amount of scores missing, and provides imputed values that are consistent with the ground truth complete dataset.

[CV-82] Automatic Image Annotation (AIA) of AlmondNet-20 Method for Almond Detection by Improved CNN-based Model

链接: https://arxiv.org/abs/2408.11253
作者: Mohsen Asghari Ilani,Saba Moftakhar Tehran,Ashkan Kavei,Arian Radmehr
关键词-EN: competitive nut market, innovative methodology aimed, Convolutional Neural Networks, burgeoning global demand, Deep Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In response to the burgeoning global demand for premium agricultural products, particularly within the competitive nut market, this paper introduces an innovative methodology aimed at enhancing the grading process for almonds and their shells. Leveraging state-of-the-art Deep Convolutional Neural Networks (CNNs), specifically the AlmondNet-20 architecture, our study achieves exceptional accuracy exceeding 99%, facilitated by the utilization of a 20-layer CNN model. To bolster robustness in differentiating between almonds and shells, data augmentation techniques are employed, ensuring the reliability and accuracy of our classification system. Our model, meticulously trained over 1000 epochs, demonstrates remarkable performance, boasting an accuracy rate of 99% alongside a minimal loss function of 0.0567. Rigorous evaluation through test datasets further validates the efficacy of our approach, revealing impeccable precision, recall, and F1-score metrics for almond detection. Beyond its technical prowess, this advanced classification system offers tangible benefits to both industry experts and non-specialists alike, ensuring globally reliable almond classification. The application of deep learning algorithms, as showcased in our study, not only enhances grading accuracy but also presents opportunities for product patents, thereby contributing to the economic value of our nation. Through the adoption of cutting-edge technologies such as the AlmondNet-20 model, we pave the way for future advancements in agricultural product classification, ultimately enriching global trade and economic prosperity.

[CV-83] Irregularity Inspection using Neural Radiance Field

链接: https://arxiv.org/abs/2408.11251
作者: Tianqi Ding,Dawei Xiang
关键词-EN: growth of industrialization, increasing growth, industries are relying, large-scale production machinery, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the increasing growth of industrialization, more and more industries are relying on machine automation for production. However, defect detection in large-scale production machinery is becoming increasingly important. Due to their large size and height, it is often challenging for professionals to conduct defect inspections on such large machinery. For example, the inspection of aging and misalignment of components on tall machinery like towers requires companies to assign dedicated personnel. Employees need to climb the towers and either visually inspect or take photos to detect safety hazards in these large machines. Direct visual inspection is limited by its low level of automation, lack of precision, and safety concerns associated with personnel climbing the towers. Therefore, in this paper, we propose a system based on neural network modeling (NeRF) of 3D twin models. By comparing two digital models, this system enables defect detection at the 3D interface of an object.

[CV-84] CNN-based Labelled Crack Detection for Image Annotation

链接: https://arxiv.org/abs/2408.11250
作者: Mohsen Asghari Ilani,Leila Amini,Hossein Karimi,Maryam Shavali Kuhshuri
关键词-EN: human-conducted onsite inspections, Numerous image processing, Numerous image, offering an alternative, onsite inspections
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Numerous image processing techniques (IPTs) have been employed to detect crack defects, offering an alternative to human-conducted onsite inspections. These IPTs manipulate images to extract defect features, particularly cracks in surfaces produced through Additive Manufacturing (AM). This article presents a vision-based approach that utilizes deep convolutional neural networks (CNNs) for crack detection in AM surfaces. Traditional image processing techniques face challenges with diverse real-world scenarios and varying crack types. To overcome these challenges, our proposed method leverages CNNs, eliminating the need for extensive feature extraction. Annotation for CNN training is facilitated by LabelImg without the requirement for additional IPTs. The trained CNN, enhanced by OpenCV preprocessing techniques, achieves an outstanding 99.54% accuracy on a dataset of 14,982 annotated images with resolutions of 1536 x 1103 pixels. Evaluation metrics exceeding 96% precision, 98% recall, and a 97% F1-score highlight the precision and effectiveness of the entire process.

[CV-85] CooPre: Cooperative Pretraining for V2X Cooperative Perception

链接: https://arxiv.org/abs/2408.11241
作者: Seth Z. Zhao,Hao Xiang,Chenfeng Xu,Xin Xia,Bolei Zhou,Jiaqi Ma
关键词-EN: accurate multi-agent, rely on accurate, perception methods rely, cooperative perception, Existing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing Vehicle-to-Everything (V2X) cooperative perception methods rely on accurate multi-agent 3D annotations. Nevertheless, it is time-consuming and expensive to collect and annotate real-world data, especially for V2X systems. In this paper, we present a self-supervised learning method for V2X cooperative perception, which utilizes the vast amount of unlabeled 3D V2X data to enhance the perception performance. Beyond simply extending the previous pre-training methods for point-cloud representation learning, we introduce a novel self-supervised Cooperative Pretraining framework (termed as CooPre) customized for a collaborative scenario. We point out that cooperative point-cloud sensing compensates for information loss among agents. This motivates us to design a novel proxy task for the 3D encoder to reconstruct LiDAR point clouds across different agents. Besides, we develop a V2X bird-eye-view (BEV) guided masking strategy which effectively allows the model to pay attention to 3D features across heterogeneous V2X agents (i.e., vehicles and infrastructure) in the BEV space. Noticeably, such a masking strategy effectively pretrains the 3D encoder and is compatible with mainstream cooperative perception backbones. Our approach, validated through extensive experiments on representative datasets (i.e., V2X-Real, V2V4Real, and OPV2V), leads to a performance boost across all V2X settings. Additionally, we demonstrate the framework’s improvements in cross-domain transferability, data efficiency, and robustness under challenging scenarios. The code will be made publicly available.

[CV-86] Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

链接: https://arxiv.org/abs/2408.11237
作者: Christos Constantinou,Georgios Ioannides,Aman Chadha,Aaron Elkins,Edwin Simpson
关键词-EN: machine learning applications, model overconfidence, crucial in machine, machine learning, learning applications
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The majority of existing OOD detection methods predominantly address uni-modal inputs, such as images or texts. In the context of multi-modal documents, there is a notable lack of extensive research on the performance of these methods, which have primarily been developed with a focus on computer vision tasks. We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches and significantly decreases the false positive rate (FPR) compared to existing solutions up to 7.5%. This methodology generalizes well to multi-modal data, such as documents, where visual and textual information are modeled under the same Transformer architecture. To address the scarcity of high-quality publicly available document datasets and encourage further research on OOD detection for documents, we introduce FinanceDocs, a new document AI dataset. Our code and dataset are publicly available.

[CV-87] Unified Deep Learning Model for Global Prediction of Aboveground Biomass Canopy Height and Cover from High-Resolution Multi-Sensor Satellite Imagery

链接: https://arxiv.org/abs/2408.11234
作者: Manuel Weber,Carly Beneke,Clyde Wheeler
关键词-EN: international climate initiatives, ground based assessments, carbon stock, carbon accounting, climate initiatives
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regular measurement of carbon stock in the world’s forests is critical for carbon accounting and reporting under national and international climate initiatives, and for scientific research, but has been largely limited in scalability and temporal resolution due to a lack of ground based assessments. Increasing efforts have been made to address these challenges by incorporating remotely sensed data. We present a new methodology which uses multi-sensor, multi-spectral imagery at a resolution of 10 meters and a deep learning based model which unifies the prediction of above ground biomass density (AGBD), canopy height (CH), canopy cover (CC) as well as uncertainty estimations for all three quantities. The model is trained on millions of globally sampled GEDI-L2/L4 measurements. We validate the capability of our model by deploying it over the entire globe for the year 2023 as well as annually from 2016 to 2023 over selected areas. The model achieves a mean absolute error for AGBD (CH, CC) of 26.1 Mg/ha (3.7 m, 9.9 %) and a root mean squared error of 50.6 Mg/ha (5.4 m, 15.8 %) on a globally sampled test dataset, demonstrating a significant improvement over previously published results. We also report the model performance against independently collected ground measurements published in the literature, which show a high degree of correlation across varying conditions. We further show that our pre-trained model facilitates seamless transferability to other GEDI variables due to its multi-head architecture.

[CV-88] On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

链接: https://arxiv.org/abs/2408.11221
作者: Sadia Ilyas,Ido Freeman,Matthias Rottmann
关键词-EN: critical task focused, data distribution, training data, critical task, task focused
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) object detection is a critical task focused on detecting objects that originate from a data distribution different from that of the training data. In this study, we investigate to what extent state-of-the-art open-vocabulary object detectors can detect unusual objects in street scenes, which are considered as OOD or rare scenarios with respect to common street scene datasets. Specifically, we evaluate their performance on the OoDIS Benchmark, which extends RoadAnomaly21 and RoadObstacle21 from SegmentMeIfYouCan, as well as LostAndFound, which was recently extended to object level annotations. The objective of our study is to uncover short-comings of contemporary object detectors in challenging real-world, and particularly in open-world scenarios. Our experiments reveal that open vocabulary models are promising for OOD object detection scenarios, however far from perfect. Substantial improvements are required before they can be reliably deployed in real-world applications. We benchmark four state-of-the-art open-vocabulary object detection models on three different datasets. Noteworthily, Grounding DINO achieves the best results on RoadObstacle21 and LostAndFound in our study with an AP of 48.3% and 25.4% respectively. YOLO-World excels on RoadAnomaly21 with an AP of 21.2%.

[CV-89] Revisiting Min-Max Optimization Problem in Adversarial Training

链接: https://arxiv.org/abs/2408.11218
作者: Sina Hajer Ahmadi,Hassan Bahrami
关键词-EN: computer vision applications, real world puts, deep neural networks, neural networks, convolutional neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of computer vision applications in the real world puts the security of the deep neural networks at risk. Recent works demonstrate that convolutional neural networks are susceptible to adversarial examples - where the input images look similar to the natural images but are classified incorrectly by the model. To provide a rebuttal to this problem, we propose a new method to build robust deep neural networks against adversarial attacks by reformulating the saddle point optimization problem in \citemadry2017towards. Our proposed method offers significant resistance and a concrete security guarantee against multiple adversaries. The goal of this paper is to act as a stepping stone for a new variation of deep learning models which would lead towards fully robust deep learning models.

[CV-90] A Short Review and Evaluation of SAM2s Performance in 3D CT Image Segmentation

链接: https://arxiv.org/abs/2408.11210
作者: Yufan He,Pengfei Guo,Yucheng Tang,Andriy Myronenko,Vishwesh Nath,Ziyue Xu,Dong Yang,Can Zhao,Daguang Xu,Wenqi Li
关键词-EN: release of Segment, medical image segmentation, image segmentation, actively evaluating, evaluating its performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Since the release of Segment Anything 2 (SAM2), the medical imaging community has been actively evaluating its performance for 3D medical image segmentation. However, different studies have employed varying evaluation pipelines, resulting in conflicting outcomes that obscure a clear understanding of SAM2’s capabilities and potential applications. We shortly review existing benchmarks and point out that the SAM2 paper clearly outlines a zero-shot evaluation pipeline, which simulates user clicks iteratively for up to eight iterations. We reproduced this interactive annotation simulation on 3D CT datasets and provided the results and code~\urlthis https URL. Our findings reveal that directly applying SAM2 on 3D medical imaging in a zero-shot manner is far from satisfactory. It is prone to generating false positives when foreground objects disappear, and annotating more slices cannot fully offset this tendency. For smaller single-connected objects like kidney and aorta, SAM2 performs reasonably well but for most organs it is still far behind state-of-the-art 3D annotation methods. More research and innovation are needed for 3D medical imaging community to use SAM2 correctly.

[CV-91] PooDLe: Pooled and dense self-supervised learning from naturalistic videos

链接: https://arxiv.org/abs/2408.11208
作者: Alex N. Wang,Christopher Hoang,Yuwen Xiong,Yann LeCun,Mengye Ren
关键词-EN: driven significant progress, Self-supervised learning, driven significant, significant progress, Self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.

[CV-92] Quantum Inverse Contextual Vision Transformers (Q-ICVT): A New Frontier in 3D Object Detection for AVs CIKM’24

链接: https://arxiv.org/abs/2408.11207
作者: Sanjay Bhargav Dharavath,Tanmoy Dam,Supriyo Chakraborty,Prithwiraj Roy,Aniruddha Maiti
关键词-EN: predominantly leverages multi-modal, Contextual Vision Transformers, Inverse Contextual Vision, Quantum Inverse Contextual, autonomous vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: The paper has been accepted as a short paper at CIKM '24

点击查看摘要

Abstract:The field of autonomous vehicles (AVs) predominantly leverages multi-modal integration of LiDAR and camera data to achieve better performance compared to using a single modality. However, the fusion process encounters challenges in detecting distant objects due to the disparity between the high resolution of cameras and the sparse data from LiDAR. Insufficient integration of global perspectives with local-level details results in sub-optimal fusion this http URL address this issue, we have developed an innovative two-stage fusion process called Quantum Inverse Contextual Vision Transformers (Q-ICVT). This approach leverages adiabatic computing in quantum concepts to create a novel reversible vision transformer known as the Global Adiabatic Transformer (GAT). GAT aggregates sparse LiDAR features with semantic features in dense images for cross-modal integration in a global form. Additionally, the Sparse Expert of Local Fusion (SELF) module maps the sparse LiDAR 3D proposals and encodes position information of the raw point cloud onto the dense camera feature space using a gating point fusion approach. Our experiments show that Q-ICVT achieves an mAPH of 82.54 for L2 difficulties on the Waymo dataset, improving by 1.88% over current state-of-the-art fusion methods. We also analyze GAT and SELF in ablation studies to highlight the impact of Q-ICVT. Our code is available at this https URL Q-ICVT

[CV-93] Robust Long-Range Perception Against Sensor Misalignment in Autonomous Vehicles

链接: https://arxiv.org/abs/2408.11196
作者: Zi-Xiang Xia,Sudeep Fadadu,Yi Shi,Louis Foucard
关键词-EN: Advances in machine, machine learning algorithms, road users, enhancing safety, fusion have significantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Advances in machine learning algorithms for sensor fusion have significantly improved the detection and prediction of other road users, thereby enhancing safety. However, even a small angular displacement in the sensor’s placement can cause significant degradation in output, especially at long range. In this paper, we demonstrate a simple yet generic and efficient multi-task learning approach that not only detects misalignment between different sensor modalities but is also robust against them for long-range perception. Along with the amount of misalignment, our method also predicts calibrated uncertainty, which can be useful for filtering and fusing predicted misalignment values over time. In addition, we show that the predicted misalignment parameters can be used for self-correcting input sensor data, further improving the perception performance under sensor misalignment.

[CV-94] Compress Guidance in Conditional Diffusion Sampling

链接: https://arxiv.org/abs/2408.11194
作者: Anh-Dung Dinh,Daochang Liu,Chang Xu
关键词-EN: proves counterproductive due, entire sampling process, Enforcing guidance, model-fitting issue., expected condition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, Computer Vision and Machine Learning

点击查看摘要

Abstract:Enforcing guidance throughout the entire sampling process often proves counterproductive due to the model-fitting issue., where samples are generated to match the classifier’s parameters rather than generalizing the expected condition. This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue. By distributing the guidance densely in the early stages of the process, we observe a significant improvement in image quality and diversity while also reducing the required guidance timesteps by nearly 40%. This approach addresses a major challenge in applying guidance effectively to generative tasks. Consequently, our proposed method, termed Compress Guidance, allows for the exclusion of a substantial number of guidance timesteps while still surpassing baseline models in image quality. We validate our approach through benchmarks on label conditional and text-to-image generative tasks across various datasets and models.

[CV-95] CRACKS: Crowdsourcing Resources for Analysis and Categorization of Key Subsurface faults

链接: https://arxiv.org/abs/2408.11185
作者: Mohit Prabhushankar,Kiran Kokilepersaud,Jorge Quesada,Yavuz Yarici,Chen Zhou,Mohammad Alotaibi,Ghassan AlRegib,Ahmad Mustafa,Yusufjon Kumakov
关键词-EN: Crowdsourcing annotations, machine learning, created a paradigm, paradigm shift, data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Crowdsourcing annotations has created a paradigm shift in the availability of labeled data for machine learning. Availability of large datasets has accelerated progress in common knowledge applications involving visual and language data. However, specialized applications that require expert labels lag in data availability. One such application is fault segmentation in subsurface imaging. Detecting, tracking, and analyzing faults has broad societal implications in predicting fluid flows, earthquakes, and storing excess atmospheric CO _2 . However, delineating faults with current practices is a labor-intensive activity that requires precise analysis of subsurface imaging data by geophysicists. In this paper, we propose the \textttCRACKS dataset to detect and segment faults in subsurface images by utilizing crowdsourced resources. We leverage Amazon Mechanical Turk to obtain fault delineations from sections of the Netherlands North Sea subsurface images from (i) 26 novices who have no exposure to subsurface data and were shown a video describing and labeling faults, (ii) 8 practitioners who have previously interacted and worked on subsurface data, (iii) one geophysicist to label 7636 faults in the region. Note that all novices, practitioners, and the expert segment faults on the same subsurface volume with disagreements between and among the novices and practitioners. Additionally, each fault annotation is equipped with the confidence level of the annotator. The paper provides benchmarks on detecting and segmenting the expert labels, given the novice and practitioner labels. Additional details along with the dataset links and codes are available at \hrefthis https URLlink .

[CV-96] Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

链接: https://arxiv.org/abs/2408.11160
作者: Josh Goldman,John K. Tsotsos
关键词-EN: Deep neural networks, Deep neural, computer vision benchmarks, achieved impressive performance, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注: 13 pages

点击查看摘要

Abstract:Deep neural networks have achieved impressive performance on many computer vision benchmarks in recent years. However, can we be confident that impressive performance on benchmarks will translate to strong performance in real-world environments? Many environments in the real world are safety critical, and even slight model failures can be catastrophic. Therefore, it is crucial to test models rigorously before deployment. We argue, through both statistical theory and empirical evidence, that selecting representative image datasets for testing a model is likely implausible in many domains. Furthermore, performance statistics calculated with non-representative image datasets are highly unreliable. As a consequence, we cannot guarantee that models which perform well on withheld test images will also perform well in the real world. Creating larger and larger datasets will not help, and bias aware datasets cannot solve this problem either. Ultimately, there is little statistical foundation for evaluating models using withheld test sets. We recommend that future evaluation methodologies focus on assessing a model’s decision-making process, rather than metrics such as accuracy.

[CV-97] An Interpretable Deep Learning Approach for Morphological Script Type Analysis ICDAR2024

链接: https://arxiv.org/abs/2408.11150
作者: Malamatenia Vlachou-Efstathiou,Ioannis Siglidis,Dominique Stutzmann,Mathieu Aubry
关键词-EN: Defining script types, establishing classification criteria, Defining script, establishing classification, medieval handwriting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICDAR 2024 Workshop on Computational Paleography (IWCP, 31 August - Athens, Greece)

点击查看摘要

Abstract:Defining script types and establishing classification criteria for medieval handwriting is a central aspect of palaeographical analysis. However, existing typologies often encounter methodological challenges, such as descriptive limitations and subjective criteria. We propose an interpretable deep learning-based approach to morphological script type analysis, which enables systematic and objective analysis and contributes to bridging the gap between qualitative observations and quantitative measurements. More precisely, we adapt a deep instance segmentation method to learn comparable character prototypes, representative of letter morphology, and provide qualitative and quantitative tools for their comparison and analysis. We demonstrate our approach by applying it to the Textualis Formata script type and its two subtypes formalized by A. Derolez: Northern and Southern Textualis

[CV-98] ISLES 2024: The first longitudinal multimodal multi-center real-world dataset in (sub-)acute stroke

链接: https://arxiv.org/abs/2408.11142
作者: Evamaria O. Riedel,Ezequiel de la Rosa, TheAnh Baran,Moritz Hernandez Petzsche,Hakim Baazaoui,Kaiyuan Yang,David Robben,Joaquin Oscar Seia,Roland Wiest,Mauricio Reyes,Ruisheng Su,Claus Zimmer,Tobias Boeckh-Behrens,Maria Berndt,Bjoern Menze,Benedikt Wiestler,Susanne Wegener,Jan S. Kirschke
关键词-EN: heavy socioeconomic burden, Stroke Lesion Segmentation, morbidity and mortality, placing a heavy, socioeconomic burden
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Stroke remains a leading cause of global morbidity and mortality, placing a heavy socioeconomic burden. Over the past decade, advances in endovascular reperfusion therapy and the use of CT and MRI imaging for treatment guidance have significantly improved patient outcomes and are now standard in clinical practice. To develop machine learning algorithms that can extract meaningful and reproducible models of brain function for both clinical and research purposes from stroke images - particularly for lesion identification, brain health quantification, and prognosis - large, diverse, and well-annotated public datasets are essential. While only a few datasets with (sub-)acute stroke data were previously available, several large, high-quality datasets have recently been made publicly accessible. However, these existing datasets include only MRI data. In contrast, our dataset is the first to offer comprehensive longitudinal stroke data, including acute CT imaging with angiography and perfusion, follow-up MRI at 2-9 days, as well as acute and longitudinal clinical data up to a three-month outcome. The dataset includes a training dataset of n = 150 and a test dataset of n = 100 scans. Training data is publicly available, while test data will be used exclusively for model validation. We are making this dataset available as part of the 2024 edition of the Ischemic Stroke Lesion Segmentation (ISLES) challenge (this https URL), which continuously aims to establish benchmark methods for acute and sub-acute ischemic stroke lesion segmentation, aiding in creating open stroke imaging datasets and evaluating cutting-edge image processing algorithms.

[CV-99] arget-Oriented Object Grasping via Multimodal Human Guidance ECCV2024

链接: https://arxiv.org/abs/2408.11138
作者: Pengwei Xie,Siang Chen,Dingchang Hu,Yixiang Dai,Kaiqin Yang,Guijin Wang
关键词-EN: encounters numerous challenges, numerous challenges, context of human-robot, human-robot interaction, interaction and collaboration
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024 Workshop on Assistive Computer Vision and Robotics (ACVR 2024)

点击查看摘要

Abstract:In the context of human-robot interaction and collaboration scenarios, robotic grasping still encounters numerous challenges. Traditional grasp detection methods generally analyze the entire scene to predict grasps, leading to redundancy and inefficiency. In this work, we reconsider 6-DoF grasp detection from a target-referenced perspective and propose a Target-Oriented Grasp Network (TOGNet). TOGNet specifically targets local, object-agnostic region patches to predict grasps more efficiently. It integrates seamlessly with multimodal human guidance, including language instructions, pointing gestures, and interactive clicks. Thus our system comprises two primary functional modules: a guidance module that identifies the target object in 3D space and TOGNet, which detects region-focal 6-DoF grasps around the target, facilitating subsequent motion planning. Through 50 target-grasping simulation experiments in cluttered scenes, our system achieves a success rate improvement of about 13.7%. In real-world experiments, we demonstrate that our method excels in various target-oriented grasping scenarios.

[CV-100] Binocular Model: A deep learning solution for online melt pool temperature analysis using dual-wavelength Imaging Pyrometry

链接: https://arxiv.org/abs/2408.11126
作者: Javid Akhavan,Chaitanya Krishna Vallabh,Xianyun Zhao,Souran Manoochehri
关键词-EN: metal Additive Manufacturing, Additive Manufacturing, Melt Pool, ensuring part quality, defect prevention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In metal Additive Manufacturing (AM), monitoring the temperature of the Melt Pool (MP) is crucial for ensuring part quality, process stability, defect prevention, and overall process optimization. Traditional methods, are slow to converge and require extensive manual effort to translate data into actionable insights, rendering them impractical for real-time monitoring and control. To address this challenge, we propose an Artificial Intelligence (AI)-based solution aimed at reducing manual data processing reliance and improving the efficiency of transitioning from data to insight. In our study, we utilize a dataset comprising dual-wavelength real-time process monitoring data and corresponding temperature maps. We introduce a deep learning model called the “Binocular model,” which exploits dual input observations to perform a precise analysis of MP temperature in Laser Powder Bed Fusion (L-PBF). Through advanced deep learning techniques, we seamlessly convert raw data into temperature maps, significantly streamlining the process and enabling batch processing at a rate of up to 750 frames per second, approximately 1000 times faster than conventional methods. Our Binocular model achieves high accuracy in temperature estimation, evidenced by a 0.95 R-squared score, while simultaneously enhancing processing efficiency by a factor of \sim1000x times. This model directly addresses the challenge of real-time MP temperature monitoring and offers insights into the encountered constraints and the benefits of our Deep Learning-based approach. By combining efficiency and precision, our work contributes to the advancement of temperature monitoring in L-PBF, thus driving progress in the field of metal AM.

[CV-101] GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

链接: https://arxiv.org/abs/2408.11085
作者: Changkun Liu,Shuai Chen,Yash Bhalgat,Siyan Hu,Zirui Wang,Ming Cheng,Victor Adrian Prisacariu,Tristan Braud
关键词-EN: Gaussian Splatting, test-time camera pose, representation and propose, test-time camera, camera pose refinement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The project page is available at https://gsloc.active.vision

点击查看摘要

Abstract:We leverage 3D Gaussian Splatting (3DGS) as a scene representation and propose a novel test-time camera pose refinement framework, GSLoc. This framework enhances the localization accuracy of state-of-the-art absolute pose regression and scene coordinate regression methods. The 3DGS model renders high-quality synthetic images and depth maps to facilitate the establishment of 2D-3D correspondences. GSLoc obviates the need for training feature extractors or descriptors by operating directly on RGB images, utilizing the 3D vision foundation model, MASt3R, for precise 2D matching. To improve the robustness of our model in challenging outdoor environments, we incorporate an exposure-adaptive module within the 3DGS framework. Consequently, GSLoc enables efficient pose refinement given a single RGB query and a coarse initial pose estimation. Our proposed approach surpasses leading NeRF-based optimization methods in both accuracy and runtime across indoor and outdoor visual localization benchmarks, achieving state-of-the-art accuracy on two indoor datasets.

[CV-102] Solving Oscillator ODEs via Soft-constrained Physics-informed Neural Network with Small Data

链接: https://arxiv.org/abs/2408.11077
作者: Kai-liang Lu,Yu-meng Su,Cheng Qiu,Zhuo Bi,Wen-jun Zhang
关键词-EN: physics-informed neural network, compared physics-informed neural, conventional neural network, neural network, solving differential equations
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 17 pages, 7 figures, 2 tables, etc

点击查看摘要

Abstract:This paper compared physics-informed neural network (PINN), conventional neural network (NN) and numerical discretization methods on solving differential equations through literature research. We formalized the mathematical framework and computational flow of the soft-constrained PINN method for solving differential equations (e.g., ODEs/PDEs). Its working mechanism and its accuracy and efficiency were experimentally verified by solving typical linear and non-linear oscillator ODEs. The implementation of the PINN method based on DeepXDE is not only light code and efficient in training, but also flexible across platforms. PINN greatly reduces the need for labeled data: when the nonlinearity of the ODE is weak, a very small amount of supervised training data plus a small amount of collocation points are sufficient to predict the solution; in the minimalist case, only one or two training points (with initial values) are needed for first- or second-order ODEs, respectively. Strongly nonlinear ODE also require only an appropriate increase in the number of training and collocation points, which still has significant advantages over conventional NN. With the aid of collocation points and the use of physical information, PINN has the ability to extrapolate data outside the time domain covered by the training set, and is robust to noisy data, thus with enhanced generalization capabilities. Training is accelerated when the gains obtained along with the reduction in the amount of data outweigh the delay caused by the increase in the loss function terms. The soft-constrained PINN method can easily impose a physical law (e.g., energy conservation) constraint by adding a regularization term to the total loss function, thus improving the solution performance of ODEs that obey this physical law.

[CV-103] DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

链接: https://arxiv.org/abs/2408.11071
作者: Pucheng Dang,Xing Hu,Dong Li,Rui Zhang,Qi Guo,Kaidi Xu
关键词-EN: raise misuse concerns, models raise misuse, misuse concerns, raise misuse, creating prohibited
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model’s capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of \textitpurely black-box attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.

[CV-104] NuSegDG: Integration of Heterogeneous Space and Gaussian Kernel for Domain-Generalized Nuclei Segmentation

链接: https://arxiv.org/abs/2408.11787
作者: Zhenye Lou,Qing Xu,Zekun Jiang,Xiangjian He,Zhen Chen,Yi Wang,Chenxin Li,Maggie M. He,Wenting Duan
关键词-EN: Domain-generalized nuclei segmentation, Domain-generalized nuclei, cell types, stain strategies, knowledge learned
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Reivew

点击查看摘要

Abstract:Domain-generalized nuclei segmentation refers to the generalizability of models to unseen domains based on knowledge learned from source domains and is challenged by various image conditions, cell types, and stain strategies. Recently, the Segment Anything Model (SAM) has made great success in universal image segmentation by interactive prompt modes (e.g., point and box). Despite its strengths, the original SAM presents limited adaptation to medical images. Moreover, SAM requires providing manual bounding box prompts for each object to produce satisfactory segmentation masks, so it is laborious in nuclei segmentation scenarios. To address these limitations, we propose a domain-generalizable framework for nuclei image segmentation, abbreviated to NuSegDG. Specifically, we first devise a Heterogeneous Space Adapter (HS-Adapter) to learn multi-dimensional feature representations of different nuclei domains by injecting a small number of trainable parameters into the image encoder of SAM. To alleviate the labor-intensive requirement of manual prompts, we introduce a Gaussian-Kernel Prompt Encoder (GKP-Encoder) to generate density maps driven by a single point, which guides segmentation predictions by mixing position prompts and semantic prompts. Furthermore, we present a Two-Stage Mask Decoder (TSM-Decoder) to effectively convert semantic masks to instance maps without the manual demand for morphological shape refinement. Based on our experimental evaluations, the proposed NuSegDG demonstrates state-of-the-art performance in nuclei instance segmentation, exhibiting superior domain generalization capabilities. The source code is available at this https URL.

[CV-105] FedGS: Federated Gradient Scaling for Heterogeneous Medical Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2408.11701
作者: Philip Schutte,Valentina Corbetta,Regina Beets-Tan,Wilson Silva
关键词-EN: automated medical image, Federated Learning, Deep Learning, enabling collaborative model, collaborative model training
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 2 figures, 1 table, accepted at MICCAI 2024 Workshop on Distributed, Collaborative, Federated Learning Workshop (DeCaF). This is the submitted manuscript with added link to github repo, funding acknowledgements and author names and affiliations. No further post submission improvements or corrections were integrated. Final version not published yet

点击查看摘要

Abstract:Federated Learning (FL) in Deep Learning (DL)-automated medical image segmentation helps preserving privacy by enabling collaborative model training without sharing patient data. However, FL faces challenges with data heterogeneity among institutions, leading to suboptimal global models. Integrating Disentangled Representation Learning (DRL) in FL can enhance robustness by separating data into distinct representations. Existing DRL methods assume heterogeneity lies solely in style features, overlooking content-based variability like lesion size and shape. We propose FedGS, a novel FL aggregation method, to improve segmentation performance on small, under-represented targets while maintaining overall efficacy. FedGS demonstrates superior performance over FedAvg, particularly for small lesions, across PolypGen and LiTS datasets. The code and pre-trained checkpoints are available at the following link: this https URL

[CV-106] LiFCal: Online Light Field Camera Calibration via Bundle Adjustment

链接: https://arxiv.org/abs/2408.11682
作者: Aymeric Fleith,Doaa Ahmed,Daniel Cremers,Niclas Zeller
关键词-EN: MLA-based light field, geometric online calibration, light field, Pattern Recognition, light field cameras
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to the German Conference on Pattern Recognition (GCPR) 2024

点击查看摘要

Abstract:We propose LiFCal, a novel geometric online calibration pipeline for MLA-based light field cameras. LiFCal accurately determines model parameters from a moving camera sequence without precise calibration targets, integrating arbitrary metric scaling constraints. It optimizes intrinsic parameters of the light field camera model, the 3D coordinates of a sparse set of scene points and camera poses in a single bundle adjustment defined directly on micro image points. We show that LiFCal can reliably and repeatably calibrate a focused plenoptic camera using different input sequences, providing intrinsic camera parameters extremely close to state-of-the-art methods, while offering two main advantages: it can be applied in a target-free scene, and it is implemented online in a complete and continuous pipeline. Furthermore, we demonstrate the quality of the obtained camera parameters in downstream tasks like depth estimation and SLAM. Webpage: this https URL Comments: Accepted to the German Conference on Pattern Recognition (GCPR) 2024 Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.11682 [eess.IV] (or arXiv:2408.11682v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.11682 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aymeric Fleith [view email] [v1] Wed, 21 Aug 2024 15:04:49 UTC (21,345 KB)

[CV-107] OAPT: Offset-Aware Partition Transformer for Double JPEG Artifacts Removal

链接: https://arxiv.org/abs/2408.11480
作者: Qiao Mo,Yukang Ding,Jinhua Hao,Qiang Zhu,Ming Sun,Chao Zhou,Feiyu Chen,Shuyuan Zhu
关键词-EN: Deep learning-based methods, shown remarkable performance, JPEG artifacts removal, single JPEG artifacts, Deep learning-based
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 9 figures. Codes and models are available at this https URL

点击查看摘要

Abstract:Deep learning-based methods have shown remarkable performance in single JPEG artifacts removal task. However, existing methods tend to degrade on double JPEG images, which are prevalent in real-world scenarios. To address this issue, we propose Offset-Aware Partition Transformer for double JPEG artifacts removal, termed as OAPT. We conduct an analysis of double JPEG compression that results in up to four patterns within each 8x8 block and design our model to cluster the similar patterns to remedy the difficulty of restoration. Our OAPT consists of two components: compression offset predictor and image reconstructor. Specifically, the predictor estimates pixel offsets between the first and second compression, which are then utilized to divide different patterns. The reconstructor is mainly based on several Hybrid Partition Attention Blocks (HPAB), combining vanilla window-based self-attention and sparse attention for clustered pattern features. Extensive experiments demonstrate that OAPT outperforms the state-of-the-art method by more than 0.16dB in double JPEG image restoration task. Moreover, without increasing any computation cost, the pattern clustering module in HPAB can serve as a plugin to enhance other transformer-based image restoration methods. The code will be available at this https URL .

[CV-108] HMT-UNet: A hybird Mamba-Transformer Vision UNet for Medical Image Segmentation

链接: https://arxiv.org/abs/2408.11289
作者: Mingya Zhang,Limei Gu,Tingshen Ling,Xianping Tao
关键词-EN: State Space Models, medical image segmentation, State Space, models based, medical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2403.09157 ; text overlap with arXiv:2407.08083 by other authors

点击查看摘要

Abstract:In the field of medical image segmentation, models based on both CNN and Transformer have been thoroughly investigated. However, CNNs have limited modeling capabilities for long-range dependencies, making it challenging to exploit the semantic information within images fully. On the other hand, the quadratic computational complexity poses a challenge for Transformers. State Space Models (SSMs), such as Mamba, have been recognized as a promising method. They not only demonstrate superior performance in modeling long-range interactions, but also preserve a linear computational complexity. The hybrid mechanism of SSM (State Space Model) and Transformer, after meticulous design, can enhance its capability for efficient modeling of visual features. Extensive experiments have demonstrated that integrating the self-attention mechanism into the hybrid part behind the layers of Mamba’s architecture can greatly improve the modeling capacity to capture long-range spatial dependencies. In this paper, leveraging the hybrid mechanism of SSM, we propose a U-shape architecture model for medical image segmentation, named Hybird Transformer vision Mamba UNet (HTM-UNet). We conduct comprehensive experiments on the ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir, CVC-ColonDB, ETIS-Larib PolypDB public datasets and ZD-LCI-GIM private dataset. The results indicate that HTM-UNet exhibits competitive performance in medical image segmentation tasks. Our code is available at this https URL.

[CV-109] OCTCube: A 3D foundation model for optical coherence tomography that improves cross-dataset cross-disease cross-device and cross-modality analysis

链接: https://arxiv.org/abs/2408.11227
作者: Zixuan Liu,Hanwen Xu,Addie Woicik,Linda G. Shapiro,Marian Blazes,Yue Wu,Cecilia S. Lee,Aaron Y. Lee,Sheng Wang
关键词-EN: Optical coherence tomography, Optical coherence, OCT, OCT images, coherence tomography
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Optical coherence tomography (OCT) has become critical for diagnosing retinal diseases as it enables 3D images of the retina and optic nerve. OCT acquisition is fast, non-invasive, affordable, and scalable. Due to its broad applicability, massive numbers of OCT images have been accumulated in routine exams, making it possible to train large-scale foundation models that can generalize to various diagnostic tasks using OCT images. Nevertheless, existing foundation models for OCT only consider 2D image slices, overlooking the rich 3D structure. Here, we present OCTCube, a 3D foundation model pre-trained on 26,605 3D OCT volumes encompassing 1.62 million 2D OCT images. OCTCube is developed based on 3D masked autoencoders and exploits FlashAttention to reduce the larger GPU memory usage caused by modeling 3D volumes. OCTCube outperforms 2D models when predicting 8 retinal diseases in both inductive and cross-dataset settings, indicating that utilizing the 3D structure in the model instead of 2D data results in significant improvement. OCTCube further shows superior performance on cross-device prediction and when predicting systemic diseases, such as diabetes and hypertension, further demonstrating its strong generalizability. Finally, we propose a contrastive-self-supervised-learning-based OCT-IR pre-training framework (COIP) for cross-modality analysis on OCT and infrared retinal (IR) images, where the OCT volumes are embedded using OCTCube. We demonstrate that COIP enables accurate alignment between OCT and IR en face images. Collectively, OCTCube, a 3D OCT foundation model, demonstrates significantly better performance against 2D models on 27 out of 29 tasks and comparable performance on the other two tasks, paving the way for AI-based retinal disease diagnosis.

机器学习

[LG-0] Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction

链接: https://arxiv.org/abs/2408.11816
作者: Anthony GX-Chen,Kenneth Marino,Rob Fergus
关键词-EN: difficult exploration problems, describing a set, face of difficult, difficult exploration, study whether giving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:In the face of difficult exploration problems in reinforcement learning, we study whether giving an agent an object-centric mapping (describing a set of items and their attributes) allow for more efficient learning. We found this problem is best solved hierarchically by modelling items at a higher level of state abstraction to pixels, and attribute change at a higher level of temporal abstraction to primitive actions. This abstraction simplifies the transition dynamic by making specific future states easier to predict. We make use of this to propose a fully model-based algorithm that learns a discriminative world model, plans to explore efficiently with only a count-based intrinsic reward, and can subsequently plan to reach any discovered (abstract) states. We demonstrate the model’s ability to (i) efficiently solve single tasks, (ii) transfer zero-shot and few-shot across item types and environments, and (iii) plan across long horizons. Across a suite of 2D crafting and MiniHack environments, we empirically show our model significantly out-performs state-of-the-art low-level methods (without abstraction), as well as performant model-free and model-based methods using the same abstraction. Finally, we show how to reinforce learn low level object-perturbing policies, as well as supervise learn the object mapping itself. Comments: Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.11816 [cs.LG] (or arXiv:2408.11816v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.11816 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Scaling Cross-Embodied Learning: One Policy for Manipulation Navigation Locomotion and Aviation

链接: https://arxiv.org/abs/2408.11812
作者: Ria Doshi,Homer Walke,Oier Mees,Sudeep Dasari,Sergey Levine
关键词-EN: Modern machine learning, attain broad generalization, Modern machine, rely on large, attain broad
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project website at this https URL

点击查看摘要

Abstract:Modern machine learning systems rely on large datasets to attain broad generalization, and this often poses a challenge in robot learning, where each robotic platform and task might have only a small dataset. By training a single policy across many different kinds of robots, a robot learning method can leverage much broader and more diverse datasets, which in turn can lead to better generalization and robustness. However, training a single policy on multi-robot data is challenging because robots can have widely varying sensors, actuators, and control frequencies. We propose CrossFormer, a scalable and flexible transformer-based policy that can consume data from any embodiment. We train CrossFormer on the largest and most diverse dataset to date, 900K trajectories across 20 different robot embodiments. We demonstrate that the same network weights can control vastly different robots, including single and dual arm manipulation systems, wheeled robots, quadcopters, and quadrupeds. Unlike prior work, our model does not require manual alignment of the observation or action spaces. Extensive experiments in the real world show that our method matches the performance of specialist policies tailored for each embodiment, while also significantly outperforming the prior state of the art in cross-embodiment learning.

[LG-2] ACE: A Cross-Platform Visual-Exoskeletons System for Low-Cost Dexterous Teleoperation

链接: https://arxiv.org/abs/2408.11805
作者: Shiqi Yang,Minghuan Liu,Yuzhe Qin,Runyu Ding,Jialong Li,Xuxin Cheng,Ruihan Yang,Sha Yi,Xiaolong Wang
关键词-EN: recently collected large-scale, large-scale robot data, collected large-scale robot, demonstrations has shown, effective approach
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Webpage: this https URL

点击查看摘要

Abstract:Learning from demonstrations has shown to be an effective approach to robotic manipulation, especially with the recently collected large-scale robot data with teleoperation systems. Building an efficient teleoperation system across diverse robot platforms has become more crucial than ever. However, there is a notable lack of cost-effective and user-friendly teleoperation systems for different end-effectors, e.g., anthropomorphic robot hands and grippers, that can operate across multiple platforms. To address this issue, we develop ACE, a cross-platform visual-exoskeleton system for low-cost dexterous teleoperation. Our system utilizes a hand-facing camera to capture 3D hand poses and an exoskeleton mounted on a portable base, enabling accurate real-time capture of both finger and wrist poses. Compared to previous systems, which often require hardware customization according to different robots, our single system can generalize to humanoid hands, arm-hands, arm-gripper, and quadruped-gripper systems with high-precision teleoperation. This enables imitation learning for complex manipulation tasks on diverse platforms.

[LG-3] Approaching Deep Learning through the Spectral Dynamics of Weights

链接: https://arxiv.org/abs/2408.11804
作者: David Yunis,Kumar Kshitij Patel,Samuel Wheeler,Pedro Savarese,Gal Vardi,Karen Livescu,Michael Maire,Matthew R. Walter
关键词-EN: empirical approach centered, deep learning, propose an empirical, empirical approach, approach centered
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose an empirical approach centered on the spectral dynamics of weights – the behavior of singular values and vectors during optimization – to unify and clarify several phenomena in deep learning. We identify a consistent bias in optimization across various experiments, from small-scale ``grokking’’ to large-scale tasks like image classification with ConvNets, image generation with UNets, speech recognition with LSTMs, and language modeling with Transformers. We also demonstrate that weight decay enhances this bias beyond its role as a norm regularizer, even in practical systems. Moreover, we show that these spectral dynamics distinguish memorizing networks from generalizing ones, offering a novel perspective on this longstanding conundrum. Additionally, we leverage spectral dynamics to explore the emergence of well-performing sparse subnetworks (lottery tickets) and the structure of the loss surface through linear mode connectivity. Our findings suggest that spectral dynamics provide a coherent framework to better understand the behavior of neural networks across diverse settings.

[LG-4] LLM Pruning and Distillation in Practice: The Minitron Approach

链接: https://arxiv.org/abs/2408.11796
作者: Sharath Turuvekere Sreenivas,Saurav Muralidharan,Raviraj Joshi,Marcin Chochowski,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Jan Kautz,Pavlo Molchanov
关键词-EN: present a comprehensive, comprehensive report, report on compressing, Evaluation Harness, Mistral NeMo
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

[LG-5] Optical ISAC: Fundamental Performance Limits and Transceiver Design

链接: https://arxiv.org/abs/2408.11792
作者: Alireza Ghazavi Khorasgani,Mahtab Mirmohseni,Ahmed Elzanaty
关键词-EN: single-input single-output, single-input multiple-output, paper characterizes, system with single-input, optimal capacity-distortion
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:This paper characterizes the optimal capacity-distortion (C-D) tradeoff in an optical point-to-point (P2P) system with single-input single-output for communication and single-input multiple-output for sensing (SISO-SIMO-C/S) within an integrated sensing and communication (ISAC) framework. We introduce practical, asymptotically optimal maximum a posteriori (MAP) and maximum likelihood estimators (MLE) for target distance, addressing nonlinear measurement-to-state relationships and non-conjugate priors. Our results show these estimators converge to the Bayesian Cramer-Rao bound (BCRB) as sensing antennas increase. We also demonstrate that the achievable rate-CRB (AR-CRB) serves as an outer bound (OB) for the optimal C-D region. To optimize input distribution across the Pareto boundary of the C-D region, we propose two algorithms: an iterative Blahut-Arimoto algorithm (BAA)-type method and a memory-efficient closed-form (CF) approach, including a CF optimal distribution for high optical signal-to-noise ratio (O-SNR) conditions. Additionally, we extend and modify the Deterministic-Random Tradeoff (DRT) to this optical ISAC context.

[LG-6] Critique-out-Loud Reward Models

链接: https://arxiv.org/abs/2408.11791
作者: Zachary Ankner,Mansheej Paul,Brandon Cui,Jonathan D. Chang,Prithviraj Ammanabrolu
关键词-EN: CLoud reward models, reward models, CLoud reward, reward, underlying large language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant’s response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.

[LG-7] RFID based Health Adherence Medicine Case Using Fair Federated Learning

链接: https://arxiv.org/abs/2408.11782
作者: Ali Kamrani khodaei,Sina Hajer Ahmadi
关键词-EN: Smart Pill Case, nonadherence significantly reduces, Smart Pill, Pill Case, Intelligent Drug Administration
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medication nonadherence significantly reduces the effectiveness of therapies, yet it remains prevalent among patients. Nonadherence has been linked to adverse outcomes, including increased risks of mortality and hospitalization. Although various methods exist to help patients track medication schedules, such as the Intelligent Drug Administration System (IDAS) and Smart Blister, these tools often face challenges that hinder their commercial viability. Building on the principles of dosage measurement and information communication in IoT, we introduce the Smart Pill Case a smart health adherence tool that leverages RFID-based data recording and NFC-based data extraction. This system incorporates a load cell for precise dosage measurement and features an Android app to monitor medication intake, offer suggestions, and issue warnings. To enhance the effectiveness and personalization of the Smart Pill Case, we propose integrating federated learning into the system. Federated learning allows the Smart Pill Case to learn from medication adherence patterns across multiple users without compromising individual privacy. By training machine learning models on decentralized data collected from various Smart Pill Cases, the system can continuously improve its recommendations and warnings, adapting to the diverse needs and behaviors of users. This approach not only enhances the tools ability to support medication adherence but also ensures that sensitive user data remains secure and private.

[LG-8] Sum of Squares Circuits

链接: https://arxiv.org/abs/2408.11778
作者: Lorenzo Loconte,Stefan Mengel,Antonio Vergari
关键词-EN: Designing expressive generative, Designing expressive, support exact, exact and efficient, efficient inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:Designing expressive generative models that support exact and efficient inference is a core question in probabilistic ML. Probabilistic circuits (PCs) offer a framework where this tractability-vs-expressiveness trade-off can be analyzed theoretically. Recently, squared PCs encoding subtractive mixtures via negative parameters have emerged as tractable models that can be exponentially more expressive than monotonic PCs, i.e., PCs with positive parameters only. In this paper, we provide a more precise theoretical characterization of the expressiveness relationships among these models. First, we prove that squared PCs can be less expressive than monotonic ones. Second, we formalize a novel class of PCs – sum of squares PCs – that can be exponentially more expressive than both squared and monotonic PCs. Around sum of squares PCs, we build an expressiveness hierarchy that allows us to precisely unify and separate different tractable model classes such as Born Machines and PSD models, and other recently introduced tractable probabilistic models by using complex parameters. Finally, we empirically show the effectiveness of sum of squares circuits in performing distribution estimation.

[LG-9] Embedding Ordinality to Binary Loss Function for Improving Solar Flare Forecasting

链接: https://arxiv.org/abs/2408.11768
作者: Chetraj Pandey,Anli Ji,Jinsu Hong,Rafal A. Angryk,Berkay Aydin
关键词-EN: intrinsic ordinal flare, skill score, True Skill Score, Heidke Skill Score, ordinal flare characteristics
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 10 Pages, 8 Figures. This manuscript is accepted to be published at DSAA 2024 conference. arXiv admin note: substantial text overlap with arXiv:2406.11054

点击查看摘要

Abstract:In this paper, we propose a novel loss function aimed at optimizing the binary flare prediction problem by embedding the intrinsic ordinal flare characteristics into the binary cross-entropy (BCE) loss function. This modification is intended to provide the model with better guidance based on the ordinal characteristics of the data and improve the overall performance of the models. For our experiments, we employ a ResNet34-based model with transfer learning to predict \geq M-class flares by utilizing the shape-based features of magnetograms of active region (AR) patches spanning from - 90 ^\circ to + 90 ^\circ of solar longitude as our input data. We use a composite skill score (CSS) as our evaluation metric, which is calculated as the geometric mean of the True Skill Score (TSS) and the Heidke Skill Score (HSS) to rank and compare our models’ performance. The primary contributions of this work are as follows: (i) We introduce a novel approach to encode ordinality into a binary loss function showing an application to solar flare prediction, (ii) We enhance solar flare forecasting by enabling flare predictions for each AR across the entire solar disk, without any longitudinal restrictions, and evaluate and compare performance. (iii) Our candidate model, optimized with the proposed loss function, shows an improvement of \sim 7%, \sim 4%, and \sim 3% for AR patches within \pm 30 ^\circ , \pm 60 ^\circ , and \pm 90 ^\circ of solar longitude, respectively in terms of CSS, when compared with standard BCE. Additionally, we demonstrate the ability to issue flare forecasts for ARs in near-limb regions (regions between \pm 60 ^\circ to \pm 90 ^\circ ) with a CSS=0.34 (TSS=0.50 and HSS=0.23), expanding the scope of AR-based models for solar flare prediction. This advances the reliability of solar flare forecasts, leading to more effective prediction capabilities.

[LG-10] Mixed Sparsity Training: Achieving 4times FLOP Reduction for Transformer Pretraining

链接: https://arxiv.org/abs/2408.11746
作者: Pihe Hu,Shaolong Li,Longbo Huang
关键词-EN: substantial computational demands, Large language models, made significant strides, Large language, Floating Point Operations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about 75% of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of 4\times without compromising performance.

[LG-11] MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

链接: https://arxiv.org/abs/2408.11743
作者: Elias Frantar,Roberto L. Castro,Jiale Chen,Torsten Hoefler,Dan Alistarh
关键词-EN: Large Language Models, Large Language, efficient GPU deployment, Language Models, machine learning applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains open whether speedups are achievable also in \emphbatched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. This paper resolves this question positively by describing the design of Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be supported with close to maximum ( 4\times ) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN’s near-optimal performance on individual LLM layers across different scenarios can also lead to end-to-end LLM inference speedups (of up to 2.8\times ) when integrated with the popular vLLM serving engine. Finally, MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.11743 [cs.LG] (or arXiv:2408.11743v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.11743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Iterative Object Count Optimization for Text-to-image Diffusion Models

链接: https://arxiv.org/abs/2408.11721
作者: Oz Zafar,Lior Wolf,Idan Schwartz
关键词-EN: accurately generating, counting, counting model, models, objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Pre-print

点击查看摘要

Abstract:We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an objectś potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can consider non-derivable counting techniques based on detection models, (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods, and (iii) the optimized counting token can be reused to generate accurate images without additional optimization. We evaluate the generation of various objects and show significant improvements in accuracy. The project page is available at this https URL.

[LG-13] On Learnable Parameters of Optimal and Suboptimal Deep Learning Models

链接: https://arxiv.org/abs/2408.11720
作者: Ziwei Zheng,Huizhi Liang,Vaclav Snasel,Vito Latora,Panos Pardalos,Giuseppe Nicosia,Varun Ojha
关键词-EN: deep learning models, learning models, deep learning, node interaction, scrutinize the structural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We scrutinize the structural and operational aspects of deep learning models, particularly focusing on the nuances of learnable parameters (weight) statistics, distribution, node interaction, and visualization. By establishing correlations between variance in weight patterns and overall network performance, we investigate the varying (optimal and suboptimal) performances of various deep-learning models. Our empirical analysis extends across widely recognized datasets such as MNIST, Fashion-MNIST, and CIFAR-10, and various deep learning models such as deep neural networks (DNNs), convolutional neural networks (CNNs), and vision transformer (ViT), enabling us to pinpoint characteristics of learnable parameters that correlate with successful networks. Through extensive experiments on the diverse architectures of deep learning models, we shed light on the critical factors that influence the functionality and efficiency of DNNs. Our findings reveal that successful networks, irrespective of datasets or models, are invariably similar to other successful networks in their converged weights statistics and distribution, while poor-performing networks vary in their weights. In addition, our research shows that the learnable parameters of widely varied deep learning models such as DNN, CNN, and ViT exhibit similar learning characteristics.

[LG-14] First line of defense: A robust first layer mitigates adversarial attacks

链接: https://arxiv.org/abs/2408.11680
作者: Janani Suresh,Nancy Nayak,Sheetal Kalyani
关键词-EN: significant computational overhead, incurs significant computational, designing inherently robust, incurs significant, computational overhead
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial training (AT) incurs significant computational overhead, leading to growing interest in designing inherently robust architectures. We demonstrate that a carefully designed first layer of the neural network can serve as an implicit adversarial noise filter (ANF). This filter is created using a combination of large kernel size, increased convolution filters, and a maxpool operation. We show that integrating this filter as the first layer in architectures such as ResNet, VGG, and EfficientNet results in adversarially robust networks. Our approach achieves higher adversarial accuracies than existing natively robust architectures without AT and is competitive with adversarial-trained architectures across a wide range of datasets. Supporting our findings, we show that (a) the decision regions for our method have better margins, (b) the visualized loss surfaces are smoother, © the modified peak signal-to-noise ratio (mPSNR) values at the output of the ANF are higher, (d) high-frequency components are more attenuated, and (e) architectures incorporating ANF exhibit better denoising in Gaussian noise compared to baseline architectures. Code for all our experiments are available at \urlthis https URL.

[LG-15] Optimizing Federated Graph Learning with Inherent Structural Knowledge and Dual-Densely Connected GNNs

链接: https://arxiv.org/abs/2408.11662
作者: Longwen Wang,Jianchun Liu,Zhi Liu,Jinyang Huang
关键词-EN: Federated Graph Learning, Graph Neural Networks, powerful Graph Neural, Neural Networks, collaboratively train powerful
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Graph Learning (FGL) is an emerging technology that enables clients to collaboratively train powerful Graph Neural Networks (GNNs) in a distributed manner without exposing their private data. Nevertheless, FGL still faces the challenge of the severe non-Independent and Identically Distributed (non-IID) nature of graphs, which possess diverse node and edge structures, especially across varied domains. Thus, exploring the knowledge inherent in these structures becomes significantly crucial. Existing methods, however, either overlook the inherent structural knowledge in graph data or capture it at the cost of significantly increased resource demands (e.g., FLOPs and communication bandwidth), which can be detrimental to distributed paradigms. Inspired by this, we propose FedDense, a novel FGL framework that optimizes the utilization efficiency of inherent structural knowledge. To better acquire knowledge of diverse and underexploited structures, FedDense first explicitly encodes the structural knowledge inherent within graph data itself alongside node features. Besides, FedDense introduces a Dual-Densely Connected (DDC) GNN architecture that exploits the multi-scale (i.e., one-hop to multi-hop) feature and structure insights embedded in the aggregated feature maps at each layer. In addition to the exploitation of inherent structures, we consider resource limitations in FGL, devising exceedingly narrow layers atop the DDC architecture and adopting a selective parameter sharing strategy to reduce resource costs substantially. We conduct extensive experiments using 15 datasets across 4 different domains, demonstrating that FedDense consistently surpasses baselines by a large margin in training performance, while demanding minimal resources.

[LG-16] Macformer: Transformer with Random Maclaurin Feature Attention

链接: https://arxiv.org/abs/2408.11656
作者: Yuhan Guo,Lizhong Ding,Ye Yuan,Guoren Wang
关键词-EN: adopts random fourier, random Maclaurin features, random fourier feature, Maclaurin Feature Attention, Random feature attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by RFA, we propose Macformer, a Transformer architecture that employs random Maclaurin features (RMF) to approximate various dot-product kernels, thereby accelerating attention computations for long sequence. Macformer consists of Random Maclaurin Feature Attention (RMFA) and pre-post Scaling Batch Normalization (ppSBN), the former is an unbiased approximation for dot-product kernelized attention and the later is a two-stage regularization mechanism guaranteeing the error of RMFA. We conducted toy experiments to demonstrate the efficiency of RMFA and ppSBN, and experiments on long range arena (LRA) benchmark to validate the acceleration and accuracy of Macformer with different dot-product kernels. Experiment results of Macformer are consistent with our theoretical analysis.

[LG-17] Optimizing Interpretable Decision Tree Policies for Reinforcement Learning

链接: https://arxiv.org/abs/2408.11632
作者: Daniël Vos,Sicco Verwer
关键词-EN: made tremendous progress, decision tree, leveraging deep learning, techniques leveraging deep, recent years
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning techniques leveraging deep learning have made tremendous progress in recent years. However, the complexity of neural networks prevents practitioners from understanding their behavior. Decision trees have gained increased attention in supervised learning for their inherent interpretability, enabling modelers to understand the exact prediction process after learning. This paper considers the problem of optimizing interpretable decision tree policies to replace neural networks in reinforcement learning settings. Previous works have relaxed the tree structure, restricted to optimizing only tree leaves, or applied imitation learning techniques to approximately copy the behavior of a neural network policy with a decision tree. We propose the Decision Tree Policy Optimization (DTPO) algorithm that directly optimizes the complete decision tree using policy gradients. Our technique uses established decision tree heuristics for regression to perform policy optimization. We empirically show that DTPO is a competitive algorithm compared to imitation learning algorithms for optimizing decision tree policies in reinforcement learning.

[LG-18] A Markovian Model for Learning-to-Optimize

链接: https://arxiv.org/abs/2408.11629
作者: Michael Sucker,Peter Ochs
关键词-EN: stochastic iterative algorithms, case of optimization, actual convergence rate, convergence rate, stochastic algorithms based
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We present a probabilistic model for stochastic iterative algorithms with the use case of optimization algorithms in mind. Based on this model, we present PAC-Bayesian generalization bounds for functions that are defined on the trajectory of the learned algorithm, for example, the expected (non-asymptotic) convergence rate and the expected time to reach the stopping criterion. Thus, not only does this model allow for learning stochastic algorithms based on their empirical performance, it also yields results about their actual convergence rate and their actual convergence time. We stress that, since the model is valid in a more general setting than learning-to-optimize, it is of interest for other fields of application, too. Finally, we conduct five practically relevant experiments, showing the validity of our claims.

[LG-19] End-to-End Cost-Effective Incentive Recommendation under Budget Constraint with Uplift Modeling RECSYS2024

链接: https://arxiv.org/abs/2408.11623
作者: Zexu Sun,Hao Yang an Dugang Liu,Yunpeng Weng,Xing Tang,Xiuqiang He
关键词-EN: enhance user engagement, modern online platforms, increase platform revenue, essential factors, factors that enhance
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by RecSys 2024

点击查看摘要

Abstract:In modern online platforms, incentives are essential factors that enhance user engagement and increase platform revenue. Over recent years, uplift modeling has been introduced as a strategic approach to assign incentives to individual customers. Especially in many real-world applications, online platforms can only incentivize customers with specific budget constraints. This problem can be reformulated as the multi-choice knapsack problem. This optimization aims to select the optimal incentive for each customer to maximize the return on investment. Recent works in this field frequently tackle the budget allocation problem using a two-stage approach. However, this solution is confronted with the following challenges: (1) The causal inference methods often ignore the domain knowledge in online marketing, where the expected response curve of a customer should be monotonic and smooth as the incentive increases. (2) An optimality gap between the two stages results in inferior sub-optimal allocation performance due to the loss of the incentive recommendation information for the uplift prediction under the limited budget constraint. To address these challenges, we propose a novel End-to-End Cost-Effective Incentive Recommendation (E3IR) model under budget constraints. Specifically, our methods consist of two modules, i.e., the uplift prediction module and the differentiable allocation module. In the uplift prediction module, we construct prediction heads to capture the incremental improvement between adjacent treatments with the marketing domain constraints (i.e., monotonic and smooth). We incorporate integer linear programming (ILP) as a differentiable layer input in the allocation module. Furthermore, we conduct extensive experiments on public and real product datasets, demonstrating that our E3IR improves allocation performance compared to existing two-stage approaches.

[LG-20] Annealed Sinkhorn for Optimal Transport: convergence regularization path and debiasing

链接: https://arxiv.org/abs/2408.11620
作者: Lénaïc Chizat
关键词-EN: large-scale optimal transport, Annealed Sinkhorn, beta, Annealed Sinkhorn algorithm, solve large-scale optimal
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Sinkhorn’s algorithm is a method of choice to solve large-scale optimal transport (OT) problems. In this context, it involves an inverse temperature parameter \beta that determines the speed-accuracy trade-off. To improve this trade-off, practitioners often use a variant of this algorithm, Annealed Sinkhorn, that uses an nondecreasing sequence (\beta_t)_t\in \mathbbN where t is the iteration count. However, besides for the schedule \beta_t=\Theta(\log t) which is impractically slow, it is not known whether this variant is guaranteed to actually solve OT. Our first contribution answers this question: we show that a concave annealing schedule asymptotically solves OT if and only if \beta_t\to+\infty and \beta_t-\beta_t-1\to 0 . The proof is based on an equivalence with Online Mirror Descent and further suggests that the iterates of Annealed Sinkhorn follow the solutions of a sequence of relaxed, entropic OT problems, the regularization path. An analysis of this path reveals that, in addition to the well-known “entropic” error in \Theta(\beta^-1_t) , the annealing procedure induces a “relaxation” error in \Theta(\beta_t-\beta_t-1) . The best error trade-off is achieved with the schedule \beta_t = \Theta(\sqrtt) which, albeit slow, is a universal limitation of this method. Going beyond this limitation, we propose a simple modification of Annealed Sinkhorn that reduces the relaxation error, and therefore enables faster annealing schedules. In toy experiments, we observe the effectiveness of our Debiased Annealed Sinkhorn’s algorithm: a single run of this algorithm spans the whole speed-accuracy Pareto front of the standard Sinkhorn’s algorithm.

[LG-21] Data-driven Modeling of Combined Sewer Systems for Urban Sustainability: An Empirical Evaluation

链接: https://arxiv.org/abs/2408.11619
作者: Vipin Singh,Tianheng Ling,Teodor Chiaburu,Felix Biessmann
关键词-EN: Climate change poses, Climate change, poses complex challenges, change poses complex, Combined Sewer Systems
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, accepted at 47th German Conference on Artificial Intelligence, Wuerzburg 2024

点击查看摘要

Abstract:Climate change poses complex challenges, with extreme weather events becoming increasingly frequent and difficult to model. Examples include the dynamics of Combined Sewer Systems (CSS). Overburdened CSS during heavy rainfall will overflow untreated wastewater into surface water bodies. Classical approaches to modeling the impact of extreme rainfall events rely on physical simulations, which are particularly challenging to create for large urban infrastructures. Deep Learning (DL) models offer a cost-effective alternative for modeling the complex dynamics of sewer systems. In this study, we present a comprehensive empirical evaluation of several state-of-the-art DL time series models for predicting sewer system dynamics in a large urban infrastructure, utilizing three years of measurement data. We especially investigate the potential of DL models to maintain predictive precision during network outages by comparing global models, which have access to all variables within the sewer system, and local models, which are limited to data from a restricted set of local sensors. Our findings demonstrate that DL models can accurately predict the dynamics of sewer system load, even under network outage conditions. These results suggest that DL models can effectively aid in balancing the load redistribution in CSS, thereby enhancing the sustainability and resilience of urban infrastructures.

[LG-22] DTN: Deep Multiple Task-specific Feature Interactions Network for Multi-Task Recommendation

链接: https://arxiv.org/abs/2408.11611
作者: Yaowen Bi,Yuteng Lian,Jie Cui,Jun Liu,Peijian Wang,Guanghui Li,Xuejun Chen,Jinglin Zhao,Hao Wen,Jing Zhang,Zhaoqi Zhang,Wenzhuo Song,Yang Sun,Weiwei Zhang,Mingchen Cai,Guanxing Zhang
关键词-EN: Neural-based multi-task learning, Neural-based multi-task, MTL, MTL models, DTN
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural-based multi-task learning (MTL) has been successfully applied to many recommendation applications. However, these MTL models (e.g., MMoE, PLE) did not consider feature interaction during the optimization, which is crucial for capturing complex high-order features and has been widely used in ranking models for real-world recommender systems. Moreover, through feature importance analysis across various tasks in MTL, we have observed an interesting divergence phenomenon that the same feature can have significantly different importance across different tasks in MTL. To address these issues, we propose Deep Multiple Task-specific Feature Interactions Network (DTN) with a novel model structure design. DTN introduces multiple diversified task-specific feature interaction methods and task-sensitive network in MTL networks, enabling the model to learn task-specific diversified feature interaction representations, which improves the efficiency of joint representation learning in a general setup. We applied DTN to our company’s real-world E-commerce recommendation dataset, which consisted of over 6.3 billion samples, the results demonstrated that DTN significantly outperformed state-of-the-art MTL models. Moreover, during online evaluation of DTN in a large-scale E-commerce recommender system, we observed a 3.28% in clicks, a 3.10% increase in orders and a 2.70% increase in GMV (Gross Merchandise Value) compared to the state-of-the-art MTL models. Finally, extensive offline experiments conducted on public benchmark datasets demonstrate that DTN can be applied to various scenarios beyond recommendations, enhancing the performance of ranking models.

[LG-23] Networked Communication for Mean-Field Games with Function Approximation and Empirical Mean-Field Estimation

链接: https://arxiv.org/abs/2408.11607
作者: Patrick Benjamin,Alessandro Abate
关键词-EN: Recent works, Munchausen Online Mirror, Online Mirror Descent, non-episodic run, Mean-Field Games
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recent works have provided algorithms by which decentralised agents, which may be connected via a communication network, can learn equilibria in Mean-Field Games from a single, non-episodic run of the empirical system. However, these algorithms are given for tabular settings: this computationally limits the size of players’ observation space, meaning that the algorithms are not able to handle anything but small state spaces, nor to generalise beyond policies depending on the ego player’s state to so-called ‘population-dependent’ policies. We address this limitation by introducing function approximation to the existing setting, drawing on the Munchausen Online Mirror Descent method that has previously been employed only in finite-horizon, episodic, centralised settings. While this permits us to include the population’s mean-field distribution in the observation for each player’s policy, it is arguably unrealistic to assume that decentralised agents would have access to this global information: we therefore additionally provide new algorithms that allow agents to estimate the global empirical distribution based on a local neighbourhood, and to improve this estimate via communication over a given network. Our experiments showcase how the communication network allows decentralised agents to estimate the mean-field distribution for population-dependent policies, and that exchanging policy information helps networked agents to outperform both independent and even centralised agents in function-approximation settings, by an even greater margin than in tabular settings.

[LG-24] Improving Calibration by Relating Focal Loss Temperature Scaling and Properness ECAI2024

链接: https://arxiv.org/abs/2408.11598
作者: Viacheslav Komisarenko,Meelis Kull
关键词-EN: produce class probabilities, test data, data, focal, focal loss
类目: Machine Learning (cs.LG)
*备注: Accepted to ECAI 2024

点击查看摘要

Abstract:Proper losses such as cross-entropy incentivize classifiers to produce class probabilities that are well-calibrated on the training data. Due to the generalization gap, these classifiers tend to become overconfident on the test data, mandating calibration methods such as temperature scaling. The focal loss is not proper, but training with it has been shown to often result in classifiers that are better calibrated on test data. Our first contribution is a simple explanation about why focal loss training often leads to better calibration than cross-entropy training. For this, we prove that focal loss can be decomposed into a confidence-raising transformation and a proper loss. This is why focal loss pushes the model to provide under-confident predictions on the training data, resulting in being better calibrated on the test data, due to the generalization gap. Secondly, we reveal a strong connection between temperature scaling and focal loss through its confidence-raising transformation, which we refer to as the focal calibration map. Thirdly, we propose focal temperature scaling - a new post-hoc calibration method combining focal calibration and temperature scaling. Our experiments on three image classification datasets demonstrate that focal temperature scaling outperforms standard temperature scaling.

[LG-25] Calibrating the Predictions for Top-N Recommendations RECSYS2024

链接: https://arxiv.org/abs/2408.11596
作者: Masahiro Sato
关键词-EN: Well-calibrated predictions, top-N items, top-N, preferences are essential, items
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: accepted at RecSys 2024

点击查看摘要

Abstract:Well-calibrated predictions of user preferences are essential for many applications. Since recommender systems typically select the top-N items for users, calibration for those top-N items, rather than for all items, is important. We show that previous calibration methods result in miscalibrated predictions for the top-N items, despite their excellent calibration performance when evaluated on all items. In this work, we address the miscalibration in the top-N recommended items. We first define evaluation metrics for this objective and then propose a generic method to optimize calibration models focusing on the top-N items. It groups the top-N items by their ranks and optimizes distinct calibration models for each group with rank-dependent training weights. We verify the effectiveness of the proposed method for both explicit and implicit feedback datasets, using diverse classes of recommender models.

[LG-26] Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control

链接: https://arxiv.org/abs/2408.11561
作者: Muhammad Aqeel,Shakiba Sharifi,Marco Cristani,Francesco Setti
关键词-EN: Iterative Refinement Process, Refinement Process, introduces the Iterative, Iterative Refinement, industrial quality control
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces the Iterative Refinement Process (IRP), a robust anomaly detection methodology designed for high-stakes industrial quality control. The IRP enhances defect detection accuracy through a cyclic data refinement strategy, iteratively removing misleading data points to improve model performance and robustness. We validate the IRP’s effectiveness using two benchmark datasets, Kolektor SDD2 (KSDD2) and MVTec AD, covering a wide range of industrial products and defect types. Our experimental results demonstrate that the IRP consistently outperforms traditional anomaly detection models, particularly in environments with high noise levels. This study highlights the IRP’s potential to significantly enhance anomaly detection processes in industrial settings, effectively managing the challenges of sparse and noisy data.

[LG-27] Memorization In In-Context Learning

链接: https://arxiv.org/abs/2408.11546
作者: Shahriar Golchin,Mihai Surdeanu,Steven Bethard,Eduardo Blanco,Ellen Riloff
关键词-EN: large language models, In-context learning, ICL, language models, strategy for improving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: v1

点击查看摘要

Abstract:In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind these performance improvements remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when it outperforms zero-shot learning. Overall, our study uncovers a hidden phenomenon – memorization – at the core of ICL, raising an important question: to what extent do LLMs truly generalize from demonstrations in ICL, and how much of their success is due to memorization?

[LG-28] A Survey of Embodied Learning for Object-Centric Robotic Manipulation

链接: https://arxiv.org/abs/2408.11537
作者: Ying Zheng,Lei Yao,Yuejiao Su,Yi Zhang,Yi Wang,Sicheng Zhao,Yiyi Zhang,Lap-Pui Chau
关键词-EN: rapidly developing, developing and challenging, challenging area, object-centric robotic manipulation, Embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embodied learning for object-centric robotic manipulation is a rapidly developing and challenging area in embodied AI. It is crucial for advancing next-generation intelligent robots and has garnered significant interest recently. Unlike data-driven machine learning methods, embodied learning focuses on robot learning through physical interaction with the environment and perceptual feedback, making it especially suitable for robotic manipulation. In this paper, we provide a comprehensive survey of the latest advancements in this field and categorize the existing work into three main branches: 1) Embodied perceptual learning, which aims to predict object pose and affordance through various data representations; 2) Embodied policy learning, which focuses on generating optimal robotic decisions using methods such as reinforcement learning and imitation learning; 3) Embodied task-oriented learning, designed to optimize the robot’s performance based on the characteristics of different tasks in object grasping and manipulation. In addition, we offer an overview and discussion of public datasets, evaluation metrics, representative applications, current challenges, and potential future research directions. A project associated with this survey has been established at this https URL.

[LG-29] he Vizier Gaussian Process Bandit Algorithm

链接: https://arxiv.org/abs/2408.11527
作者: Xingyou Song,Qiuyi Zhang,Chansoo Lee,Emily Fertig,Tzu-Kuo Huang,Lior Belenki,Greg Kochanski,Setareh Ariafar,Srinivas Vasudevan,Sagi Perel,Daniel Golovin
关键词-EN: accelerated numerous research, Bayesian optimization, success of Bayesian, Google Vizier, Open Source Vizier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: Google DeepMind Technical Report. Code can be found in this https URL

点击查看摘要

Abstract:Google Vizier has performed millions of optimizations and accelerated numerous research and production systems at Google, demonstrating the success of Bayesian optimization as a large-scale service. Over multiple years, its algorithm has been improved considerably, through the collective experiences of numerous research efforts and user feedback. In this technical report, we discuss the implementation details and design choices of the current default algorithm provided by Open Source Vizier. Our experiments on standardized benchmarks reveal its robustness and versatility against well-established industry baselines on multiple practical modes.

[LG-30] Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

链接: https://arxiv.org/abs/2408.11513
作者: Washim Uddin Mondal,Vaneet Aggarwal
关键词-EN: Markov Decision Process, Constrained Markov Decision, Decision Process, Constrained Markov, Markov Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, \epsilon_\mathrmbias , PDR-ANPG achieves a last-iterate \epsilon optimality gap and \epsilon constraint violation (up to some additive factor of \epsilon_\mathrmbias ) with a sample complexity of \tilde\mathcalO(\epsilon^-2\min\epsilon^-2,\epsilon_\mathrmbias^-\frac13) . If the class is incomplete ( \epsilon_\mathrmbias0 ), then the sample complexity reduces to \tilde\mathcalO(\epsilon^-2) for \epsilon(\epsilon_\mathrmbias)^\frac16 . Moreover, for complete policies with \epsilon_\mathrmbias=0 , our algorithm achieves a last-iterate \epsilon optimality gap and \epsilon constraint violation with \tilde\mathcalO(\epsilon^-4) sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

[LG-31] Slicing Input Features to Accelerate Deep Learning: A Case Study with Graph Neural Networks

链接: https://arxiv.org/abs/2408.11500
作者: Zhengjia Xu,Dingyang Lyu,Jinghui Zhang
关键词-EN: full-batch GNN training, full-batch GNN, GNN training, single GPU memory, GPU
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As graphs grow larger, full-batch GNN training becomes hard for single GPU memory. Therefore, to enhance the scalability of GNN training, some studies have proposed sampling-based mini-batch training and distributed graph learning. However, these methods still have drawbacks, such as performance degradation and heavy communication. This paper introduces SliceGCN, a feature-sliced distributed large-scale graph learning method. SliceGCN slices the node features, with each computing device, i.e., GPU, handling partial features. After each GPU processes its share, partial representations are obtained and concatenated to form complete representations, enabling a single GPU’s memory to handle the entire graph structure. This aims to avoid the accuracy loss typically associated with mini-batch training (due to incomplete graph structures) and to reduce inter-GPU communication during message passing (the forward propagation process of GNNs). To study and mitigate potential accuracy reductions due to slicing features, this paper proposes feature fusion and slice encoding. Experiments were conducted on six node classification datasets, yielding some interesting analytical results. These results indicate that while SliceGCN does not enhance efficiency on smaller datasets, it does improve efficiency on larger datasets. Additionally, we found that SliceGCN and its variants have better convergence, feature fusion and slice encoding can make training more stable, reduce accuracy fluctuations, and this study also discovered that the design of SliceGCN has a potentially parameter-efficient nature.

[LG-32] Learning Deep Dissipative Dynamics

链接: https://arxiv.org/abs/2408.11479
作者: Yuji Okamoto,Ryosuke Kojima
关键词-EN: challenges strictly guaranteeing, study challenges strictly, neural networks learned, time-series data, study challenges
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:This study challenges strictly guaranteeing ``dissipativity’’ of a dynamical system represented by neural networks learned from given time-series data. Dissipativity is a crucial indicator for dynamical systems that generalizes stability and input-output stability, known to be valid across various systems including robotics, biological systems, and molecular dynamics. By analytically proving the general solution to the nonlinear Kalman-Yakubovich-Popov (KYP) lemma, which is the necessary and sufficient condition for dissipativity, we propose a differentiable projection that transforms any dynamics represented by neural networks into dissipative ones and a learning method for the transformed dynamics. Utilizing the generality of dissipativity, our method strictly guarantee stability, input-output stability, and energy conservation of trained dynamical systems. Finally, we demonstrate the robustness of our method against out-of-domain input through applications to robotic arms and fluid dynamics. Code here this https URL

[LG-33] LAKD-Activation Mapping Distillation Based on Local Learning

链接: https://arxiv.org/abs/2408.11478
作者: Yaoze Zhang,Yuming Zhang,Yu Zhao,Yue Zhang,Feiyu Zhu
关键词-EN: fundamental vision models, Knowledge distillation, Attention Knowledge Distillation, knowledge distillation methods, Knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages,7 figures

点击查看摘要

Abstract:Knowledge distillation is widely applied in various fundamental vision models to enhance the performance of compact models. Existing knowledge distillation methods focus on designing different distillation targets to acquire knowledge from teacher models. However, these methods often overlook the efficient utilization of distilled information, crudely coupling different types of information, making it difficult to explain how the knowledge from the teacher network aids the student network in learning. This paper proposes a novel knowledge distillation framework, Local Attention Knowledge Distillation (LAKD), which more efficiently utilizes the distilled information from teacher networks, achieving higher interpretability and competitive performance. The framework establishes an independent interactive training mechanism through a separation-decoupling mechanism and non-directional activation mapping. LAKD decouples the teacher’s features and facilitates progressive interaction training from simple to complex. Specifically, the student network is divided into local modules with independent gradients to decouple the knowledge transferred from the teacher. The non-directional activation mapping helps the student network integrate knowledge from different local modules by learning coarse-grained feature knowledge. We conducted experiments on the CIFAR-10, CIFAR-100, and ImageNet datasets, and the results show that our LAKD method significantly outperforms existing methods, consistently achieving state-of-the-art performance across different datasets.

[LG-34] Using Part-based Representations for Explainable Deep Reinforcement Learning

链接: https://arxiv.org/abs/2408.11455
作者: Manos Kirtas,Konstantinos Tsampazis,Loukia Avramelou,Nikolaos Passalis,Nikolaos Passalis
关键词-EN: Utilizing deep learning, holds significant potential, Utilizing deep, models incorporate latent, representations holds significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Utilizing deep learning models to learn part-based representations holds significant potential for interpretable-by-design approaches, as these models incorporate latent causes obtained from feature representations through simple addition. However, training a part-based learning model presents challenges, particularly in enforcing non-negative constraints on the model’s parameters, which can result in training difficulties such as instability and convergence issues. Moreover, applying such approaches in Deep Reinforcement Learning (RL) is even more demanding due to the inherent instabilities that impact many optimization methods. In this paper, we propose a non-negative training approach for actor models in RL, enabling the extraction of part-based representations that enhance interpretability while adhering to non-negative constraints. To this end, we employ a non-negative initialization technique, as well as a modified sign-preserving training method, which can ensure better gradient flow compared to existing approaches. We demonstrate the effectiveness of the proposed approach using the well-known Cartpole benchmark.

[LG-35] DABench: A Benchmark Dataset for Data-Driven Weather Data Assimilation

链接: https://arxiv.org/abs/2408.11438
作者: Wuxin Wang,Weicheng Ni,Tao Han,Lei Bai,Boheng Duan,Kaijun Ren
关键词-EN: Large Weather Models, data-driven weather prediction, numerical weather prediction, weather prediction, weather prediction systems
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 37pages, 12 figures, 6 tables

点击查看摘要

Abstract:Recent advancements in deep learning (DL) have led to the development of several Large Weather Models (LWMs) that rival state-of-the-art (SOTA) numerical weather prediction (NWP) systems. Up to now, these models still rely on traditional NWP-generated analysis fields as input and are far from being an autonomous system. While researchers are exploring data-driven data assimilation (DA) models to generate accurate initial fields for LWMs, the lack of a standard benchmark impedes the fair evaluation among different data-driven DA algorithms. Here, we introduce DABench, a benchmark dataset utilizing ERA5 data as ground truth to guide the development of end-to-end data-driven weather prediction systems. DABench contributes four standard features: (1) sparse and noisy simulated observations under the guidance of the observing system simulation experiment method; (2) a skillful pre-trained weather prediction model to generate background fields while fairly evaluating the impact of assimilation outcomes on predictions; (3) standardized evaluation metrics for model comparison; (4) a strong baseline called the DA Transformer (DaT). DaT integrates the four-dimensional variational DA prior knowledge into the Transformer model and outperforms the SOTA in physical state reconstruction, named 4DVarNet. Furthermore, we exemplify the development of an end-to-end data-driven weather prediction system by integrating DaT with the prediction model. Researchers can leverage DABench to develop their models and compare performance against established baselines, which will benefit the future advancements of data-driven weather prediction systems. The code is available on this Github repository and the dataset is available at the Baidu Drive.

[LG-36] owards Aligned Data Removal via Twin Machine Unlearning

链接: https://arxiv.org/abs/2408.11433
作者: Yuyao Sun,Zhenxing Niu,Gang hua,Rong jin
关键词-EN: Modern privacy regulations, Modern privacy, machine unlearning, Twin Machine Unlearning, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modern privacy regulations have spurred the evolution of machine unlearning, a technique that enables the removal of data from an already trained ML model without requiring retraining from scratch. Previous unlearning methods tend to induce the model to achieve lowest classification accuracy on the removal data. Nonetheless, the authentic objective of machine unlearning is to align the unlearned model with the gold model, i.e., achieving the same classification accuracy as the gold model. For this purpose, we present a Twin Machine Unlearning (TMU) approach, where a twin unlearning problem is defined corresponding to the original unlearning problem. As a results, the generalization-label predictor trained on the twin problem can be transferred to the original problem, facilitating aligned data removal. Comprehensive empirical experiments illustrate that our approach significantly enhances the alignment between the unlearned model and the gold model. Meanwhile, our method allows data removal without compromising the model accuracy.

[LG-37] Linear-time One-Class Classification with Repeated Element-wise Folding

链接: https://arxiv.org/abs/2408.11412
作者: Jenni Raitoharju
关键词-EN: Repeated Element-wise Folding, Repeated Element-wise, element-wise folding operation, Element-wise Folding, paper proposes
类目: Machine Learning (cs.LG)
*备注: Accepted to EUSIPCO 2024

点击查看摘要

Abstract:This paper proposes an easy-to-use method for one-class classification: Repeated Element-wise Folding (REF). The algorithm consists of repeatedly standardizing and applying an element-wise folding operation on the one-class training data. Equivalent mappings are performed on unknown test items and the classification prediction is based on the item’s distance to the origin of the final distribution. As all the included operations have linear time complexity, the proposed algorithm provides a linear-time alternative for the commonly used computationally much more demanding approaches. Furthermore, REF can avoid the challenges of hyperparameter setting in one-class classification by providing robust default settings. The experiments show that the proposed method can produce similar classification performance or even outperform the more complex algorithms on various benchmark datasets. Matlab codes for REF are publicly available at this https URL.

[LG-38] Revisiting FunnyBirds evaluation framework for prototypical parts networks

链接: https://arxiv.org/abs/2408.11401
作者: Szymon Opłatek,Dawid Rymarczyk,Bartosz Zieliński
关键词-EN: post-hoc methods, popular due, produce more genuine, Prototypical parts networks, metric scores
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at 2nd XAI World Conference

点击查看摘要

Abstract:Prototypical parts networks, such as ProtoPNet, became popular due to their potential to produce more genuine explanations than post-hoc methods. However, for a long time, this potential has been strictly theoretical, and no systematic studies have existed to support it. That changed recently with the introduction of the FunnyBirds benchmark, which includes metrics for evaluating different aspects of explanations. However, this benchmark employs attribution maps visualization for all explanation techniques except for the ProtoPNet, for which the bounding boxes are used. This choice significantly influences the metric scores and questions the conclusions stated in FunnyBirds publication. In this study, we comprehensively compare metric scores obtained for two types of ProtoPNet visualizations: bounding boxes and similarity maps. Our analysis indicates that employing similarity maps aligns better with the essence of ProtoPNet, as evidenced by different metric scores obtained from FunnyBirds. Therefore, we advocate using similarity maps as a visualization technique for prototypical parts networks in explainability evaluation benchmarks. Comments: Published at 2nd XAI World Conference Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2408.11401 [cs.CV] (or arXiv:2408.11401v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.11401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models

链接: https://arxiv.org/abs/2408.11393
作者: Chi Ma,Mincong Huang,Ying Zhang,Chao Wang,Yujie Wang,Lei Yu,Chuan Liu,Wei Lin
关键词-EN: large language models, Threshold-based Dynamic Activation, DejaVu and MoEfication, demonstrated their potential, enhance the inference
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic activation (DA) techniques, such as DejaVu and MoEfication, have demonstrated their potential to significantly enhance the inference efficiency of large language models (LLMs). However, these techniques often rely on ReLU activation functions or require additional parameters and training to maintain performance. This paper introduces a training-free Threshold-based Dynamic Activation(TDA) method that leverage sequence information to exploit the inherent sparsity of models across various architectures. This method is designed to accelerate generation speed by 18-25% without significantly compromising task performance, thereby addressing the limitations of existing DA techniques. Moreover, we delve into the root causes of LLM sparsity and theoretically analyze two of its critical features: history-related activation uncertainty and semantic-irrelevant activation inertia. Our comprehensive analyses not only provide a robust theoretical foundation for DA methods but also offer valuable insights to guide future research in optimizing LLMs for greater efficiency and effectiveness.

[LG-40] Data-Centric Machine Learning for Earth Observation: Necessary and Sufficient Features ECML ACL KDD2024

链接: https://arxiv.org/abs/2408.11384
作者: Hiba Najjar,Marlon Nuske,Andreas Dengel
关键词-EN: machine learning models, extensively leveraged, leveraged to enhance, machine learning, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at MACLEAN workshop, ECML/PKDD 2024

点击查看摘要

Abstract:The availability of temporal geospatial data in multiple modalities has been extensively leveraged to enhance the performance of machine learning models. While efforts on the design of adequate model architectures are approaching a level of saturation, focusing on a data-centric perspective can complement these efforts to achieve further enhancements in data usage efficiency and model generalization capacities. This work contributes to this direction. We leverage model explanation methods to identify the features crucial for the model to reach optimal performance and the smallest set of features sufficient to achieve this performance. We evaluate our approach on three temporal multimodal geospatial datasets and compare multiple model explanation techniques. Our results reveal that some datasets can reach their optimal accuracy with less than 20% of the temporal instances, while in other datasets, the time series of a single band from a single modality is sufficient.

[LG-41] A Unified Framework for Continual Learning and Machine Unlearning

链接: https://arxiv.org/abs/2408.11374
作者: Romit Chatterjee,Vikram Chundawat,Ayush Tarun,Ankur Mali,Murari Mandal
关键词-EN: typically addressed separately, typically addressed, addressed separately, Continual learning, learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning and machine unlearning are crucial challenges in machine learning, typically addressed separately. Continual learning focuses on adapting to new knowledge while preserving past information, whereas unlearning involves selectively forgetting specific subsets of data. In this paper, we introduce a novel framework that jointly tackles both tasks by leveraging controlled knowledge distillation. Our approach enables efficient learning with minimal forgetting and effective targeted unlearning. By incorporating a fixed memory buffer, the system supports learning new concepts while retaining prior knowledge. The distillation process is carefully managed to ensure a balance between acquiring new information and forgetting specific data as needed. Experimental results on benchmark datasets show that our method matches or exceeds the performance of existing approaches in both continual learning and machine unlearning. This unified framework is the first to address both challenges simultaneously, paving the way for adaptable models capable of dynamic learning and forgetting while maintaining strong overall performance.

[LG-42] Graph Classification via Reference Distribution Learning: Theory and Practice

链接: https://arxiv.org/abs/2408.11370
作者: Zixiao Wang,Jicong Fan
关键词-EN: challenging problem owing, Reference Distribution Learning, challenging problem, problem owing, difficulty in quantifying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph classification is a challenging problem owing to the difficulty in quantifying the similarity between graphs or representing graphs as vectors, though there have been a few methods using graph kernels or graph neural networks (GNNs). Graph kernels often suffer from computational costs and manual feature engineering, while GNNs commonly utilize global pooling operations, risking the loss of structural or semantic information. This work introduces Graph Reference Distribution Learning (GRDL), an efficient and accurate graph classification method. GRDL treats each graph’s latent node embeddings given by GNN layers as a discrete distribution, enabling direct classification without global pooling, based on maximum mean discrepancy to adaptively learned reference distributions. To fully understand this new model (the existing theories do not apply) and guide its configuration (e.g., network architecture, references’ sizes, number, and regularization) for practical use, we derive generalization error bounds for GRDL and verify them numerically. More importantly, our theoretical and numerical results both show that GRDL has a stronger generalization ability than GNNs with global pooling operations. Experiments on moderate-scale and large-scale graph datasets show the superiority of GRDL over the state-of-the-art, emphasizing its remarkable efficiency, being at least 10 times faster than leading competitors in both training and inference stages.

[LG-43] owards Probabilistic Inductive Logic Programming with Neurosymbolic Inference and Relaxation

链接: https://arxiv.org/abs/2408.11367
作者: Fieke Hillerstrom,Gertjan Burghouts
关键词-EN: inductive logic programming, probabilistic background knowledge, logic programming, methods are incapable, coming from sensory
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Many inductive logic programming (ILP) methods are incapable of learning programs from probabilistic background knowledge, e.g. coming from sensory data or neural networks with probabilities. We propose Propper, which handles flawed and probabilistic background knowledge by extending ILP with a combination of neurosymbolic inference, a continuous criterion for hypothesis selection (BCE) and a relaxation of the hypothesis constrainer (NoisyCombo). For relational patterns in noisy images, Propper can learn programs from as few as 8 examples. It outperforms binary ILP and statistical models such as a Graph Neural Network.

[LG-44] GeoReasoner: Reasoning On Geospatially Grounded Context For Natural Language Understanding

链接: https://arxiv.org/abs/2408.11366
作者: Yibo Yan,Joey Lee
关键词-EN: involves recognizing geographic, recognizing geographic entities, making informed inferences, reading and communication, individuals tend
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by International Conference on Information and Knowledge Management 2024

点击查看摘要

Abstract:In human reading and communication, individuals tend to engage in geospatial reasoning, which involves recognizing geographic entities and making informed inferences about their interrelationships. To mimic such cognitive process, current methods either utilize conventional natural language understanding toolkits, or directly apply models pretrained on geo-related natural language corpora. However, these methods face two significant challenges: i) they do not generalize well to unseen geospatial scenarios, and ii) they overlook the importance of integrating geospatial context from geographical databases with linguistic information from the Internet. To handle these challenges, we propose GeoReasoner, a language model capable of reasoning on geospatially grounded natural language. Specifically, it first leverages Large Language Models (LLMs) to generate a comprehensive location description based on linguistic and geospatial information. It also encodes direction and distance information into spatial embedding via treating them as pseudo-sentences. Consequently, the model is trained on both anchor-level and neighbor-level inputs to learn geo-entity representation. Extensive experimental results demonstrate GeoReasoner’s superiority in three tasks: toponym recognition, toponym linking, and geo-entity typing, compared to the state-of-the-art baselines.

[LG-45] ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

链接: https://arxiv.org/abs/2408.11363
作者: Yijia Xiao,Edward Sun,Yiqiao Jin,Qifan Wang,Wei Wang
关键词-EN: Understanding biological processes, biotechnological advancements requires, advancements requires detailed, requires detailed analysis, Understanding biological
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 19 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Understanding biological processes, drug development, and biotechnological advancements requires detailed analysis of protein structures and sequences, a task in protein research that is inherently complex and time-consuming when performed manually. To streamline this process, we introduce ProteinGPT, a state-of-the-art multi-modal protein chat system, that allows users to upload protein sequences and/or structures for comprehensive protein analysis and responsive inquiries. ProteinGPT seamlessly integrates protein sequence and structure encoders with linear projection layers for precise representation adaptation, coupled with a large language model (LLM) to generate accurate and contextually relevant responses. To train ProteinGPT, we construct a large-scale dataset of 132,092 proteins with annotations, and optimize the instruction-tuning process using GPT-4o. This innovative system ensures accurate alignment between the user-uploaded data and prompts, simplifying protein analysis. Experiments show that ProteinGPT can produce promising responses to proteins and their corresponding questions.

[LG-46] Hypergraph Learning based Recommender System for Anomaly Detection Control and Optimization

链接: https://arxiv.org/abs/2408.11359
作者: Sakhinana Sagar Srinivas,Rajat Kumar Sarkar,Venkataramana Runkana
关键词-EN: anomaly detection framework, Anomaly detection, self-adapting anomaly detection, challenging problem, applications in industry
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 10 figure, Accepted at IEEE International Conference on Big Data 2022, Osaka, Japan

点击查看摘要

Abstract:Anomaly detection is fundamental yet, challenging problem with practical applications in industry. The current approaches neglect the higher-order dependencies within the networks of interconnected sensors in the high-dimensional time series(multisensor data) for anomaly detection. To this end, we present a self-adapting anomaly detection framework for joint learning of (a) discrete hypergraph structure and (b) modeling the temporal trends and spatial relations among the interdependent sensors using the hierarchical encoder-decoder architecture to overcome the challenges. The hypergraph representation learning-based framework exploits the relational inductive biases in the hypergraph-structured data to learn the pointwise single-step-ahead forecasts through the self-supervised autoregressive task and predicts the anomalies based on the forecast error. Furthermore, our framework incentivizes learning the anomaly-diagnosis ontology through a differentiable approach. It derives the anomaly information propagation-based computational hypergraphs for root cause analysis and provides recommendations through an offline, optimal predictive control policy to remedy an anomaly. We conduct extensive experiments to evaluate the proposed method on the benchmark datasets for fair and rigorous comparison with the popular baselines. The proposed method outperforms the baseline models and achieves SOTA performance. We report the ablation studies to support the efficacy of the framework.

[LG-47] One-step Structure Prediction and Screening for Protein-Ligand Complexes using Multi-Task Geometric Deep Learning

链接: https://arxiv.org/abs/2408.11356
作者: Kelei He,Tiejun Dong,Jinhui Wu,Junfeng Zhang
关键词-EN: Understanding the structure, Existing virtual structure, Understanding, virtual structure measurement, protein-ligand complex
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Understanding the structure of the protein-ligand complex is crucial to drug development. Existing virtual structure measurement and screening methods are dominated by docking and its derived methods combined with deep learning. However, the sampling and scoring methodology have largely restricted the accuracy and efficiency. Here, we show that these two fundamental tasks can be accurately tackled with a single model, namely LigPose, based on multi-task geometric deep learning. By representing the ligand and the protein pair as a graph, LigPose directly optimizes the three-dimensional structure of the complex, with the learning of binding strength and atomic interactions as auxiliary tasks, enabling its one-step prediction ability without docking tools. Extensive experiments show LigPose achieved state-of-the-art performance on major tasks in drug research. Its considerable improvements indicate a promising paradigm of AI-based pipeline for drug development.

[LG-48] Vision HgNN: An Electron-Micrograph is Worth Hypergraph of Hypernodes ICLR

链接: https://arxiv.org/abs/2408.11351
作者: Sakhinana Sagar Srinivas,Rajat Kumar Sarkar,Sreeja Gangasani,Venkataramana Runkana
关键词-EN: electron micrographs, crucial but challenging, challenging task, task with applications, quantum materials
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, Accepted in PML4DC Workshop at International Conference on Learning Representations (ICLR) 2023

点击查看摘要

Abstract:Material characterization using electron micrographs is a crucial but challenging task with applications in various fields, such as semiconductors, quantum materials, batteries, etc. The challenges in categorizing electron micrographs include but are not limited to the complexity of patterns, high level of detail, and imbalanced data distribution(long-tail distribution). Existing methods have difficulty in modeling the complex relational structure in electron micrographs, hindering their ability to effectively capture the complex relationships between different spatial regions of micrographs. We propose a hypergraph neural network(HgNN) backbone architecture, a conceptually alternative approach, to better model the complex relationships in electron micrographs and improve material characterization accuracy. By utilizing cost-effective GPU hardware, our proposed framework outperforms popular baselines. The results of the ablation studies demonstrate that the proposed framework is effective in achieving state-of-the-art performance on benchmark datasets and efficient in terms of computational and memory requirements for handling large-scale electron micrograph-based datasets.

[LG-49] Clinical Context-aware Radiology Report Generation from Medical Images using Transformers

链接: https://arxiv.org/abs/2408.11344
作者: Sonit Singh
关键词-EN: Natural Language Processing, radiology report generation, Recent developments, field of Natural, Language Processing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 21 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Recent developments in the field of Natural Language Processing, especially language models such as the transformer have brought state-of-the-art results in language understanding and language generation. In this work, we investigate the use of the transformer model for radiology report generation from chest X-rays. We also highlight limitations in evaluating radiology report generation using only the standard language generation metrics. We then applied a transformer based radiology report generation architecture, and also compare the performance of a transformer based decoder with the recurrence based decoder. Experiments were performed using the IU-CXR dataset, showing superior results to its LSTM counterpart and being significantly faster. Finally, we identify the need of evaluating radiology report generation system using both language generation metrics and classification metrics, which helps to provide robust measure of generated reports in terms of their coherence and diagnostic value.

[LG-50] Automatic Dataset Construction (ADC): Sample Collection Data Curation and Beyond

链接: https://arxiv.org/abs/2408.11338
作者: Minghao Liu,Zonglin Di,Jiaheng Wei,Zhongruo Wang,Hengxiang Zhang,Ruixuan Xiao,Haoyu Wang,Jinlong Pang,Hao Chen,Ankit Shah,Hongxin Wei,Xinlei He,Zhaowei Zhao,Haobo Wang,Lei Feng,Jindong Wang,James Davis,Yang Liu
关键词-EN: Large-scale data collection, developing personalized training, fine-tuning specialized models, Large-scale data, mitigating the shortage
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale data collection is essential for developing personalized training data, mitigating the shortage of training data, and fine-tuning specialized models. However, creating high-quality datasets quickly and accurately remains a challenge due to annotation errors, the substantial time and costs associated with human labor. To address these issues, we propose Automatic Dataset Construction (ADC), an innovative methodology that automates dataset creation with negligible cost and high efficiency. Taking the image classification task as a starting point, ADC leverages LLMs for the detailed class design and code generation to collect relevant samples via search engines, significantly reducing the need for manual annotation and speeding up the data generation process. Despite these advantages, ADC also encounters real-world challenges such as label errors (label noise) and imbalanced data distributions (label bias). We provide open-source software that incorporates existing methods for label error detection, robust learning under noisy and biased data, ensuring a higher-quality training data and more robust model training procedure. Furthermore, we design three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning. These datasets are vital because there are few existing datasets specifically for label noise detection, despite its importance. Finally, we evaluate the performance of existing popular methods on these datasets, thereby facilitating further research in the field.

[LG-51] FATE: Focal-modulated Attention Encoder for Temperature Prediction

链接: https://arxiv.org/abs/2408.11336
作者: Tajamul Ashraf,Janibul Bashir
关键词-EN: rising sea levels, increased storm frequency, melting glaciers, evidenced by rising, sea levels
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:One of the major challenges of the twenty-first century is climate change, evidenced by rising sea levels, melting glaciers, and increased storm frequency. Accurate temperature forecasting is vital for understanding and mitigating these impacts. Traditional data-driven models often use recurrent neural networks (RNNs) but face limitations in parallelization, especially with longer sequences. To address this, we introduce a novel approach based on the FocalNet Transformer architecture. Our Focal modulation Attention Encoder (FATE) framework operates in a multi-tensor format, utilizing tensorized modulation to capture spatial and temporal nuances in meteorological data. Comparative evaluations against existing transformer encoders, 3D CNNs, LSTM, and ConvLSTM models show that FATE excels at identifying complex patterns in temperature data. Additionally, we present a new labeled dataset, the Climate Change Parameter dataset (CCPD), containing 40 years of data from Jammu and Kashmir on seven climate-related parameters. Experiments with real-world temperature datasets from the USA, Canada, and Europe show accuracy improvements of 12%, 23%, and 28%, respectively, over current state-of-the-art models. Our CCPD dataset also achieved a 24% improvement in accuracy. To support reproducible research, we have released the source code and pre-trained FATE model at \hrefthis https URLthis https URL.

[LG-52] Design Principle Transfer in Neural Architecture Search via Large Language Models

链接: https://arxiv.org/abs/2408.11330
作者: Xun Zhou,Liang Feng,Xingyu Wu,Zhichao Lu,Kay Chen Tan
关键词-EN: Transferable neural architecture, Transferable neural, efficient neural architectures, design efficient neural, efficient neural
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Transferable neural architecture search (TNAS) has been introduced to design efficient neural architectures for multiple tasks, to enhance the practical applicability of NAS in real-world scenarios. In TNAS, architectural knowledge accumulated in previous search processes is reused to warm up the architecture search for new tasks. However, existing TNAS methods still search in an extensive search space, necessitating the evaluation of numerous architectures. To overcome this challenge, this work proposes a novel transfer paradigm, i.e., design principle transfer. In this work, the linguistic description of various structural components’ effects on architectural performance is termed design principles. They are learned from established architectures and then can be reused to reduce the search space by discarding unpromising architectures. Searching in the refined search space can boost both the search performance and efficiency for new NAS tasks. To this end, a large language model (LLM)-assisted design principle transfer (LAPT) framework is devised. In LAPT, LLM is applied to automatically reason the design principles from a set of given architectures, and then a principle adaptation method is applied to refine these principles progressively based on the new search results. Experimental results show that LAPT can beat the state-of-the-art TNAS methods on most tasks and achieve comparable performance on others.

[LG-53] Improving Out-of-Distribution Data Handling and Corruption Resistance via Modern Hopfield Networks

链接: https://arxiv.org/abs/2408.11309
作者: Saleh Sargolzaei,Luis Rueda
关键词-EN: Modern Hopfield Networks, Hopfield Networks, Modern Hopfield, computer vision models, potential of Modern
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the potential of Modern Hopfield Networks (MHN) in improving the ability of computer vision models to handle out-of-distribution data. While current computer vision models can generalize to unseen samples from the same distribution, they are susceptible to minor perturbations such as blurring, which limits their effectiveness in real-world applications. We suggest integrating MHN into the baseline models to enhance their robustness. This integration can be implemented during the test time for any model and combined with any adversarial defense method. Our research shows that the proposed integration consistently improves model performance on the MNIST-C dataset, achieving a state-of-the-art increase of 13.84% in average corruption accuracy, a 57.49% decrease in mean Corruption Error (mCE), and a 60.61% decrease in relative mCE compared to the baseline model. Additionally, we investigate the capability of MHN to converge to the original non-corrupted data. Notably, our method does not require test-time adaptation or augmentation with corruptions, underscoring its practical viability for real-world deployment. (Source code publicly available at: this https URL)

[LG-54] KAN4TSF: Are KAN and KAN-based models Effective for Time Series Forecasting?

链接: https://arxiv.org/abs/2408.11306
作者: Xiao Han,Xinfeng Zhang,Yiling Wu,Zhenduo Zhang,Zhe Wu
关键词-EN: Time series forecasting, Time series, series forecasting, crucial task, task that predicts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is a crucial task that predicts the future values of variables based on historical data. Time series forecasting techniques have been developing in parallel with the machine learning community, from early statistical learning methods to current deep learning methods. Although existing methods have made significant progress, they still suffer from two challenges. The mathematical theory of mainstream deep learning-based methods does not establish a clear relation between network sizes and fitting capabilities, and these methods often lack interpretability. To this end, we introduce the Kolmogorov-Arnold Network (KAN) into time series forecasting research, which has better mathematical properties and interpretability. First, we propose the Reversible Mixture of KAN experts (RMoK) model, which is a KAN-based model for time series forecasting. RMoK uses a mixture-of-experts structure to assign variables to KAN experts. Then, we compare performance, integration, and speed between RMoK and various baselines on real-world datasets, and the experimental results show that RMoK achieves the best performance in most cases. And we find the relationship between temporal feature weights and data periodicity through visualization, which roughly explains RMoK’s mechanism. Thus, we conclude that KAN and KAN-based models (RMoK) are effective in time series forecasting. Code is available at KAN4TSF: this https URL.

[LG-55] FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts

链接: https://arxiv.org/abs/2408.11304
作者: Hanzi Mei,Dongqi Cai,Ao Zhou,Shangguang Wang,Mengwei Xu
关键词-EN: Large Language Models, Large Language, making Federated Learning, Language Models, push the boundaries
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) push the boundaries of AI capabilities, their demand for data is growing. Much of this data is private and distributed across edge devices, making Federated Learning (FL) a de-facto alternative for fine-tuning (i.e., FedLLM). However, it faces significant challenges due to the inherent heterogeneity among clients, including varying data distributions and diverse task types. Towards a versatile FedLLM, we replace traditional dense model with a sparsely-activated Mixture-of-Experts (MoE) architecture, whose parallel feed-forward networks enable greater flexibility. To make it more practical in resource-constrained environments, we present FedMoE, the efficient personalized FL framework to address data heterogeneity, constructing an optimal sub-MoE for each client and bringing the knowledge back to global MoE. FedMoE is composed of two fine-tuning stages. In the first stage, FedMoE simplifies the problem by conducting a heuristic search based on observed activation patterns, which identifies a suboptimal submodel for each client. In the second stage, these submodels are distributed to clients for further training and returned for server aggregating through a novel modular aggregation strategy. Meanwhile, FedMoE progressively adjusts the submodels to optimal through global expert recommendation. Experimental results demonstrate the superiority of our method over previous personalized FL methods.

[LG-56] Koopman AutoEncoder via Singular Value Decomposition for Data-Driven Long-Term Prediction

链接: https://arxiv.org/abs/2408.11303
作者: Jinho Choi,Sivaram Krishnan,Jihong Park
关键词-EN: modeling nonlinear dynamics, deep learning methods, Koopman autoencoder, data-driven technique, recent years
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 5 figures, to be presented at IEEE MLSP 2024

点击查看摘要

Abstract:The Koopman autoencoder, a data-driven technique, has gained traction for modeling nonlinear dynamics using deep learning methods in recent years. Given the linear characteristics inherent to the Koopman operator, controlling its eigenvalues offers an opportunity to enhance long-term prediction performance, a critical task for forecasting future trends in time-series datasets with long-term behaviors. However, controlling eigenvalues is challenging due to high computational complexity and difficulties in managing them during the training process. To tackle this issue, we propose leveraging the singular value decomposition (SVD) of the Koopman matrix to adjust the singular values for better long-term prediction. Experimental results demonstrate that, during training, the loss term for singular values effectively brings the eigenvalues close to the unit circle, and the proposed approach outperforms existing baseline methods for long-term prediction tasks.

[LG-57] Modeling Reference-dependent Choices with Graph Neural Networks

链接: https://arxiv.org/abs/2408.11302
作者: Liang Zhang,Guannan Liu,Junjie Wu,Yong Tan
关键词-EN: classic Prospect Theory, Prospect Theory, recommender systems development, classic Prospect, Theory has highlighted
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:While the classic Prospect Theory has highlighted the reference-dependent and comparative nature of consumers’ product evaluation processes, few models have successfully integrated this theoretical hypothesis into data-driven preference quantification, particularly in the realm of recommender systems development. To bridge this gap, we propose a new research problem of modeling reference-dependent preferences from a data-driven perspective, and design a novel deep learning-based framework named Attributed Reference-dependent Choice Model for Recommendation (ArcRec) to tackle the inherent challenges associated with this problem. ArcRec features in building a reference network from aggregated historical purchase records for instantiating theoretical reference points, which is then decomposed into product attribute specific sub-networks and represented through Graph Neural Networks. In this way, the reference points of a consumer can be encoded at the attribute-level individually from her past experiences but also reflect the crowd influences. ArcRec also makes novel contributions to quantifying consumers’ reference-dependent preferences using a deep neural network-based utility function that integrates both interest-inspired and price-inspired preferences, with their complex interaction effects captured by an attribute-aware price sensitivity mechanism. Most importantly, ArcRec introduces a novel Attribute-level Willingness-To-Pay measure to the reference-dependent utility function, which captures a consumer’s heterogeneous salience of product attributes via observing her attribute-level price tolerance to a product. Empirical evaluations on both synthetic and real-world online shopping datasets demonstrate ArcRec’s superior performances over fourteen state-of-the-art baselines.

[LG-58] Offline Policy Learning via Skill-step Abstraction for Long-horizon Goal-Conditioned Tasks

链接: https://arxiv.org/abs/2408.11300
作者: Donghoon Kim,Minjong Yoo,Honguk Woo
关键词-EN: policy learning, confronting long-horizon goals, policy, sparsity of rewards, long-horizon goals
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, International Joint Conference on Artificial Intelligence 2024, Published version

点击查看摘要

Abstract:Goal-conditioned (GC) policy learning often faces a challenge arising from the sparsity of rewards, when confronting long-horizon goals. To address the challenge, we explore skill-based GC policy learning in offline settings, where skills are acquired from existing data and long-horizon goals are decomposed into sequences of near-term goals that align with these skills. Specifically, we present an `offline GC policy learning via skill-step abstraction’ framework (GLvSA) tailored for tackling long-horizon GC tasks affected by goal distribution shifts. In the framework, a GC policy is progressively learned offline in conjunction with the incremental modeling of skill-step abstractions on the data. We also devise a GC policy hierarchy that not only accelerates GC policy learning within the framework but also allows for parameter-efficient fine-tuning of the policy. Through experiments with the maze and Franka kitchen environments, we demonstrate the superiority and efficiency of our GLvSA framework in adapting GC policies to a wide range of long-horizon goals. The framework achieves competitive zero-shot and few-shot adaptation performance, outperforming existing GC policy learning and skill-based methods.

[LG-59] ViIK: Flow-based Vision Inverse Kinematics Solver with Fusing Collision Checking

链接: https://arxiv.org/abs/2408.11293
作者: Qinglong Meng,Chongkun Xia,Xueqian Wang
关键词-EN: Discrete Oriented Polytope, Inverse Kinematics, end effector, Inverse Kinematics solver, Vision Inverse Kinematics
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse Kinematics (IK) is to find the robot’s configurations that satisfy the target pose of the end effector. In motion planning, diverse configurations were required in case a feasible trajectory was not found. Meanwhile, collision checking (CC), e.g. Oriented bounding box (OBB), Discrete Oriented Polytope (DOP), and Quickhull \citequickhull, needs to be done for each configuration provided by the IK solver to ensure every goal configuration for motion planning is available. This means the classical IK solver and CC algorithm should be executed repeatedly for every configuration. Thus, the preparation time is long when the required number of goal configurations is large, e.g. motion planning in cluster environments. Moreover, structured maps, which might be difficult to obtain, were required by classical collision-checking algorithms. To sidestep such two issues, we propose a flow-based vision method that can output diverse available configurations by fusing inverse kinematics and collision checking, named Vision Inverse Kinematics solver (ViIK). Moreover, ViIK uses RGB images as the perception of environments. ViIK can output 1000 configurations within 40 ms, and the accuracy is about 3 millimeters and 1.5 degrees. The higher accuracy can be obtained by being refined by the classical IK solver within a few iterations. The self-collision rates can be lower than 2%. The collision-with-env rates can be lower than 10% in most scenes. The code is available at: this https URL.

[LG-60] aming Generative Diffusion for Universal Blind Image Restoration

链接: https://arxiv.org/abs/2408.11287
作者: Siwei Tu,Weidong Yang,Ben Fei
关键词-EN: blind image restoration, image restoration, blind image, widely utilized, image restoration methods
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures, 8 tables

点击查看摘要

Abstract:Diffusion models have been widely utilized for image restoration. However, previous blind image restoration methods still need to assume the type of degradation model while leaving the parameters to be optimized, limiting their real-world applications. Therefore, we aim to tame generative diffusion prior for universal blind image restoration dubbed BIR-D, which utilizes an optimizable convolutional kernel to simulate the degradation model and dynamically update the parameters of the kernel in the diffusion steps, enabling it to achieve blind image restoration results even in various complex situations. Besides, based on mathematical reasoning, we have provided an empirical formula for the chosen of adaptive guidance scale, eliminating the need for a grid search for the optimal parameter. Experimentally, Our BIR-D has demonstrated superior practicality and versatility than off-the-shelf unsupervised methods across various tasks both on real-world and synthetic datasets, qualitatively and quantitatively. BIR-D is able to fulfill multi-guidance blind image restoration. Moreover, BIR-D can also restore images that undergo multiple and complicated degradations, demonstrating the practical applications.

[LG-61] Inverting the Leverage Score Gradient: An Efficient Approximate Newton Method

链接: https://arxiv.org/abs/2408.11267
作者: Chenyang Li,Zhao Song,Zhaoxing Xu,Junze Yin
关键词-EN: aiding regression analysis, randomized matrix computations, leverage score, leverage scores gradient, machine learning
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2404.13785

点击查看摘要

Abstract:Leverage scores have become essential in statistics and machine learning, aiding regression analysis, randomized matrix computations, and various other tasks. This paper delves into the inverse problem, aiming to recover the intrinsic model parameters given the leverage scores gradient. This endeavor not only enriches the theoretical understanding of models trained with leverage score techniques but also has substantial implications for data privacy and adversarial security. We specifically scrutinize the inversion of the leverage score gradient, denoted as g(x) . An innovative iterative algorithm is introduced for the approximate resolution of the regularized least squares problem stated as \min_x \in \mathbbR^d 0.5 |g(x) - c|_2^2 + 0.5|\mathrmdiag(w)Ax|_2^2 . Our algorithm employs subsampled leverage score distributions to compute an approximate Hessian in each iteration, under standard assumptions, considerably mitigating the time complexity. Given that a total of T = \log(| x_0 - x^* |_2/ \epsilon) iterations are required, the cost per iteration is optimized to the order of O( (\mathrmnnz(A) + d^\omega ) \cdot \mathrmpoly(\log(n/\delta)) , where \mathrmnnz(A) denotes the number of non-zero entries of A .

[LG-62] Practical Aspects on Solving Differential Equations Using Deep Learning: A Primer

链接: https://arxiv.org/abs/2408.11266
作者: Georgios Is. Detorakis
关键词-EN: Deep Galerkin method, Deep Galerkin, Galerkin method, partial differential equations, Deep learning
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 32 pages, 12 figures, primer (tutorial)

点击查看摘要

Abstract:Deep learning has become a popular tool across many scientific fields, including the study of differential equations, particularly partial differential equations. This work introduces the basic principles of deep learning and the Deep Galerkin method, which uses deep neural networks to solve differential equations. This primer aims to provide technical and practical insights into the Deep Galerkin method and its implementation. We demonstrate how to solve the one-dimensional heat equation step-by-step. We also show how to apply the Deep Galerkin method to solve systems of ordinary differential equations and integral equations, such as the Fredholm of the second kind. Additionally, we provide code snippets within the text and the complete source code on Github. The examples are designed so that one can run them on a simple computer without needing a GPU.

[LG-63] Correlation Analysis of Adversarial Attack in Time Series Classification

链接: https://arxiv.org/abs/2408.11264
作者: Zhengyang Li,Wenhao Liang,Chang Dong,Weitong Chen,Dong Huang
关键词-EN: time series classification, process local versus, Auto Correlation Function, Normalized Auto Correlation, local versus global
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:This study investigates the vulnerability of time series classification models to adversarial attacks, with a focus on how these models process local versus global information under such conditions. By leveraging the Normalized Auto Correlation Function (NACF), an exploration into the inclination of neural networks is conducted. It is demonstrated that regularization techniques, particularly those employing Fast Fourier Transform (FFT) methods and targeting frequency components of perturbations, markedly enhance the effectiveness of attacks. Meanwhile, the defense strategies, like noise introduction and Gaussian filtering, are shown to significantly lower the Attack Success Rate (ASR), with approaches based on noise introducing notably effective in countering high-frequency distortions. Furthermore, models designed to prioritize global information are revealed to possess greater resistance to adversarial manipulations. These results underline the importance of designing attack and defense mechanisms, informed by frequency domain analysis, as a means to considerably reinforce the resilience of neural network models against adversarial threats.

[LG-64] Do Neural Scaling Laws Exist on Graph Self-Supervised Learning?

链接: https://arxiv.org/abs/2408.11243
作者: Qian Ma,Haitao Mao,Jingzhe Liu,Zhehua Zhang,Chunlin Feng,Yu Song,Yihan Shao,Tianfan Fu,Yao Ma
关键词-EN: graph SSL techniques, existing graph SSL, graph SSL, effectively leveraging knowledge, SSL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervised learning~(SSL) is essential to obtain foundation models in NLP and CV domains via effectively leveraging knowledge in large-scale unlabeled data. The reason for its success is that a suitable SSL design can help the model to follow the neural scaling law, i.e., the performance consistently improves with increasing model and dataset sizes. However, it remains a mystery whether existing SSL in the graph domain can follow the scaling behavior toward building Graph Foundation Models~(GFMs) with large-scale pre-training. In this study, we examine whether existing graph SSL techniques can follow the neural scaling behavior with the potential to serve as the essential component for GFMs. Our benchmark includes comprehensive SSL technique implementations with analysis conducted on both the conventional SSL setting and many new settings adopted in other domains. Surprisingly, despite the SSL loss continuously decreasing, no existing graph SSL techniques follow the neural scaling behavior on the downstream performance. The model performance only merely fluctuates on different data scales and model scales. Instead of the scales, the key factors influencing the performance are the choices of model architecture and pretext task design. This paper examines existing SSL techniques for the feasibility of Graph SSL techniques in developing GFMs and opens a new direction for graph SSL design with the new evaluation prototype. Our code implementation is available online to ease reproducibility on this https URL.

[LG-65] Asymmetric Graph Error Control with Low Complexity in Causal Bandits

链接: https://arxiv.org/abs/2408.11240
作者: Chen Peng,Di Zhang,Urbashi Mitra
关键词-EN: causal bandit problem, causal graph, causal, causal graph learning, select an optimal
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, the causal bandit problem is investigated, in which the objective is to select an optimal sequence of interventions on nodes in a causal graph. It is assumed that the graph is governed by linear structural equations; it is further assumed that both the causal topology and the distribution of interventions are unknown. By exploiting the causal relationships between the nodes whose signals contribute to the reward, interventions are optimized. First, based on the difference between the two types of graph identification errors (false positives and negatives), a causal graph learning method is proposed, which strongly reduces sample complexity relative to the prior art by learning sub-graphs. Under the assumption of Gaussian exogenous inputs and minimum-mean squared error weight estimation, a new uncertainty bound tailored to the causal bandit problem is derived. This uncertainty bound drives an upper confidence bound based intervention selection to optimize the reward. To cope with non-stationary bandits, a sub-graph change detection mechanism is proposed, with high sample efficiency. Numerical results compare the new methodology to existing schemes and show a substantial performance improvement in both stationary and non-stationary settings. Compared to existing approaches, the proposed scheme takes 67% fewer samples to learn the causal structure and achieves an average reward gain of 85%.

[LG-66] A Little Confidence Goes a Long Way

链接: https://arxiv.org/abs/2408.11239
作者: John Scoville,Shang Gao,Devanshu Agrawal,Javed Qadrud-Din
关键词-EN: binary classification tasks, large language models, hidden state activations, introduce a group, group of related
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:We introduce a group of related methods for binary classification tasks using probes of the hidden state activations in large language models (LLMs). Performance is on par with the largest and most advanced LLMs currently available, but requiring orders of magnitude fewer computational resources and not requiring labeled data. This approach involves translating class labels into a semantically rich description, spontaneous symmetry breaking of multilayer perceptron probes for unsupervised learning and inference, training probes to generate confidence scores (prior probabilities) from hidden state activations subject to known constraints via entropy maximization, and selecting the most confident probe model from an ensemble for prediction. These techniques are evaluated on four datasets using five base LLMs.

[LG-67] Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

链接: https://arxiv.org/abs/2408.11237
作者: Christos Constantinou,Georgios Ioannides,Aman Chadha,Aaron Elkins,Edwin Simpson
关键词-EN: machine learning applications, model overconfidence, crucial in machine, machine learning, learning applications
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The majority of existing OOD detection methods predominantly address uni-modal inputs, such as images or texts. In the context of multi-modal documents, there is a notable lack of extensive research on the performance of these methods, which have primarily been developed with a focus on computer vision tasks. We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches and significantly decreases the false positive rate (FPR) compared to existing solutions up to 7.5%. This methodology generalizes well to multi-modal data, such as documents, where visual and textual information are modeled under the same Transformer architecture. To address the scarcity of high-quality publicly available document datasets and encourage further research on OOD detection for documents, we introduce FinanceDocs, a new document AI dataset. Our code and dataset are publicly available.

[LG-68] Unified Deep Learning Model for Global Prediction of Aboveground Biomass Canopy Height and Cover from High-Resolution Multi-Sensor Satellite Imagery

链接: https://arxiv.org/abs/2408.11234
作者: Manuel Weber,Carly Beneke,Clyde Wheeler
关键词-EN: international climate initiatives, ground based assessments, carbon stock, carbon accounting, climate initiatives
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regular measurement of carbon stock in the world’s forests is critical for carbon accounting and reporting under national and international climate initiatives, and for scientific research, but has been largely limited in scalability and temporal resolution due to a lack of ground based assessments. Increasing efforts have been made to address these challenges by incorporating remotely sensed data. We present a new methodology which uses multi-sensor, multi-spectral imagery at a resolution of 10 meters and a deep learning based model which unifies the prediction of above ground biomass density (AGBD), canopy height (CH), canopy cover (CC) as well as uncertainty estimations for all three quantities. The model is trained on millions of globally sampled GEDI-L2/L4 measurements. We validate the capability of our model by deploying it over the entire globe for the year 2023 as well as annually from 2016 to 2023 over selected areas. The model achieves a mean absolute error for AGBD (CH, CC) of 26.1 Mg/ha (3.7 m, 9.9 %) and a root mean squared error of 50.6 Mg/ha (5.4 m, 15.8 %) on a globally sampled test dataset, demonstrating a significant improvement over previously published results. We also report the model performance against independently collected ground measurements published in the literature, which show a high degree of correlation across varying conditions. We further show that our pre-trained model facilitates seamless transferability to other GEDI variables due to its multi-head architecture.

[LG-69] Revisiting Min-Max Optimization Problem in Adversarial Training

链接: https://arxiv.org/abs/2408.11218
作者: Sina Hajer Ahmadi,Hassan Bahrami
关键词-EN: computer vision applications, real world puts, deep neural networks, neural networks, convolutional neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of computer vision applications in the real world puts the security of the deep neural networks at risk. Recent works demonstrate that convolutional neural networks are susceptible to adversarial examples - where the input images look similar to the natural images but are classified incorrectly by the model. To provide a rebuttal to this problem, we propose a new method to build robust deep neural networks against adversarial attacks by reformulating the saddle point optimization problem in \citemadry2017towards. Our proposed method offers significant resistance and a concrete security guarantee against multiple adversaries. The goal of this paper is to act as a stepping stone for a new variation of deep learning models which would lead towards fully robust deep learning models.

[LG-70] Approximation of the Proximal Operator of the ell_infty Norm Using a Neural Network

链接: https://arxiv.org/abs/2408.11211
作者: Kathryn Linehan,Radu Balan
关键词-EN: partial sort similar, Computing the proximal, generally requires, similar to quicksort, infty
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 30 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Computing the proximal operator of the \ell_\infty norm, \textbfprox_\alpha ||\cdot||\infty(\mathbfx) , generally requires a sort of the input data, or at least a partial sort similar to quicksort. In order to avoid using a sort, we present an O(m) approximation of \textbfprox\alpha ||\cdot||\infty(\mathbfx) using a neural network. A novel aspect of the network is that it is able to accept vectors of varying lengths due to a feature selection process that uses moments of the input data. We present results on the accuracy of the approximation, feature importance, and computational efficiency of the approach. We show that the network outperforms a “vanilla neural network” that does not use feature selection. We also present an algorithm with corresponding theory to calculate \textbfprox\alpha ||\cdot||_\infty(\mathbfx) exactly, relate it to the Moreau decomposition, and compare its computational efficiency to that of the approximation.

[LG-71] PooDLe: Pooled and dense self-supervised learning from naturalistic videos

链接: https://arxiv.org/abs/2408.11208
作者: Alex N. Wang,Christopher Hoang,Yuwen Xiong,Yann LeCun,Mengye Ren
关键词-EN: driven significant progress, Self-supervised learning, driven significant, significant progress, Self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our findings indicate that a unified objective applied at multiple feature scales is essential for learning effective image representations from high-resolution, naturalistic videos. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.

[LG-72] UKAN: Unbound Kolmogorov-Arnold Network Accompanied with Accelerated Library

链接: https://arxiv.org/abs/2408.11200
作者: Alireza Moradzadeh,Lukasz Wawrzyniak,Miles Macklin,Saee G. Paliwal
关键词-EN: Kolmogorov-Arnold Networks, B-spline, GPU-accelerated library, B-spline coefficients, underlying components
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, 4 tables

点击查看摘要

Abstract:In this work, we present a GPU-accelerated library for the underlying components of Kolmogorov-Arnold Networks (KANs), along with an algorithm to eliminate bounded grids in KANs. The GPU-accelerated library reduces the computational complexity of Basis Spline (B-spline) evaluation by a factor of \mathcalO (grid size) compared to existing codes, enabling batch computation for large-scale learning. To overcome the limitations of traditional KANs, we introduce Unbounded KANs (UKANs), which eliminate the need for a bounded grid and a fixed number of B-spline coefficients. To do so, we replace the KAN parameters (B-spline coefficients) with a coefficient generator (CG) model. The inputs to the CG model are designed based on the idea of an infinite symmetric grid extending from negative infinity to positive infinity. The positional encoding of grid group, a sequential collection of B-spline grid indexes, is fed into the CG model, and coefficients are consumed by the efficient implementation (matrix representations) of B-spline functions to generate outputs. We perform several experiments on regression, classification, and generative tasks, which are promising. In particular, UKAN does not require data normalization or a bounded domain for evaluation. Additionally, our benchmarking results indicate the superior memory and computational efficiency of our library compared to existing codes.

[LG-73] Active Learning of Molecular Data for Task-Specific Objectives

链接: https://arxiv.org/abs/2408.11191
作者: Kunal Ghosh,Milica Todorović,Aki Vehtari,Patrick Rinke
关键词-EN: machine learning approach, data-efficient machine learning, Active learning, learning approach, machine learning
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Active learning (AL) has shown promise for being a particularly data-efficient machine learning approach. Yet, its performance depends on the application and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes and GP noise settings. AL was insensitive to the acquisition batch size and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.

[LG-74] Reading with Intent

链接: https://arxiv.org/abs/2408.11189
作者: Benjamin Reichman,Kartik Talamadupula,Toshish Jawale,Larry Heck
关键词-EN: integrating external information, external information sources, Retrieval augmented generation, RAG systems, open internet
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) systems augment how knowledge language models are by integrating external information sources such as Wikipedia, internal documents, scientific papers, or the open internet. RAG systems that rely on the open internet as their knowledge source have to contend with the complexities of human-generated content. Human communication extends much deeper than just the words rendered as text. Intent, tonality, and connotation can all change the meaning of what is being conveyed. Recent real-world deployments of RAG systems have shown some difficulty in understanding these nuances of human communication. One significant challenge for these systems lies in processing sarcasm. Though the Large Language Models (LLMs) that make up the backbone of these RAG systems are able to detect sarcasm, they currently do not always use these detections for the subsequent processing of text. To address these issues, in this paper, we synthetically generate sarcastic passages from Natural Question’s Wikipedia retrieval corpus. We then test the impact of these passages on the performance of both the retriever and reader portion of the RAG pipeline. We introduce a prompting system designed to enhance the model’s ability to interpret and generate responses in the presence of sarcasm, thus improving overall system performance. Finally, we conduct ablation studies to validate the effectiveness of our approach, demonstrating improvements in handling sarcastic content within RAG systems.

[LG-75] CRACKS: Crowdsourcing Resources for Analysis and Categorization of Key Subsurface faults

链接: https://arxiv.org/abs/2408.11185
作者: Mohit Prabhushankar,Kiran Kokilepersaud,Jorge Quesada,Yavuz Yarici,Chen Zhou,Mohammad Alotaibi,Ghassan AlRegib,Ahmad Mustafa,Yusufjon Kumakov
关键词-EN: Crowdsourcing annotations, machine learning, created a paradigm, paradigm shift, data
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Crowdsourcing annotations has created a paradigm shift in the availability of labeled data for machine learning. Availability of large datasets has accelerated progress in common knowledge applications involving visual and language data. However, specialized applications that require expert labels lag in data availability. One such application is fault segmentation in subsurface imaging. Detecting, tracking, and analyzing faults has broad societal implications in predicting fluid flows, earthquakes, and storing excess atmospheric CO _2 . However, delineating faults with current practices is a labor-intensive activity that requires precise analysis of subsurface imaging data by geophysicists. In this paper, we propose the \textttCRACKS dataset to detect and segment faults in subsurface images by utilizing crowdsourced resources. We leverage Amazon Mechanical Turk to obtain fault delineations from sections of the Netherlands North Sea subsurface images from (i) 26 novices who have no exposure to subsurface data and were shown a video describing and labeling faults, (ii) 8 practitioners who have previously interacted and worked on subsurface data, (iii) one geophysicist to label 7636 faults in the region. Note that all novices, practitioners, and the expert segment faults on the same subsurface volume with disagreements between and among the novices and practitioners. Additionally, each fault annotation is equipped with the confidence level of the annotator. The paper provides benchmarks on detecting and segmenting the expert labels, given the novice and practitioner labels. Additional details along with the dataset links and codes are available at \hrefthis https URLlink .

[LG-76] A Full DAG Score-Based Algorithm for Learning Causal Bayesian Networks with Latent Confounders ECAI’24

链接: https://arxiv.org/abs/2408.11181
作者: Christophe Gonzales,Amir-Hosein Valizadeh
关键词-EN: Causal Bayesian networks, encode causal relations, Causal Bayesian, Bayesian networks, popular graphical probabilistic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, extended version with supplementary material of paper accepted at the 27th European Conference on Artificial Intelligence (ECAI’24)

点击查看摘要

Abstract:Causal Bayesian networks (CBN) are popular graphical probabilistic models that encode causal relations among variables. Learning their graphical structure from observational data has received a lot of attention in the literature. When there exists no latent (unobserved) confounder, i.e., no unobserved direct common cause of some observed variables, learning algorithms can be divided essentially into two classes: constraint-based and score-based approaches. The latter are often thought to be more robust than the former and to produce better results. However, to the best of our knowledge, when variables are discrete, no score-based algorithm is capable of dealing with latent confounders. This paper introduces the first fully score-based structure learning algorithm searching the space of DAGs (directed acyclic graphs) that is capable of identifying the presence of some latent confounders. It is justified mathematically and experiments highlight its effectiveness.

[LG-77] SubgoalXL: Subgoal-based Expert Learning for Theorem Proving

链接: https://arxiv.org/abs/2408.11172
作者: Xueliang Zhao,Lin Zheng,Haige Bo,Changran Hu,Urmish Thakker,Lingpeng Kong
关键词-EN: Formal theorem proving, large language models, theorem proving, Formal theorem, computer science
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Formal theorem proving, a field at the intersection of mathematics and computer science, has seen renewed interest with advancements in large language models (LLMs). This paper introduces SubgoalXL, a novel approach that synergizes subgoal-based proofs with expert learning to enhance LLMs’ capabilities in formal theorem proving within the Isabelle environment. SubgoalXL addresses two critical challenges: the scarcity of specialized mathematics and theorem-proving data, and the need for improved multi-step reasoning abilities in LLMs. By optimizing data efficiency and employing subgoal-level supervision, SubgoalXL extracts richer information from limited human-generated proofs. The framework integrates subgoal-oriented proof strategies with an expert learning system, iteratively refining formal statement, proof, and subgoal generators. Leveraging the Isabelle environment’s advantages in subgoal-based proofs, SubgoalXL achieves a new state-of-the-art performance of 56.1% in Isabelle on the standard miniF2F dataset, marking an absolute improvement of 4.9%. Notably, SubgoalXL successfully solves 41 AMC12, 9 AIME, and 3 IMO problems from miniF2F. These results underscore the effectiveness of maximizing limited data utility and employing targeted guidance for complex reasoning in formal theorem proving, contributing to the ongoing advancement of AI reasoning capabilities. The implementation is available at \urlthis https URL.

[LG-78] Swim till You Sink: Computing the Limit of a Game

链接: https://arxiv.org/abs/2408.11146
作者: Rashida Hakim,Jason Milionis,Christos Papadimitriou,Georgios Piliouras
关键词-EN: Nash equilibria, game dynamics, limit behavior, game, natural game dynamics
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:During 2023, two interesting results were proven about the limit behavior of game dynamics: First, it was shown that there is a game for which no dynamics converges to the Nash equilibria. Second, it was shown that the sink equilibria of a game adequately capture the limit behavior of natural game dynamics. These two results have created a need and opportunity to articulate a principled computational theory of the meaning of the game that is based on game dynamics. Given any game in normal form, and any prior distribution of play, we study the problem of computing the asymptotic behavior of a class of natural dynamics called the noisy replicator dynamics as a limit distribution over the sink equilibria of the game. When the prior distribution has pure strategy support, we prove this distribution can be computed efficiently, in near-linear time to the size of the best-response graph. When the distribution can be sampled – for example, if it is the uniform distribution over all mixed strategy profiles – we show through experiments that the limit distribution of reasonably large games can be estimated quite accurately through sampling and simulation.

[LG-79] otal Uncertainty Quantification in Inverse PDE Solutions Obtained with Reduced-Order Deep Learning Surrogate Models

链接: https://arxiv.org/abs/2408.11145
作者: Yuanzhe Wang,Alexandre M. Tartakovsky
关键词-EN: including operator learning, approximate Bayesian method, operator learning models, machine learning surrogate, PDE solutions obtained
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose an approximate Bayesian method for quantifying the total uncertainty in inverse PDE solutions obtained with machine learning surrogate models, including operator learning models. The proposed method accounts for uncertainty in the observations and PDE and surrogate models. First, we use the surrogate model to formulate a minimization problem in the reduced space for the maximum a posteriori (MAP) inverse solution. Then, we randomize the MAP objective function and obtain samples of the posterior distribution by minimizing different realizations of the objective function. We test the proposed framework by comparing it with the iterative ensemble smoother and deep ensembling methods for a non-linear diffusion equation with an unknown space-dependent diffusion coefficient. Among other problems, this equation describes groundwater flow in an unconfined aquifer. Depending on the training dataset and ensemble sizes, the proposed method provides similar or more descriptive posteriors of the parameters and states than the iterative ensemble smoother method. Deep ensembling underestimates uncertainty and provides less informative posteriors than the other two methods.

[LG-80] MS3D: A RG Flow-Based Regularization for GAN Training with Limited Data

链接: https://arxiv.org/abs/2408.11135
作者: Jian Wang,Xin Lan,Yuxin Tian,Jiancheng Lv
关键词-EN: Generative adversarial networks, made impressive advances, avoid degradation caused, require large-scale training, Generative adversarial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative adversarial networks (GANs) have made impressive advances in image generation, but they often require large-scale training data to avoid degradation caused by discriminator overfitting. To tackle this issue, we investigate the challenge of training GANs with limited data, and propose a novel regularization method based on the idea of renormalization group (RG) in physics.We observe that in the limited data setting, the gradient pattern that the generator obtains from the discriminator becomes more aggregated over time. In RG context, this aggregated pattern exhibits a high discrepancy from its coarse-grained versions, which implies a high-capacity and sensitive system, prone to overfitting and collapse. To address this problem, we introduce a \textbfmulti-\textbfscale \textbfstructural \textbfself-\textbfdissimilarity (MS ^3 D) regularization, which constrains the gradient field to have a consistent pattern across different scales, thereby fostering a more redundant and robust system. We show that our method can effectively enhance the performance and stability of GANs under limited data scenarios, and even allow them to generate high-quality images with very few data.

[LG-81] Binocular Model: A deep learning solution for online melt pool temperature analysis using dual-wavelength Imaging Pyrometry

链接: https://arxiv.org/abs/2408.11126
作者: Javid Akhavan,Chaitanya Krishna Vallabh,Xianyun Zhao,Souran Manoochehri
关键词-EN: metal Additive Manufacturing, Additive Manufacturing, Melt Pool, ensuring part quality, defect prevention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In metal Additive Manufacturing (AM), monitoring the temperature of the Melt Pool (MP) is crucial for ensuring part quality, process stability, defect prevention, and overall process optimization. Traditional methods, are slow to converge and require extensive manual effort to translate data into actionable insights, rendering them impractical for real-time monitoring and control. To address this challenge, we propose an Artificial Intelligence (AI)-based solution aimed at reducing manual data processing reliance and improving the efficiency of transitioning from data to insight. In our study, we utilize a dataset comprising dual-wavelength real-time process monitoring data and corresponding temperature maps. We introduce a deep learning model called the “Binocular model,” which exploits dual input observations to perform a precise analysis of MP temperature in Laser Powder Bed Fusion (L-PBF). Through advanced deep learning techniques, we seamlessly convert raw data into temperature maps, significantly streamlining the process and enabling batch processing at a rate of up to 750 frames per second, approximately 1000 times faster than conventional methods. Our Binocular model achieves high accuracy in temperature estimation, evidenced by a 0.95 R-squared score, while simultaneously enhancing processing efficiency by a factor of \sim1000x times. This model directly addresses the challenge of real-time MP temperature monitoring and offers insights into the encountered constraints and the benefits of our Deep Learning-based approach. By combining efficiency and precision, our work contributes to the advancement of temperature monitoring in L-PBF, thus driving progress in the field of metal AM.

[LG-82] DOMBA: Double Model Balancing for Access-Controlled Language Models via Minimum-Bounded Aggregation

链接: https://arxiv.org/abs/2408.11121
作者: Tom Segal,Asaf Shabtai,Yuval Elovici
关键词-EN: depends heavily, quality and quantity, large language models, large language, LLMs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:The utility of large language models (LLMs) depends heavily on the quality and quantity of their training data. Many organizations possess large data corpora that could be leveraged to train or fine-tune LLMs tailored to their specific needs. However, these datasets often come with access restrictions that are based on user privileges and enforced by access control mechanisms. Training LLMs on such datasets could result in exposure of sensitive information to unauthorized users. A straightforward approach for preventing such exposure is to train a separate model for each access level. This, however, may result in low utility models due to the limited amount of training data per model compared to the amount in the entire organizational corpus. Another approach is to train a single LLM on all the data while limiting the exposure of unauthorized information. However, current exposure-limiting methods for LLMs are ineffective for access-controlled data, where sensitive information appears frequently across many training examples. We propose DOMBA - double model balancing - a simple approach for training and deploying LLMs that provides high utility and access-control functionality with security guarantees. DOMBA aggregates the probability distributions of two models, each trained on documents with (potentially many) different access levels, using a “min-bounded” average function (a function that is bounded by the smaller value, e.g., harmonic mean). A detailed mathematical analysis and extensive evaluation show that DOMBA safeguards restricted information while offering utility comparable to non-secure models.

[LG-83] Experimentation deployment and monitoring Machine Learning models: Approaches for applying MLOps

链接: https://arxiv.org/abs/2408.11112
作者: Diego Nogare,Ismar Frango Silveira
关键词-EN: Data Science, significantly enhancing decision-making, recent years, tool for industry, significantly enhancing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Data Science has become increasingly relevant as a support tool for industry, significantly enhancing decision-making in a way never seen before. In this context, the MLOps discipline emerges as a solution to automate the life cycle of Machine Learning models, ranging from experimentation to monitoring in productive environments. Research results shows MLOps is a constantly evolving discipline, with challenges and solutions for integrating development and production environments, publishing models in production environments, and monitoring models throughout the end to end development lifecycle. This paper contributes to the understanding of MLOps techniques and their most diverse applications.

[LG-84] ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks

链接: https://arxiv.org/abs/2408.11104
作者: Qiang Liu,Mengyu Chu,Nils Thuerey
关键词-EN: conflicting update directions, yield conflicting update, multiple additive terms, Physics-Informed Neural Networks, problems contain multiple
类目: Machine Learning (cs.LG)
*备注: Project homepage: this https URL

点击查看摘要

Abstract:The loss functions of many learning problems contain multiple additive terms that can disagree and yield conflicting update directions. For Physics-Informed Neural Networks (PINNs), loss terms on initial/boundary conditions and physics equations are particularly interesting as they are well-established as highly difficult tasks. To improve learning the challenging multi-objective task posed by PINNs, we propose the ConFIG method, which provides conflict-free updates by ensuring a positive dot product between the final update and each loss-specific gradient. It also maintains consistent optimization rates for all loss terms and dynamically adjusts gradient magnitudes based on conflict levels. We additionally leverage momentum to accelerate optimizations by alternating the back-propagation of different loss terms. The proposed method is evaluated across a range of challenging PINN scenarios, consistently showing superior performance and runtime compared to baseline methods. We also test the proposed method in a classic multi-task benchmark, where the ConFIG method likewise exhibits a highly promising performance. Source code is available at \urlthis https URL.

[LG-85] What can Large Language Models Capture about Code Functional Equivalence?

链接: https://arxiv.org/abs/2408.11081
作者: Nickil Maveli,Antonio Vergari,Shay B. Cohen
关键词-EN: shown great progress, learning rich representations, large code corpora, classify code fragments, pre-trained on large
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 37 pages

点击查看摘要

Abstract:Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing SeqCoBench, a benchmark for systematically assessing how Code-LLMs can capture code functional equivalence. SeqCoBench contains over 20 code transformations that either preserve or alter the semantics of Python programs. We conduct extensive evaluations in different settings, including zero-shot and parameter-efficient finetuning methods on state-of-the-art (Code-)LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench. We find that the performance gap between these LLMs and classical match-based retrieval scores is minimal, with both approaches showing a concerning lack of depth in understanding code semantics.

[LG-86] Solving Oscillator ODEs via Soft-constrained Physics-informed Neural Network with Small Data

链接: https://arxiv.org/abs/2408.11077
作者: Kai-liang Lu,Yu-meng Su,Cheng Qiu,Zhuo Bi,Wen-jun Zhang
关键词-EN: physics-informed neural network, compared physics-informed neural, conventional neural network, neural network, solving differential equations
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 17 pages, 7 figures, 2 tables, etc

点击查看摘要

Abstract:This paper compared physics-informed neural network (PINN), conventional neural network (NN) and numerical discretization methods on solving differential equations through literature research. We formalized the mathematical framework and computational flow of the soft-constrained PINN method for solving differential equations (e.g., ODEs/PDEs). Its working mechanism and its accuracy and efficiency were experimentally verified by solving typical linear and non-linear oscillator ODEs. The implementation of the PINN method based on DeepXDE is not only light code and efficient in training, but also flexible across platforms. PINN greatly reduces the need for labeled data: when the nonlinearity of the ODE is weak, a very small amount of supervised training data plus a small amount of collocation points are sufficient to predict the solution; in the minimalist case, only one or two training points (with initial values) are needed for first- or second-order ODEs, respectively. Strongly nonlinear ODE also require only an appropriate increase in the number of training and collocation points, which still has significant advantages over conventional NN. With the aid of collocation points and the use of physical information, PINN has the ability to extrapolate data outside the time domain covered by the training set, and is robust to noisy data, thus with enhanced generalization capabilities. Training is accelerated when the gains obtained along with the reduction in the amount of data outweigh the delay caused by the increase in the loss function terms. The soft-constrained PINN method can easily impose a physical law (e.g., energy conservation) constraint by adding a regularization term to the total loss function, thus improving the solution performance of ODEs that obey this physical law.

[LG-87] oward End-to-End Bearing Fault Diagnosis for Industrial Scenarios with Spiking Neural Networks

链接: https://arxiv.org/abs/2408.11067
作者: Yongqi Ding,Lin Zuo,Mengmeng Jing,Kunshan Yang,Biao Chen,Yunqian Yu
关键词-EN: Spiking neural networks, received widespread attention, low-power binary spikes, neural networks, transmit information
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:Spiking neural networks (SNNs) transmit information via low-power binary spikes and have received widespread attention in areas such as computer vision and reinforcement learning. However, there have been very few explorations of SNNs in more practical industrial scenarios. In this paper, we focus on the application of SNNs in bearing fault diagnosis to facilitate the integration of high-performance AI algorithms and real-world industries. In particular, we identify two key limitations of existing SNN fault diagnosis methods: inadequate encoding capacity that necessitates cumbersome data preprocessing, and non-spike-oriented architectures that constrain the performance of SNNs. To alleviate these problems, we propose a Multi-scale Residual Attention SNN (MRA-SNN) to simultaneously improve the efficiency, performance, and robustness of SNN methods. By incorporating a lightweight attention mechanism, we have designed a multi-scale attention encoding module to extract multiscale fault features from vibration signals and encode them as spatio-temporal spikes, eliminating the need for complicated preprocessing. Then, the spike residual attention block extracts high-dimensional fault features and enhances the expressiveness of sparse spikes with the attention mechanism for end-to-end diagnosis. In addition, the performance and robustness of MRA-SNN is further enhanced by introducing the lightweight attention mechanism within the spiking neurons to simulate the biological dendritic filtering effect. Extensive experiments on MFPT and JNU benchmark datasets demonstrate that MRA-SNN significantly outperforms existing methods in terms of accuracy, energy consumption and noise robustness, and is more feasible for deployment in real-world industrial scenarios.

[LG-88] abular Transfer Learning via Prompting LLMs

链接: https://arxiv.org/abs/2408.11063
作者: Jaehyun Nam,Woomin Song,Seong Hyeon Park,Jihoon Tack,Sukmin Yun,Jaehyung Kim,Kyu Hwan Oh,Jinwoo Shin
关键词-EN: transfer learning, tabular transfer learning, Learning, transfer, obtain annotations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: COLM 2024

点击查看摘要

Abstract:Learning with a limited number of labeled data is a central problem in real-world applications of machine learning, as it is often expensive to obtain annotations. To deal with the scarcity of labeled data, transfer learning is a conventional approach; it suggests to learn a transferable knowledge by training a neural network from multiple other sources. In this paper, we investigate transfer learning of tabular tasks, which has been less studied and successful in the literature, compared to other domains, e.g., vision and language. This is because tables are inherently heterogeneous, i.e., they contain different columns and feature spaces, making transfer learning difficult. On the other hand, recent advances in natural language processing suggest that the label scarcity issue can be mitigated by utilizing in-context learning capability of large language models (LLMs). Inspired by this and the fact that LLMs can also process tables within a unified language space, we ask whether LLMs can be effective for tabular transfer learning, in particular, under the scenarios where the source and target datasets are of different format. As a positive answer, we propose a novel tabular transfer learning framework, coined Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with LLMs. Specifically, P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts. Experimental results demonstrate that P2T outperforms previous methods on various tabular learning benchmarks, showing good promise for the important, yet underexplored tabular transfer learning problem. Code is available at this https URL.

[LG-89] Plug-in estimation of Schr"odinger bridges

链接: https://arxiv.org/abs/2408.11686
作者: Aram-Alexandre Pooladian,Jonathan Niles-Weed
关键词-EN: probability distributions, propose a procedure, procedure for estimating, Schrödinger bridge, estimating the Schrödinger
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 39 pages, 3 figures, 1 table

点击查看摘要

Abstract:We propose a procedure for estimating the Schrödinger bridge between two probability distributions. Unlike existing approaches, our method does not require iteratively simulating forward and backward diffusions or training neural networks to fit unknown drifts. Instead, we show that the potentials obtained from solving the static entropic optimal transport problem between the source and target samples can be modified to yield a natural plug-in estimator of the time-dependent drift that defines the bridge between two measures. Under minimal assumptions, we show that our proposal, which we call the \emphSinkhorn bridge, provably estimates the Schrödinger bridge with a rate of convergence that depends on the intrinsic dimensionality of the target measure. Our approach combines results from the areas of sampling, and theoretical and statistical entropic optimal transport.

[LG-90] 5G NR PRACH Detection with Convolutional Neural Networks (CNN): Overcoming Cell Interference Challenges

链接: https://arxiv.org/abs/2408.11659
作者: Desire Guel,Arsene Kabore,Didier Bassole
关键词-EN: Convolutional Neural Networks, Convolutional Neural, Neural Networks, Random Access Channel, Physical Random Access
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel approach to interference detection in 5G New Radio (5G-NR) networks using Convolutional Neural Networks (CNN). Interference in 5G networks challenges high-quality service due to dense user equipment deployment and increased wireless environment complexity. Our CNN-based model is designed to detect Physical Random Access Channel (PRACH) sequences amidst various interference scenarios, leveraging the spatial and temporal characteristics of PRACH signals to enhance detection accuracy and robustness. Comprehensive datasets of simulated PRACH signals under controlled interference conditions were generated to train and validate the model. Experimental results show that our CNN-based approach outperforms traditional PRACH detection methods in accuracy, precision, recall and F1-score. This study demonstrates the potential of AI/ML techniques in advancing interference management in 5G networks, providing a foundation for future research and practical applications in optimizing network performance and reliability.

[LG-91] Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

链接: https://arxiv.org/abs/2408.11641
作者: Paul Primus,Florian Schmid,Gerhard Widmer
关键词-EN: systems are commonly, commonly optimized, optimized with contrastive, contrastive learning, audio retrieval systems
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: In Proceedings of the 9th Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE, Tokyo, Japan, 2024. Implementation available on GitHub: this https URL

点击查看摘要

Abstract:Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.

[LG-92] Persistent Homology via Ellipsoids

链接: https://arxiv.org/abs/2408.11450
作者: Sara Kališnik,Bastian Rieck,Ana Žegarac
关键词-EN: Persistent homology, Principal Component Analysis, persistent homology involves, Rips, ellipsoid
类目: Algebraic Topology (math.AT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Persistent homology is one of the most popular methods in Topological Data Analysis. An initial step in any analysis with persistent homology involves constructing a nested sequence of simplicial complexes, called a filtration, from a point cloud. There is an abundance of different complexes to choose from, with Rips, Alpha, and witness complexes being popular choices. In this manuscript, we build a different type of a geometrically-informed simplicial complex, called an ellipsoid complex. This complex is based on the idea that ellipsoids aligned with tangent directions better approximate the data compared to conventional (Euclidean) balls centered at sample points that are used in the construction of Rips and Alpha complexes, for instance. We use Principal Component Analysis to estimate tangent spaces directly from samples and present algorithms as well as an implementation for computing ellipsoid barcodes, i.e., topological descriptors based on ellipsoid complexes. Furthermore, we conduct extensive experiments and compare ellipsoid barcodes with standard Rips barcodes. Our findings indicate that ellipsoid complexes are particularly effective for estimating homology of manifolds and spaces with bottlenecks from samples. In particular, the persistence intervals corresponding to a ground-truth topological feature are longer compared to the intervals obtained when using the Rips complex of the data. Furthermore, ellipsoid barcodes lead to better classification results in sparsely-sampled point clouds. Finally, we demonstrate that ellipsoid barcodes outperform Rips barcodes in classification tasks.

[LG-93] Learning Flock: Enhancing Sets of Particles for Multi~Sub-State Particle Filtering with Neural Augmentation

链接: https://arxiv.org/abs/2408.11348
作者: Itai Nuri,Nir Shlezinger
关键词-EN: leading family, state estimation, estimation in dynamic, dynamic systems, systems with multiple
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review for publication in the IEEE

点击查看摘要

Abstract:A leading family of algorithms for state estimation in dynamic systems with multiple sub-states is based on particle filters (PFs). PFs often struggle when operating under complex or approximated modelling (necessitating many particles) with low latency requirements (limiting the number of particles), as is typically the case in multi target tracking (MTT). In this work, we introduce a deep neural network (DNN) augmentation for PFs termed learning flock (LF). LF learns to correct a particles-weights set, which we coin flock, based on the relationships between all sub-particles in the set itself, while disregarding the set acquisition procedure. Our proposed LF, which can be readily incorporated into different PFs flow, is designed to facilitate rapid operation by maintaining accuracy with a reduced number of particles. We introduce a dedicated training algorithm, allowing both supervised and unsupervised training, and yielding a module that supports a varying number of sub-states and particles without necessitating re-training. We experimentally show the improvements in performance, robustness, and latency of LF augmentation for radar multi-target tracking, as well its ability to mitigate the effect of a mismatched observation modelling. We also compare and illustrate the advantages of LF over a state-of-the-art DNN-aided PF, and demonstrate that LF enhances both classic PFs as well as DNN-based filters.

[LG-94] ransfer Learning and the Early Estimation of Single-Photon Source Quality using Machine Learning Methods

链接: https://arxiv.org/abs/2408.11322
作者: David Jacob Kedziora,Anna Musiał,Wojciech Rudno-Rudziński,Bogdan Gabrys
关键词-EN: devices proposed amidst, central to numerous, numerous systems, systems and devices, devices proposed
类目: Optics (physics.optics); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注: The data and software that supports the findings of this study are openly available at this https URL

点击查看摘要

Abstract:The use of single-photon sources (SPSs) is central to numerous systems and devices proposed amidst a modern surge in quantum technology. However, manufacturing schemes remain imperfect, and single-photon emission purity must often be experimentally verified via interferometry. Such a process is typically slow and costly, which has motivated growing research into whether SPS quality can be more rapidly inferred from incomplete emission statistics. Hence, this study is a sequel to previous work that demonstrated significant uncertainty in the standard method of quality estimation, i.e. the least-squares fitting of a physically motivated function, and asks: can machine learning (ML) do better? The study leverages eight datasets obtained from measurements involving an exemplary quantum emitter, i.e. a single InGaAs/GaAs epitaxial quantum dot; these eight contexts predominantly vary in the intensity of the exciting laser. Specifically, via a form of `transfer learning’, five ML models, three linear and two ensemble-based, are trained on data from seven of the contexts and tested on the eighth. Validation metrics quickly reveal that even a linear regressor can outperform standard fitting when it is tested on the same contexts it was trained on, but the success of transfer learning is less assured, even though statistical analysis, made possible by data augmentation, suggests its superiority as an early estimator. Accordingly, the study concludes by discussing future strategies for grappling with the problem of SPS context dissimilarity, e.g. feature engineering and model adaptation.

[LG-95] Chernoff Bounds for Tensor Expanders on Riemannian Manifolds Using Graph Laplacian Approximation

链接: https://arxiv.org/abs/2408.11276
作者: Shih-Yu Chang
关键词-EN: crucial statistical tool, probability tail bound, tail bound analysis, paper addresses, addresses the advancement
类目: Probability (math.PR); Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper addresses the advancement of probability tail bound analysis, a crucial statistical tool for assessing the probability of large deviations of random variables from their expected values. Traditional tail bounds, such as Markov’s, Chebyshev’s, and Chernoff bounds, have proven valuable across numerous scientific and engineering fields. However, as data complexity grows, there is a pressing need to extend tail bound estimation from scalar variables to high-dimensional random objects. Existing studies often rely on the assumption of independence among high-dimensional random objects, an assumption that may not always be valid. Building on the work of researchers like Garg et al. and Chang, who employed random walks to model high-dimensional ensembles, this study introduces a more generalized approach by exploring random walks over manifolds. To address the challenges of constructing an appropriate underlying graph for a manifold, we propose a novel method that enhances random walks on graphs approximating the manifold. This approach ensures spectral similarity between the original manifold and the approximated graph, including matching eigenvalues, eigenvectors, and eigenfunctions. Leveraging graph approximation technique proposed by Burago et al. for manifolds, we derive the tensor Chernoff bound and establish its range for random walks on a Riemannian manifold according to the underlying manifold’s spectral characteristics.

[LG-96] Effective Off-Policy Evaluation and Learning in Contextual Combinatorial Bandits RECSYS2024

链接: https://arxiv.org/abs/2408.11202
作者: Tatsuhiro Shimizu,Koichi Tanaka,Ren Kishimoto,Haruka Kiyohara,Masahiro Nomura,Yuta Saito
关键词-EN: contextual combinatorial bandits, explore off-policy evaluation, evaluation and learning, combinatorial bandits, explore off-policy
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: accepted at RecSys2024

点击查看摘要

Abstract:We explore off-policy evaluation and learning (OPE/L) in contextual combinatorial bandits (CCB), where a policy selects a subset in the action space. For example, it might choose a set of furniture pieces (a bed and a drawer) from available items (bed, drawer, chair, etc.) for interior design sales. This setting is widespread in fields such as recommender systems and healthcare, yet OPE/L of CCB remains unexplored in the relevant literature. Typical OPE/L methods such as regression and importance sampling can be applied to the CCB problem, however, they face significant challenges due to high bias or variance, exacerbated by the exponential growth in the number of available subsets. To address these challenges, we introduce a concept of factored action space, which allows us to decompose each subset into binary indicators. This formulation allows us to distinguish between the ‘‘main effect’’ derived from the main actions, and the ‘‘residual effect’’, originating from the supplemental actions, facilitating more effective OPE. Specifically, our estimator, called OPCB, leverages an importance sampling-based approach to unbiasedly estimate the main effect, while employing regression-based approach to deal with the residual effect with low variance. OPCB achieves substantial variance reduction compared to conventional importance sampling methods and bias reduction relative to regression methods under certain conditions, as illustrated in our theoretical analysis. Experiments demonstrate OPCB’s superior performance over typical methods in both OPE and OPL.

[LG-97] he Ensemble Epanechnikov Mixture Filter

链接: https://arxiv.org/abs/2408.11164
作者: Andrey A. Popov,Renato Zanetti
关键词-EN: mixture kernel density, Epanechnikov mixture kernel, Gaussian mixture kernel, Epanechnikov mixture filter, Gaussian mixture filter
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In the high-dimensional setting, Gaussian mixture kernel density estimates become increasingly suboptimal. In this work we aim to show that it is practical to instead use the optimal multivariate Epanechnikov kernel. We make use of this optimal Epanechnikov mixture kernel density estimate for the sequential filtering scenario through what we term the ensemble Epanechnikov mixture filter (EnEMF). We provide a practical implementation of the EnEMF that is as cost efficient as the comparable ensemble Gaussian mixture filter. We show on a static example that the EnEMF is robust to growth in dimension, and also that the EnEMF has a significant reduction in error per particle on the 40-variable Lorenz '96 system.

[LG-98] Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles STOC

链接: https://arxiv.org/abs/2408.11084
作者: Yifan Hu,Jie Wang,Xin Chen,Niao He
关键词-EN: MLMC gradient methods, MLMC gradient, obtaining stochastic gradients, gradient methods, stochastic
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: A preliminary version of this manuscript has appeared in a conference proceeding. Please refer to Yifan Hu, Xin Chen, and Niao He. On the bias-variance-cost tradeoff of stochastic optimization. Advances in Neural Information Processing Systems, 2021

点击查看摘要

Abstract:We consider stochastic optimization when one only has access to biased stochastic oracles of the objective and the gradient, and obtaining stochastic gradients with low biases comes at high costs. This setting captures various optimization paradigms, such as conditional stochastic optimization, distributionally robust optimization, shortfall risk optimization, and machine learning paradigms, such as contrastive learning. We examine a family of multi-level Monte Carlo (MLMC) gradient methods that exploit a delicate tradeoff among bias, variance, and oracle cost. We systematically study their total sample and computational complexities for strongly convex, convex, and nonconvex objectives and demonstrate their superiority over the widely used biased stochastic gradient method. When combined with the variance reduction techniques like SPIDER, these MLMC gradient methods can further reduce the complexity in the nonconvex regime. Our results imply that a series of stochastic optimization problems with biased oracles, previously considered to be more challenging, is fundamentally no harder than the classical stochastic optimization with unbiased oracles. We also delineate the boundary conditions under which these problems become more difficult. Moreover, MLMC gradient methods significantly improve the best-known complexities in the literature for conditional stochastic optimization and shortfall risk optimization. Our extensive numerical experiments on distributionally robust optimization, pricing and staffing scheduling problems, and contrastive learning demonstrate the superior performance of MLMC gradient methods.

信息检索

[IR-0] Do We Really Need to Drop Items with Missing Modalities in Multimodal Recommendation? CIKM2024

链接: https://arxiv.org/abs/2408.11767
作者: Daniele Malitesta,Emanuele Rossi,Claudio Pomo,Tommaso Di Noia,Fragkiskos D. Malliaros
关键词-EN: Generally, multimodal, multimodal recommendation, multimodal recommender system, recommendation
类目: Information Retrieval (cs.IR)
*备注: Accepted at CIKM 2024 in the short paper track

点击查看摘要

Abstract:Generally, items with missing modalities are dropped in multimodal recommendation. However, with this work, we question this procedure, highlighting that it would further damage the pipeline of any multimodal recommender system. First, we show that the lack of (some) modalities is, in fact, a widely-diffused phenomenon in multimodal recommendation. Second, we propose a pipeline that imputes missing multimodal features in recommendation by leveraging traditional imputation strategies in machine learning. Then, given the graph structure of the recommendation data, we also propose three more effective imputation solutions that leverage the item-item co-purchase graph and the multimodal similarities of co-interacted items. Our method can be plugged into any multimodal RSs in the literature working as an untrained pre-processing phase, showing (through extensive experiments) that any data pre-filtering is not only unnecessary but also harmful to the performance.

[IR-1] A Novel Evaluation Perspective on GNNs-based Recommender Systems through the Topology of the User-Item Graph RECSYS2024

链接: https://arxiv.org/abs/2408.11762
作者: Daniele Malitesta,Claudio Pomo,Vito Walter Anelli,Alberto Carlo Maria Mancino,Tommaso Di Noia,Eugenio Di Sciascio
关键词-EN: encountered great success, graph neural networks, neural networks, encountered great, great success
类目: Information Retrieval (cs.IR)
*备注: Accepted at RecSys 2024 in the reproducibility track. arXiv admin note: substantial text overlap with arXiv:2308.10778

点击查看摘要

Abstract:Recently, graph neural networks (GNNs)-based recommender systems have encountered great success in recommendation. As the number of GNNs approaches rises, some works have started questioning the theoretical and empirical reasons behind their superior performance. Nevertheless, this investigation still disregards that GNNs treat the recommendation data as a topological graph structure. Building on this assumption, in this work, we provide a novel evaluation perspective on GNNs-based recommendation, which investigates the impact of the graph topology on the recommendation performance. To this end, we select some (topological) properties of the recommendation data and three GNNs-based recommender systems (i.e., LightGCN, DGCF, and SVD-GCN). Then, starting from three popular recommendation datasets (i.e., Yelp2018, Gowalla, and Amazon-Book) we sample them to obtain 1,800 size-reduced datasets that still resemble the original ones but can encompass a wider range of topological structures. We use this procedure to build a large pool of samples for which data characteristics and recommendation performance of the selected GNNs models are measured. Through an explanatory framework, we find strong correspondences between graph topology and GNNs performance, offering a novel evaluation perspective on these models.

[IR-2] Mathematical Information Retrieval: Search and Question Answering

链接: https://arxiv.org/abs/2408.11646
作者: Richard Zanibbi,Behrooz Mansouri,Anurag Agarwal
关键词-EN: essential for technical, Mathematical information, mathematical question answering, multimodal search engines, developed multimodal search
类目: Information Retrieval (cs.IR)
*备注: [DRAFT] 1st draft

点击查看摘要

Abstract:Mathematical information is essential for technical work, but its creation, interpretation, and search are challenging. To help address these challenges, researchers have developed multimodal search engines and mathematical question answering systems. This book begins with a simple framework characterizing the information tasks that people and systems perform as we work to answer math-related questions. The framework is used to organize and relate the other core topics of the book, including interactions between people and systems, representing math formulas in sources, and evaluation. We close with some key questions and concrete directions for future work. This book is intended for use by students, instructors, and researchers, and those who simply wish that it was easier to find and use mathematical information

[IR-3] End-to-End Cost-Effective Incentive Recommendation under Budget Constraint with Uplift Modeling RECSYS2024

链接: https://arxiv.org/abs/2408.11623
作者: Zexu Sun,Hao Yang an Dugang Liu,Yunpeng Weng,Xing Tang,Xiuqiang He
关键词-EN: enhance user engagement, modern online platforms, increase platform revenue, essential factors, factors that enhance
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by RecSys 2024

点击查看摘要

Abstract:In modern online platforms, incentives are essential factors that enhance user engagement and increase platform revenue. Over recent years, uplift modeling has been introduced as a strategic approach to assign incentives to individual customers. Especially in many real-world applications, online platforms can only incentivize customers with specific budget constraints. This problem can be reformulated as the multi-choice knapsack problem. This optimization aims to select the optimal incentive for each customer to maximize the return on investment. Recent works in this field frequently tackle the budget allocation problem using a two-stage approach. However, this solution is confronted with the following challenges: (1) The causal inference methods often ignore the domain knowledge in online marketing, where the expected response curve of a customer should be monotonic and smooth as the incentive increases. (2) An optimality gap between the two stages results in inferior sub-optimal allocation performance due to the loss of the incentive recommendation information for the uplift prediction under the limited budget constraint. To address these challenges, we propose a novel End-to-End Cost-Effective Incentive Recommendation (E3IR) model under budget constraints. Specifically, our methods consist of two modules, i.e., the uplift prediction module and the differentiable allocation module. In the uplift prediction module, we construct prediction heads to capture the incremental improvement between adjacent treatments with the marketing domain constraints (i.e., monotonic and smooth). We incorporate integer linear programming (ILP) as a differentiable layer input in the allocation module. Furthermore, we conduct extensive experiments on public and real product datasets, demonstrating that our E3IR improves allocation performance compared to existing two-stage approaches.

[IR-4] DTN: Deep Multiple Task-specific Feature Interactions Network for Multi-Task Recommendation

链接: https://arxiv.org/abs/2408.11611
作者: Yaowen Bi,Yuteng Lian,Jie Cui,Jun Liu,Peijian Wang,Guanghui Li,Xuejun Chen,Jinglin Zhao,Hao Wen,Jing Zhang,Zhaoqi Zhang,Wenzhuo Song,Yang Sun,Weiwei Zhang,Mingchen Cai,Guanxing Zhang
关键词-EN: Neural-based multi-task learning, Neural-based multi-task, MTL, MTL models, DTN
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural-based multi-task learning (MTL) has been successfully applied to many recommendation applications. However, these MTL models (e.g., MMoE, PLE) did not consider feature interaction during the optimization, which is crucial for capturing complex high-order features and has been widely used in ranking models for real-world recommender systems. Moreover, through feature importance analysis across various tasks in MTL, we have observed an interesting divergence phenomenon that the same feature can have significantly different importance across different tasks in MTL. To address these issues, we propose Deep Multiple Task-specific Feature Interactions Network (DTN) with a novel model structure design. DTN introduces multiple diversified task-specific feature interaction methods and task-sensitive network in MTL networks, enabling the model to learn task-specific diversified feature interaction representations, which improves the efficiency of joint representation learning in a general setup. We applied DTN to our company’s real-world E-commerce recommendation dataset, which consisted of over 6.3 billion samples, the results demonstrated that DTN significantly outperformed state-of-the-art MTL models. Moreover, during online evaluation of DTN in a large-scale E-commerce recommender system, we observed a 3.28% in clicks, a 3.10% increase in orders and a 2.70% increase in GMV (Gross Merchandise Value) compared to the state-of-the-art MTL models. Finally, extensive offline experiments conducted on public benchmark datasets demonstrate that DTN can be applied to various scenarios beyond recommendations, enhancing the performance of ranking models.

[IR-5] Calibrating the Predictions for Top-N Recommendations RECSYS2024

链接: https://arxiv.org/abs/2408.11596
作者: Masahiro Sato
关键词-EN: Well-calibrated predictions, top-N items, top-N, preferences are essential, items
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: accepted at RecSys 2024

点击查看摘要

Abstract:Well-calibrated predictions of user preferences are essential for many applications. Since recommender systems typically select the top-N items for users, calibration for those top-N items, rather than for all items, is important. We show that previous calibration methods result in miscalibrated predictions for the top-N items, despite their excellent calibration performance when evaluated on all items. In this work, we address the miscalibration in the top-N recommended items. We first define evaluation metrics for this objective and then propose a generic method to optimize calibration models focusing on the top-N items. It groups the top-N items by their ranks and optimizes distinct calibration models for each group with rank-dependent training weights. We verify the effectiveness of the proposed method for both explicit and implicit feedback datasets, using diverse classes of recommender models.

[IR-6] Oh Behave! Country Representation Dynamics Created by Feedback Loops in Music Recommender Systems RECSYS2024

链接: https://arxiv.org/abs/2408.11565
作者: Oleg Lesota,Jonas Geiger,Max Walder,Dominik Kowald,Markus Schedl
关键词-EN: Recent work suggests, music recommender systems, Recent work, disproportionally frequent recommendations, training data
类目: Information Retrieval (cs.IR)
*备注: RecSys 2024

点击查看摘要

Abstract:Recent work suggests that music recommender systems are prone to disproportionally frequent recommendations of music from countries more prominently represented in the training data, notably the US. However, it remains unclear to what extent feedback loops in music recommendation influence the dynamics of such imbalance. In this work, we investigate the dynamics of representation of local (i.e., country-specific) and US-produced music in user profiles and recommendations. To this end, we conduct a feedback loop simulation study using the standardized LFM-2b dataset. The results suggest that most of the investigated recommendation models decrease the proportion of music from local artists in their recommendations. Furthermore, we find that models preserving average proportions of US and local music do not necessarily provide country-calibrated recommendations. We also look into popularity calibration and, surprisingly, find that the most popularity-calibrated model in our study (ItemKNN) provides the least country-calibrated recommendations. In addition, users from less represented countries (e.g., Finland) are, in the long term, most affected by the under-representation of their local music in recommendations.

[IR-7] A Quick trustworthy spectral detection QA system based on the SDAAP Dataset and large language model

链接: https://arxiv.org/abs/2408.11557
作者: Jiheng Liang,Ziru Yu,Zujie Xie,Xiangyang Yu
关键词-EN: Large Language Model, natural language processing, Large Language, Language Model, demonstrated significant success
类目: Information Retrieval (cs.IR)
*备注: 16 pages,10 figures,3 tables

点击查看摘要

Abstract:Large Language Model (LLM) has demonstrated significant success in a range of natural language processing (NLP) tasks within general domain. The emergence of LLM has introduced innovative methodologies across diverse fields, including the natural sciences. Researchers aim to implement automated, concurrent process driven by LLM to supplant conventional manual, repetitive and labor-intensive work. In the domain of spectral analysis and detection, it is imperative for researchers to autonomously acquire pertinent knowledge across various research objects, which encompasses the spectroscopic techniques and the chemometric methods that are employed in experiments and analysis. Paradoxically, despite the recognition of spectroscopic detection as an effective analytical method, the fundamental process of knowledge retrieval remains both time-intensive and repetitive. In response to this challenge, we first introduced the Spectral Detection and Analysis Based Paper(SDAAP) dataset, which is the first open-source textual knowledge dataset for spectral analysis and detection and contains annotated literature data as well as corresponding knowledge instruction data. Subsequently, we also designed an automated Q\A framework based on the SDAAP dataset, which can retrieve relevant knowledge and generate high-quality responses by extracting entities in the input as retrieval parameters. It is worth noting that: within this framework, LLM is only used as a tool to provide generalizability, while RAG technique is used to accurately capture the source of the knowledge.This approach not only improves the quality of the generated responses, but also ensures the traceability of the knowledge. Experimental results show that our framework generates responses with more reliable expertise compared to the baseline.

[IR-8] LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding

链接: https://arxiv.org/abs/2408.11523
作者: Zhizhong Wan,Bin Yin,Junjie Xie,Fei Jiang,Xiang Li,Wei Lin
关键词-EN: Click-Through Rate, provide personalized recommendation, personalized recommendation services, Recommendation System, Large Language Models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Click-Through Rate (CTR) prediction is crucial for Recommendation System(RS), aiming to provide personalized recommendation services for users in many aspects such as food delivery, e-commerce and so on. However, traditional RS relies on collaborative signals, which lacks semantic understanding to real-time scenes. We also noticed that a major challenge in utilizing Large Language Models (LLMs) for practical recommendation purposes is their efficiency in dealing with long text input. To break through the problems above, we propose Large Language Model Aided Real-time Scene Recommendation(LARR), adopt LLMs for semantic understanding, utilizing real-time scene information in RS without requiring LLM to process the entire real-time scene text directly, thereby enhancing the efficiency of LLM-based CTR modeling. Specifically, recommendation domain-specific knowledge is injected into LLM and then RS employs an aggregation encoder to build real-time scene information from separate LLM’s outputs. Firstly, a LLM is continual pretrained on corpus built from recommendation data with the aid of special tokens. Subsequently, the LLM is fine-tuned via contrastive learning on three kinds of sample construction strategies. Through this step, LLM is transformed into a text embedding model. Finally, LLM’s separate outputs for different scene features are aggregated by an encoder, aligning to collaborative signals in RS, enhancing the performance of recommendation model.

[IR-9] Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation

链接: https://arxiv.org/abs/2408.11372
作者: Hao Wang,Yongqiang Han,Kefan Wang,Kai Cheng,Zhen Wang,Wei Guo,Yong Liu,Defu Lian,Enhong Chen
关键词-EN: interacting with items, recommendation systems, Efficient Behavior Miner, Multi-Behavior Sequential Recommendation, enhance recommendation performance
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the realm of recommendation systems, users exhibit a diverse array of behaviors when interacting with items. This phenomenon has spurred research into learning the implicit semantic relationships between these behaviors to enhance recommendation performance. However, these methods often entail high computational complexity. To address concerns regarding efficiency, pre-training presents a viable solution. Its objective is to extract knowledge from extensive pre-training data and fine-tune the model for downstream tasks. Nevertheless, previous pre-training methods have primarily focused on single-behavior data, while multi-behavior data contains significant noise. Additionally, the fully fine-tuning strategy adopted by these methods still imposes a considerable computational burden. In response to this challenge, we propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation. Specifically, in the pre-training stage, we commence by proposing a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales, thereby facilitating the comprehension of the contextual semantics of multi-behavior sequences. Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module, which generates personalized, progressive, and diverse prompts to fully exploit the potential of the pre-trained model effectively. Extensive experiments on three real-world datasets have unequivocally demonstrated that DPCPL not only exhibits high efficiency and effectiveness, requiring minimal parameter adjustments but also surpasses the state-of-the-art performance across a diverse range of downstream tasks.

[IR-10] Deep Tree-based Retrieval for Efficient Recommendation: Theory and Method

链接: https://arxiv.org/abs/2408.11345
作者: Ze Liu,Jin Zhang,Chao Feng,Defu Lian,Jie Wang,Enhong Chen
关键词-EN: achieve remarkable improvements, deep recommendation models, tree-based deep recommendation, deep recommendation, recommendation models
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the development of deep learning techniques, deep recommendation models also achieve remarkable improvements in terms of recommendation accuracy. However, due to the large number of candidate items in practice and the high cost of preference computation, these methods also suffer from low efficiency of recommendation. The recently proposed tree-based deep recommendation models alleviate the problem by directly learning tree structure and representations under the guidance of recommendation objectives. However, such models have shortcomings. The max-heap assumption in the hierarchical tree, in which the preference for a parent node should be the maximum between the preferences for its children, is difficult to satisfy in their binary classification objectives. To this end, we propose Tree-based Deep Retrieval (TDR for short) for efficient recommendation. In TDR, all the trees generated during the training process are retained to form the forest. When learning the node representation of each tree, we have to satisfy the max-heap assumption as much as possible and mimic beam search behavior over the tree in the training stage. This is achieved by TDR to regard the training task as multi-classification over tree nodes at the same level. However, the number of tree nodes grows exponentially with levels, making us train the preference model with the guidance of the sampled-softmax technique. The experiments are conducted on real-world datasets, validating the effectiveness of the proposed preference model learning method and tree learning method.

[IR-11] Parallel Algorithms for Median Consensus Clustering in Complex Networks

链接: https://arxiv.org/abs/2408.11331
作者: Md Taufique Hussain,Mahantesh Halappanavar,Samrat Chatterjee,Filippo Radicchi,Santo Fortunato,Ariful Azad
关键词-EN: algorithm, clustering solutions, Abstract, median set, find median set
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Data Structures and Algorithms (cs.DS); Social and Information Networks (cs.SI)
*备注: 12 pages

点击查看摘要

Abstract:We develop an algorithm that finds the consensus of many different clustering solutions of a graph. We formulate the problem as a median set partitioning problem and propose a greedy optimization technique. Unlike other approaches that find median set partitions, our algorithm takes graph structure into account and finds a comparable quality solution much faster than the other approaches. For graphs with known communities, our consensus partition captures the actual community structure more accurately than alternative approaches. To make it applicable to large graphs, we remove sequential dependencies from our algorithm and design a parallel algorithm. Our parallel algorithm achieves 35x speedup when utilizing 64 processing cores for large real-world graphs from single-cell experiments.

[IR-12] Reading with Intent

链接: https://arxiv.org/abs/2408.11189
作者: Benjamin Reichman,Kartik Talamadupula,Toshish Jawale,Larry Heck
关键词-EN: integrating external information, external information sources, Retrieval augmented generation, RAG systems, open internet
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) systems augment how knowledge language models are by integrating external information sources such as Wikipedia, internal documents, scientific papers, or the open internet. RAG systems that rely on the open internet as their knowledge source have to contend with the complexities of human-generated content. Human communication extends much deeper than just the words rendered as text. Intent, tonality, and connotation can all change the meaning of what is being conveyed. Recent real-world deployments of RAG systems have shown some difficulty in understanding these nuances of human communication. One significant challenge for these systems lies in processing sarcasm. Though the Large Language Models (LLMs) that make up the backbone of these RAG systems are able to detect sarcasm, they currently do not always use these detections for the subsequent processing of text. To address these issues, in this paper, we synthetically generate sarcastic passages from Natural Question’s Wikipedia retrieval corpus. We then test the impact of these passages on the performance of both the retriever and reader portion of the RAG pipeline. We introduce a prompting system designed to enhance the model’s ability to interpret and generate responses in the presence of sarcasm, thus improving overall system performance. Finally, we conduct ablation studies to validate the effectiveness of our approach, demonstrating improvements in handling sarcastic content within RAG systems.

[IR-13] Public Health in Disaster: Emotional Health and Life Incidents Extraction during Hurricane Harvey

链接: https://arxiv.org/abs/2408.11133
作者: Thomas Hoang,Quynh Anh Nguyen,Long Nguyen
关键词-EN: causing severe damage, Countless disasters, climate change, causing severe, resulted from climate
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Countless disasters have resulted from climate change, causing severe damage to infrastructure and the economy. These disasters have significant societal impacts, necessitating mental health services for the millions affected. To prepare for and respond effectively to such events, it is important to understand people’s emotions and the life incidents they experience before and after a disaster strikes. In this case study, we collected a dataset of approximately 400,000 public tweets related to the storm. Using a BERT-based model, we predicted the emotions associated with each tweet. To efficiently identify these topics, we utilized the Latent Dirichlet Allocation (LDA) technique for topic modeling, which allowed us to bypass manual content analysis and extract meaningful patterns from the data. However, rather than stopping at topic identification like previous methods \citemath11244910, we further refined our analysis by integrating Graph Neural Networks (GNN) and Large Language Models (LLM). The GNN was employed to generate embeddings and construct a similarity graph of the tweets, which was then used to optimize clustering. Subsequently, we used an LLM to automatically generate descriptive names for each event cluster, offering critical insights for disaster preparedness and response strategies.

[IR-14] Mistral-SPLADE: LLMs for for better Learned Sparse Retrieval

链接: https://arxiv.org/abs/2408.11119
作者: Meet Doshi,Vishwajeet Kumar,Rudra Murthy,Vignesh P,Jaydeep Sen
关键词-EN: embedding-based dense retrievers, Learned Sparse Retrievers, traditional keyword-based sparse, keyword-based sparse retrievers, Sparse Retrievers
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Learned Sparse Retrievers (LSR) have evolved into an effective retrieval strategy that can bridge the gap between traditional keyword-based sparse retrievers and embedding-based dense retrievers. At its core, learned sparse retrievers try to learn the most important semantic keyword expansions from a query and/or document which can facilitate better retrieval with overlapping keyword expansions. LSR like SPLADE has typically been using encoder only models with MLM (masked language modeling) style objective in conjunction with known ways of retrieval performance improvement such as hard negative mining, distillation, etc. In this work, we propose to use decoder-only model for learning semantic keyword expansion. We posit, decoder only models that have seen much higher magnitudes of data are better equipped to learn keyword expansions needed for improved retrieval. We use Mistral as the backbone to develop our Learned Sparse Retriever similar to SPLADE and train it on a subset of sentence-transformer data which is often used for training text embedding models. Our experiments support the hypothesis that a sparse retrieval model based on decoder only large language model (LLM) surpasses the performance of existing LSR systems, including SPLADE and all its variants. The LLM based model (Echo-Mistral-SPLADE) now stands as a state-of-the-art learned sparse retrieval model on the BEIR text retrieval benchmark.

[IR-15] LLM Agents Improve Semantic Code Search

链接: https://arxiv.org/abs/2408.11058
作者: Sarthak Jain(University of Illinois Urbana Champaign and Cisco),Aditya Dora(University of Illinois Urbana Champaign),Ka Seng Sam(University of Illinois Urbana Champaign),Prabhat Singh(Cisco)
关键词-EN: solutions to problems, key task, developing solutions, Retrieval Augmented Generation, Augmented Generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: 12 pages, 1 Figure

点击查看摘要

Abstract:Code Search is a key task that many programmers often have to perform while developing solutions to problems. Current methodologies suffer from an inability to perform accurately on prompts that contain some ambiguity or ones that require additional context relative to a code-base. We introduce the approach of using Retrieval Augmented Generation (RAG) powered agents to inject information into user prompts allowing for better inputs into embedding models. By utilizing RAG, agents enhance user queries with relevant details from GitHub repositories, making them more informative and contextually aligned. Additionally, we introduce a multi-stream ensemble approach which when paired with agentic workflow can obtain improved retrieval accuracy, which we deploy on application called this http URL. Experimental results on the CodeSearchNet dataset demonstrate that RepoRift significantly outperforms existing methods, achieving an 78.2% success rate at Success@10 and a 34.6% success rate at Success@1. This research presents a substantial advancement in semantic code search, highlighting the potential of agentic LLMs and RAG to enhance code retrieval systems.

附件下载

点击下载今日全部论文列表