本篇博文主要展示 2024-09-06 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-06)

今日共更新360篇论文,其中:

  • 自然语言处理47篇(Computation and Language (cs.CL))
  • 人工智能61篇(Artificial Intelligence (cs.AI))
  • 计算机视觉83篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习114篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
[NLP-0] Lexicon 3D:探索复杂3D场景理解的视觉基础模型

链接: https://arxiv.org/abs/2409.03757
作者: Yunze Man,Shuhong Zheng,Zhipeng Bao,Martial Hebert,Liang-Yan Gui,Yu-Xiong Wang
关键词-EN: gained increasing attention, scene encoding strategies, encoding strategies playing, increasing attention, gained increasing
关键词-ZH: 受到越来越多的关注,场景编码策略,编码策略玩,越来越多的关注,越来越多的关注
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL , Github: this https URL

点击查看摘要

Abstract:Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.
摘要:复杂的三维场景理解越来越受到人们的重视,而场景编码策略在其中起着至关重要的作用。然而,各种场景的最佳场景编码策略仍然不清楚,特别是与基于图像的对应策略相比。为了解决这个问题,我们提出了一项全面的研究,探索了用于3D场景理解的各种视觉编码模型,确定了每种模型在不同场景中的优势和局限性。我们的评估涵盖了七种VISION基础编码器,包括基于图像、基于视频和3D基础模型。我们在四个任务中对这些模型进行了评估:视觉语言场景推理、视觉基础、分割和配准,每个任务都侧重于场景理解的不同方面。我们的评估得出了重要的发现:DINOv2显示出卓越的性能,视频模型在对象级任务中表现出色,扩散模型有利于几何任务,而语言预训练模型在与语言相关的任务中显示出意想不到的局限性。这些见解挑战了一些传统的理解,为利用视觉基础模型提供了新的视角,并强调了在未来的视觉语言和场景理解任务中需要更灵活的编码器选择。

[NLP-1] WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild
[NLP-1] WildVis:百万规模野外聊天收件箱的开源可视化工具

链接: https://arxiv.org/abs/2409.03753
作者: Yuntian Deng,Wenting Zhao,Jack Hessel,Xiang Ren,Claire Cardie,Yejin Choi
关键词-EN: offers exciting opportunities, data offers exciting, study user-chatbot interactions, conversation data offers, real-world conversation data
关键词-ZH: 提供令人兴奋的机会,数据提供令人兴奋的、研究用户与聊天机器人的交互、对话数据提供、现实世界的对话数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis’s utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.
摘要:真实世界对话数据的日益丰富为研究人员研究用户与聊天机器人的交互提供了令人兴奋的机会。然而,这些数据的绝对数量使手动检查个人对话变得不切实际。为了克服这一挑战,我们引入了WildVis,这是一种交互式工具,可以实现快速、多功能和大规模的对话分析。WildVis基于一系列标准在文本和嵌入空间中提供搜索和可视化功能。为了管理百万规模的数据集,我们实施了包括搜索索引构建、嵌入预计算和压缩以及缓存在内的优化,以确保在几秒钟内响应用户交互。我们通过三个案例研究展示了WildVis的效用:促进聊天机器人滥用研究,可视化和比较数据集上的主题分布,以及表征特定于用户的对话模式。WildVis是开源的,设计为可扩展,支持额外的数据集以及定制的搜索和可视化功能。

[NLP-2] Attention Heads of Large Language Models : A Survey
[NLP-2] 大型语言模型的注意力负责人:一项调查

链接: https://arxiv.org/abs/2409.03752
作者: Zifan Zheng,Yezhaohui Wang,Yuxin Huang,Shichao Song,Bo Tang,Feiyu Xiong,Zhiyu Li
关键词-EN: Large Language Models, Large Language, Language Models, advent of ChatGPT, black-box systems
关键词-ZH: 大型语言模型、大型语言、语言模型、ChatGPT的出现、黑匣子系统
类目: Computation and Language (cs.CL)
备注: 20 pages, 11 figures, 4 tables

点击查看摘要

Abstract:Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in various tasks but remain largely as black-box systems. Consequently, their development relies heavily on data-driven approaches, limiting performance enhancement through changes in internal architecture and reasoning pathways. As a result, many researchers have begun exploring the potential internal mechanisms of LLMs, aiming to identify the essence of their reasoning bottlenecks, with most studies focusing on attention heads. Our survey aims to shed light on the internal reasoning processes of LLMs by concentrating on the interpretability and underlying mechanisms of attention heads. We first distill the human thought process into a four-stage framework: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation. Using this framework, we systematically review existing research to identify and categorize the functions of specific attention heads. Furthermore, we summarize the experimental methodologies used to discover these special heads, dividing them into two categories: Modeling-Free methods and Modeling-Required methods. Also, we outline relevant evaluation methods and benchmarks. Finally, we discuss the limitations of current research and propose several potential future directions. Our reference list is open-sourced at \urlthis https URL.
摘要:自从ChatGPT出现以来,大型语言模型(LLM)在各种任务中表现出色,但在很大程度上仍然是黑盒系统。因此,它们的开发在很大程度上依赖于数据驱动的方法,通过更改内部体系结构和推理路径限制了性能的提高。因此,许多研究者开始探索LLMS的潜在内部机制,旨在找出其推理瓶颈的本质,大多数研究集中在注意头部。我们的调查旨在通过集中于注意头部的可解释性和潜在的机制来揭示LLMS的内部推理过程。我们首先将人类的思维过程提炼成一个四个阶段的框架:知识回忆、上下文识别、潜在推理和表达准备。利用这一框架,我们系统地回顾了现有的研究,以确定和分类特定注意头部的功能。此外,我们还总结了用于发现这些特殊头部的实验方法,将它们分为两类:无建模方法和需要建模方法。此外,我们还概述了相关的评估方法和基准。最后,我们讨论了当前研究的局限性,并提出了几个潜在的未来方向。我们的参考列表在这个HTTPS URL上是开源的。

[NLP-3] Planning In Natural Language Improves LLM Search For Code Generation
[NLP-3] 自然语言规划改进了LLM代码生成搜索

链接: https://arxiv.org/abs/2409.03733
作者: Evan Wang,Federico Cassano,Catherine Wu,Yunfeng Bai,Will Song,Vaskar Nath,Ziwen Han,Sean Hendryx,Summer Yue,Hugh Zhang
关键词-EN: scaling training compute, scaling inference compute, yielded analogous gains, training compute, compute has led
关键词-ZH: 扩展训练计算,扩展推理计算,产生了类似的收益,训练计算,计算已经领先
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PLANSEARCH generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas.
摘要:虽然缩放训练计算已经在大型语言模型(LLMS)中带来了显著的改进,但缩放推理计算还没有产生类似的收益。我们假设,一个核心缺失组件是缺乏多样化的LLM输出,由于模型重复采样高度相似但不正确的世代,导致搜索效率低下。我们的经验证明,这种多样性的缺乏可以通过搜索用自然语言解决问题的候选计划来缓解。基于这一认识,我们提出了一种新的搜索算法PLANSEARCH,该算法在HumanEval+、MBPP+和LiveCodeB边(竞争编码的无污染基准)上显示了良好的结果。PLANSEARCH生成关于问题的一组不同的观察,然后使用这些观察来构建解决问题的计划。通过用自然语言搜索计划,而不是直接搜索代码解决方案,PLANSEARCH探索了比基线搜索方法更多样化的潜在解决方案范围。在Claude 3.5十四行诗之上使用PLANSEARCH在LiveCodeBitch上获得了77.0%的最新通过率@200,超过了在没有搜索的情况下获得的最好成绩(通过率@1=41.4%)和使用标准重复抽样(通过率@200=60.6%)。最后,我们表明,在所有分析的模型、搜索算法和基准测试中,我们可以准确地预测搜索带来的性能收益,这是多样性对生成的想法的直接函数。

[NLP-4] RAG based Question-Answering for Contextual Response Prediction System CIKM’24
[NLP-4] 基于RAG的上下文响应预测系统的调度服务

链接: https://arxiv.org/abs/2409.03708
作者: Sriram Veturi,Saurabh Vaichal,Nafis Irtiza Tripto,Reshma Lal Jagadheesh,Nian Yan
关键词-EN: Large Language Models, Natural Language Processing, Large Language, Language Models, Language Processing
关键词-ZH: 大型语言模型、自然语言处理、大型语言、语言模型、语言处理
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at the 1st Workshop on GenAI and RAG Systems for Enterprise, CIKM’24. 6 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-answering framework for real-world applications using RAG entails several challenges: 1) data availability issues, 2) evaluating the quality of generated content, and 3) the costly nature of human evaluation. In this paper, we introduce an end-to-end framework that employs LLMs with RAG capabilities for industry use cases. Given a customer query, the proposed system retrieves relevant knowledge documents and leverages them, along with previous chat history, to generate response suggestions for customer service agents in the contact centers of a major retail company. Through comprehensive automated and human evaluations, we show that this solution outperforms the current BERT-based algorithms in accuracy and relevance. Our findings suggest that RAG-based LLMs can be an excellent support to human customer service representatives by lightening their workload.
摘要:大型语言模型在各种自然语言处理(NLP)任务中表现出了多功能性,包括它们作为有效问答系统的潜力。但是,为了针对行业环境中的特定客户查询提供准确和相关的信息,低成本管理系统需要访问全面的知识库,以避免产生幻觉。检索增强一代(RAG)作为一种很有前途的技术应运而生。然而,使用RAG为现实世界的应用程序开发一个准确的问答框架需要几个挑战:1)数据可用性问题,2)评估生成内容的质量,3)人工评估的代价高昂。在本文中,我们介绍了一个端到端框架,该框架使用具有RAG功能的LLMS用于行业用例。在给定客户查询的情况下,建议的系统检索相关知识文档,并利用它们以及之前的聊天历史,为大型零售公司联系中心的客户服务代理生成响应建议。通过综合的自动化和人工评估,我们表明,该解决方案在准确率和相关性方面优于当前基于ERT的算法。我们的发现表明,基于RAG的LLMS可以通过减轻人类客户服务代表的工作量来为他们提供极好的支持。

[NLP-5] A Different Level Text Protection Mechanism With Differential Privacy
[NLP-5] 具有差异隐私的不同级别文本保护机制

链接: https://arxiv.org/abs/2409.03707
作者: Qingwen Fu
关键词-EN: BERT pre-training model, BERT pre-training, pre-training model, model and proves, proves the effectiveness
关键词-ZH: BERT预训练模型,BERT预训练,预训练模型,模型并证明,证明有效性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The article introduces a method for extracting words of different degrees of importance based on the BERT pre-training model and proves the effectiveness of this method. The article also discusses the impact of maintaining the same perturbation results for words of different importance on the overall text utility. This method can be applied to long text protection.
摘要:文章介绍了一种基于BERT预训练模型提取不同重要度单词的方法,并证明了该方法的有效性。本文还讨论了对不同重要性的单词保持相同的扰动结果对整体文本实用性的影响。该方法可以应用于长文本保护。

[NLP-6] LAST: Language Model Aware Speech Tokenization
[NLP-6] 最后:语言模型感知语音令牌化

链接: https://arxiv.org/abs/2409.03701
作者: Arnon Turetzky,Yossi Adi
关键词-EN: perform various tasks, Speech, Speech tokenization serves, spoken language modeling, tokenization serves
关键词-ZH: 执行各种任务、语音、语音标记化服务、口语建模、标记化服务
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization process and its usage afterward. In this study, we propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs. We advocate for the integration of this objective into the process of learning discrete speech representations. Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs. We empirically investigate the impact of various model design choices, including speech vocabulary size and text LM size. Our results demonstrate the proposed tokenization method outperforms the evaluated baselines considering both spoken language modeling and speech-to-text. More importantly, unlike prior work, the proposed method allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.
摘要:语音标记化是语音语言模型的基础,使其能够执行各种任务,如口语建模、文本到语音、语音到文本等。大多数语音标记器的训练独立于语音语言模型的训练过程,依赖于单独的声学模型和量化方法。遵循这样的方法可能会在标记化过程及其之后的使用之间产生不匹配。在这项研究中,我们提出了一种新的方法,通过利用预先训练的文本LMS中的目标来训练语音标记器。我们主张将这一目标整合到学习离散语音表示的过程中。我们的目标是将预先训练的语音模型中的特征转换到一个新的特征空间中,从而为语音LMS提供更好的聚类。我们实证研究了各种模型设计选择的影响,包括语音词汇量和文本LM大小。实验结果表明,无论是口语建模还是语音到文本转换,本文提出的标记化方法都优于评价基准。更重要的是,与以前的工作不同,提出的方法允许使用单个预先训练的LM来处理语音和文本输入,这使其有别于传统的标记化方法。

[NLP-7] A Fused Large Language Model for Predicting Startup Success
[NLP-7] 预测初创企业成功的融合大语言模型

链接: https://arxiv.org/abs/2409.03668
作者: Abdurahman Maarouf,Stefan Feuerriegel,Nicolas Pröllochs
关键词-EN: continuously seeking profitable, predict startup success, continuously seeking, startup success, startup
关键词-ZH: 持续寻求盈利,预测初创企业成功,持续寻求,初创企业成功,初创企业
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Investors are continuously seeking profitable investment opportunities in startups and, hence, for effective decision-making, need to predict a startup’s probability of success. Nowadays, investors can use not only various fundamental information about a startup (e.g., the age of the startup, the number of founders, and the business sector) but also textual description of a startup’s innovation and business model, which is widely available through online venture capital (VC) platforms such as Crunchbase. To support the decision-making of investors, we develop a machine learning approach with the aim of locating successful startups on VC platforms. Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success. Thereby, we assess to what extent self-descriptions on VC platforms are predictive of startup success. Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success, with textual self-descriptions being responsible for a significant part of the predictive power. Our work provides a decision support tool for investors to find profitable investment opportunities.
摘要:投资者不断地在初创企业中寻找有利可图的投资机会,因此,为了做出有效的决策,需要预测初创企业的成功概率。如今,投资者不仅可以使用创业公司的各种基本信息(如创业公司的年龄、创始人人数和业务部门),还可以使用创业公司创新和商业模式的文字描述,这些信息可以通过Crunchbase等在线风险投资平台广泛获得。为了支持投资者的决策,我们开发了一种机器学习方法,目的是在风险投资平台上定位成功的初创公司。具体地说,我们开发、培训和评估一个量身定做的、融合了大量语言的模型来预测创业成功。因此,我们评估风险投资平台上的自我描述在多大程度上可以预测创业成功。通过使用Crunchbase的20,172份在线个人资料,我们发现,我们的融合大型语言模型可以预测创业成功,而文本自我描述是预测能力的重要组成部分。我们的工作为投资者寻找有利可图的投资机会提供决策支持工具。

[NLP-8] he representation landscape of few-shot learning and fine-tuning in large language models
[NLP-8] 大型语言模型中的少量学习和微调的表示景观

链接: https://arxiv.org/abs/2409.03662
作者: Diego Doimo,Alessandro Serra,Alessio Ansuini,Alberto Cazzaniga
关键词-EN: In-context learning, modern large language, supervised fine-tuning, modern large, large language models
关键词-ZH: 上下文学习、现代大型语言、监督微调、现代大型语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.
摘要:情境学习(ICL)和有监督微调(SFT)是提高现代大型语言模型(LLMS)在特定任务上性能的两种常用策略。尽管它们的性质不同,但这些策略通常会带来类似的性能收益。然而,关于它们是否在LLM中诱导了类似的表征,人们知之甚少。我们通过分析它们在这两种情况下的隐藏表示的概率图景来解决这个问题。更具体地说,我们比较了LLM如何解决相同的问答任务,发现ICL和SFT创建了非常不同的内部结构,在两种情况下都在网络中间经历了急剧的转变。在网络的前半部分,ICL形成了根据其语义内容分层组织的可解释表示。相比之下,SFT得到的概率图景更模糊,语义也更混合。在模型的后半部分,微调的表示发展出更好地编码答案的同一性的概率模式,而ICL表示的景观的特征是较少定义的峰值。我们的方法揭示了LLMS内部开发的不同计算策略,以在不同条件下解决相同的任务,使我们朝着设计从语言模型中提取信息的最佳方法迈出了一步。

[NLP-9] LLM-based multi-agent poetry generation in non-cooperative environments
[NLP-9] 非合作环境中基于LLM的多智能体诗歌生成

链接: https://arxiv.org/abs/2409.03659
作者: Ran Zhang,Steffen Eger
关键词-EN: training process differs, process differs greatly, poetry generation, large language models, generated poetry lacks
关键词-ZH: 训练过程不同,过程差异很大,诗歌生成,大型语言模型,生成的诗歌缺乏
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Despite substantial progress of large language models (LLMs) for automatic poetry generation, the generated poetry lacks diversity while the training process differs greatly from human learning. Under the rationale that the learning process of the poetry generation systems should be more human-like and their output more diverse and novel, we introduce a framework based on social learning where we emphasize non-cooperative interactions besides cooperative interactions to encourage diversity. Our experiments are the first attempt at LLM-based multi-agent systems in non-cooperative environments for poetry generation employing both TRAINING-BASED agents (GPT-2) and PROMPTING-BASED agents (GPT-3 and GPT-4). Our evaluation based on 96k generated poems shows that our framework benefits the poetry generation process for TRAINING-BASED agents resulting in 1) a 3.0-3.7 percentage point (pp) increase in diversity and a 5.6-11.3 pp increase in novelty according to distinct and novel n-grams. The generated poetry from TRAINING-BASED agents also exhibits group divergence in terms of lexicons, styles and semantics. PROMPTING-BASED agents in our framework also benefit from non-cooperative environments and a more diverse ensemble of models with non-homogeneous agents has the potential to further enhance diversity, with an increase of 7.0-17.5 pp according to our experiments. However, PROMPTING-BASED agents show a decrease in lexical diversity over time and do not exhibit the group-based divergence intended in the social network. Our paper argues for a paradigm shift in creative tasks such as automatic poetry generation to include social learning processes (via LLM-based agent modeling) similar to human interaction.
摘要:尽管大语言模型(LLMS)在诗歌自动生成方面取得了很大进展,但生成的诗歌缺乏多样性,训练过程与人类学习有很大不同。在诗歌生成系统的学习过程应该更加人性化、输出更加多样化和新奇的理论基础下,我们引入了一个基于社会学习的框架,在强调合作互动的基础上,强调非合作互动,以鼓励多样性。我们的实验是首次尝试在非合作环境中使用基于训练的代理(GPT-2)和基于提示的代理(GPT-3和GPT-4)的基于LLM的多代理系统来生成诗歌。我们基于96k个生成的诗歌的评估表明,我们的框架有利于基于训练的代理的诗歌生成过程,从而导致1)多样性增加3.0-3.7个百分点(Pp),根据不同和新颖的n-gram,新颖性增加5.6-11.3个百分点。基于训练的主体生成的诗歌在词汇、风格和语义方面也表现出群体差异。我们框架中基于提示的代理也受益于非合作环境,并且更多样化的模型与非同质代理集成有可能进一步增强多样性,根据我们的实验,增加了7.0-17.5pp。然而,基于提示的代理随着时间的推移表现出词汇多样性的减少,并且没有表现出社交网络中预期的基于群体的分歧。我们的论文主张在创造性任务中进行范式转变,如自动诗歌生成,以包括类似于人类交互的社会学习过程(通过基于LLM的代理建模)。

[NLP-10] On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
[NLP-10] 直接偏好优化诱导的隐性报酬模型的有限概括能力

链接: https://arxiv.org/abs/2409.03650
作者: Yong Lin,Skyler Seto,Maartje ter Hoeve,Katherine Metcalf,Barry-John Theobald,Xuan Wang,Yizhe Zhang,Chen Huang,Tong Zhang
关键词-EN: Human Feedback, Reinforcement Learning, aligning language models, Direct Preference Optimization, human preferences
关键词-ZH: 人类反馈、强化学习、对齐语言模型、直接偏好优化、人类偏好
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 8 tables, 2 figures

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM’s effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM. Our findings indicate that even though DPORM fits the training dataset comparably, it generalizes less effectively than EXRM, especially when the validation datasets contain distribution shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.
摘要:人类反馈强化学习(RLHF)是一种使语言模型与人类偏好保持一致的有效方法。RLHF的核心是学习一种奖励函数,用于对人类偏好进行评分。学习奖励模型的两种主要方法是1)训练显式奖励模型(EXRM),如在RLHF中;2)使用通过直接偏好优化(DPO)等方法从偏好数据中学习的隐含奖励。前人的工作表明,DPO的隐式奖励模型(记为DPORM)在极限上可以逼近EXRM。DPORM的有效性直接意味着学习策略的最优性,对包括迭代DPO在内的LLM对齐方法也具有实际意义。然而,目前尚不清楚DPORM在经验上与EXRM的表现有多匹配。这项工作研究了DPORM和EXRM在区分偏好答案和拒绝答案方面的准确性。我们的发现表明,即使DPORM与训练数据集的拟合程度相当高,但它的泛化效果不如EXRM,特别是当验证数据集包含分布偏移时。在五种分布外设置中,DPORM的精度平均下降3%,最大下降7%。这些发现突显了DPORM的泛化能力有限,并证实了显式奖励模型在迭代DPO方法中的整合。

[NLP-11] CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
[NLP-11] CDM:公平准确公式识别评估的可靠指标

链接: https://arxiv.org/abs/2409.03643
作者: Bin Wang,Fan Wu,Linke Ouyang,Zhuangcheng Gu,Rui Zhang,Renqiu Xia,Bo Zhang,Conghui He
关键词-EN: presents significant challenges, significant challenges due, recognition presents significant, Formula recognition presents, Formula recognition
关键词-ZH: 提出重大挑战,重大挑战,识别提出重大挑战,公式识别提出,公式识别
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Website: this https URL

点击查看摘要

Abstract:Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.
摘要:由于数学公式的复杂结构和多种多样的表示法,公式识别面临着巨大的挑战。尽管公式识别模型不断进步,但BLEU和编辑距离等模型所采用的评价指标仍显示出显著的局限性。它们忽略了同一公式具有不同的表示形式,对训练数据的分布高度敏感,从而造成公式识别评价的不公平性。为此,我们提出了一种字符检测匹配(CDM)度量,通过设计图像级而不是乳胶级的度量来确保评估的客观性。具体地说,CDM将模型预测的LaTeX公式和地面真实的LaTeX公式都呈现为图像格式的公式,然后使用视觉特征提取和定位技术进行精确的字符级匹配,并结合空间位置信息。与之前仅依赖基于文本的字符匹配的BLEU和编辑距离度量相比,这种空间感知和字符匹配方法提供了更准确和公平的评估。在实验上,我们使用清洁发展机制、BLEU和ExpRate度量对各种公式识别模型进行了评估。他们的结果表明,清洁发展机制更接近人类评估标准,并通过消除不同公式表示造成的差异,在不同模型之间提供更公平的比较。

[NLP-12] Attend First Consolidate Later: On the Importance of Attention in Different LLM Layers
[NLP-12] 先参加后巩固:论LLM不同层注意力的重要性

链接: https://arxiv.org/abs/2409.03621
作者: Amit Ben Artzy,Roy Schwartz
关键词-EN: serves two purposes, attention mechanism, mechanism of future, layer serves, current token
关键词-ZH: 服务于两个目的,注意力机制、未来机制、层服务、当前代币
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidden states at some layer k with random vectors. Our experimenting with four LLMs and four tasks show that this operation often leads to small to negligible drop in performance. Importantly, this happens if the manipulation occurs in the top part of the model-k is in the final 30-50% of the layers. In contrast, doing the same manipulation in earlier layers might lead to chance level performance. We continue by switching the hidden state of certain tokens with hidden states of other tokens from another prompt; e.g., replacing the word “Italy” with “France” in “What is the capital of Italy?”. We find that when applying this switch in the top 1/3 of the model, the model ignores it (answering “Rome”). However if we apply it before, the model conforms to the switch (“Paris”). Our results hint at a two stage process in transformer-based LLMs: the first part gathers input from previous tokens, while the second mainly processes that information internally.
摘要:在基于解码器的LLMS中,给定层的表示有两个目的:一是在计算当前令牌时作为下一层的输入;二是作为未来令牌的注意机制的输入。在这项工作中,我们表明后一种作用的重要性可能被高估了。为了说明这一点,我们从操作先前令牌的表示开始;例如,通过用随机向量替换某些层k处的隐藏状态。我们使用四个LLM和四个任务进行的实验表明,这种操作通常会导致很小甚至可以忽略不计的性能下降。重要的是,如果操作发生在模型的顶部-k在最后30%-50%的层中,就会发生这种情况。相比之下,在较早的层次中进行同样的操作可能会导致机率级的表现。我们继续从另一个提示中将某些令牌的隐藏状态与其他令牌的隐藏状态进行切换;例如,将“意大利”一词替换为“意大利的首都是什么?”我们发现,当在模型的前1/3应用此开关时,模型会忽略它(答案是“Roman”)。然而,如果我们之前应用它,该模型符合开关(“paris”)。我们的结果暗示了基于转换器的LLM的两个阶段的过程:第一部分收集来自以前令牌的输入,而第二部分主要在内部处理该信息。

[NLP-13] 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances KDD
[NLP-13] 您只需100个实例:通过对一些实例进行测试,在未见数据上预测新LLM的成功

链接: https://arxiv.org/abs/2409.03563
作者: Lorenzo Pacchiardi,Lucy G. Cheke,José Hernández-Orallo
关键词-EN: individual task instances, task instances, LLM, performance, instances
关键词-ZH: 单个任务实例、任务实例、LLM、性能、实例
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Presented at the 2024 KDD workshop on Evaluation and Trustworthiness of Generative AI Models

点击查看摘要

Abstract:Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.
摘要:预测LLMS在单个任务实例上的性能对于确保它们在高风险应用中的可靠性至关重要。要做到这一点,一种可能性是在一组任务实例上评估所考虑的LLM,并培训评估员根据实例的特征预测其性能。然而,这种方法需要在足够大的任务实例集上评估每个新的LLM,以培训特定于它的评估员。在这项工作中,我们利用先前测试的LLM的评估结果来减少预测新LLM性能所需的评估数量。在实践中,我们建议在一小部分参考实例上测试新的LLM,并训练一个通用评估器,该评估器基于前者在参考集上的性能和感兴趣实例的特征来预测LLM在实例上的性能。我们在Helm-Lite和KindsOfReasning上进行了实证研究,这是我们引入的现有推理数据集的集合,在此基础上,我们评估了所有指令微调的OpenAI模型,直到2024年1月的GPT4版本。当预测与用于训练通用评估器的分布相同的实例上的性能时,我们发现这获得了与在整个实例集上训练的特定LLM评估器相当的性能。此外,我们发现随机选择引用实例的效果与我们测试的一些高级选择方法一样好。然而,对于分布外,没有明显的赢家出现,整体表现更差,这表明LLMS的内在可预测性很低。

[NLP-14] From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents
[NLP-14] 从MOOC到MAIC:通过法学硕士驱动的代理重塑在线教学和学习

链接: https://arxiv.org/abs/2409.03512
作者: Jifan Yu,Zheyuan Zhang,Daniel Zhang-li,Shangqing Tu,Zhanxin Hao,Rui Miao Li,Haoxuan Li,Yuanchun Wang,Hanming Li,Linlu Gong,Jie Cao,Jiayin Lin,Jinchang Zhou,Fei Qin,Haohua Wang,Jianxiao Jiang,Lijun Deng,Yisi Zhan,Chaojun Xiao,Xusheng Dai,Xuan Yan,Nianyi Lin,Nan Zhang,Ruixin Ni,Yang Dang,Lei Hou,Yu Zhang,Xu Han,Manli Li,Juanzi Li,Zhiyuan Liu,Huiqin Liu,Maosong Sun
关键词-EN: sparked extensive discussion, widespread adoption, uploaded to accessible, accessible and shared, scaling the dissemination
关键词-ZH: 引发广泛讨论,广泛采用,上传到可访问、可访问和共享,扩大传播范围
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integrated into this learning format, resulting in a variety of educational AI applications such as educational recommendation and intelligent tutoring. The emergence of intelligence in large language models (LLMs) has allowed for these educational enhancements to be built upon a unified foundational model, enabling deeper integration. In this context, we propose MAIC (Massive AI-empowered Course), a new form of online education that leverages LLM-driven multi-agent systems to construct an AI-augmented classroom, balancing scalability with adaptivity. Beyond exploring the conceptual framework and technical innovations, we conduct preliminary experiments at Tsinghua University, one of China’s leading universities. Drawing from over 100,000 learning records of more than 500 students, we obtain a series of valuable observations and initial analyses. This project will continue to evolve, ultimately aiming to establish a comprehensive open platform that supports and unifies research, technology, and applications in exploring the possibilities of online education in the era of large model AI. We envision this platform as a collaborative hub, bringing together educators, researchers, and innovators to collectively explore the future of AI-driven online education.
摘要:自从第一次出现在线教育,课程被上传到可访问和共享的在线平台以来,这种将人类知识传播到更广泛受众的形式引发了广泛的讨论和广泛采用。认识到个性化学习仍然具有巨大的改进潜力,新的AI技术不断融入这种学习业态,产生了教育推荐、智能辅导等各种教育AI应用。大型语言模型(LLM)中智能的出现使得这些教育增强可以建立在统一的基础模型上,从而实现更深层次的集成。在这种背景下,我们提出了大规模人工智能赋能课程(MAIC),这是一种新的在线教育形式,利用LLM驱动的多代理系统来构建人工智能增强的课堂,平衡可伸缩性和适应性。除了探索概念框架和技术创新外,我们还在中国的顶尖大学之一清华大学进行了初步实验。从500多名学生的10万多份学习记录中,我们获得了一系列有价值的观察和初步分析。该项目将继续发展,最终目标是建立一个全面的开放平台,支持和统一研究、技术和应用,探索大模型人工智能时代在线教育的可能性。我们将这个平台设想为一个协作中心,将教育工作者、研究人员和创新者聚集在一起,共同探索人工智能驱动的在线教育的未来。

[NLP-15] How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes
[NLP-15] 多少数据才是足够的数据?内部翻译的微调大型语言模型:跨多种数据集大小的性能评估

链接: https://arxiv.org/abs/2409.03454
作者: Inacio Vieira,Will Allred,Seamus Lankford,Sheila Castilho Monteiro De Sousa,Andy Way
关键词-EN: Decoder-only LLMs, generate high-quality translations, shown impressive performance, shown impressive, ability to learn
关键词-ZH: 仅限解码器的LLM,生成高质量的翻译,表现出令人印象深刻的性能,表现出令人印象深刻的学习能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoder-only LLMs have shown impressive performance in MT due to their ability to learn from extensive datasets and generate high-quality translations. However, LLMs often struggle with the nuances and style required for organisation-specific translation. In this study, we explore the effectiveness of fine-tuning Large Language Models (LLMs), particularly Llama 3 8B Instruct, leveraging translation memories (TMs), as a valuable resource to enhance accuracy and efficiency. We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector. Our experiments cover five translation directions across languages of varying resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to evaluate their influence on translation quality. We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET. Our findings reveal improvement in translation performance with larger datasets across all metrics. On average, BLEU and COMET scores increase by 13 and 25 points, respectively, on the largest training set against the baseline model. Notably, there is a performance deterioration in comparison with the baseline model when fine-tuning on only 1k and 2k examples; however, we observe a substantial improvement as the training dataset size increases. The study highlights the potential of integrating TMs with LLMs to create bespoke translation models tailored to the specific needs of businesses, thus enhancing translation quality and reducing turn-around times. This approach offers a valuable insight for organisations seeking to leverage TMs and LLMs for optimal translation outcomes, especially in narrower domains.
摘要:在机器翻译中,只有译码的LLMS表现出了令人印象深刻的性能,因为它们能够从大量的数据集中学习并生成高质量的翻译。然而,LLM往往难以处理特定于组织的翻译所需的细微差别和风格。在这项研究中,我们探索了利用翻译记忆库™对大型语言模型(LLM)进行微调的有效性,特别是LLAMA 38B指令,将其作为提高准确性和效率的宝贵资源。我们调查了使用来自软件部门特定组织的TMS微调Llama3模型的影响。我们的实验涵盖了不同资源级别的五种语言的翻译方向(英语到巴西葡萄牙语、捷克语、德语、芬兰语和韩语)。我们分析了不同大小的训练数据集(1k到207k片段),以评估它们对翻译质量的影响。我们为每个训练集微调单独的模型,并基于自动度量、BLEU、chrF++、ter和comet来评估它们的性能。我们的发现表明,在所有指标上使用更大的数据集时,翻译性能都有所提高。与基线模型相比,在最大训练集上,BLEU和彗星的平均得分分别提高了13分和25分。值得注意的是,当仅在1k和2k示例上进行微调时,与基线模型相比,性能有所下降;然而,我们观察到随着训练数据集大小的增加,性能有了实质性的改善。这项研究强调了将TMS与LLMS相结合的潜力,以创建适合企业特定需求的定制翻译模式,从而提高翻译质量并减少周转时间。这种方法为寻求利用TM和LLMS实现最佳翻译结果的组织提供了宝贵的见解,尤其是在较窄的领域中。

[NLP-16] Fine-tuning large language models for domain adaptation: Exploration of training strategies scaling model merging and synergistic capabilities
[NLP-16] 微调大型语言模型以实现领域适应:探索训练策略扩展模型合并和协同能力

链接: https://arxiv.org/abs/2409.03444
作者: Wei Lu,Rachel K. Luu,Markus J. Buehler
关键词-EN: Large Language Models, Large Language, Direct Preference Optimization, Ratio Preference Optimization, Odds Ratio Preference
关键词-ZH: 大型语言模型、大型语言、直接偏好优化、比率偏好优化、赔率偏好
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.
摘要:在材料科学和工程等领域应用的大语言模型(LLM)的进步取决于微调策略的发展,该策略使模型适应于专门的技术能力。在这项工作中,我们探讨了持续预训练(CPT)、有监督精调(SFT)以及各种基于偏好的优化方法,包括直接偏好优化(DPO)和赔率比偏好优化(ORPO),对微调LLM性能的影响。我们的分析显示了这些策略如何影响模型结果,并揭示了多个微调模型的合并可以导致出现超过父模型的单独贡献的能力。我们发现,模型合并带来了两个父模型都不能单独实现的新功能,从而提高了特定领域评估的性能。用不同的模型结构进行了实验,包括Llama 3.1 8B和Mistral 7B模型,在这些模型上观察到了相似的行为。为了探索这一结果是否也适用于小得多的模型,我们使用了具有17亿个参数的微型LLM,并表明非常小的LLM在模型合并下不一定具有紧急能力,这表明模型缩放可能是一个关键组件。在人类和人工智能模型之间开放但一致的聊天对话中,我们的评估揭示了对不同模型变体如何执行的详细洞察,并表明最小的模型在包括推理深度、创造力、清晰度和量化精度在内的关键标准上获得了高智能分数。其他实验包括基于不同的生物材料设计概念开发图像生成提示,以创建新的微结构、建筑概念,以及基于生物材料启发的建筑原则的城市设计。

[NLP-17] Rx Strategist: Prescription Verification using LLM Agents System
[NLP-17] Rx策略师:使用LLM代理系统进行处方验证

链接: https://arxiv.org/abs/2409.03440
作者: Phuc Phan Van,Dat Nguyen Minh,An Dinh Ngoc,Huy Phan Thanh
关键词-EN: Large Language Models, protect patient safety, pharmaceutical complexity demands, complexity demands strict, modern pharmaceutical complexity
关键词-ZH: 大型语言模型,保护患者安全,制药复杂性要求,复杂性要求严格,现代制药复杂性
类目: Computation and Language (cs.CL)
备注: 17 Pages, 6 Figures, Under Review

点击查看摘要

Abstract:To protect patient safety, modern pharmaceutical complexity demands strict prescription verification. We offer a new approach - Rx Strategist - that makes use of knowledge graphs and different search strategies to enhance the power of Large Language Models (LLMs) inside an agentic framework. This multifaceted technique allows for a multi-stage LLM pipeline and reliable information retrieval from a custom-built active ingredient database. Different facets of prescription verification, such as indication, dose, and possible drug interactions, are covered in each stage of the pipeline. We alleviate the drawbacks of monolithic LLM techniques by spreading reasoning over these stages, improving correctness and reliability while reducing memory demands. Our findings demonstrate that Rx Strategist surpasses many current LLMs, achieving performance comparable to that of a highly experienced clinical pharmacist. In the complicated world of modern medications, this combination of LLMs with organized knowledge and sophisticated search methods presents a viable avenue for reducing prescription errors and enhancing patient outcomes.
摘要:为了保护患者的安全,现代药物的复杂性要求严格的处方审核。我们提出了一种新的方法-Rx策略师-利用知识图和不同的搜索策略来增强大型语言模型(LLM)在代理框架内的能力。这种多方面的技术允许多阶段LLM管道和从定制的活性成分数据库中可靠地检索信息。处方验证的不同方面,如适应症、剂量和可能的药物相互作用,都在流水线的每个阶段涵盖。我们通过在这些阶段传播推理,在减少内存需求的同时提高正确性和可靠性,从而缓解了单片LLM技术的缺陷。我们的发现表明,Rx策略师超过了许多当前的LLM,取得了与经验丰富的临床药剂师相当的业绩。在现代药物的复杂世界中,LLM与有组织的知识和复杂的搜索方法相结合,为减少处方错误和提高患者结果提供了一条可行的途径。

[NLP-18] CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks
[NLP-18] CogniDual框架:在双系统理论框架内自我训练大型语言模型,以改善认知任务

链接: https://arxiv.org/abs/2409.03381
作者: Yongxin Deng(1),Xihe Qiu(1),Xiaoyu Tan(2),Chao Qu(2),Jing Pan(3),Yuan Cheng(3),Yinghui Xu(4),Wei Chu(2) ((1) School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China, (2) INF Technology (Shanghai) Co., Ltd., Shanghai, China, (3) School of Art, Design and Architecture, Monash University, Melbourne, Australia, (4) Artificial Intelligence Innovation and Incubation Institute, Fudan University, Shanghai, China)
关键词-EN: psychology investigates perception, investigates perception, Cognitive psychology investigates, rational System, System
关键词-ZH: 心理学调查知觉,调查知觉,认知心理学调查,理性系统,系统
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cognitive psychology investigates perception, attention, memory, language, problem-solving, decision-making, and reasoning. Kahneman’s dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2. Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level proficiency in various cognitive tasks. Nonetheless, the presence of a dual-system framework analogous to human cognition in LLMs remains unexplored. This study introduces the \textbfCogniDual Framework for LLMs (CFLLMs), designed to assess whether LLMs can, through self-training, evolve from deliberate deduction to intuitive responses, thereby emulating the human process of acquiring and mastering new information. Our findings reveal the cognitive mechanisms behind LLMs’ response generation, enhancing our understanding of their capabilities in cognitive psychology. Practically, self-trained models can provide faster responses to certain queries, reducing computational demands during inference.
摘要:认知心理学研究感知、注意力、记忆、语言、解决问题、决策和推理。卡纳曼的双系统理论阐明了人类的决策过程,区分了快速、直观的系统1和深思熟虑的理性系统2。最近的进步将大语言模型(LLM)定位为在各种认知任务中接近人类水平的强大工具。尽管如此,在LLMS中是否存在类似于人类认知的双重系统框架仍未被探索。本研究介绍了LLMS的认知双重框架(CFLLMS),旨在评估LLMS是否能够通过自我训练从刻意的演绎演变为直觉反应,从而模仿人类获取和掌握新信息的过程。我们的发现揭示了LLMS反应生成背后的认知机制,增强了我们对他们认知心理学能力的理解。实际上,自训练模型可以对某些查询提供更快的响应,从而减少推理过程中的计算需求。

[NLP-19] Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time
[NLP-19] 通过自然语言处理利用大型语言模型,实时提供可解释的心理恶化机器学习预测

链接: https://arxiv.org/abs/2409.03375
作者: Francisco de Arriba-Pérez,Silvia García-Méndez
关键词-EN: million people worldwide, Based on official, million people, natural language analysis, official estimates
关键词-ZH: 全球百万人,基于官方、百万人、自然语言分析、官方估计
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, Artificial Intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using Large Language Models (LLMs) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical-patient communication using intelligent systems. Consequently, we leverage an LLM using the latest Natural Language Processing (NLP) techniques in a chatbot solution to provide interpretable Machine Learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing NLP-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80 % in all evaluation metrics, with a recall value for the mental deterioration class about 85 %. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.
摘要:据官方估计,全球有5000万人患有痴呆症,而且这一数字每年还会增加1000万新患者。在没有治愈的情况下,临床预测和早期干预是延缓其进展的最有效的方法。为此,可以利用人工智能和计算语言学进行自然语言分析、个性化评估、监控和治疗。然而,传统的方法需要更多的语义知识管理和解释能力。此外,使用大型语言模型(LLM)进行认知衰退诊断仍然很少见,尽管这些模型代表了使用智能系统进行临床与患者交流的最先进方式。因此,我们在聊天机器人解决方案中利用使用最新自然语言处理(NLP)技术的LLM来实时提供可解释的机器学习认知下降预测。利用语言-概念特征进行适当的自然语言分析。通过可解释性,我们的目标是与模型的潜在偏见作斗争,并提高它们的潜力,以帮助临床工作者做出诊断决策。更详细地说,拟议的流程包括:(1)采用基于自然语言规划的即时工程的数据提取;(2)基于流的数据处理,包括特征工程、分析和选择;(3)实时分类;(4)可解释性仪表板,提供对预测结果的视觉和自然语言描述。在所有评价指标中,分类结果超过80%,其中精神恶化类别的召回值约为85%。总而言之,我们为这项工作贡献了一个负担得起的、灵活的、非侵入性的、个性化的诊断系统。

[NLP-20] Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding
[NLP-20] Con-ReCall:通过对比解码检测LLM中的预训练数据

链接: https://arxiv.org/abs/2409.03363
作者: Cheng Wang,Yiwei Wang,Bryan Hooi,Yujun Cai,Nanyun Peng,Kai-Wei Chang
关键词-EN: large language models, security risks, large language, language models, models is key
关键词-ZH: 大型语言模型,安全风险,大型语言,语言模型,模型是关键
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member and non-member contexts. While previous work suggested that member contexts provide little information due to the minor distributional shift they induce, our analysis reveals that these subtle shifts can be effectively leveraged when contrasted with non-member contexts. In this paper, we propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts through contrastive decoding, amplifying subtle differences to enhance membership inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves state-of-the-art performance on the WikiMIA benchmark and is robust against various text manipulation techniques.
摘要:大型语言模型中的训练数据是它们成功的关键,但它也存在隐私和安全风险,因为其中可能包含敏感信息。检测训练前数据对于缓解这些担忧至关重要。现有方法通常孤立地或仅使用非成员上下文来分析目标文本,忽略了同时考虑成员和非成员上下文的潜在洞察力。虽然以前的工作表明,由于成员上下文引起的微小分布变化,它们提供的信息很少,但我们的分析表明,与非成员上下文相比,这些微妙的变化可以有效地利用。在本文中,我们提出了一种新的方法Con-Recall,它通过对比解码来利用成员和非成员上下文引起的非对称分布移位,放大细微的差异来增强成员推理。广泛的实验评估表明,CON-Recall在WikiMIA基准测试中实现了最先进的性能,并且对各种文本处理技术具有健壮性。

[NLP-21] Sketch: A Toolkit for Streamlining LLM Operations
[NLP-21] 草图:简化LLM运营的工具包

链接: https://arxiv.org/abs/2409.03346
作者: Xin Jiang,Xiang Li,Wenjia Ma,Xuezhi Fang,Yiqun Yao,Naitong Yu,Xuying Meng,Peng Han,Jing Li,Aixin Sun,Yequan Wang
关键词-EN: Large language models, achieved remarkable success, represented by GPT, Large language, GPT family
关键词-ZH: 大型语言模型,取得显着成功,以GPT、大型语言、GPT家族为代表
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model’s outputs, thereby constraining the application of LLMs in various domains. In this work, we present Sketch, an innovative toolkit designed to streamline LLM operations across diverse fields. Sketch comprises the following components: (1) a suite of task description schemas and prompt templates encompassing various NLP tasks; (2) a user-friendly, interactive process for building structured output LLM services tailored to various NLP tasks; (3) an open-source dataset for output format control, along with tools for dataset construction; and (4) an open-source model based on LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting instructions. We anticipate this initiative to bring considerable convenience to LLM users, achieving the goal of ‘‘plug-and-play’’ for various applications. The components of Sketch will be progressively open-sourced at this https URL.
摘要:以GPT家族为代表的大语言模型取得了令人瞩目的成就。LLMS的特点在于它们能够通过生成性方法适应广泛的任务。然而,其输出格式的灵活性给控制和利用模型的输出带来了挑战,从而限制了LLMS在各个领域的应用。在这项工作中,我们介绍了Sketch,这是一个创新的工具包,旨在简化不同领域的LLM操作。SKETE由以下部分组成:(1)一套包含各种NLP任务的任务描述模式和提示模板;(2)用户友好的、交互的过程,用于构建针对各种NLP任务的结构化输出LLM服务;(3)用于输出格式控制的开源数据集,以及数据集构建工具;以及(4)基于LLaMA3-8B指令的开源模型,它熟练地理解并遵守输出格式指令。我们预计这一举措将为LLM用户带来相当大的便利,实现各种应用的即插即用目标。Sketch的组件将在这个HTTPS URL上逐步开源。

[NLP-22] Normal forms in Virus Machines
[NLP-22] 病毒机器中的正常形式

链接: https://arxiv.org/abs/2409.03327
作者: A. Ramírez-de-Arellano,F. G. C. Cabarle,D. Orellana-Martín,M. J. Pérez-Jiménez
关键词-EN: study the computational, virus machines, normal forms, VMs, present work
关键词-ZH: 研究计算、病毒机、范式、虚拟机、当前工作
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:In the present work, we further study the computational power of virus machines (VMs in short). VMs provide a computing paradigm inspired by the transmission and replication networks of viruses. VMs consist of process units (called hosts) structured by a directed graph whose arcs are called channels and an instruction graph that controls the transmissions of virus objects among hosts. The present work complements our understanding of the computing power of VMs by introducing normal forms; these expressions restrict the features in a given computing model. Some of the features that we restrict in our normal forms include (a) the number of hosts, (b) the number of instructions, and © the number of virus objects in each host. After we recall some known results on the computing power of VMs we give our normal forms, such as the size of the loops in the network, proving new characterisations of family of sets, such as the finite sets, semilinear sets, or NRE.
摘要:在本工作中,我们进一步研究了病毒机(简称为VMs)的计算能力。虚拟机提供了一种受病毒传输和复制网络启发的计算范式。虚拟机由进程单元(称为主机)组成,该进程单元由有向图(其弧线称为通道)和指令图(控制主机之间病毒对象的传输)结构。本工作通过引入范式来补充我们对虚拟机计算能力的理解;这些表达限制了给定计算模型中的特征。我们在正常形式中限制的一些功能包括(a)主机数量、(b)指令数量和(c)每个主机中病毒对象的数量。在我们回忆起有关虚拟机计算能力的一些已知结果后,我们给出了我们的范式,例如网络中循环的大小,证明了集族的新特征,例如有限集、半线性集或NRE。

[NLP-23] N-gram Prediction and Word Difference Representations for Language Modeling
[NLP-23] 语言建模的N-gram预测和词差表示

链接: https://arxiv.org/abs/2409.03295
作者: DongNyeong Heo,Daniela Noemi Rim,Heeyoul Choi
关键词-EN: Causal language modeling, underpinning remarkable successes, recent large language, foundational framework underpinning, framework underpinning remarkable
关键词-ZH: 因果语言建模,支撑着非凡的成功,最近的大型语言,支撑着基础框架,支撑着非凡的框架
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words’ prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.
摘要:因果语言模型(CLM)是近年来大型语言模型(LLM)取得显著成就的基础框架。尽管取得了成功,但用于下一个单词预测的训练方法带来了潜在的风险,可能会导致模型过度关注句子中的局部依存关系。虽然以前的研究是为了同时预测未来的N个单词,但它们主要应用于掩蔽语言建模(MLM)和神经机器翻译(NMT)等任务。在这项研究中,我们介绍了一个简单的N元语法预测框架,用于CLM任务。此外,我们在N元语法预测框架的基础上,在模型训练过程中引入了单词差异表示(WDR)作为替代和上下文目标表示。为了进一步提高下一个词的预测质量,我们提出了一种融合未来N个词的预测结果的集成方法。对包含CLM和NMT任务的多个基准数据集的经验评估表明,我们提出的方法比传统的CLM具有显著的优势。

[NLP-24] LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts EMNLP
[NLP-24] LLM探测器仍然达不到现实世界:LLM生成的短新闻类帖子的案例

链接: https://arxiv.org/abs/2409.03291
作者: Henrique Da Silva Gameiro,Andrei Kucharavy,Ljiljana Dolamic
关键词-EN: large Language Models, Language Models, major concern, emergence of widely, widely available powerful
关键词-ZH: 大型语言模型,语言模型,主要关注点,广泛使用的出现,强大的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 20 pages, 7 tables, 13 figures, under consideration for EMNLP

点击查看摘要

Abstract:With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations – short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (this https URL). Comments: 20 pages, 7 tables, 13 figures, under consideration for EMNLP Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: I.2.7; K.6.5 Cite as: arXiv:2409.03291 [cs.CL] (or arXiv:2409.03291v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.03291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:随着广泛使用的功能强大的LLM的出现,大语言模型(LLM)产生的虚假信息已经成为人们关注的主要问题。从历史上看,LLM探测器一直被吹捧为一种解决方案,但它们在现实世界中的有效性仍有待证明。在这篇文章中,我们关注信息操作中的一个重要环境–由中等经验丰富的攻击者生成的类似新闻的短帖子。我们证明,现有的LLM探测器,无论是零炮还是专门训练的,都没有准备好在那种情况下用于现实世界。所有经过测试的零射击探测器的性能都与以前的基准不一致,并且非常容易受到采样温度升高的影响,这是最近的基准中所没有的一种轻微攻击。可以开发出一种专门训练的检测器,可以在LLMS和不可见攻击中推广,但它无法推广到新的人类书写的文本。我们认为,前者表明需要特定领域的基准测试,而后者则建议在对抗性回避韧性和对参考人类文本的过度匹配之间进行权衡,两者都需要在基准中进行评估,目前还没有。我们认为这表明了对当前LLM探测器基准方法的重新考虑,并提供了一个动态可扩展的基准来允许它(这个HTTPS URL)。评论:20页,7个表,13张图,正在审议EMNLP科目:计算和语言(cs.CL);人工智能(cs.AI);密码学和安全(cs.CR);机器学习(cs.LG)ACM类:I.2.7;K.6.5引用为:arxiv:2409.03291cs.CLhttps://doi.org/10.48550/arXiv.2409.03291 Focus通过DataCite(待注册)了解更多arxiv发布的文件

[NLP-25] xt2KG: Incremental Knowledge Graphs Construction Using Large Language Models
[NLP-25] xt2 KG:使用大型语言模型构建增量知识图

链接: https://arxiv.org/abs/2409.03284
作者: Yassir Lairgi,Ludovic Moncla,Rémy Cazabet,Khalid Benabdeslem,Pierre Cléau
关键词-EN: access valuable information, challenging to access, access valuable, making it challenging, building Knowledge Graphs
关键词-ZH: 访问有价值的信息,访问具有挑战性,访问有价值,使其具有挑战性,构建知识图
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at The International Web Information Systems Engineering conference (the WISE conference) 2024

点击查看摘要

Abstract:Most available data is unstructured, making it challenging to access valuable information. Automatically building Knowledge Graphs (KGs) is crucial for structuring data and making it accessible, allowing users to search for information effectively. KGs also facilitate insights, inference, and reasoning. Traditional NLP methods, such as named entity recognition and relation extraction, are key in information retrieval but face limitations, including the use of predefined entity types and the need for supervised learning. Current research leverages large language models’ capabilities, such as zero- or few-shot learning. However, unresolved and semantically duplicated entities and relations still pose challenges, leading to inconsistent graphs and requiring extensive post-processing. Additionally, most approaches are topic-dependent. In this paper, we propose iText2KG, a method for incremental, topic-independent KG construction without post-processing. This plug-and-play, zero-shot method is applicable across a wide range of KG construction scenarios and comprises four modules: Document Distiller, Incremental Entity Extractor, Incremental Relation Extractor, and Graph Integrator and Visualization. Our method demonstrates superior performance compared to baseline methods across three scenarios: converting scientific papers to graphs, websites to graphs, and CVs to graphs.
摘要:大多数可用的数据都是非结构化的,这使得访问有价值的信息变得具有挑战性。自动构建知识图(KG)对于结构化数据和使其可访问至关重要,从而使用户能够有效地搜索信息。KG还有助于洞察、推理和推理。传统的自然语言处理方法,如命名实体识别和关系提取,是信息检索的关键,但面临着限制,包括使用预定义的实体类型和需要监督学习。目前的研究利用了大型语言模型的能力,例如零机会或极少机会学习。然而,未解决的和语义上重复的实体和关系仍然构成挑战,导致图形不一致,并需要广泛的后处理。此外,大多数方法都是主题相关的。在本文中,我们提出了一种无需后处理的增量、主题无关的KG构建方法iText2KG。这种即插即用的零镜头方法适用于广泛的KG构建场景,包括四个模块:文档蒸馏器、增量实体抽取器、增量关系抽取器和图形集成器和可视化。与基线方法相比,我们的方法在三个场景中表现出了优越的性能:将科学论文转换为图表,将网站转换为图表,以及将简历转换为图表。

[NLP-26] ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
[NLP-26] ChartMoE:用于高级图表理解的专家连接器混合

链接: https://arxiv.org/abs/2409.03277
作者: Zhengzhuo Xu,Bowen Qu,Yiyan Qi,Sinan Du,Chengjin Xu,Chun Yuan,Jian Guo
关键词-EN: Automatic chart understanding, Automatic chart, document parsing, chart understanding, crucial for content
关键词-ZH: 自动图表理解,自动图表,文档解析,图表理解,对内容至关重要
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, the application of alignment training within the chart domain is still underexplored. To address this, we propose ChartMoE, which employs the mixture of expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train multiple linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with over 900K chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts in four distinct ways and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.
摘要:自动图表理解是内容理解和文档解析的关键。多模式大型语言模型(MLLMS)通过特定领域的对齐和微调,在图表理解方面表现出了非凡的能力。然而,在图表领域内对齐训练的应用仍未得到充分的探索。为了解决这一问题,我们提出了ChartMoE,它采用混合专家(MoE)架构来取代传统的线性投影仪,以弥补通道之间的差距。具体来说,我们通过不同的对准任务来训练多个线性连接符,这些连接符被用作不同专家的基本初始化参数。此外,我们引入了ChartMoE-Align,这是一个拥有超过90万个图表-表格-JSON代码四元组的数据集,用于执行三个对齐任务(图表-表格/JSON/代码)。结合香草连接器,我们通过四种不同的方式初始化不同的专家,并采用高质量的知识学习来进一步细化MOE连接器和LLM参数。广泛的实验证明了MOE连接器和我们的初始化策略的有效性,例如,ChartMoE在ChartQA基准上将以前最先进的连接器的准确率从80.48%提高到84.64%。

[NLP-27] Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation
[NLP-27] 战略思想链:通过战略启发指导法学硕士的准确推理

链接: https://arxiv.org/abs/2409.03271
作者: Yu Wang,Shiwan Zhao,Zhihu Wang,Heyuan Huang,Ming Fan,Yubo Zhang,Zhixing Wang,Haijun Wang,Ting Liu
关键词-EN: large language models, paradigm has emerged, capabilities of large, large language, LLM performance
关键词-ZH: 大型语言模型、范式已经出现、大型语言的能力、LLM性能
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to sub-optimal reasoning performance. To address this challenge, we propose the \textbfStrategic Chain-of-Thought (SCoT), a novel methodology designed to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers. Our experiments across eight challenging reasoning datasets demonstrate significant improvements, including a 21.05% increase on the GSM8K dataset and 24.13% on the Tracking_Objects dataset, respectively, using the Llama3-8b model. Additionally, we extend the SCoT framework to develop a few-shot method with automatically matched demonstrations, yielding even stronger results. These findings underscore the efficacy of SCoT, highlighting its potential to substantially enhance LLM performance in complex reasoning tasks.
摘要:思想链(CoT)范式已经成为增强大型语言模型(LLM)推理能力的关键方法。然而,尽管COT方法被广泛采用并取得了成功,但由于它们无法一致地确保生成的推理路径的质量,导致推理性能次优,因此经常表现出不稳定。为了应对这一挑战,我们提出了战略思想链(SCOT),这是一种新的方法,旨在通过在生成中间推理步骤之前整合战略知识来优化LLM的性能。SCOT在单个提示中采用了两个阶段的方法:首先得出有效的问题解决策略,然后使用该策略指导生成高质量的COT路径和最终答案。我们在8个具有挑战性的推理数据集上的实验表明,使用Llama3-8b模型,GSM8K数据集和跟踪对象数据集分别增加了21.05%和24.13%。此外,我们对SCOT框架进行了扩展,以开发具有自动匹配演示的几次方法,从而产生更强的结果。这些发现突显了SCOT的有效性,突出了它在复杂推理任务中显著提高LLM表现的潜力。

[NLP-28] GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding
[NLP-28] GraphInsight:挖掘大型语言模型中的见解以理解图结构

链接: https://arxiv.org/abs/2409.03258
作者: Yukun Cao,Shuo Han,Zengyi Gao,Zezhong Ding,Xike Xie,S. Kevin Zhou
关键词-EN: Large Language Models, Language Models, Large Language, graph description sequences, description sequences
关键词-ZH: 大型语言模型、语言模型、大型语言、图描述序列、描述序列
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ‘‘positional biases’’. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs’ comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.
摘要:虽然大型语言模型在处理图形方面显示出了潜力,但它们很难通过图形描述序列的提示来理解图形结构信息,特别是随着图形大小的增加。我们将这一挑战归因于图描述序列中不同位置的LLM的不同记忆性能,即所谓的位置偏差(Positive Biase)。为了解决这个问题,我们提出了GraphInsight,一个新的框架,旨在提高LLMS对宏观和微观图形信息的理解。GraphInsight基于两个关键策略:1)将关键图形信息放置在LLM表现出更强记忆性能的位置,2)在检索-增强生成(RAG)的启发下,研究针对记忆性能较弱区域的轻量级外部知识库。此外,GraphInsight还探索了将这两种策略集成到LLM代理流程中,以用于需要多步骤推理的复合图任务。对具有广泛评估任务的基准的广泛经验研究表明,GraphInsight在理解不同大小的图形结构方面显著优于所有其他图形描述方法(例如,提示技术和重新排序策略)。

[NLP-29] Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard
[NLP-29] 通过纵向研究了解LLM发展:来自开放Ko-LLM排行榜的见解

链接: https://arxiv.org/abs/2409.03257
作者: Chanjun Park,Hyeonwoo Kim
关键词-EN: Open Ko-LLM Leaderboard, Open Ko-LLM, restricted observation periods, Ko-LLM Leaderboard, eleven months
关键词-ZH: 开放Ko-LLM排行榜,开放Ko-LLM,限制观察期,Ko-LLM排行榜,十一个月
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.
摘要:本文进行了一项为期11个月的纵向研究,以解决之前对Open Ko-LLM排行榜研究的局限性,这些研究依赖于观察期仅为五个月的实证研究。通过延长分析持续时间,我们的目标是更全面地了解韩语大型语言模型(LLM)的开发进展。我们的研究以三个主要研究问题为指导:(1)随着时间的推移,在Open Ko-LLM排行榜上的不同任务中提高LLM绩效有哪些具体挑战?(2)模型大小如何影响各种基准的任务性能相关性?(3)Open Ko-LLM排行榜上的排行榜排名模式随着时间的推移发生了怎样的变化?通过分析此期间的1,769个模型,我们的研究对法学硕士的持续进步和评估框架的不断变化性质进行了全面检查。

[NLP-30] E2CL: Exploration-based Error Correction Learning for Embodied Agents
[NLP-30] E2 CL:基于探索的错误纠正学习,用于被确定的代理

链接: https://arxiv.org/abs/2409.03256
作者: Hanlin Wang,Chak Tou Leong,Jian Wang,Wenjie Li
关键词-EN: exhibiting increasing capability, Language models, utilization and reasoning, models are exhibiting, exhibiting increasing
关键词-ZH: 表现出不断增长的能力,语言模型,利用率和推理,模型正在表现出,表现出不断增长
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are exhibiting increasing capability in knowledge utilization and reasoning. However, when applied as agents in embodied environments, they often suffer from misalignment between their intrinsic knowledge and environmental knowledge, leading to infeasible actions. Traditional environment alignment methods, such as supervised learning on expert trajectories and reinforcement learning, face limitations in covering environmental knowledge and achieving efficient convergence, respectively. Inspired by human learning, we propose Exploration-based Error Correction Learning (E2CL), a novel framework that leverages exploration-induced errors and environmental feedback to enhance environment alignment for LM-based agents. E2CL incorporates teacher-guided and teacher-free exploration to gather environmental feedback and correct erroneous actions. The agent learns to provide feedback and self-correct, thereby enhancing its adaptability to target environments. Evaluations in the Virtualhome environment demonstrate that E2CL-trained agents outperform those trained by baseline methods and exhibit superior self-correction capabilities.
摘要:语言模型在知识利用和推理方面表现出越来越强的能力。然而,当它们作为智能体应用在具体环境中时,它们的内在知识和环境知识往往不一致,导致行动不可行。传统的环境对齐方法,如基于专家轨迹的监督学习和强化学习,分别在覆盖环境知识和实现高效收敛方面存在局限性。受人类学习的启发,我们提出了基于探索的错误纠正学习(E2CL),这是一个新的框架,利用探索引起的错误和环境反馈来增强基于LM的代理的环境对齐。E2CL结合了教师指导和教师自由探索,以收集环境反馈和纠正错误行为。代理学习提供反馈和自我纠正,从而增强其对目标环境的适应性。在Virtualhome环境中的评估表明,接受E2CL培训的代理的表现优于那些通过基线方法培训的代理,并显示出卓越的自我纠正能力。

[NLP-31] Preserving Empirical Probabilities in BERT for Small-sample Clinical Entity Recognition
[NLP-31] 保留BERT中的经验概率以实现小样本临床实体识别

链接: https://arxiv.org/abs/2409.03238
作者: Abdul Rehman,Jian Jun Zhang,Xiaosong Yang
关键词-EN: Named Entity Recognition, Entity Recognition, Named Entity, equitable entity recognition, encounters the challenge
关键词-ZH: 命名实体识别,实体识别,命名实体,公平实体识别,遇到挑战
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 8 figures

点击查看摘要

Abstract:Named Entity Recognition (NER) encounters the challenge of unbalanced labels, where certain entity types are overrepresented while others are underrepresented in real-world datasets. This imbalance can lead to biased models that perform poorly on minority entity classes, impeding accurate and equitable entity recognition. This paper explores the effects of unbalanced entity labels of the BERT-based pre-trained model. We analyze the different mechanisms of loss calculation and loss propagation for the task of token classification on randomized datasets. Then we propose ways to improve the token classification for the highly imbalanced task of clinical entity recognition.
摘要:命名实体识别(NER)面临着标签不平衡的挑战,其中某些实体类型在现实世界数据集中被过度代表,而其他实体类型则被低估。这种不平衡可能会导致有偏见的模型,在少数群体实体类别中表现不佳,从而阻碍准确和公平的实体承认。本文探讨了基于BERT的预训练模型的不平衡实体标签的影响。我们分析了随机数据集上的令牌分类任务的损失计算和损失传播的不同机制。然后,我们提出了改进标记分类的方法,以应对高度不平衡的临床实体识别任务。

[NLP-32] Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration
[NLP-32] 通过非典型演示重新校准增强医疗保健LLM信任

链接: https://arxiv.org/abs/2409.03225
作者: Jeremy Qin,Bang Liu,Quoc Dinh Nguyen
关键词-EN: Black-box large language, large language models, making it essential, large language, increasingly deployed
关键词-ZH: 黑匣子大型语言、大型语言模型,使其变得至关重要,大型语言,越来越多地部署
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to effectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence, leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, \textitAtypical Presentations Recalibration, which leverages atypical presentations to adjust the model’s confidence estimates. Our approach significantly improves calibration, reducing calibration errors by approximately 60% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within the recalibration framework.
摘要:黑盒大语言模型(LLM)越来越多地被应用于各种环境中,这使得这些模型必须有效地传达它们的信心和不确定性,特别是在高风险环境中。然而,这些模型往往表现出过度自信,导致潜在的风险和误判。现有的用于得出和校准LLM置信度的技术主要集中在一般推理数据集上,仅产生了适度的改进。准确的校准对于明智的决策和防止不良后果至关重要,但由于这些模型执行的任务的复杂性和可变性,仍然具有挑战性。在这项工作中,我们调查了医疗保健环境中黑盒LLMS的错误校准行为。我们提出了一种新的方法,利用非典型表示来调整模型的置信度估计,即典型表示再校准。我们的方法显著地改进了校准,在三个医学问题回答数据集上减少了大约60%的校准误差,并且优于现有的方法,如普通的口头置信度,COT口头置信度等。此外,我们还对非典型性在重新校准框架内的作用进行了深入分析。

[NLP-33] xLAM: A Family of Large Action Models to Empower AI Agent Systems
[NLP-33] xLAM:一系列支持人工智能代理系统的大型动作模型

链接: https://arxiv.org/abs/2409.03215
作者: Jianguo Zhang,Tian Lan,Ming Zhu,Zuxin Liu,Thai Hoang,Shirley Kokane,Weiran Yao,Juntao Tan,Akshara Prabhakar,Haolin Chen,Zhiwei Liu,Yihao Feng,Tulika Awalgaonkar,Rithesh Murthy,Eric Hu,Zeyuan Chen,Ran Xu,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong
关键词-EN: significant research interest, attracted significant research, research interest, agent tasks, Autonomous agents powered
关键词-ZH: 显着的研究兴趣,吸引了显着的研究,研究兴趣,代理任务,自主代理动力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical report for the Salesforce xLAM model series

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents’ generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at this https URL
摘要:基于大语言模型的自主智能体已经引起了人们极大的研究兴趣。然而,由于高质量的代理数据集的稀缺和这一领域缺乏标准协议,开放源码社区在开发代理任务的专门模型方面面临许多挑战。我们引入并公开发布了xLAM,这是一系列为AI代理任务设计的大型动作模型。XLAM系列包括五个具有密集和混合专家架构的模型,从1B到8x22B参数不等,使用可扩展、灵活的管道进行培训,该管道统一、增强和合成不同的数据集,以增强AI代理在不同环境中的泛化能力和性能。我们的实验结果表明,xLAM在多个代理能力基准测试中始终提供卓越的性能,尤其是在Berkeley函数调用排行榜上获得第一名,在工具使用方面优于GPT-4、Claude-3和许多其他模型。通过发布xLAM系列,我们的目标是提高用于自主AI代理的开源LLM的性能,潜在地加速进展并使访问高性能代理任务的模型大众化。型号可通过此HTTPS URL获得

[NLP-34] An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification
[NLP-34] 低资源情绪分类中有效部署扩散LM进行数据增强

链接: https://arxiv.org/abs/2409.03203
作者: Zhuowei Chen,Lianxi Wang,Yuben Wu,Xinfeng Liao,Yujia Tian,Junyang Zhong
关键词-EN: imbalanced label distributions, imbalanced label, label distributions, Sentiment classification, language model
关键词-ZH: 不平衡标签分布,不平衡标签,标签分布,情感分类,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework’s modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
摘要:情感分类(SC)经常面临领域特定的上下文、不平衡的标签分布和极少的场景等资源不足的挑战。扩散语言模型(LM)用于文本数据扩充(DA)的潜力仍未被发掘,此外,文本DA方法难以平衡新样本的多样性和一致性。大多数DA方法要么执行逻辑修改,要么用语言模型重新表述原始序列中不太重要的标记。在SC的背景下,强烈的情绪表征可以对整个序列的情绪起批判性作用。因此,与重新表述不太重要的上下文相反,我们提出了DiffusionCLS利用扩散LM来捕获领域内知识并通过重构强标签相关标记来生成伪样本。这种方法确保了一致性和多样性之间的平衡,避免了噪声的引入,并增强了数据集的关键特征。DiffusionCLS还包括一个抗噪培训目标,以帮助模型推广。实验证明了该方法在各种低资源场景中的有效性,包括领域特定问题和领域通用问题。烧蚀研究证实了我们框架模块的有效性,可视化研究强调了最佳部署条件,强化了我们的结论。

[NLP-35] Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
[NLP-35] 扰乱DARCY防御:难以区分的普遍对抗触发

链接: https://arxiv.org/abs/2409.03183
作者: Zuquan Peng,Yuanyuan He,Jianbing Ni,Ben Niu
关键词-EN: Natural Language Processing, Universal Adversarial Triggers, Neural networks, Universal Adversarial, Language Processing
关键词-ZH: 自然语言处理、通用对抗触发器、神经网络、通用对抗、语言处理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the “honeypot” concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY’s detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT’s adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
摘要:用于自然语言处理(NLP)的神经网络(NN)分类模型容易受到通用对手触发器(UAT)的攻击,UAT会触发模型对任何输入产生特定的预测。Darcy借用了“蜜罐”的概念来引诱多个陷门,有效地检测到UAT生成的敌意例子。不幸的是,我们发现了一种新的UAT生成方法,称为IndisUAT,它生成触发器(即令牌),并使用它们来创建敌意示例,其特征分布与Darcy检测层随机选择的类别中的良性示例的特征分布无法区分。在Darcy保护模型中,产生的对抗性例子导致预测结果的最大损失。同时,所产生的触发语在黑盒模型中对文本生成、文本推理和阅读理解都是有效的。最后,在神经网络模型下对NLP任务的评估结果表明,IndisUAT方法可以有效地绕过Darcy并穿透其他防御。例如,在RNN和CNN模型中,IndisUAT可以使Darcy检测的真阳性率至少下降40.8%和90.6%,准确率至少下降33.3%和51.6%。INdisUAT将BERT的对抗性防御模型的准确性降低了至少34.0%,并使GPT-2语言模型即使在非种族背景下也会产生种族主义输出。

[NLP-36] MARAGS: A Multi-Adapter System for Multi-Task Retrieval Augmented Generation Question Answering KDD
[NLP-36] MARAGS:用于多任务检索增强生成问题回答的多适配器系统

链接: https://arxiv.org/abs/2409.03171
作者: Mitchell DeHaven
关键词-EN: Meta Comprehensive RAG, Meta Comprehensive, multi-adapter retrieval augmented, Comprehensive RAG, retrieval augmented generation
关键词-ZH: Meta Comprehensive RAG、Meta Comprehensive、多适配器检索增强、Comprehensive RAG、检索增强生成
类目: Computation and Language (cs.CL)
备注: Accepted to CRAG KDD Cup 24 Workshop

点击查看摘要

Abstract:In this paper we present a multi-adapter retrieval augmented generation system (MARAGS) for Meta’s Comprehensive RAG (CRAG) competition for KDD CUP 2024. CRAG is a question answering dataset contains 3 different subtasks aimed at realistic question and answering RAG related tasks, with a diverse set of question topics, question types, time dynamic answers, and questions featuring entities of varying popularity. Our system follows a standard setup for web based RAG, which uses processed web pages to provide context for an LLM to produce generations, while also querying API endpoints for additional information. MARAGS also utilizes multiple different adapters to solve the various requirements for these tasks with a standard cross-encoder model for ranking candidate passages relevant for answering the question. Our system achieved 2nd place for Task 1 as well as 3rd place on Task 2. Comments: Accepted to CRAG KDD Cup 24 Workshop Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.03171 [cs.CL] (or arXiv:2409.03171v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.03171 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在本文中,我们提出了一个多适配器检索增强生成系统(MARAGS)的Meta的综合RAG(CRAG)竞争KDD杯2024年。CRAG是一个问答数据集,包含3个针对现实问题和回答RAG相关任务的不同子任务,具有不同的问题主题、问题类型、时间动态答案和具有不同受欢迎实体的问题。我们的系统遵循基于Web的RAG的标准设置,它使用处理后的网页为LLM提供上下文以生成几代,同时还查询API端点以获取更多信息。MARAGS还利用多个不同的适配器来解决这些任务的各种要求,并使用标准的交叉编码器模型对与回答问题相关的候选段落进行排名。我们的系统在任务1中获得第二名,在任务2中获得第三名。评论:接受CRIG KDCUP 24研讨会主题:计算和语言(cs.CL)引用为:arxiv:2409.03171cs.CLhttps://doi.org/10.48550/arXiv.2409.03171 Focus通过DataCite了解更多由arxiv发布的DOI(等待注册)

[NLP-37] Continual Skill and Task Learning via Dialogue
[NLP-37] 通过对话持续技能和任务学习

链接: https://arxiv.org/abs/2409.03166
作者: Weiwei Gu,Suresh Kondepudi,Lixiao Huang,Nakul Gopalan
关键词-EN: sample efficiency, challenging problem, perpetually with sample, robot, skills
关键词-ZH: 样本效率,具有挑战性的问题,永远与样本、机器人、技能在一起
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continual and interactive robot learning is a challenging problem as the robot is present with human users who expect the robot to learn novel skills to solve novel tasks perpetually with sample efficiency. In this work we present a framework for robots to query and learn visuo-motor robot skills and task relevant information via natural language dialog interactions with human users. Previous approaches either focus on improving the performance of instruction following agents, or passively learn novel skills or concepts. Instead, we used dialog combined with a language-skill grounding embedding to query or confirm skills and/or tasks requested by a user. To achieve this goal, we developed and integrated three different components for our agent. Firstly, we propose a novel visual-motor control policy ACT with Low Rank Adaptation (ACT-LoRA), which enables the existing SoTA ACT model to perform few-shot continual learning. Secondly, we develop an alignment model that projects demonstrations across skill embodiments into a shared embedding allowing us to know when to ask questions and/or demonstrations from users. Finally, we integrated an existing LLM to interact with a human user to perform grounded interactive continual skill learning to solve a task. Our ACT-LoRA model learns novel fine-tuned skills with a 100% accuracy when trained with only five demonstrations for a novel skill while still maintaining a 74.75% accuracy on pre-trained skills in the RLBench dataset where other models fall significantly short. We also performed a human-subjects study with 8 subjects to demonstrate the continual learning capabilities of our combined framework. We achieve a success rate of 75% in the task of sandwich making with the real robot learning from participant data demonstrating that robots can learn novel skills or task knowledge from dialogue with non-expert users using our approach.
摘要:持续和交互的机器人学习是一个具有挑战性的问题,因为人类用户希望机器人学习新技能,以样本效率永久地解决新任务。在这项工作中,我们提出了一个框架,供机器人通过与人类用户的自然语言对话交互来查询和学习视觉运动机器人技能和任务相关信息。以往的方法要么着眼于提高跟随智能体的教学绩效,要么被动地学习新的技能或概念。相反,我们使用对话框结合语言技能基础嵌入来查询或确认用户请求的技能和/或任务。为了实现这个目标,我们为我们的代理开发并集成了三个不同的组件。首先,我们提出了一种新的低阶自适应视觉-运动控制策略ACT(ACT-LORA),使现有的SOTA-ACT模型能够进行少镜头连续学习。其次,我们开发了一个一致性模型,该模型将跨技能实施例的演示投影到共享嵌入中,从而允许我们知道何时向用户提问和/或演示。最后,我们集成了一个现有的LLM来与人类用户进行交互,以执行扎根的交互式持续技能学习来解决任务。我们的ACT-LORA模型在只对一项新技能进行五次演示训练时,以100%的准确率学习新的微调技能,同时在RLB边数据集中保持74.75%的预训练技能准确率,而其他模型在这方面存在显着不足。我们还对8名受试者进行了一项人-受试者研究,以展示我们组合框架的持续学习能力。使用真实的机器人进行三明治制作的成功率达到75%。参与者的学习数据表明,使用我们的方法,机器人可以通过与非专家用户的对话学习新的技能或任务知识。

[NLP-38] MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models
[NLP-38] MaterialBENCH:评估大学级材料科学大型语言模型解决问题的能力

链接: https://arxiv.org/abs/2409.03161
作者: Michiko Yoshitake(1),Yuta Suzuki(2),Ryo Igarashi(1),Yoshitaka Ushiku(1),Keisuke Nagato(3) ((1) OMRON SINIC X, (2) Osaka Univ., (3) Univ. Tokyo)
关键词-EN: college-level benchmark dataset, materials science field, large language models, science field, college-level benchmark
关键词-ZH: 大学水平基准数据集、材料科学领域、大型语言模型、科学领域、大学水平基准
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the multiple-choice type. Multiple-choice problems are constructed by adding three incorrect answers as choices to a correct answer, so that LLMs can choose one of the four as a response. Most of the problems for free-response answer and multiple-choice types overlap except for the format of the answers. We also conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5, ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with the OpenAI API. The differences and similarities in the performance of LLMs measured by the MaterialBENCH are analyzed and discussed. Performance differences between the free-response type and multiple-choice type in the same models and the influence of using system massages on multiple-choice problems are also studied. We anticipate that MaterialBENCH will encourage further developments of LLMs in reasoning abilities to solve more complicated problems and eventually contribute to materials research and discovery.
摘要:构建了一个面向材料科学领域大型语言模型的大学级基准数据集–MaterialBENCH。该数据集由基于大学教科书的问题-答案对组成。有两种类型的问题:一种是自由回答型,另一种是多项选择型。多项选择题是通过将三个不正确的答案作为选项添加到正确的答案中来构建的,这样LLMS可以从四个答案中选择一个作为答案。除了答案的格式外,自由回答和多项选择题的大多数问题都是重叠的。我们还使用MaterialBENCH在LLMS上进行了实验,包括ChatGPT-3.5、ChatGPT-4、BARD(在实验时),以及GPT-3.5和GPT-4与OpenAI API。分析和讨论了由MaterialBENCH测量的LLMS性能的不同和相似之处。研究了自由反应型和选择型在同一模型中的表现差异,以及使用系统信息对多项选择问题的影响。我们预计,MaterialBENCH将促进LLM在推理能力方面的进一步发展,以解决更复杂的问题,并最终为材料研究和发现做出贡献。

[NLP-39] Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models
[NLP-39] 关于图形的争论:大型语言模型的灵活可靠推理框架

链接: https://arxiv.org/abs/2409.03155
作者: Jie Ma,Zhitao Gao,Qi Chai,Wangchun Sun,Pinghui Wang,Hongbin Pei,Jing Tao,Lingyun Song,Jun Liu,Chen Zhang,Lizhen Cui
关键词-EN: Large Language Models, real-world applications due, knowledge graphs, Large Language, Language Models
关键词-ZH: 大型语言模型、现实世界应用程序、知识图、大型语言、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: \textitexcessively long reasoning paths distracting from the answer generation, and \textitfalse-positive relations hindering the path refinement. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7% and 9.1% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG. Code is available at \urlthis https URL.
摘要:由于缺乏相关知识,大型语言模型在实际应用中可能会出现幻觉。相比之下,知识图谱包含广泛的、多关系的结构,这些结构存储了大量的符号事实。因此,将LLMS与知识图集成被广泛探索,知识图问答(KGQA)是集成的关键试金石。这项任务要求LLMS通过从知识图中检索相关的三元组来回答自然语言问题。然而,现有的方法面临着两个重大挑战:一是过长的推理路径分散了答案生成的注意力;二是假正关系阻碍了路径求精。本文提出了一种迭代交互式KGQA框架,该框架利用LLMS的交互式学习能力来执行图上的推理和辩论(DOG)。具体地说,DOG采用子图聚焦机制,允许LLMS在每个推理步骤后进行答案尝试,从而缓解了冗长推理路径的影响。另一方面,DOG利用多角色辩论团队逐步简化复杂的问题,减少假阳性关系的影响。这一辩论机制确保了推理过程的可靠性。在五个公共数据集上的实验结果证明了该体系结构的有效性和优越性。值得注意的是,在WebQuestions和GraQA上,Dog的准确率分别比最先进的方法tog高出23.7%和9.1%。此外,在上述数据集上用不同的LLMS进行的集成实验突出了DOG的灵活性。代码位于此HTTPS URL。

[NLP-40] GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation
[NLP-40] GraphEx:一种基于图的广告商关键词推荐提取方法

链接: https://arxiv.org/abs/2409.03140
作者: Ashirbad Mishra,Soumik Dey,Marshall Wu,Jinyu Zhao,He Yu,Kaichen Ni,Binbin Li,Kamesh Madduri
关键词-EN: Extreme Multi-Label Classification, Online sellers, listed products, enhance their sales, advertisers are recommended
关键词-ZH: 极端多标签分类、在线卖家、上市产品、提高销售额、推荐广告商
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Online sellers and advertisers are recommended keyphrases for their listed products, which they bid on to enhance their sales. One popular paradigm that generates such recommendations is Extreme Multi-Label Classification (XMC), which involves tagging/mapping keyphrases to items. We outline the limitations of using traditional item-query based tagging or mapping techniques for keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles. Additionally, we demonstrate that relying on traditional metrics such as precision/recall can be misleading in practical applications, thereby necessitating a combination of metrics to evaluate performance in real-world scenarios. These metrics are designed to assess the relevance of keyphrases to items and the potential for buyer outreach. GraphEx outperforms production models at eBay, achieving the objectives mentioned above. It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.
摘要:在线卖家和广告商是他们列出的产品的推荐关键词,他们竞标这些关键词来提高自己的销售额。生成此类推荐的一个流行范例是Extreme多标签分类(XMC),它涉及到将关键短语标记/映射到条目。我们概述了在电子商务平台上使用传统的基于项目查询的标签或映射技术来进行关键词推荐的局限性。我们介绍了GraphEx,这是一种基于图形的创新方法,通过从商品标题中提取令牌排列来向卖家推荐关键短语。此外,我们还演示了在实际应用中依赖传统指标(如查准率/查全率)可能会产生误导,从而需要组合指标来评估真实世界场景中的性能。这些指标旨在评估关键短语与物品的相关性以及买家拓展业务的可能性。GraphEx在eBay上的表现优于生产模型,实现了上述目标。它支持在资源受限的生产环境中进行近乎实时的推理,并可针对数十亿个项目进行有效扩展。

[NLP-41] Well that escalated quickly: The Single-Turn Crescendo Attack (STCA)
[NLP-41] 嗯,情况迅速升级:单圈渐强攻击(STCA)

链接: https://arxiv.org/abs/2409.03131
作者: Alan Aqrawi
关键词-EN: large language models, Single-Turn Crescendo Attack, multi-turn crescendo attack, crescendo attack established, Crescendo Attack
关键词-ZH: 大型语言模型,单轮渐强攻击,多轮渐强攻击,渐强攻击已建立,渐强攻击
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores a novel approach to adversarial attacks on large language models (LLM): the Single-Turn Crescendo Attack (STCA). The STCA builds upon the multi-turn crescendo attack established by Mark Russinovich, Ahmed Salem, Ronen Eldan. Traditional multi-turn adversarial strategies gradually escalate the context to elicit harmful or controversial responses from LLMs. However, this paper introduces a more efficient method where the escalation is condensed into a single interaction. By carefully crafting the prompt to simulate an extended dialogue, the attack bypasses typical content moderation systems, leading to the generation of responses that would normally be filtered out. I demonstrate this technique through a few case studies. The results highlight vulnerabilities in current LLMs and underscore the need for more robust safeguards. This work contributes to the broader discourse on responsible AI (RAI) safety and adversarial testing, providing insights and practical examples for researchers and developers. This method is unexplored in the literature, making it a novel contribution to the field.
摘要:针对大型语言模型的对抗性攻击,提出了一种新的攻击方法:单轮渐近攻击。STCA建立在Mark Russinovich,Ahmed Salem,Ronen Eldan建立的多轮渐强攻击的基础上。传统的多回合对抗性战略逐渐升级背景,以引起低收入国家的有害或有争议的反应。然而,本文介绍了一种更有效的方法,在该方法中,升级被压缩为单个交互。通过精心设计提示符来模拟延长的对话,攻击绕过了典型的内容审核系统,导致生成通常会被过滤掉的响应。我通过几个案例研究来演示这项技术。这一结果突显了当前LLM的脆弱性,并突显了需要更强有力的保障措施。这项工作有助于更广泛地讨论负责任的人工智能(RAI)安全和对抗性测试,为研究人员和开发人员提供见解和实践示例。这种方法在文献中还没有被探索过,这使它成为该领域的一个新贡献。

[NLP-42] Probing self-attention in self-supervised speech models for cross-linguistic differences
[NLP-42] 探索自我监督语音模型中的自我注意力以寻找跨语言差异

链接: https://arxiv.org/abs/2409.03115
作者: Sai Gopinath,Joselyn Rodriguez
关键词-EN: gained traction, increase in accuracy, transformer architectures, Speech, models
关键词-ZH: 获得吸引力、准确性提高、Transformer架构、语音、模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 18 figures

点击查看摘要

Abstract:Speech models have gained traction thanks to increase in accuracy from novel transformer architectures. While this impressive increase in performance across automatic speech recognition (ASR) benchmarks is noteworthy, there is still much that is unknown about the use of attention mechanisms for speech-related tasks. For example, while it is assumed that these models are learning language-independent (i.e., universal) speech representations, there has not yet been an in-depth exploration of what it would mean for the models to be language-independent. In the current paper, we explore this question within the realm of self-attention mechanisms of one small self-supervised speech transformer model (TERA). We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language. We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining. We also present a head ablation study which shows that models across languages primarily rely on diagonal heads to classify phonemes.
摘要:由于新的变压器架构提高了准确度,语音模型获得了吸引力。尽管自动语音识别(ASR)基准测试在性能方面的显著提升值得注意,但在语音相关任务中使用注意力机制仍有许多未知之处。例如,虽然假设这些模型正在学习与语言无关的(即通用的)语音表征,但还没有深入探讨这些模型独立于语言意味着什么。在本文中,我们在一个小型自监督语音转换器模型(TERA)的自我注意机制的范围内探讨了这一问题。我们发现,即使是一个小模型,学习的注意力头部也是多样化的,从几乎完全对角线到几乎完全全局,与训练语言无关。我们强调了土耳其语和英语在注意模式上的一些显著差异,并证明了这些模型在预训练期间确实学习了重要的语音信息。我们还提出了一项头部消融研究,该研究表明,跨语言的模型主要依赖对角线头部来分类音素。

[NLP-43] Quantification of stylistic differences in human- and ASR-produced transcripts of African American English INTERSPEECH2024
[NLP-43] 人类和ASB制作的非裔美国人英语成绩单中文体差异的量化

链接: https://arxiv.org/abs/2409.03059
作者: Annika Heuser,Tyler Kendall,Miguel del Rio,Quinten McNamara,Nishchal Bhandari,Corey Miller,Migüel Jetté
关键词-EN: conflate multiple sources, Common measures, automatic speech recognition, ASR performance evaluation, conflate multiple
关键词-ZH: 合并多个来源、常见测量、自动语音识别、ASB性能评估、合并多个
类目: Computation and Language (cs.CL)
备注: Published in Interspeech 2024 Proceedings, 5 pages excluding references, 5 figures

点击查看摘要

Abstract:Common measures of accuracy used to assess the performance of automatic speech recognition (ASR) systems, as well as human transcribers, conflate multiple sources of error. Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation when differences exist between training and test datasets. The problem is compounded for speech from underrepresented varieties, where the speech to orthography mapping is not as standardized. We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English (AAE) speech. Focusing on verbatim features and AAE morphosyntactic features, we investigate the interactions of these categories with how well transcripts can be compared via word error rate (WER). The results, and overall analysis, help clarify how ASR outputs are a function of the decisions made by the training data’s human transcribers.
摘要:用于评估自动语音识别(ASB)系统以及人类转录器性能的常见准确性指标混淆了多种错误来源。当训练和测试数据集之间存在差异时,文体差异(例如逐字与非逐字)可以在ASB性能评估中发挥重要作用。对于代表性不足的变体的语音来说,这个问题变得更加复杂,其中语音到正拼写的映射还不那么标准化。我们对10小时的非裔美国人英语(AAE)演讲的6个转录版本(4个人类版本和2个ASB版本)之间的文体差异进行了分类。我们重点关注逐字特征和AAE形态语法特征,研究这些类别与通过词错误率(WER)比较成绩单的程度之间的相互作用。结果和总体分析有助于澄清ASB输出如何取决于训练数据的人类转录者所做的决策。

[NLP-44] Oddballness: universal anomaly detection with language models
[NLP-44] 奇怪的:使用语言模型的通用异常检测

链接: https://arxiv.org/abs/2409.03046
作者: Filip Graliński,Ryszard Staruch,Krzysztof Jurkiewicz
关键词-EN: totally unsupervised manner, language model, detect anomalies, unsupervised manner, totally unsupervised
关键词-ZH: 完全无监督方式,语言模型,检测异常,无监督方式,完全无监督
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a new method to detect anomalies in texts (in general: in sequences of any data), using language models, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a language model, but instead of focusing on low-likelihood tokens, it considers a new metric introduced in this paper: oddballness. Oddballness measures how ``strange’’ a given token is according to the language model. We demonstrate in grammatical error detection tasks (a specific case of text anomaly detection) that oddballness is better than just considering low-likelihood events, if a totally unsupervised setup is assumed.
摘要:我们提出了一种新的方法,使用语言模型,以完全无监督的方式检测文本中的异常(一般来说:任何数据的序列中)。该方法考虑了语言模型生成的概率(可能性),但它没有关注低可能性标记,而是考虑了本文中引入的一个新指标:奇怪度。根据语言模型,奇怪度衡量给定令牌的“奇怪”程度。我们在语法错误检测任务(文本异常检测的特定情况)中证明,如果假设完全无监督的设置,奇怪性比仅考虑低可能性事件要好。

[NLP-45] CLUE: Concept-Level Uncertainty Estimation for Large Language Models
[NLP-45] CLUE:大型语言模型的概念级不确定性估计

链接: https://arxiv.org/abs/2409.03021
作者: Yu-Hsiang Wang,Andrew Bai,Che-Ping Tsai,Cho-Jui Hsieh
关键词-EN: Large Language Models, Large Language, Language Models, natural language generation, demonstrated remarkable proficiency
关键词-ZH: 大型语言模型、大型语言、语言模型、自然语言生成,表现出非凡的熟练程度
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in various natural language generation (NLG) tasks. Previous studies suggest that LLMs’ generation process involves uncertainty. However, existing approaches to uncertainty estimation mainly focus on sequence-level uncertainty, overlooking individual pieces of information within sequences. These methods fall short in separately assessing the uncertainty of each component in a sequence. In response, we propose a novel framework for Concept-Level Uncertainty Estimation (CLUE) for LLMs. We leverage LLMs to convert output sequences into concept-level representations, breaking down sequences into individual concepts and measuring the uncertainty of each concept separately. We conduct experiments to demonstrate that CLUE can provide more interpretable uncertainty estimation results compared with sentence-level uncertainty, and could be a useful tool for various tasks such as hallucination detection and story generation.
摘要:大型语言模型(LLM)在自然语言生成(NLG)任务中表现出了卓越的能力。先前的研究表明,低分子激光雷达的生成过程中包含着不确定性。然而,现有的不确定性估计方法主要集中在序列级别的不确定性上,忽略了序列中的单个信息片段。这些方法在单独评估序列中每个分量的不确定性方面存在不足。针对这一问题,我们提出了一种新的概念级不确定性估计框架。我们利用LLMS将输出序列转换为概念级别的表示,将序列分解为单独的概念,并分别测量每个概念的不确定性。我们通过实验证明,与句子级别的不确定性相比,线索可以提供更可解释的不确定性估计结果,并且可以作为一种有用的工具用于幻觉检测和故事生成等各种任务。

[NLP-46] Hallucination Detection in LLMs: Fast and Memory-Efficient Finetuned Models
[NLP-46] LLM中的幻觉检测:快速且内存高效的微调模型

链接: https://arxiv.org/abs/2409.02976
作者: Gabriel Y. Arteaga,Thomas B. Schön,Nicolas Pielawski
关键词-EN: Uncertainty estimation, high-risk settings, Large Language Models, autonomous cars, component when implementing
关键词-ZH: 不确定性估计、高风险环境、大型语言模型、自动驾驶汽车、实施时的组件
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Uncertainty estimation is a necessary component when implementing AI in high-risk settings, such as autonomous cars, medicine, or insurances. Large Language Models (LLMs) have seen a surge in popularity in recent years, but they are subject to hallucinations, which may cause serious harm in high-risk settings. Despite their success, LLMs are expensive to train and run: they need a large amount of computations and memory, preventing the use of ensembling methods in practice. In this work, we present a novel method that allows for fast and memory-friendly training of LLM ensembles. We show that the resulting ensembles can detect hallucinations and are a viable approach in practice as only one GPU is needed for training and inference.
摘要:在自动驾驶汽车、医疗或保险等高风险环境中实施人工智能时,不确定性估计是必要的组成部分。近年来,大型语言模型(LLM)的受欢迎程度激增,但它们容易产生幻觉,这可能会在高风险环境中造成严重伤害。尽管它们取得了成功,但它们的训练和运行成本很高:它们需要大量的计算和内存,从而阻碍了在实践中使用集成方法。在这项工作中,我们提出了一种新颖的方法,可以对LLM集成进行快速且记忆友好的训练。我们表明,由此产生的集成可以检测幻觉,并且在实践中是一种可行的方法,因为只需要一个图形处理器即可进行训练和推理。

人工智能

[AI-0] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

链接: https://arxiv.org/abs/2409.03757
作者: Yunze Man,Shuhong Zheng,Zhipeng Bao,Martial Hebert,Liang-Yan Gui,Yu-Xiong Wang
关键词-EN: gained increasing attention, scene encoding strategies, encoding strategies playing, increasing attention, gained increasing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project page: this https URL , Github: this https URL

点击查看摘要

Abstract:Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.

[AI-1] WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild

链接: https://arxiv.org/abs/2409.03753
作者: Yuntian Deng,Wenting Zhao,Jack Hessel,Xiang Ren,Claire Cardie,Yejin Choi
关键词-EN: offers exciting opportunities, data offers exciting, study user-chatbot interactions, conversation data offers, real-world conversation data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis’s utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.

[AI-2] LLM-CI: Assessing Contextual Integrity Norms in Language Models

链接: https://arxiv.org/abs/2409.03735
作者: Yan Shvartzshnaider,Vasisht Duddu,John Lacalamita
关键词-EN: Large language models, training data scraped, Large language, inadvertently encode societal, encode societal preferences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注: 20 pages, 8 Figures, 4 Tables

点击查看摘要

Abstract:Large language models (LLMs), while memorizing parts of their training data scraped from the Internet, may also inadvertently encode societal preferences and norms. As these models are integrated into sociotechnical systems, it is crucial that the norms they encode align with societal expectations. These norms could vary across models, hyperparameters, optimization techniques, and datasets. This is especially challenging due to prompt sensitivity - small variations in prompts yield different responses, rendering existing assessment methodologies unreliable. There is a need for a comprehensive framework covering various models, optimization, and datasets, along with a reliable methodology to assess encoded norms. We present LLM-CI, the first open-sourced framework to assess privacy norms encoded in LLMs. LLM-CI uses a Contextual Integrity-based factorial vignette methodology to assess the encoded norms across different contexts and LLMs. We propose the multi-prompt assessment methodology to address prompt sensitivity by assessing the norms from only the prompts that yield consistent responses across multiple variants. Using LLM-CI and our proposed methodology, we comprehensively evaluate LLMs using IoT and COPPA vignettes datasets from prior work, examining the impact of model properties (e.g., hyperparameters, capacity) and optimization strategies (e.g., alignment, quantization). Comments: 20 pages, 8 Figures, 4 Tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY) Cite as: arXiv:2409.03735 [cs.LG] (or arXiv:2409.03735v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.03735 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Planning In Natural Language Improves LLM Search For Code Generation

链接: https://arxiv.org/abs/2409.03733
作者: Evan Wang,Federico Cassano,Catherine Wu,Yunfeng Bai,Will Song,Vaskar Nath,Ziwen Han,Sean Hendryx,Summer Yue,Hugh Zhang
关键词-EN: scaling training compute, scaling inference compute, yielded analogous gains, training compute, compute has led
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PLANSEARCH generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas.

[AI-4] Sample-Efficient Diffusion for Text-To-Speech Synthesis INTERSPEECH2024

链接: https://arxiv.org/abs/2409.03717
作者: Justin Lovelace,Soham Ray,Kwangyoun Kim,Kilian Q. Weinberger,Felix Wu
关键词-EN: work introduces Sample-Efficient, introduces Sample-Efficient Speech, effective speech synthesis, modest data regimes, Sample-Efficient Speech Diffusion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Interspeech 2024

点击查看摘要

Abstract:This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.

[AI-5] Applications and Advances of Artificial Intelligence in Music Generation:A Review

链接: https://arxiv.org/abs/2409.03715
作者: Yanxu Chen,Linshu Huang,Tian Gou
关键词-EN: made significant progress, music generation, artificial intelligence, recent years, driving innovation
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In recent years, artificial intelligence (AI) has made significant progress in the field of music generation, driving innovation in music creation and applications. This paper provides a systematic review of the latest research advancements in AI music generation, covering key technologies, models, datasets, evaluation methods, and their practical applications across various fields. The main contributions of this review include: (1) presenting a comprehensive summary framework that systematically categorizes and compares different technological approaches, including symbolic generation, audio generation, and hybrid models, helping readers better understand the full spectrum of technologies in the field; (2) offering an extensive survey of current literature, covering emerging topics such as multimodal datasets and emotion expression evaluation, providing a broad reference for related research; (3) conducting a detailed analysis of the practical impact of AI music generation in various application domains, particularly in real-time interaction and interdisciplinary applications, offering new perspectives and insights; (4) summarizing the existing challenges and limitations of music quality evaluation methods and proposing potential future research directions, aiming to promote the standardization and broader adoption of evaluation techniques. Through these innovative summaries and analyses, this paper serves as a comprehensive reference tool for researchers and practitioners in AI music generation, while also outlining future directions for the field.

[AI-6] A Different Level Text Protection Mechanism With Differential Privacy

链接: https://arxiv.org/abs/2409.03707
作者: Qingwen Fu
关键词-EN: BERT pre-training model, BERT pre-training, pre-training model, model and proves, proves the effectiveness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The article introduces a method for extracting words of different degrees of importance based on the BERT pre-training model and proves the effectiveness of this method. The article also discusses the impact of maintaining the same perturbation results for words of different importance on the overall text utility. This method can be applied to long text protection.

[AI-7] View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

链接: https://arxiv.org/abs/2409.03685
作者: Stephen Tian,Blake Wulfe,Kyle Sargent,Katherine Liu,Sergey Zakharov,Vitor Guizilini,Jiajun Wu
关键词-EN: Large-scale visuomotor policy, visuomotor policy learning, generalizable manipulation systems, visuomotor policy, promising approach
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to CoRL 2024

点击查看摘要

Abstract:Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at this https URL.

[AI-8] RACE-cs: Trustworthy Reasoning for Contrastive Explanations in Course Scheduling Problems

链接: https://arxiv.org/abs/2409.03671
作者: Stylianos Loukas Vasileiou,William Yeoh
关键词-EN: large language models, address contrastive queries, combines symbolic reasoning, hybrid system, system that combines
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present TRACE-cs, a novel hybrid system that combines symbolic reasoning with large language models (LLMs) to address contrastive queries in scheduling problems. TRACE-cs leverages SAT solving techniques to encode scheduling constraints and generate explanations for user queries, while utilizing an LLM to process the user queries into logical clauses as well as refine the explanations generated by the symbolic solver to natural language sentences. By integrating these components, our approach demonstrates the potential of combining symbolic methods with LLMs to create explainable AI agents with correctness guarantees.

[AI-9] Limited but consistent gains in adversarial robustness by co-training object recognition models with human EEG

链接: https://arxiv.org/abs/2409.03646
作者: Manshan Guo,Bhavin Choksi,Sari Sadiya,Alessandro T. Gifford,Martina G. Vilas,Radoslaw M. Cichy,Gemma Roig
关键词-EN: artificial neural networks, artificial neural, neural networks, remain relatively susceptible, EEG prediction accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In contrast to human vision, artificial neural networks (ANNs) remain relatively susceptible to adversarial attacks. To address this vulnerability, efforts have been made to transfer inductive bias from human brains to ANNs, often by training the ANN representations to match their biological counterparts. Previous works relied on brain data acquired in rodents or primates using invasive techniques, from specific regions of the brain, under non-natural conditions (anesthetized animals), and with stimulus datasets lacking diversity and naturalness. In this work, we explored whether aligning model representations to human EEG responses to a rich set of real-world images increases robustness to ANNs. Specifically, we trained ResNet50-backbone models on a dual task of classification and EEG prediction; and evaluated their EEG prediction accuracy and robustness to adversarial attacks. We observed significant correlation between the networks’ EEG prediction accuracy, often highest around 100 ms post stimulus onset, and their gains in adversarial robustness. Although effect size was limited, effects were consistent across different random initializations and robust for architectural variants. We further teased apart the data from individual EEG channels and observed strongest contribution from electrodes in the parieto-occipital regions. The demonstrated utility of human EEG for such tasks opens up avenues for future efforts that scale to larger datasets under diverse stimuli conditions with the promise of stronger effects.

[AI-10] Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Cord Paralysis

链接: https://arxiv.org/abs/2409.03597
作者: Yucong Zhang,Xin Zou,Jinshan Yang,Wenjun Chen,Faya Liang,Ming Li
关键词-EN: Multimodal Analyzing System, automatically extract key, extract key segments, presents the Multimodal, laryngeal videostroboscopic videos
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents the Multimodal Analyzing System for Laryngoscope (MASL), a system that combines audio and video data to automatically extract key segments and metrics from laryngeal videostroboscopic videos for clinical assessment. MASL integrates glottis detection with keyword spotting to analyze patient vocalizations and refine video highlights for better inspection of vocal cord movements. The system includes a strobing video extraction module that identifies frames by analyzing hue, saturation, and value fluctuations. MASL also provides effective metrics for vocal cord paralysis detection, employing a two-stage glottis segmentation process using U-Net followed by diffusion-based refinement to reduce false positives. Instead of glottal area waveforms, MASL estimates anterior glottic angle waveforms (AGAW) from glottis masks, evaluating both left and right vocal cords to detect unilateral vocal cord paralysis (UVFP). By comparing AGAW variances, MASL distinguishes between left and right paralysis. Ablation studies and experiments on public and real-world datasets validate MASL’s segmentation module and demonstrate its ability to provide reliable metrics for UVFP diagnosis.

[AI-11] 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances KDD

链接: https://arxiv.org/abs/2409.03563
作者: Lorenzo Pacchiardi,Lucy G. Cheke,José Hernández-Orallo
关键词-EN: individual task instances, task instances, LLM, performance, instances
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Presented at the 2024 KDD workshop on Evaluation and Trustworthiness of Generative AI Models

点击查看摘要

Abstract:Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.

[AI-12] DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture

链接: https://arxiv.org/abs/2409.03550
作者: Qianlong Xiang,Miao Zhang,Yuzhang Shang,Jianlong Wu,Yan Yan,Liqiang Nie
关键词-EN: high computational demands, demonstrated exceptional generative, exceptional generative capabilities, slow inference speeds, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated exceptional generative capabilities across various areas, while they are hindered by slow inference speeds and high computational demands during deployment. The most common way to accelerate DMs involves reducing the number of denoising steps during generation, achieved through faster sampling solvers or knowledge distillation (KD). In contrast to prior approaches, we propose a novel method that transfers the capability of large pretrained DMs to faster architectures. Specifically, we employ KD in a distinct manner to compress DMs by distilling their generative ability into more rapid variants. Furthermore, considering that the source data is either unaccessible or too enormous to store for current generative models, we introduce a new paradigm for their distillation without source data, termed Data-Free Knowledge Distillation for Diffusion Models (DKDM). Generally, our established DKDM framework comprises two main components: 1) a DKDM objective that uses synthetic denoising data produced by pretrained DMs to optimize faster DMs without source data, and 2) a dynamic iterative distillation method that flexibly organizes the synthesis of denoising data, preventing it from slowing down the optimization process as the generation is slow. To our knowledge, this is the first attempt at using KD to distill DMs into any architecture in a data-free manner. Importantly, our DKDM is orthogonal to most existing acceleration methods, such as denoising step reduction, quantization and pruning. Experiments show that our DKDM is capable of deriving 2x faster DMs with performance remaining on par with the baseline. Notably, our DKDM enables pretrained DMs to function as “datasets” for training new DMs.

[AI-13] Prediction Accuracy Reliability: Classification and Object Localization under Distribution Shift

链接: https://arxiv.org/abs/2409.03543
作者: Fabian Diet,Moussa Kassem Sbeyti,Michelle Karg
关键词-EN: Natural distribution shift, convolutional neural networks, distribution shift, Natural distribution, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This preprint has not undergone any post-submission improvements or corrections

点击查看摘要

Abstract:Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.

[AI-14] LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

链接: https://arxiv.org/abs/2409.03516
作者: Jeongsoo Kim,Jongho Nang,Junsuk Choe
关键词-EN: Recent Vision Transformer, Recent Vision, Vision Transformer, demonstrated impressive performance, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at this https URL.

[AI-15] Disclosure of AI-Generated News Increases Engagement but Does Not Reduce Aversion Despite Positive Quality Ratings

链接: https://arxiv.org/abs/2409.03500
作者: Fabrizio Gilardi,Sabrina Di Lorenzo,Juri Ezzaini,Beryl Santa,Benjamin Streiff,Eric Zurfluh,Emma Hoes
关键词-EN: including journalism, artificial intelligence, advancement of artificial, articles, AI-generated
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of artificial intelligence (AI) has led to its application in many areas, including journalism. One key issue is the public’s perception of AI-generated content. This preregistered study investigates (i) the perceived quality of AI-assisted and AI-generated versus human-generated news articles, (ii) whether disclosure of AI’s involvement in generating these news articles influences engagement with them, and (iii) whether such awareness affects the willingness to read AI-generated articles in the future. We employed a between-subjects survey experiment with 599 participants from the German-speaking part of Switzerland, who evaluated the credibility, readability, and expertise of news articles. These articles were either written by journalists (control group), rewritten by AI (AI-assisted group), or entirely generated by AI (AI-generated group). Our results indicate that all news articles, regardless of whether they were written by journalists or AI, were perceived to be of equal quality. When participants in the treatment groups were subsequently made aware of AI’s involvement in generating the articles, they expressed a higher willingness to engage with (i.e., continue reading) the articles than participants in the control group. However, they were not more willing to read AI-generated news in the future. These results suggest that aversion to AI usage in news media is not primarily rooted in a perceived lack of quality, and that by disclosing using AI, journalists could attract more immediate engagement with their content, at least in the short term.

[AI-16] Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation

链接: https://arxiv.org/abs/2409.03470
作者: Prerak Mody,Nicolas F. Chaves-de-Plaza,Chinmay Rao,Eleftheria Astrenidou,Mischa de Ridder,Nienke Hoekstra,Klaus Hildebrandt,Marius Staring
关键词-EN: medical image segmentation, Increased usage, learning in medical, medical image, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the “utility” of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at this https URL

[AI-17] Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks

链接: https://arxiv.org/abs/2409.03463
作者: Lorenzo Bini,Marco Sorbi,Stephane Marchand-Maillet
关键词-EN: Graph Neural Networks, Neural Networks, effectively modeling data, Graph Neural, increasingly popular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become increasingly popular for effectively modeling data with graph structures. Recently, attention mechanisms have been integrated into GNNs to improve their ability to capture complex patterns. This paper presents the first comprehensive study revealing a critical, unexplored consequence of this integration: the emergence of Massive Activations (MAs) within attention layers. We introduce a novel method for detecting and analyzing MAs, focusing on edge features in different graph transformer architectures. Our study assesses various GNN models using benchmark datasets, including ZINC, TOX21, and PROTEINS. Key contributions include (1) establishing the direct link between attention mechanisms and MAs generation in GNNs, (2) developing a robust definition and detection method for MAs based on activation ratio distributions, (3) introducing the Explicit Bias Term (EBT) as a potential countermeasure and exploring it as an adversarial framework to assess models robustness based on the presence or absence of MAs. Our findings highlight the prevalence and impact of attention-induced MAs across different architectures, such as GraphTransformer, GraphiT, and SAN. The study reveals the complex interplay between attention mechanisms, model architecture, dataset characteristics, and MAs emergence, providing crucial insights for developing more robust and reliable graph models.

[AI-18] How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes

链接: https://arxiv.org/abs/2409.03454
作者: Inacio Vieira,Will Allred,Seamus Lankford,Sheila Castilho Monteiro De Sousa,Andy Way
关键词-EN: Decoder-only LLMs, generate high-quality translations, shown impressive performance, shown impressive, ability to learn
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Decoder-only LLMs have shown impressive performance in MT due to their ability to learn from extensive datasets and generate high-quality translations. However, LLMs often struggle with the nuances and style required for organisation-specific translation. In this study, we explore the effectiveness of fine-tuning Large Language Models (LLMs), particularly Llama 3 8B Instruct, leveraging translation memories (TMs), as a valuable resource to enhance accuracy and efficiency. We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector. Our experiments cover five translation directions across languages of varying resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to evaluate their influence on translation quality. We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET. Our findings reveal improvement in translation performance with larger datasets across all metrics. On average, BLEU and COMET scores increase by 13 and 25 points, respectively, on the largest training set against the baseline model. Notably, there is a performance deterioration in comparison with the baseline model when fine-tuning on only 1k and 2k examples; however, we observe a substantial improvement as the training dataset size increases. The study highlights the potential of integrating TMs with LLMs to create bespoke translation models tailored to the specific needs of businesses, thus enhancing translation quality and reducing turn-around times. This approach offers a valuable insight for organisations seeking to leverage TMs and LLMs for optimal translation outcomes, especially in narrower domains.

[AI-19] Fine-tuning large language models for domain adaptation: Exploration of training strategies scaling model merging and synergistic capabilities

链接: https://arxiv.org/abs/2409.03444
作者: Wei Lu,Rachel K. Luu,Markus J. Buehler
关键词-EN: Large Language Models, Large Language, Direct Preference Optimization, Ratio Preference Optimization, Odds Ratio Preference
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.

[AI-20] KiloBot: A Programming Language for Deploying Perception-Guided Industrial Manipulators at Scale

链接: https://arxiv.org/abs/2409.03439
作者: Wei Gao,Jingqiang Wang,Xinv Zhu,Jun Zhong,Yue Shen,Youshuang Ding
关键词-EN: handle unstructured environments, handle unstructured, unstructured environments, environments with cameras, industrial robots
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:We would like industrial robots to handle unstructured environments with cameras and perception pipelines. In contrast to traditional industrial robots that replay offline-crafted trajectories, online behavior planning is required for these perception-guided industrial applications. Aside from perception and planning algorithms, deploying perception-guided manipulators also requires substantial effort in integration. One approach is writing scripts in a traditional language (such as Python) to construct the planning problem and perform integration with other algorithmic modules external devices. While scripting in Python is feasible for a handful of robots and applications, deploying perception-guided manipulation at scale (e.g., more than 10000 robot workstations in over 2000 customer sites) becomes intractable. To resolve this challenge, we propose a Domain-Specific Language (DSL) for perception-guided manipulation applications. To scale up the deployment,our DSL provides: 1) an easily accessible interface to construct solve a sub-class of Task and Motion Planning (TAMP) problems that are important in practical applications; and 2) a mechanism to implement flexible control flow to perform integration and address customized requirements of distinct industrial application. Combined with an intuitive graphical programming frontend, our DSL is mainly used by machine operators without coding experience in traditional programming languages. Within hours of training, operators are capable of orchestrating interesting sophisticated manipulation behaviors with our DSL. Extensive practical deployments demonstrate the efficacy of our method.

[AI-21] Reinforcement Learning Approach to Optimizing Profilometric Sensor Trajectories for Surface Inspection

链接: https://arxiv.org/abs/2409.03429
作者: Sara Roos-Hoefgeest,Mario Roos-Hoefgeest,Ignacio Alvarez,Rafael C. González
关键词-EN: High-precision surface defect, surface defect detection, High-precision surface, defect detection, detection in manufacturing
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:High-precision surface defect detection in manufacturing is essential for ensuring quality control. Laser triangulation profilometric sensors are key to this process, providing detailed and accurate surface measurements over a line. To achieve a complete and precise surface scan, accurate relative motion between the sensor and the workpiece is required. It is crucial to control the sensor pose to maintain optimal distance and relative orientation to the surface. It is also important to ensure uniform profile distribution throughout the scanning process. This paper presents a novel Reinforcement Learning (RL) based approach to optimize robot inspection trajectories for profilometric sensors. Building upon the Boustrophedon scanning method, our technique dynamically adjusts the sensor position and tilt to maintain optimal orientation and distance from the surface, while also ensuring a consistent profile distance for uniform and high-quality scanning. Utilizing a simulated environment based on the CAD model of the part, we replicate real-world scanning conditions, including sensor noise and surface irregularities. This simulation-based approach enables offline trajectory planning based on CAD models. Key contributions include the modeling of the state space, action space, and reward function, specifically designed for inspection applications using profilometric sensors. We use Proximal Policy Optimization (PPO) algorithm to efficiently train the RL agent, demonstrating its capability to optimize inspection trajectories with profilometric sensors. To validate our approach, we conducted several experiments where a model trained on a specific training piece was tested on various parts in simulation. Also, we conducted a real-world experiment by executing the optimized trajectory, generated offline from a CAD model, to inspect a part using a UR3e robotic arm model.

[AI-22] KAN See In the Dark

链接: https://arxiv.org/abs/2409.03404
作者: Aoxiang Ning,Minglong Xue,Jinhong He,Chengyun Song
关键词-EN: Existing low-light image, complex nonlinear relationship, low-light image enhancement, low-light images due, Existing low-light
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing low-light image enhancement methods are difficult to fit the complex nonlinear relationship between normal and low-light images due to uneven illumination and noise effects. The recently proposed Kolmogorov-Arnold networks (KANs) feature spline-based convolutional layers and learnable activation functions, which can effectively capture nonlinear dependencies. In this paper, we design a KAN-Block based on KANs and innovatively apply it to low-light image enhancement. This method effectively alleviates the limitations of current methods constrained by linear network structures and lack of interpretability, further demonstrating the potential of KANs in low-level vision tasks. Given the poor perception of current low-light image enhancement methods and the stochastic nature of the inverse diffusion process, we further introduce frequency-domain perception for visually oriented enhancement. Extensive experiments demonstrate the competitive performance of our method on benchmark datasets. The code will be available at: this https URLthis https URL.

[AI-23] Game On: Towards Language Models as RL Experimenters

链接: https://arxiv.org/abs/2409.03402
作者: Jingwei Zhang,Thomas Lampe,Abbas Abdolmaleki,Jost Tobias Springenberg,Martin Riedmiller
关键词-EN: learning experiment workflow, common reinforcement learning, enable automated mastery, reinforcement learning experiment, automates parts
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We propose an agent architecture that automates parts of the common reinforcement learning experiment workflow, to enable automated mastery of control domains for embodied agents. To do so, it leverages a VLM to perform some of the capabilities normally required of a human experimenter, including the monitoring and analysis of experiment progress, the proposition of new tasks based on past successes and failures of the agent, decomposing tasks into a sequence of subtasks (skills), and retrieval of the skill to execute - enabling our system to build automated curricula for learning. We believe this is one of the first proposals for a system that leverages a VLM throughout the full experiment cycle of reinforcement learning. We provide a first prototype of this system, and examine the feasibility of current models and techniques for the desired level of automation. For this, we use a standard Gemini model, without additional fine-tuning, to provide a curriculum of skills to a language-conditioned Actor-Critic algorithm, in order to steer data collection so as to aid learning new skills. Data collected in this way is shown to be useful for learning and iteratively improving control policies in a robotics domain. Additional examination of the ability of the system to build a growing library of skills, and to judge the progress of the training of those skills, also shows promising results, suggesting that the proposed architecture provides a potential recipe for fully automated mastery of tasks and domains for embodied agents.

[AI-24] Hardware Acceleration of LLMs: A comprehensive survey and comparison

链接: https://arxiv.org/abs/2409.03384
作者: Nikoletta Koilia,Christoforos Kachris
关键词-EN: Large Language Models, generate human-like text, language processing tasks, natural language processing, Large Language
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. In this paper, we present a comprehensive survey of the several research efforts that have been presented for the acceleration of transformer networks for Large Language Models using hardware accelerators. The survey presents the frameworks that have been proposed and then performs a qualitative and quantitative comparison regarding the technology, the processing platform (FPGA, ASIC, In-Memory, GPU), the speedup, the energy efficiency, the performance (GOPs), and the energy efficiency (GOPs/W) of each framework. The main challenge in comparison is that every proposed scheme is implemented on a different process technology making hard a fair comparison. The main contribution of this paper is that we extrapolate the results of the performance and the energy efficiency on the same technology to make a fair comparison; one theoretical and one more practical. We implement part of the LLMs on several FPGA chips to extrapolate the results to the same process technology and then we make a fair comparison of the performance. Comments: this https URL Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.03384 [cs.AR] (or arXiv:2409.03384v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2409.03384 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-25] CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks

链接: https://arxiv.org/abs/2409.03381
作者: Yongxin Deng(1),Xihe Qiu(1),Xiaoyu Tan(2),Chao Qu(2),Jing Pan(3),Yuan Cheng(3),Yinghui Xu(4),Wei Chu(2) ((1) School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China, (2) INF Technology (Shanghai) Co., Ltd., Shanghai, China, (3) School of Art, Design and Architecture, Monash University, Melbourne, Australia, (4) Artificial Intelligence Innovation and Incubation Institute, Fudan University, Shanghai, China)
关键词-EN: psychology investigates perception, investigates perception, Cognitive psychology investigates, rational System, System
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cognitive psychology investigates perception, attention, memory, language, problem-solving, decision-making, and reasoning. Kahneman’s dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2. Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level proficiency in various cognitive tasks. Nonetheless, the presence of a dual-system framework analogous to human cognition in LLMs remains unexplored. This study introduces the \textbfCogniDual Framework for LLMs (CFLLMs), designed to assess whether LLMs can, through self-training, evolve from deliberate deduction to intuitive responses, thereby emulating the human process of acquiring and mastering new information. Our findings reveal the cognitive mechanisms behind LLMs’ response generation, enhancing our understanding of their capabilities in cognitive psychology. Practically, self-trained models can provide faster responses to certain queries, reducing computational demands during inference.

[AI-26] Raw Speech Enhancement with Deep State Space Modeling

链接: https://arxiv.org/abs/2409.03377
作者: Yan Ru Pei,Ritik Shrivastava,FNU Sidharth
关键词-EN: simple deep state-space, deep state-space autoencoder, state-space autoencoder configured, efficient online raw, online raw speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement in an end-to-end fashion. The network’s performance is primarily evaluated on raw speech denoising, with additional assessments on tasks such as super-resolution and de-quantization. We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets. The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency. Even as a raw waveform processing model, the model maintains high fidelity to the clean signal with minimal audible artifacts. In addition, the model remains performant even when the noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech enhancement capabilities in low-resource environments.

[AI-27] Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time

链接: https://arxiv.org/abs/2409.03375
作者: Francisco de Arriba-Pérez,Silvia García-Méndez
关键词-EN: million people worldwide, Based on official, million people, natural language analysis, official estimates
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, Artificial Intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using Large Language Models (LLMs) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical-patient communication using intelligent systems. Consequently, we leverage an LLM using the latest Natural Language Processing (NLP) techniques in a chatbot solution to provide interpretable Machine Learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing NLP-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80 % in all evaluation metrics, with a recall value for the mental deterioration class about 85 %. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.

[AI-28] Sketch: A Toolkit for Streamlining LLM Operations

链接: https://arxiv.org/abs/2409.03346
作者: Xin Jiang,Xiang Li,Wenjia Ma,Xuezhi Fang,Yiqun Yao,Naitong Yu,Xuying Meng,Peng Han,Jing Li,Aixin Sun,Yequan Wang
关键词-EN: Large language models, achieved remarkable success, represented by GPT, Large language, GPT family
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model’s outputs, thereby constraining the application of LLMs in various domains. In this work, we present Sketch, an innovative toolkit designed to streamline LLM operations across diverse fields. Sketch comprises the following components: (1) a suite of task description schemas and prompt templates encompassing various NLP tasks; (2) a user-friendly, interactive process for building structured output LLM services tailored to various NLP tasks; (3) an open-source dataset for output format control, along with tools for dataset construction; and (4) an open-source model based on LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting instructions. We anticipate this initiative to bring considerable convenience to LLM users, achieving the goal of ‘‘plug-and-play’’ for various applications. The components of Sketch will be progressively open-sourced at this https URL.

[AI-29] YOLO-PPA based Efficient Traffic Sign Detection for Cruise Control in Autonomous Driving

链接: https://arxiv.org/abs/2409.03320
作者: Jingyu Zhang,Wenqing Zhang,Chaoyi Tan,Xiangtian Li,Qianyi Sun
关键词-EN: autonomous driving systems, traffic signs efficiently, detect traffic signs, traffic sign detection, proposed YOLO PPA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:It is very important to detect traffic signs efficiently and accurately in autonomous driving systems. However, the farther the distance, the smaller the traffic signs. Existing object detection algorithms can hardly detect these small scaled this http URL addition, the performance of embedded devices on vehicles limits the scale of detection this http URL address these challenges, a YOLO PPA based traffic sign detection algorithm is proposed in this paper.The experimental results on the GTSDB dataset show that compared to the original YOLO, the proposed method improves inference efficiency by 11.2%. The mAP 50 is also improved by 93.2%, which demonstrates the effectiveness of the proposed YOLO PPA.

[AI-30] N-gram Prediction and Word Difference Representations for Language Modeling

链接: https://arxiv.org/abs/2409.03295
作者: DongNyeong Heo,Daniela Noemi Rim,Heeyoul Choi
关键词-EN: Causal language modeling, underpinning remarkable successes, recent large language, foundational framework underpinning, framework underpinning remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words’ prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.

[AI-31] LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts EMNLP

链接: https://arxiv.org/abs/2409.03291
作者: Henrique Da Silva Gameiro,Andrei Kucharavy,Ljiljana Dolamic
关键词-EN: large Language Models, Language Models, major concern, emergence of widely, widely available powerful
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 20 pages, 7 tables, 13 figures, under consideration for EMNLP

点击查看摘要

Abstract:With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations – short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (this https URL). Comments: 20 pages, 7 tables, 13 figures, under consideration for EMNLP Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: I.2.7; K.6.5 Cite as: arXiv:2409.03291 [cs.CL] (or arXiv:2409.03291v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.03291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-32] xt2KG: Incremental Knowledge Graphs Construction Using Large Language Models

链接: https://arxiv.org/abs/2409.03284
作者: Yassir Lairgi,Ludovic Moncla,Rémy Cazabet,Khalid Benabdeslem,Pierre Cléau
关键词-EN: access valuable information, challenging to access, access valuable, making it challenging, building Knowledge Graphs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Accepted at The International Web Information Systems Engineering conference (the WISE conference) 2024

点击查看摘要

Abstract:Most available data is unstructured, making it challenging to access valuable information. Automatically building Knowledge Graphs (KGs) is crucial for structuring data and making it accessible, allowing users to search for information effectively. KGs also facilitate insights, inference, and reasoning. Traditional NLP methods, such as named entity recognition and relation extraction, are key in information retrieval but face limitations, including the use of predefined entity types and the need for supervised learning. Current research leverages large language models’ capabilities, such as zero- or few-shot learning. However, unresolved and semantically duplicated entities and relations still pose challenges, leading to inconsistent graphs and requiring extensive post-processing. Additionally, most approaches are topic-dependent. In this paper, we propose iText2KG, a method for incremental, topic-independent KG construction without post-processing. This plug-and-play, zero-shot method is applicable across a wide range of KG construction scenarios and comprises four modules: Document Distiller, Incremental Entity Extractor, Incremental Relation Extractor, and Graph Integrator and Visualization. Our method demonstrates superior performance compared to baseline methods across three scenarios: converting scientific papers to graphs, websites to graphs, and CVs to graphs.

[AI-33] ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding

链接: https://arxiv.org/abs/2409.03277
作者: Zhengzhuo Xu,Bowen Qu,Yiyan Qi,Sinan Du,Chengjin Xu,Chun Yuan,Jian Guo
关键词-EN: Automatic chart understanding, Automatic chart, document parsing, chart understanding, crucial for content
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, the application of alignment training within the chart domain is still underexplored. To address this, we propose ChartMoE, which employs the mixture of expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train multiple linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with over 900K chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts in four distinct ways and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.

[AI-34] Recent Advances in Attack and Defense Approaches of Large Language Models

链接: https://arxiv.org/abs/2409.03274
作者: Jing Cui,Yishi Xu,Zhewei Huang,Shuchang Zhou,Jianbin Jiao,Junge Zhang
关键词-EN: Large Language Models, Large Language, revolutionized artificial intelligence, advanced text processing, Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized artificial intelligence and machine learning through their advanced text processing and generating capabilities. However, their widespread deployment has raised significant safety and reliability concerns. Established vulnerabilities in deep neural networks, coupled with emerging threat models, may compromise security evaluations and create a false sense of security. Given the extensive research in the field of LLM security, we believe that summarizing the current state of affairs will help the research community better understand the present landscape and inform future developments. This paper reviews current research on LLM vulnerabilities and threats, and evaluates the effectiveness of contemporary defense mechanisms. We analyze recent studies on attack vectors and model weaknesses, providing insights into attack mechanisms and the evolving threat landscape. We also examine current defense strategies, highlighting their strengths and limitations. By contrasting advancements in attack and defense methodologies, we identify research gaps and propose future directions to enhance LLM security. Our goal is to advance the understanding of LLM safety challenges and guide the development of more robust security measures.

[AI-35] Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

链接: https://arxiv.org/abs/2409.03271
作者: Yu Wang,Shiwan Zhao,Zhihu Wang,Heyuan Huang,Ming Fan,Yubo Zhang,Zhixing Wang,Haijun Wang,Ting Liu
关键词-EN: large language models, paradigm has emerged, capabilities of large, large language, LLM performance
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The Chain-of-Thought (CoT) paradigm has emerged as a critical approach for enhancing the reasoning capabilities of large language models (LLMs). However, despite their widespread adoption and success, CoT methods often exhibit instability due to their inability to consistently ensure the quality of generated reasoning paths, leading to sub-optimal reasoning performance. To address this challenge, we propose the \textbfStrategic Chain-of-Thought (SCoT), a novel methodology designed to refine LLM performance by integrating strategic knowledge prior to generating intermediate reasoning steps. SCoT employs a two-stage approach within a single prompt: first eliciting an effective problem-solving strategy, which is then used to guide the generation of high-quality CoT paths and final answers. Our experiments across eight challenging reasoning datasets demonstrate significant improvements, including a 21.05% increase on the GSM8K dataset and 24.13% on the Tracking_Objects dataset, respectively, using the Llama3-8b model. Additionally, we extend the SCoT framework to develop a few-shot method with automatically matched demonstrations, yielding even stronger results. These findings underscore the efficacy of SCoT, highlighting its potential to substantially enhance LLM performance in complex reasoning tasks.

[AI-36] Bones Cant Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision ECCV2024

链接: https://arxiv.org/abs/2409.03261
作者: Jinhee Kim,Taesung Kim,Jaegul Choo
关键词-EN: Recent advances, minimizing user intervention, vertebrae keypoint estimation, keypoint estimation, enhanced accuracy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 33 pages, ECCV 2024, Project Page: this https URL

点击查看摘要

Abstract:Recent advances in interactive keypoint estimation methods have enhanced accuracy while minimizing user intervention. However, these methods require user input for error correction, which can be costly in vertebrae keypoint estimation where inaccurate keypoints are densely clustered or overlap. We introduce a novel approach, KeyBot, specifically designed to identify and correct significant and typical errors in existing models, akin to user revision. By characterizing typical error types and using simulated errors for training, KeyBot effectively corrects these errors and significantly reduces user workload. Comprehensive quantitative and qualitative evaluations on three public datasets confirm that KeyBot significantly outperforms existing methods, achieving state-of-the-art performance in interactive vertebrae keypoint estimation. The source code and demo video are available at: this https URL

[AI-37] In Search of Trees: Decision-Tree Policy Synthesis for Black-Box Systems via Search

链接: https://arxiv.org/abs/2409.03260
作者: Emir Demirović,Christian Schilling,Anna Lukina
关键词-EN: attractive as control, control policies, Decision trees, formal synthesis, policies
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages main text incl. references, 1 page appendix

点击查看摘要

Abstract:Decision trees, owing to their interpretability, are attractive as control policies for (dynamical) systems. Unfortunately, constructing, or synthesising, such policies is a challenging task. Previous approaches do so by imitating a neural-network policy, approximating a tabular policy obtained via formal synthesis, employing reinforcement learning, or modelling the problem as a mixed-integer linear program. However, these works may require access to a hard-to-obtain accurate policy or a formal model of the environment (within reach of formal synthesis), and may not provide guarantees on the quality or size of the final tree policy. In contrast, we present an approach to synthesise optimal decision-tree policies given a black-box environment and specification, and a discretisation of the tree predicates, where optimality is defined with respect to the number of steps to achieve the goal. Our approach is a specialised search algorithm which systematically explores the (exponentially large) space of decision trees under the given discretisation. The key component is a novel pruning mechanism that significantly reduces the search space. Our approach represents a conceptually novel way of synthesising small decision-tree policies with optimality guarantees even for black-box environments with black-box specifications.

[AI-38] Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

链接: https://arxiv.org/abs/2409.03257
作者: Chanjun Park,Hyeonwoo Kim
关键词-EN: Open Ko-LLM Leaderboard, Open Ko-LLM, restricted observation periods, Ko-LLM Leaderboard, eleven months
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.

[AI-39] E2CL: Exploration-based Error Correction Learning for Embodied Agents

链接: https://arxiv.org/abs/2409.03256
作者: Hanlin Wang,Chak Tou Leong,Jian Wang,Wenjie Li
关键词-EN: exhibiting increasing capability, Language models, utilization and reasoning, models are exhibiting, exhibiting increasing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language models are exhibiting increasing capability in knowledge utilization and reasoning. However, when applied as agents in embodied environments, they often suffer from misalignment between their intrinsic knowledge and environmental knowledge, leading to infeasible actions. Traditional environment alignment methods, such as supervised learning on expert trajectories and reinforcement learning, face limitations in covering environmental knowledge and achieving efficient convergence, respectively. Inspired by human learning, we propose Exploration-based Error Correction Learning (E2CL), a novel framework that leverages exploration-induced errors and environmental feedback to enhance environment alignment for LM-based agents. E2CL incorporates teacher-guided and teacher-free exploration to gather environmental feedback and correct erroneous actions. The agent learns to provide feedback and self-correct, thereby enhancing its adaptability to target environments. Evaluations in the Virtualhome environment demonstrate that E2CL-trained agents outperform those trained by baseline methods and exhibit superior self-correction capabilities.

[AI-40] Granular-ball Representation Learning for Deep CNN on Learning with Label Noise

链接: https://arxiv.org/abs/2409.03254
作者: Dawei Dai,Hao Zhu,Shuyin Xia,Guoyin Wang
关键词-EN: deep CNN models, actual scenarios, automatically annotated, manually or automatically, noise is inevitably
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In actual scenarios, whether manually or automatically annotated, label noise is inevitably generated in the training data, which can affect the effectiveness of deep CNN models. The popular solutions require data cleaning or designing additional optimizations to punish the data with mislabeled data, thereby enhancing the robustness of models. However, these methods come at the cost of weakening or even losing some data during the training process. As we know, content is the inherent attribute of an image that does not change with changes in annotations. In this study, we propose a general granular-ball computing (GBC) module that can be embedded into a CNN model, where the classifier finally predicts the label of granular-ball ( gb ) samples instead of each individual samples. Specifically, considering the classification task: (1) in forward process, we split the input samples as gb samples at feature-level, each of which can correspond to multiple samples with varying numbers and share one single label; (2) during the backpropagation process, we modify the gradient allocation strategy of the GBC module to enable it to propagate normally; and (3) we develop an experience replay policy to ensure the stability of the training process. Experiments demonstrate that the proposed method can improve the robustness of CNN models with no additional data or optimization.

[AI-41] DiffGrad for Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2409.03239
作者: Jamshaid Ul Rahman,Nimra
关键词-EN: Physics-Informed Neural Networks, addressing highly nonlinear, highly nonlinear problems, nonlinear problems based, Physics-Informed Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 20 pages, 14 figures

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are regarded as state-of-the-art tools for addressing highly nonlinear problems based on partial differential equations. Despite their broad range of applications, PINNs encounter several performance challenges, including issues related to efficiency, minimization of computational cost, and enhancement of accuracy. Burgers’ equation, a fundamental equation in fluid dynamics that is extensively used in PINNs, provides flexible results with the Adam optimizer that does not account for past gradients. This paper introduces a novel strategy for solving Burgers’ equation by incorporating DiffGrad with PINNs, a method that leverages the difference between current and immediately preceding gradients to enhance performance. A comprehensive computational analysis is conducted using optimizers such as Adam, Adamax, RMSprop, and DiffGrad to evaluate and compare their effectiveness. Our approach includes visualizing the solutions over space at various time intervals to demonstrate the accuracy of the network. The results show that DiffGrad not only improves the accuracy of the solution but also reduces training time compared to the other optimizers.

[AI-42] Content Moderation by LLM: From Accuracy to Legitimacy

链接: https://arxiv.org/abs/2409.03219
作者: Tao Huang
关键词-EN: large language model, LLM, large language, language model, content moderation
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One trending application of LLM (large language model) is to use it for content moderation in online platforms. Most current studies on this application have focused on the metric of accuracy - the extent to which LLM makes correct decisions about content. This article argues that accuracy is insufficient and misleading, because it fails to grasp the distinction between easy cases and hard cases as well as the inevitable trade-offs in achieving higher accuracy. Closer examination reveals that content moderation is a constitutive part of platform governance, the key of which is to gain and enhance legitimacy. Instead of making moderation decisions correct, the chief goal of LLM is to make them legitimate. In this regard, this article proposes a paradigm shift from the single benchmark of accuracy towards a legitimacy-based framework of evaluating the performance of LLM moderators. The framework suggests that for easy cases, the key is to ensure accuracy, speed and transparency, while for hard cases, what matters is reasoned justification and user participation. Examined under this framework, LLM’s real potential in moderation is not accuracy improvement. Rather, LLM can better contribute in four other aspects: to conduct screening of hard cases from easy cases, to provide quality explanations for moderation decisions, to assist human reviewers in getting more contextual information, and to facilitate user participation in a more interactive way. Using normative theories from law and social sciences to critically assess the new technological application, this article seeks to redefine LLM’s role in content moderation and redirect relevant research in this field.

[AI-43] xLAM: A Family of Large Action Models to Empower AI Agent Systems

链接: https://arxiv.org/abs/2409.03215
作者: Jianguo Zhang,Tian Lan,Ming Zhu,Zuxin Liu,Thai Hoang,Shirley Kokane,Weiran Yao,Juntao Tan,Akshara Prabhakar,Haolin Chen,Zhiwei Liu,Yihao Feng,Tulika Awalgaonkar,Rithesh Murthy,Eric Hu,Zeyuan Chen,Ran Xu,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong
关键词-EN: significant research interest, attracted significant research, research interest, agent tasks, Autonomous agents powered
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Technical report for the Salesforce xLAM model series

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents’ generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at this https URL

[AI-44] C-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

链接: https://arxiv.org/abs/2409.03206
作者: Mingze Gao,Jingyu Liu,Mingda Li,Jiangtao Xie,Qingbin Liu,Bo Zhao,Xi Chen,Hui Xiong
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, significantly improved performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model’s capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM’s temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.

[AI-45] An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification

链接: https://arxiv.org/abs/2409.03203
作者: Zhuowei Chen,Lianxi Wang,Yuben Wu,Xinfeng Liao,Yujia Tian,Junyang Zhong
关键词-EN: imbalanced label distributions, imbalanced label, label distributions, Sentiment classification, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework’s modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.

[AI-46] Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers

链接: https://arxiv.org/abs/2409.03183
作者: Zuquan Peng,Yuanyuan He,Jianbing Ni,Ben Niu
关键词-EN: Natural Language Processing, Universal Adversarial Triggers, Neural networks, Universal Adversarial, Language Processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the “honeypot” concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY’s detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT’s adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.

[AI-47] InfraLib: Enabling Reinforcement Learning and Decision Making for Large Scale Infrastructure Management

链接: https://arxiv.org/abs/2409.03167
作者: Pranay Thangeda,Trevor S. Betz,Michael N. Grussing,Melkior Ornik
关键词-EN: Efficient management, economic stability, public safety, crucial for economic, Efficient
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Efficient management of infrastructure systems is crucial for economic stability, sustainability, and public safety. However, infrastructure management is challenging due to the vast scale of systems, stochastic deterioration of components, partial observability, and resource constraints. While data-driven approaches like reinforcement learning (RL) offer a promising avenue for optimizing management policies, their application to infrastructure has been limited by the lack of suitable simulation environments. We introduce InfraLib, a comprehensive framework for modeling and analyzing infrastructure management problems. InfraLib employs a hierarchical, stochastic approach to realistically model infrastructure systems and their deterioration. It supports practical functionality such as modeling component unavailability, cyclical budgets, and catastrophic failures. To facilitate research, InfraLib provides tools for expert data collection, simulation-driven analysis, and visualization. We demonstrate InfraLib’s capabilities through case studies on a real-world road network and a synthetic benchmark with 100,000 components.

[AI-48] Continual Skill and Task Learning via Dialogue

链接: https://arxiv.org/abs/2409.03166
作者: Weiwei Gu,Suresh Kondepudi,Lixiao Huang,Nakul Gopalan
关键词-EN: sample efficiency, challenging problem, perpetually with sample, robot, skills
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Continual and interactive robot learning is a challenging problem as the robot is present with human users who expect the robot to learn novel skills to solve novel tasks perpetually with sample efficiency. In this work we present a framework for robots to query and learn visuo-motor robot skills and task relevant information via natural language dialog interactions with human users. Previous approaches either focus on improving the performance of instruction following agents, or passively learn novel skills or concepts. Instead, we used dialog combined with a language-skill grounding embedding to query or confirm skills and/or tasks requested by a user. To achieve this goal, we developed and integrated three different components for our agent. Firstly, we propose a novel visual-motor control policy ACT with Low Rank Adaptation (ACT-LoRA), which enables the existing SoTA ACT model to perform few-shot continual learning. Secondly, we develop an alignment model that projects demonstrations across skill embodiments into a shared embedding allowing us to know when to ask questions and/or demonstrations from users. Finally, we integrated an existing LLM to interact with a human user to perform grounded interactive continual skill learning to solve a task. Our ACT-LoRA model learns novel fine-tuned skills with a 100% accuracy when trained with only five demonstrations for a novel skill while still maintaining a 74.75% accuracy on pre-trained skills in the RLBench dataset where other models fall significantly short. We also performed a human-subjects study with 8 subjects to demonstrate the continual learning capabilities of our combined framework. We achieve a success rate of 75% in the task of sandwich making with the real robot learning from participant data demonstrating that robots can learn novel skills or task knowledge from dialogue with non-expert users using our approach.

[AI-49] Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models

链接: https://arxiv.org/abs/2409.03155
作者: Jie Ma,Zhitao Gao,Qi Chai,Wangchun Sun,Pinghui Wang,Hongbin Pei,Jing Tao,Lingyun Song,Jun Liu,Chen Zhang,Lizhen Cui
关键词-EN: Large Language Models, real-world applications due, knowledge graphs, Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: \textitexcessively long reasoning paths distracting from the answer generation, and \textitfalse-positive relations hindering the path refinement. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7% and 9.1% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG. Code is available at \urlthis https URL.

[AI-50] Addressing the Gaps in Early Dementia Detection: A Path Towards Enhanced Diagnostic Models through Machine Learning

链接: https://arxiv.org/abs/2409.03147
作者: Juan A. Berrios Moya
关键词-EN: rapid global aging, global aging trend, accurate diagnostic methods, underscoring the urgent, rapid global
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid global aging trend has led to an increase in dementia cases, including Alzheimer’s disease, underscoring the urgent need for early and accurate diagnostic methods. Traditional diagnostic techniques, such as cognitive tests, neuroimaging, and biomarker analysis, face significant limitations in sensitivity, accessibility, and cost, particularly in the early stages. This study explores the potential of machine learning (ML) as a transformative approach to enhance early dementia detection by leveraging ML models to analyze and integrate complex multimodal datasets, including cognitive assessments, neuroimaging, and genetic information. A comprehensive review of existing literature was conducted to evaluate various ML models, including supervised learning, deep learning, and advanced techniques such as ensemble learning and transformer models, assessing their accuracy, interpretability, and potential for clinical integration. The findings indicate that while ML models show significant promise in improving diagnostic precision and enabling earlier interventions, challenges remain in their generalizability, interpretability, and ethical deployment. This research concludes by outlining future directions aimed at enhancing the clinical utility of ML models in dementia detection, emphasizing interdisciplinary collaboration and ethically sound frameworks to improve early detection and intervention strategies for Alzheimer’s disease and other forms of dementia.

[AI-51] Backdoor defense learnability and obfuscation

链接: https://arxiv.org/abs/2409.03077
作者: Paul Christiano,Jacob Hilton,Victor Lecomte,Mark Xu
关键词-EN: introduce a formal, formal notion, attacker, PAC learnability, function class
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 29 pages

点击查看摘要

Abstract:We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the “trigger”, while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker’s strategy must work for a randomly-chosen trigger. Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of Hanneke et al. (2022) to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation. Comments: 29 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2409.03077 [cs.LG] (or arXiv:2409.03077v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.03077 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-52] MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.03062
作者: Shehan Perera,Yunus Erzurumlu,Deepak Gulati,Alper Yilmaz
关键词-EN: medical image analysis, cancer segmentation poses, poses a significant, significant challenge, challenge in medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV 2024 - BioImage Computing Workshop (Oral)

点击查看摘要

Abstract:Skin cancer segmentation poses a significant challenge in medical image analysis. Numerous existing solutions, predominantly CNN-based, face issues related to a lack of global contextual understanding. Alternatively, some approaches resort to large-scale Transformer models to bridge the global contextual gaps, but at the expense of model size and computational complexity. Finally many Transformer based approaches rely primarily on CNN based decoders overlooking the benefits of Transformer based decoding models. Recognizing these limitations, we address the need efficient lightweight solutions by introducing MobileUNETR, which aims to overcome the performance constraints associated with both CNNs and Transformers while minimizing model size, presenting a promising stride towards efficient image segmentation. MobileUNETR has 3 main features. 1) MobileUNETR comprises of a lightweight hybrid CNN-Transformer encoder to help balance local and global contextual feature extraction in an efficient manner; 2) A novel hybrid decoder that simultaneously utilizes low-level and global features at different resolutions within the decoding stage for accurate mask generation; 3) surpassing large and complex architectures, MobileUNETR achieves superior performance with 3 million parameters and a computational complexity of 1.3 GFLOP resulting in 10x and 23x reduction in parameters and FLOPS, respectively. Extensive experiments have been conducted to validate the effectiveness of our proposed method on four publicly available skin lesion segmentation datasets, including ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. The code will be publicly available at: this https URL

[AI-53] Better Verified Explanations with Applications to Incorrectness and Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.03060
作者: Min Wu,Xiaofu Li,Haoze Wu,Clark Barrett
关键词-EN: learning model outputs, machine learning model, producing optimal verified, Building on VeriX, present VeriX
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Building on VeriX (Verified eXplainability, arXiv:2212.01051), a system for producing optimal verified explanations for machine learning model outputs, we present VeriX+, which significantly improves both the size and the generation time of verified explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time – the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain (Junker 2004) algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of 38% on the GTSRB dataset and a time reduction of 90% on MNIST. We also explore applications of our verified explanations and show that explanation size is a useful proxy for both incorrectness detection and out-of-distribution detection.

[AI-54] Can Your Generative Model Detect Out-of-Distribution Covariate Shift? ECCV2024

链接: https://arxiv.org/abs/2409.03043
作者: Christiaan Viviers,Amaan Valiuddin,Francisco Caetano,Lemar Abdi,Lena Filatova,Peter de With,Fons van der Sommen
关键词-EN: high-level image statistics, normal and In-Distribution, high-level image, distribution shift aims, OOD detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Detecting Out-of-Distribution~(OOD) sensory data and covariate distribution shift aims to identify new test examples with different high-level image statistics to the captured, normal and In-Distribution (ID) set. Existing OOD detection literature largely focuses on semantic shift with little-to-no consensus over covariate shift. Generative models capture the ID data in an unsupervised manner, enabling them to effectively identify samples that deviate significantly from this learned distribution, irrespective of the downstream task. In this work, we elucidate the ability of generative models to detect and quantify domain-specific covariate shift through extensive analyses that involves a variety of models. To this end, we conjecture that it is sufficient to detect most occurring sensory faults (anomalies and deviations in global signals statistics) by solely modeling high-frequency signal-dependent and independent details. We propose a novel method, CovariateFlow, for OOD detection, specifically tailored to covariate heteroscedastic high-frequency image-components using conditional Normalizing Flows (cNFs). Our results on CIFAR10 vs. CIFAR10-C and ImageNet200 vs. ImageNet200-C demonstrate the effectiveness of the method by accurately detecting OOD covariate shift. This work contributes to enhancing the fidelity of imaging systems and aiding machine learning models in OOD detection in the presence of covariate shift.

[AI-55] Large Language Model-Based Agents for Software Engineering: A Survey

链接: https://arxiv.org/abs/2409.02977
作者: Junwei Liu,Kaixin Wang,Yixuan Chen,Xin Peng,Zhenpeng Chen,Lingming Zhang,Yiling Lou
关键词-EN: Large Language Models, Language Models, Large Language, advance in Large, LLM-based agents
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 106 papers and categorize them from two perspectives, i.e., the SE and agent perspectives. In addition, we discuss open challenges and future directions in this critical domain. The repository of this survey is at this https URL.

[AI-56] Hallucination Detection in LLMs: Fast and Memory-Efficient Finetuned Models

链接: https://arxiv.org/abs/2409.02976
作者: Gabriel Y. Arteaga,Thomas B. Schön,Nicolas Pielawski
关键词-EN: Uncertainty estimation, high-risk settings, Large Language Models, autonomous cars, component when implementing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Uncertainty estimation is a necessary component when implementing AI in high-risk settings, such as autonomous cars, medicine, or insurances. Large Language Models (LLMs) have seen a surge in popularity in recent years, but they are subject to hallucinations, which may cause serious harm in high-risk settings. Despite their success, LLMs are expensive to train and run: they need a large amount of computations and memory, preventing the use of ensembling methods in practice. In this work, we present a novel method that allows for fast and memory-friendly training of LLM ensembles. We show that the resulting ensembles can detect hallucinations and are a viable approach in practice as only one GPU is needed for training and inference.

[AI-57] Managing multiple agents by automatically adjusting incentives

链接: https://arxiv.org/abs/2409.02960
作者: Shunichi Akatsuka,Yaemi Teramoto,Aaron Courville
关键词-EN: coming years, complex decisions, including in situations, groups of people, making more complex
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: 7 pages

点击查看摘要

Abstract:In the coming years, AI agents will be used for making more complex decisions, including in situations involving many different groups of people. One big challenge is that AI agent tends to act in its own interest, unlike humans who often think about what will be the best for everyone in the long run. In this paper, we explore a method to get self-interested agents to work towards goals that benefit society as a whole. We propose a method to add a manager agent to mediate agent interactions by assigning incentives to certain actions. We tested our method with a supply-chain management problem and showed that this framework (1) increases the raw reward by 22.2%, (2) increases the agents’ reward by 23.8%, and (3) increases the manager’s reward by 20.1%.

[AI-58] Multi-Modal Adapter for Vision-Language Models

链接: https://arxiv.org/abs/2409.02958
作者: Dominykas Seputis,Serghei Mihailov,Soham Chatterjee,Zehao Xiao
关键词-EN: Large pre-trained vision-language, Large pre-trained, pre-trained vision-language models, requiring retraining, image classification tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large pre-trained vision-language models, such as CLIP, have demonstrated state-of-the-art performance across a wide range of image classification tasks, without requiring retraining. Few-shot CLIP is competitive with existing specialized architectures that were trained on the downstream tasks. Recent research demonstrates that the performance of CLIP can be further improved using lightweight adaptation approaches. However, previous methods adapt different modalities of the CLIP model individually, ignoring the interactions and relationships between visual and textual representations. In this work, we propose Multi-Modal Adapter, an approach for Multi-Modal adaptation of CLIP. Specifically, we add a trainable Multi-Head Attention layer that combines text and image features to produce an additive adaptation of both. Multi-Modal Adapter demonstrates improved generalizability, based on its performance on unseen classes compared to existing adaptation methods. We perform additional ablations and investigations to validate and interpret the proposed approach.

[AI-59] CortexCompile: Harnessing Cortical-Inspired Architectures for Enhanced Multi-Agent NLP Code Synthesis

链接: https://arxiv.org/abs/2409.02938
作者: Gautham Ramachandran,Rick Yang
关键词-EN: automated code generation, Natural Language Processing, lack real-time adaptability, automated code, complex programming tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Current approaches to automated code generation often rely on monolithic models that lack real-time adaptability and scalability. This limitation is particularly evident in complex programming tasks that require dynamic adjustment and efficiency. The integration of neuroscience principles into Natural Language Processing (NLP) has the potential to revolutionize automated code generation. This paper presents CortexCompile, a novel modular system inspired by the specialized functions of the human brain’s cortical regions. By emulating the distinct roles of the Prefrontal Cortex, Parietal Cortex, Temporal Lobe, and Motor Cortex, CortexCompile achieves significant advancements in scalability, efficiency, and adaptability compared to traditional monolithic models like GPT-4o. The system’s architecture features a Task Orchestration Agent that manages dynamic task delegation and parallel processing, facilitating the generation of highly accurate and optimized code across increasingly complex programming tasks. Experimental evaluations demonstrate that CortexCompile consistently outperforms GPT-4o in development time, accuracy, and user satisfaction, particularly in tasks involving real-time strategy games and first-person shooters. These findings underscore the viability of neuroscience-inspired architectures in addressing the limitations of current NLP models, paving the way for more efficient and human-like AI systems.

[AI-60] A method to benchmark high-dimensional process drift detection

链接: https://arxiv.org/abs/2409.03669
作者: Edgar Wolf,Tobias Windisch
关键词-EN: multi-variate finite time, finite time series, time series data, series data coming, manufacturing processes
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Process curves are multi-variate finite time series data coming from manufacturing processes. This paper studies machine learning methods for drifts of process curves. A theoretic framework to synthetically generate process curves in a controlled way is introduced in order to benchmark machine learning algorithms for process drift detection. A evaluation score, called the temporal area under the curve, is introduced, which allows to quantify how well machine learning models unveil curves belonging to drift segments. Finally, a benchmark study comparing popular machine learning approaches on synthetic data generated with the introduced framework shown.

计算机视觉

[CV-0] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

链接: https://arxiv.org/abs/2409.03757
作者: Yunze Man,Shuhong Zheng,Zhipeng Bao,Martial Hebert,Liang-Yan Gui,Yu-Xiong Wang
关键词-EN: gained increasing attention, scene encoding strategies, encoding strategies playing, increasing attention, gained increasing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project page: this https URL , Github: this https URL

点击查看摘要

Abstract:Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.

[CV-1] DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation ECCV2024

链接: https://arxiv.org/abs/2409.03755
作者: Wenliang Zhao,Haolin Wang,Jie Zhou,Jiwen Lu
关键词-EN: Diffusion probabilistic models, computationally expensive due, shown remarkable performance, predictor-corrector diffusion samplers, probabilistic models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Diffusion probabilistic models (DPMs) have shown remarkable performance in visual synthesis but are computationally expensive due to the need for multiple evaluations during the sampling. Recent predictor-corrector diffusion samplers have significantly reduced the required number of function evaluations (NFE), but inherently suffer from a misalignment issue caused by the extra corrector step, especially with a large classifier-free guidance scale (CFG). In this paper, we introduce a new fast DPM sampler called DC-Solver, which leverages dynamic compensation (DC) to mitigate the misalignment of the predictor-corrector samplers. The dynamic compensation is controlled by compensation ratios that are adaptive to the sampling steps and can be optimized on only 10 datapoints by pushing the sampling trajectory toward a ground truth trajectory. We further propose a cascade polynomial regression (CPR) which can instantly predict the compensation ratios on unseen sampling configurations. Additionally, we find that the proposed dynamic compensation can also serve as a plug-and-play module to boost the performance of predictor-only samplers. Extensive experiments on both unconditional sampling and conditional sampling demonstrate that our DC-Solver can consistently improve the sampling quality over previous methods on different DPMs with a wide range of resolutions up to 1024 \times 1024. Notably, we achieve 10.38 FID (NFE=5) on unconditional FFHQ and 0.394 MSE (NFE=5, CFG=7.5) on Stable-Diffusion-2.1. Code is available at this https URL

[CV-2] Foundation Model or Finetune? Evaluation of few-shot semantic segmentation for river pollution ECCV2024

链接: https://arxiv.org/abs/2409.03754
作者: Marga Don,Stijn Pinson,Blanca Guillen Cebrian,Yuki M. Asano
关键词-EN: Foundation models, popular topic, topic of research, Foundation, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 Green Foundation Models workshop

点击查看摘要

Abstract:Foundation models (FMs) are a popular topic of research in AI. Their ability to generalize to new tasks and datasets without retraining or needing an abundance of data makes them an appealing candidate for applications on specialist datasets. In this work, we compare the performance of FMs to finetuned pre-trained supervised models in the task of semantic segmentation on an entirely new dataset. We see that finetuned models consistently outperform the FMs tested, even in cases were data is scarce. We release the code and dataset for this work on GitHub.

[CV-3] ArtiFade: Learning to Generate High-quality Subject from Blemished Images

链接: https://arxiv.org/abs/2409.03745
作者: Shuya Yang,Shaozhe Hao,Yukang Cao,Kwan-Yee K. Wong
关键词-EN: witnessed remarkable advancements, generation has witnessed, witnessed remarkable, remarkable advancements, ability to learn
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images. However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts. This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts. The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning. ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images. We further devise evaluation benchmarks tailored for this task. Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.

[CV-4] Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

链接: https://arxiv.org/abs/2409.03718
作者: Slava Elizarov,Ciara Rowles,Simon Donné
关键词-EN: textual descriptions remains, challenging problem due, Geometry Image Diffusion, computational cost, Generating high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 11 pages, 9 figures, Project page: this https URL

点击查看摘要

Abstract:Generating high-quality 3D objects from textual descriptions remains a challenging problem due to computational cost, the scarcity of 3D data, and complex 3D representations. We introduce Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures. By integrating a Collaborative Control mechanism, we exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion. This enables strong generalization even with limited 3D training data (allowing us to use only high-quality training data) as well as retaining compatibility with guidance techniques such as IPAdapter. In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models. The generated objects consist of semantically meaningful, separate parts and include internal structures, enhancing both usability and versatility.

[CV-5] View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

链接: https://arxiv.org/abs/2409.03685
作者: Stephen Tian,Blake Wulfe,Kyle Sargent,Katherine Liu,Sergey Zakharov,Vitor Guizilini,Jiajun Wu
关键词-EN: Large-scale visuomotor policy, visuomotor policy learning, generalizable manipulation systems, visuomotor policy, promising approach
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to CoRL 2024

点击查看摘要

Abstract:Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at this https URL.

[CV-6] RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images

链接: https://arxiv.org/abs/2409.03644
作者: Benzhi Wang,Jingkai Zhou,Jingqi Bai,Yang Yang,Weihua Chen,Fan Wang,Zhen Lei
关键词-EN: Generative Adversarial Networks, Adversarial Networks, Generative Adversarial, outperforming traditional frameworks, revolutionized visual generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, diffusion models have revolutionized visual generation, outperforming traditional frameworks like Generative Adversarial Networks (GANs). However, generating images of humans with realistic semantic parts, such as hands and faces, remains a significant challenge due to their intricate structural complexity. To address this issue, we propose a novel post-processing solution named RealisHuman. The RealisHuman framework operates in two stages. First, it generates realistic human parts, such as hands or faces, using the original malformed parts as references, ensuring consistent details with the original image. Second, it seamlessly integrates the rectified human parts back into their corresponding positions by repainting the surrounding areas to ensure smooth and realistic blending. The RealisHuman framework significantly enhances the realism of human generation, as demonstrated by notable improvements in both qualitative and quantitative metrics. Code is available at this https URL.

[CV-7] CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

链接: https://arxiv.org/abs/2409.03643
作者: Bin Wang,Fan Wu,Linke Ouyang,Zhuangcheng Gu,Rui Zhang,Renqiu Xia,Bo Zhang,Conghui He
关键词-EN: presents significant challenges, significant challenges due, recognition presents significant, Formula recognition presents, Formula recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Project Website: this https URL

点击查看摘要

Abstract:Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.

[CV-8] Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction ECCV2024

链接: https://arxiv.org/abs/2409.03634
作者: Rui Peng,Shihe Shen,Kaiqiang Xiong,Huachen Gao,Jianbo Jiao,Xiaodong Gu,Ronggang Wang
关键词-EN: attracted widespread attention, Reconstructing the high-fidelity, multi-view images, recent years, critical and practical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Accepted

点击查看摘要

Abstract:Reconstructing the high-fidelity surface from multi-view images, especially sparse images, is a critical and practical task that has attracted widespread attention in recent years. However, existing methods are impeded by the memory constraint or the requirement of ground-truth depths and cannot recover satisfactory geometric details. To this end, we propose SuRF, a new Surface-centric framework that incorporates a new Region sparsification based on a matching Field, achieving good trade-offs between performance, efficiency and scalability. To our knowledge, this is the first unsupervised method achieving end-to-end sparsification powered by the introduced matching field, which leverages the weight distribution to efficiently locate the boundary regions containing surface. Instead of predicting an SDF value for each voxel, we present a new region sparsification approach to sparse the volume by judging whether the voxel is inside the surface region. In this way, our model can exploit higher frequency features around the surface with less memory and computational consumption. Extensive experiments on multiple benchmarks containing complex large-scale scenes show that our reconstructions exhibit high-quality details and achieve new state-of-the-art performance, i.e., 46% improvements with 80% less memory consumption. Code is available at this https URL.

[CV-9] SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

链接: https://arxiv.org/abs/2409.03605
作者: Lingyu Xiong,Xize Cheng,Jintao Tan,Xianjia Wu,Xiandong Li,Lei Zhu,Fei Ma,Minglei Li,Huang Xu,Zhihu Hu
关键词-EN: Audio-driven talking face, face generation aims, Audio-driven talking, input audio, talking face generation
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 10 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.

[CV-10] CDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces

链接: https://arxiv.org/abs/2409.03600
作者: Bernardo Biesseck,Pedro Vidal,Luiz Coelho,Roger Granada,David Menotti|
关键词-EN: Condition Diffusion Model, Triple Condition Diffusion, include a large, large number, numerous samples
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: SIBGRAPI 2024

点击查看摘要

Abstract:A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion). Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand. Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency. In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance. Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. Our source code is available at: this https URL.

[CV-11] A practical approach to evaluating the adversarial distance for machine learning classifiers

链接: https://arxiv.org/abs/2409.03598
作者: Georg Siedel,Ekagra Gupta,Andrey Morozov
关键词-EN: ensure consistent performance, adversarial, machine learning, adversarial robustness, critical for machine
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted manuscript at International Mechanical Engineering Congress and Exposition IMECE2024

点击查看摘要

Abstract:Robustness is critical for machine learning (ML) classifiers to ensure consistent performance in real-world applications where models may encounter corrupted or adversarial inputs. In particular, assessing the robustness of classifiers to adversarial inputs is essential to protect systems from vulnerabilities and thus ensure safety in use. However, methods to accurately compute adversarial robustness have been challenging for complex ML models and high-dimensional data. Furthermore, evaluations typically measure adversarial accuracy on specific attack budgets, limiting the informative value of the resulting metrics. This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach. Combined, the methods provide a comprehensive evaluation of adversarial robustness by computing estimates for the upper and lower bounds of the adversarial distance. We present visualisations and ablation studies that provide insights into how this evaluation method should be applied and parameterised. We find that our adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations. The approach in this paper should encourage a more informative way of evaluating the adversarial robustness of ML classifiers.

[CV-12] xt-Guided Mixup Towards Long-Tailed Image Categorization BMVC’24

链接: https://arxiv.org/abs/2409.03583
作者: Richard Franklin,Jiawei Yao,Deyang Zhong,Qi Qian,Juhua Hu
关键词-EN: require heavy amounts, training deep neural, challenges traditional approaches, deep neural networks, class label distribution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BMVC’24, code is available at this https URL

点击查看摘要

Abstract:In many real-world applications, the frequency distribution of class labels for training data can exhibit a long-tailed distribution, which challenges traditional approaches of training deep neural networks that require heavy amounts of balanced data. Gathering and labeling data to balance out the class label distribution can be both costly and time-consuming. Many existing solutions that enable ensemble learning, re-balancing strategies, or fine-tuning applied to deep neural networks are limited by the inert problem of few class samples across a subset of classes. Recently, vision-language models like CLIP have been observed as effective solutions to zero-shot or few-shot learning by grasping a similarity between vision and language features for image and text pairs. Considering that large pre-trained vision-language models may contain valuable side textual information for minor classes, we propose to leverage text supervision to tackle the challenge of long-tailed learning. Concretely, we propose a novel text-guided mixup technique that takes advantage of the semantic relations between classes recognized by the pre-trained text encoder to help alleviate the long-tailed problem. Our empirical study on benchmark long-tailed tasks demonstrates the effectiveness of our proposal with a theoretical guarantee. Our code is available at this https URL.

[CV-13] MaskVal: Simple but Effective Uncertainty Quantification for 6D Pose Estimation

链接: https://arxiv.org/abs/2409.03556
作者: Philipp Quentin,Daniel Goehring
关键词-EN: predictable operational performance, utmost importance, importance to ensure, predictable operational, pose
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For the use of 6D pose estimation in robotic applications, reliable poses are of utmost importance to ensure a safe, reliable and predictable operational performance. Despite these requirements, state-of-the-art 6D pose estimators often do not provide any uncertainty quantification for their pose estimates at all, or if they do, it has been shown that the uncertainty provided is only weakly correlated with the actual true error. To address this issue, we investigate a simple but effective uncertainty quantification, that we call MaskVal, which compares the pose estimates with their corresponding instance segmentations by rendering and does not require any modification of the pose estimator itself. Despite its simplicity, MaskVal significantly outperforms a state-of-the-art ensemble method on both a dataset and a robotic setup. We show that by using MaskVal, the performance of a state-of-the-art 6D pose estimator is significantly improved towards a safe and reliable operation. In addition, we propose a new and specific approach to compare and evaluate uncertainty quantification methods for 6D pose estimation in the context of robotic manipulation.

[CV-14] Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection

链接: https://arxiv.org/abs/2409.03555
作者: Ali Aghababaei-Harandi,Massih-Reza Amini
关键词-EN: complex neural networks, significant computational resources, neural networks demand, networks demand significant, demand significant computational
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite their high accuracy, complex neural networks demand significant computational resources, posing challenges for deployment on resource-constrained devices such as mobile phones and embedded systems. Compression algorithms have been developed to address these challenges by reducing model size and computational demands while maintaining accuracy. Among these approaches, factorization methods based on tensor decomposition are theoretically sound and effective. However, they face difficulties in selecting the appropriate rank for decomposition. This paper tackles this issue by presenting a unified framework that simultaneously applies decomposition and optimal rank selection, employing a composite compression loss within defined rank constraints. Our approach includes an automatic rank search in a continuous space, efficiently identifying optimal rank configurations without the use of training data, making it computationally efficient. Combined with a subsequent fine-tuning step, our approach maintains the performance of highly compressed models on par with their original counterparts. Using various benchmark datasets, we demonstrate the efficacy of our method through a comprehensive analysis.

[CV-15] Organized Grouped Discrete Representation for Object-Centric Learning

链接: https://arxiv.org/abs/2409.03553
作者: Rongzhen Zhao,Vivienne Wang,Juho Kannala,Joni Pajarinen
关键词-EN: represents dense image, represents dense, Variational Autoencoder, Grouped Discrete Representation, dense image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object-Centric Learning (OCL) represents dense image or video pixels as sparse object features. Representative methods utilize discrete representation composed of Variational Autoencoder (VAE) template features to suppress pixel-level information redundancy and guide object-level feature aggregation. The most recent advancement, Grouped Discrete Representation (GDR), further decomposes these template features into attributes. However, its naive channel grouping as decomposition may erroneously group channels belonging to different attributes together and discretize them as sub-optimal template attributes, which losses information and harms expressivity. We propose Organized GDR (OGDR) to organize channels belonging to the same attributes together for correct decomposition from features into attributes. In unsupervised segmentation experiments, OGDR is fully superior to GDR in augmentating classical transformer-based OCL methods; it even improves state-of-the-art diffusion-based ones. Codebook PCA and representation similarity analyses show that compared with GDR, our OGDR eliminates redundancy and preserves information better for guiding object representation learning. The source code is available in the supplementary material.

[CV-16] DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture

链接: https://arxiv.org/abs/2409.03550
作者: Qianlong Xiang,Miao Zhang,Yuzhang Shang,Jianlong Wu,Yan Yan,Liqiang Nie
关键词-EN: high computational demands, demonstrated exceptional generative, exceptional generative capabilities, slow inference speeds, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated exceptional generative capabilities across various areas, while they are hindered by slow inference speeds and high computational demands during deployment. The most common way to accelerate DMs involves reducing the number of denoising steps during generation, achieved through faster sampling solvers or knowledge distillation (KD). In contrast to prior approaches, we propose a novel method that transfers the capability of large pretrained DMs to faster architectures. Specifically, we employ KD in a distinct manner to compress DMs by distilling their generative ability into more rapid variants. Furthermore, considering that the source data is either unaccessible or too enormous to store for current generative models, we introduce a new paradigm for their distillation without source data, termed Data-Free Knowledge Distillation for Diffusion Models (DKDM). Generally, our established DKDM framework comprises two main components: 1) a DKDM objective that uses synthetic denoising data produced by pretrained DMs to optimize faster DMs without source data, and 2) a dynamic iterative distillation method that flexibly organizes the synthesis of denoising data, preventing it from slowing down the optimization process as the generation is slow. To our knowledge, this is the first attempt at using KD to distill DMs into any architecture in a data-free manner. Importantly, our DKDM is orthogonal to most existing acceleration methods, such as denoising step reduction, quantization and pruning. Experiments show that our DKDM is capable of deriving 2x faster DMs with performance remaining on par with the baseline. Notably, our DKDM enables pretrained DMs to function as “datasets” for training new DMs.

[CV-17] Prediction Accuracy Reliability: Classification and Object Localization under Distribution Shift

链接: https://arxiv.org/abs/2409.03543
作者: Fabian Diet,Moussa Kassem Sbeyti,Michelle Karg
关键词-EN: Natural distribution shift, convolutional neural networks, distribution shift, Natural distribution, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This preprint has not undergone any post-submission improvements or corrections

点击查看摘要

Abstract:Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.

[CV-18] Use of triplet loss for facial restoration in low-resolution images

链接: https://arxiv.org/abs/2409.03530
作者: Sebastian Pulgar,Domingo Mery
关键词-EN: achieving impressive results, recent years, biometric tool, achieving impressive, numerous datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:In recent years, facial recognition (FR) models have become the most widely used biometric tool, achieving impressive results on numerous datasets. However, inherent hardware challenges or shooting distances often result in low-resolution images, which significantly impact the performance of FR models. To address this issue, several solutions have been proposed, including super-resolution (SR) models that generate highly realistic faces. Despite these efforts, significant improvements in FR algorithms have not been achieved. We propose a novel SR model FTLGAN, which focuses on generating high-resolution images that preserve individual identities rather than merely improving image quality, thereby maximizing the performance of FR models. The results are compelling, demonstrating a mean value of d’ 21% above the best current state-of-the-art models, specifically having a value of d’ = 1.099 and AUC = 0.78 for 14x14 pixels, d’ = 2.112 and AUC = 0.92 for 28x28 pixels, and d’ = 3.049 and AUC = 0.98 for 56x56 pixels. The contributions of this study are significant in several key areas. Firstly, a notable improvement in facial recognition performance has been achieved in low-resolution images, specifically at resolutions of 14x14, 28x28, and 56x56 pixels. Secondly, the enhancements demonstrated by FTLGAN show a consistent response across all resolutions, delivering outstanding performance uniformly, unlike other comparative models. Thirdly, an innovative approach has been implemented using triplet loss logic, enabling the training of the super-resolution model solely with real images, contrasting with current models, and expanding potential real-world applications. Lastly, this study introduces a novel model that specifically addresses the challenge of improving classification performance in facial recognition systems by integrating facial recognition quality as a loss during model training.

[CV-19] FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

链接: https://arxiv.org/abs/2409.03525
作者: Xi Chen,Haosen Yang,Sheng Jin,Xiatian Zhu,Hongxun Yao
关键词-EN: Open-vocabulary segmentation poses, poses significant challenges, segmentation poses significant, Open-vocabulary segmentation, unconstrained environments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model’s visual encoder as the feature backbone, we inject the space-aware feature into the learnable queries and CLIP features within the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck. Extensive experiments demonstrate that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at this https URL.

[CV-20] Have Large Vision-Language Models Mastered Art History?

链接: https://arxiv.org/abs/2409.03521
作者: Ombretta Strafforello,Derya Soydaner,Michiel Willems,Anne-Sofie Maerten,Stefanie De Winter
关键词-EN: large Vision-Language Models, Vision-Language Models, recently established, established new baselines, multiple domains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The emergence of large Vision-Language Models (VLMs) has recently established new baselines in image classification across multiple domains. However, the performance of VLMs in the specific task of artwork classification, particularly art style classification of paintings - a domain traditionally mastered by art historians - has not been explored yet. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively predict the art historical attributes of paintings. We conduct an in-depth analysis of four VLMs, namely CLIP, LLaVA, OpenFlamingo, and GPT-4o, focusing on zero-shot classification of art style, author and time period using two public benchmarks of artworks. Additionally, we present ArTest, a well-curated test set of artworks, including pivotal paintings studied by art historians.

[CV-21] LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

链接: https://arxiv.org/abs/2409.03516
作者: Jeongsoo Kim,Jongho Nang,Junsuk Choe
关键词-EN: Recent Vision Transformer, Recent Vision, Vision Transformer, demonstrated impressive performance, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at this https URL.

[CV-22] Blended Latent Diffusion under Attention Control for Real-World Video Editing

链接: https://arxiv.org/abs/2409.03514
作者: Deyin Liu,Lin Yuanbo Wu,Xianghua Xie
关键词-EN: face grand challenges, editing methods tend, current video editing, Due to lack, build on pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to lack of fully publicly available text-to-video models, current video editing methods tend to build on pre-trained text-to-image generation models, however, they still face grand challenges in dealing with the local editing of video with temporal information. First, although existing methods attempt to focus on local area editing by a pre-defined mask, the preservation of the outside-area background is non-ideal due to the spatially entire generation of each frame. In addition, specially providing a mask by user is an additional costly undertaking, so an autonomous masking strategy integrated into the editing process is desirable. Last but not least, image-level pretrained model hasn’t learned temporal information across frames of a video which is vital for expressing the motion and dynamics. In this paper, we propose to adapt a image-level blended latent diffusion model to perform local video editing tasks. Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones to better preserve the background information of the input video. We further introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps. Finally, we enhance the temporal consistency across video frames by transforming the self-attention blocks of U-Net into temporal-spatial blocks. Through extensive experiments, our proposed approach demonstrates effectiveness in different real-world video editing tasks.

[CV-23] Domain-Guided Weight Modulation for Semi-Supervised Domain Generalization WACV25

链接: https://arxiv.org/abs/2409.03509
作者: Chamuditha Jayanaga Galappaththige,Zachary Izzo,Xilin He,Honglu Zhou,Muhammad Haris Khan
关键词-EN: low developmental costs, great practical significance, practical significance due, unseen domain data, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV25

点击查看摘要

Abstract:Unarguably, deep learning models capable of generalizing to unseen domain data while leveraging a few labels are of great practical significance due to low developmental costs. In search of this endeavor, we study the challenging problem of semi-supervised domain generalization (SSDG), where the goal is to learn a domain-generalizable model while using only a small fraction of labeled data and a relatively large fraction of unlabeled data. Domain generalization (DG) methods show subpar performance under the SSDG setting, whereas semi-supervised learning (SSL) methods demonstrate relatively better performance, however, they are considerably poor compared to the fully-supervised DG methods. Towards handling this new, but challenging problem of SSDG, we propose a novel method that can facilitate the generation of accurate pseudo-labels under various domain shifts. This is accomplished by retaining the domain-level specialism in the classifier during training corresponding to each source domain. Specifically, we first create domain-level information vectors on the fly which are then utilized to learn a domain-aware mask for modulating the classifier’s weights. We provide a mathematical interpretation for the effect of this modulation procedure on both pseudo-labeling and model training. Our method is plug-and-play and can be readily applied to different SSL baselines for SSDG. Extensive experiments on six challenging datasets in two different SSDG settings show that our method provides visible gains over the various strong SSL-based SSDG baselines.

[CV-24] owards Data-Centric Face Anti-Spoofing: Improving Cross-domain Generalization via Physics-based Data Synthesis

链接: https://arxiv.org/abs/2409.03501
作者: Rizhao Cai,Cecelia Soh,Zitong Yu,Haoliang Li,Wenhan Yang,Alex Kot
关键词-EN: Face Anti-Spoofing, FAS, data, cross-domain, domain gap
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by International Journal of Computer Vision (IJCV) in Sept 2024

点击查看摘要

Abstract:Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, \textitetc. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at this https URL.

[CV-25] ScreenMark: Watermarking Arbitrary Visual Content on Screen

链接: https://arxiv.org/abs/2409.03487
作者: Xiujian Liang,Gaozhi Liu,Yichao Si,Xiaoxiao Hu,Zhenxing Qian,Xinpeng Zhang
关键词-EN: protecting multimedia content, Digital watermarking, protecting multimedia, Screen Content, Digital
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Digital watermarking has demonstrated its effectiveness in protecting multimedia content. However, existing watermarking are predominantly tailored for specific media types, rendering them less effective for the protection of content displayed on computer screens, which is often multimodal and dynamic. Visual Screen Content (VSC), is particularly susceptible to theft and leakage via screenshots, a vulnerability that current watermarking methods fail to adequately this http URL tackle these challenges, we propose ScreenMark, a robust and practical watermarking method designed specifically for arbitrary VSC protection. ScreenMark utilizes a three-stage progressive watermarking framework. Initially, inspired by diffusion principles, we initialize the mutual transformation between regular watermark information and irregular watermark patterns. Subsequently, these patterns are integrated with screen content using a pre-multiplication alpha blending technique, supported by a pre-trained screen decoder for accurate watermark retrieval. The progressively complex distorter enhances the robustness of the watermark in real-world screenshot scenarios. Finally, the model undergoes fine-tuning guided by a joint-level distorter to ensure optimal this http URL validate the effectiveness of ScreenMark, we compiled a dataset comprising 100,000 screenshots from various devices and resolutions. Extensive experiments across different datasets confirm the method’s superior robustness, imperceptibility, and practical applicability.

[CV-26] Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation

链接: https://arxiv.org/abs/2409.03470
作者: Prerak Mody,Nicolas F. Chaves-de-Plaza,Chinmay Rao,Eleftheria Astrenidou,Mischa de Ridder,Nienke Hoekstra,Klaus Hildebrandt,Marius Staring
关键词-EN: medical image segmentation, Increased usage, learning in medical, medical image, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the “utility” of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at this https URL

[CV-27] LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones WACV2025

链接: https://arxiv.org/abs/2409.03460
作者: Moritz Nottebaum,Matteo Dunnhofer,Christian Micheloni
关键词-EN: transformer blocks, mixture of convolutions, convolutions and transformer, Research, efficient vision backbones
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at WACV 2025. Features 11 pages in total

点击查看摘要

Abstract:Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at this https URL altair199797/LowFormer.

[CV-28] Non-Uniform Illumination Attack for Fooling Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.03458
作者: Akshay Jain,Shiv Ram Dubey,Satish Kumar Singh,KC Santosh,Bidyut Baran Chaudhuri
关键词-EN: Convolutional Neural Networks, Convolutional Neural, Neural Networks, made remarkable strides, NUI
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have made remarkable strides; however, they remain susceptible to vulnerabilities, particularly in the face of minor image perturbations that humans can easily recognize. This weakness, often termed as ‘attacks’, underscores the limited robustness of CNNs and the need for research into fortifying their resistance against such manipulations. This study introduces a novel Non-Uniform Illumination (NUI) attack technique, where images are subtly altered using varying NUI masks. Extensive experiments are conducted on widely-accepted datasets including CIFAR10, TinyImageNet, and CalTech256, focusing on image classification with 12 different NUI attack models. The resilience of VGG, ResNet, MobilenetV3-small and InceptionV3 models against NUI attacks are evaluated. Our results show a substantial decline in the CNN models’ classification accuracy when subjected to NUI attacks, indicating their vulnerability under non-uniform illumination. To mitigate this, a defense strategy is proposed, including NUI-attacked images, generated through the new NUI transformation, into the training set. The results demonstrate a significant enhancement in CNN model performance when confronted with perturbed images affected by NUI attacks. This strategy seeks to bolster CNN models’ resilience against NUI attacks.

[CV-29] LM-Gaussian: Boost Sparse-view 3D Gaussian Splatting with Large Model Priors

链接: https://arxiv.org/abs/2409.03456
作者: Hanyang Yu,Xiaoxiao Long,Ping Tan
关键词-EN: large-scale vision models, vision models, address sparse-view reconstruction, aim to address, large-scale vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We aim to address sparse-view reconstruction of a 3D scene by leveraging priors from large-scale vision models. While recent advancements such as 3D Gaussian Splatting (3DGS) have demonstrated remarkable successes in 3D reconstruction, these methods typically necessitate hundreds of input images that densely capture the underlying scene, making them time-consuming and impractical for real-world applications. However, sparse-view reconstruction is inherently ill-posed and under-constrained, often resulting in inferior and incomplete outcomes. This is due to issues such as failed initialization, overfitting on input images, and a lack of details. To mitigate these challenges, we introduce LM-Gaussian, a method capable of generating high-quality reconstructions from a limited number of images. Specifically, we propose a robust initialization module that leverages stereo priors to aid in the recovery of camera poses and the reliable point clouds. Additionally, a diffusion-based refinement is iteratively applied to incorporate image diffusion priors into the Gaussian optimization process to preserve intricate scene details. Finally, we utilize video diffusion priors to further enhance the rendered images for realistic visual effects. Overall, our approach significantly reduces the data acquisition requirements compared to previous 3DGS methods. We validate the effectiveness of our framework through experiments on various public datasets, demonstrating its potential for high-quality 360-degree scene reconstruction. Visual results are on our website.

[CV-30] Data-free Distillation with Degradation-prompt Diffusion for Multi-weather Image Restoration

链接: https://arxiv.org/abs/2409.03455
作者: Pei Wang,Xiaotong Luo,Yuan Xie,Yanyun Qu
关键词-EN: witnessed incredible progress, expensive data acquisition, data acquisition impair, increasing model capacity, Multi-weather image restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-weather image restoration has witnessed incredible progress, while the increasing model capacity and expensive data acquisition impair its applications in memory-limited devices. Data-free distillation provides an alternative for allowing to learn a lightweight student model from a pre-trained teacher model without relying on the original training data. The existing data-free learning methods mainly optimize the models with the pseudo data generated by GANs or the real data collected from the Internet. However, they inevitably suffer from the problems of unstable training or domain shifts with the original data. In this paper, we propose a novel Data-free Distillation with Degradation-prompt Diffusion framework for multi-weather Image Restoration (D4IR). It replaces GANs with pre-trained diffusion models to avoid model collapse and incorporates a degradation-aware prompt adapter to facilitate content-driven conditional diffusion for generating domain-related images. Specifically, a contrast-based degradation prompt adapter is firstly designed to capture degradation-aware prompts from web-collected degraded images. Then, the collected unpaired clean images are perturbed to latent features of stable diffusion, and conditioned with the degradation-aware prompts to synthesize new domain-related degraded images for knowledge distillation. Experiments illustrate that our proposal achieves comparable performance to the model distilled with original training data, and is even superior to other mainstream unsupervised methods.

[CV-31] Automatic occlusion removal from 3D maps for maritime situational awareness

链接: https://arxiv.org/abs/2409.03451
作者: Felix Sattler,Borja Carrillo Perez,Maurice Stephan,Sarah Barnes
关键词-EN: specifically targeting occlusion, targeting occlusion removal, large-scale maritime environments, occlusion removal, removal in large-scale
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint of SPIE Sensor + Imaging 2024 conference paper

点击查看摘要

Abstract:We introduce a novel method for updating 3D geospatial models, specifically targeting occlusion removal in large-scale maritime environments. Traditional 3D reconstruction techniques often face problems with dynamic objects, like cars or vessels, that obscure the true environment, leading to inaccurate models or requiring extensive manual editing. Our approach leverages deep learning techniques, including instance segmentation and generative inpainting, to directly modify both the texture and geometry of 3D meshes without the need for costly reprocessing. By selectively targeting occluding objects and preserving static elements, the method enhances both geometric and visual accuracy. This approach not only preserves structural and textural details of map data but also maintains compatibility with current geospatial standards, ensuring robust performance across diverse datasets. The results demonstrate significant improvements in 3D model fidelity, making this method highly applicable for maritime situational awareness and the dynamic display of auxiliary information.

[CV-32] Shuffle Vision Transformer: Lightweight Fast and Efficient Recognition of Driver Facial Expression

链接: https://arxiv.org/abs/2409.03438
作者: Ibtissam Saadi,Douglas W. Cunningham,Taleb-ahmed Abdelmalik,Abdenour Hadid,Yassin El Hillali
关键词-EN: facial expression recognition, computationally intensive, rendering them unsuitable, Existing methods, expression recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication in The 6th IEEE International Conference on Artificial Intelligence Circuits and Systems (IEEE AICAS 2024), 5 pages, 3 figures

点击查看摘要

Abstract:Existing methods for driver facial expression recognition (DFER) are often computationally intensive, rendering them unsuitable for real-time applications. In this work, we introduce a novel transfer learning-based dual architecture, named ShuffViT-DFER, which elegantly combines computational efficiency and accuracy. This is achieved by harnessing the strengths of two lightweight and efficient models using convolutional neural network (CNN) and vision transformers (ViT). We efficiently fuse the extracted features to enhance the performance of the model in accurately recognizing the facial expressions of the driver. Our experimental results on two benchmarking and public datasets, KMU-FED and KDEF, highlight the validity of our proposed method for real-time application with superior performance when compared to state-of-the-art methods.

[CV-33] A Key-Driven Framework for Identity-Preserving Face Anonymization NDSS NDSS2025

链接: https://arxiv.org/abs/2409.03434
作者: Miaomiao Wang,Guang Hua,Sheng Li,Guorui Feng
关键词-EN: Virtual faces, Virtual, face, original face, original
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NDSS Symposium 2025. Please cite this paper as “Miaomiao Wang, Guang Hua, Sheng Li, and Guorui Feng. A Key-Driven Framework for Identity-Preserving Face Anonymization. In the 32nd Annual Network and Distributed System Security Symposium (NDSS 2025).”

点击查看摘要

Abstract:Virtual faces are crucial content in the metaverse. Recently, attempts have been made to generate virtual faces for privacy protection. Nevertheless, these virtual faces either permanently remove the identifiable information or map the original identity into a virtual one, which loses the original identity forever. In this study, we first attempt to address the conflict between privacy and identifiability in virtual faces, where a key-driven face anonymization and authentication recognition (KFAAR) framework is proposed. Concretely, the KFAAR framework consists of a head posture-preserving virtual face generation (HPVFG) module and a key-controllable virtual face authentication (KVFA) module. The HPVFG module uses a user key to project the latent vector of the original face into a virtual one. Then it maps the virtual vectors to obtain an extended encoding, based on which the virtual face is generated. By simultaneously adding a head posture and facial expression correction module, the virtual face has the same head posture and facial expression as the original face. During the authentication, we propose a KVFA module to directly recognize the virtual faces using the correct user key, which can obtain the original identity without exposing the original face image. We also propose a multi-task learning objective to train HPVFG and KVFA. Extensive experiments demonstrate the advantages of the proposed HPVFG and KVFA modules, which effectively achieve both facial anonymity and identifiability.

[CV-34] UV-Mamba: A DCN-Enhanced State Space Model for Urban Village Boundary Identification in High-Resolution Remote Sensing Images

链接: https://arxiv.org/abs/2409.03431
作者: Lulin Li,Ben Chen,Xuechao Zou,Junliang Xing,Pin Tao
关键词-EN: diverse geographical environments, highly challenging task, urban village boundaries, remote sensing images, high-resolution remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi’an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi’an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.

[CV-35] Weight Conditioning for Smooth Optimization of Neural Networks ECCV2024

链接: https://arxiv.org/abs/2409.03424
作者: Hemanth Saratchandran,Thomas X. Wang,Simon Lucey
关键词-EN: term weight conditioning, Neural Radiance Fields, Convolutional Neural Networks, neural network weight, neural network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:In this article, we introduce a novel normalization technique for neural network weight matrices, which we term weight conditioning. This approach aims to narrow the gap between the smallest and largest singular values of the weight matrices, resulting in better-conditioned matrices. The inspiration for this technique partially derives from numerical linear algebra, where well-conditioned matrices are known to facilitate stronger convergence results for iterative solvers. We provide a theoretical foundation demonstrating that our normalization technique smoothens the loss landscape, thereby enhancing convergence of stochastic gradient descent algorithms. Empirically, we validate our normalization across various neural network architectures, including Convolutional Neural Networks (CNNs), Vision Transformers (ViT), Neural Radiance Fields (NeRF), and 3D shape modeling. Our findings indicate that our normalization method is not only competitive but also outperforms existing weight normalization techniques from the literature.

[CV-36] mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

链接: https://arxiv.org/abs/2409.03420
作者: Anwen Hu,Haiyang Xu,Liang Zhang,Jiabo Ye,Ming Yan,Ji Zhang,Qin Jin,Fei Huang,Jingren Zhou
关键词-EN: Multimodel Large Language, Large Language Models, Multimodel Large, Large Language, achieved promising OCR-free
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at this https URL.

[CV-37] G-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model

链接: https://arxiv.org/abs/2409.03412
作者: Yihao Zhao,Enhao Zhong,Cuiyun Yuan,Yang Li,Man Zhao,Chunxia Li,Jun Hu,Chenbin Liu
关键词-EN: Text-Guided Large Multi-Modal, Large Multi-Modal Model, Text-Guided Large, leverages textual descriptions, Large Multi-Modal
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:We propose TG-LMM (Text-Guided Large Multi-Modal Model), a novel approach that leverages textual descriptions of organs to enhance segmentation accuracy in medical images. Existing medical image segmentation methods face several challenges: current medical automatic segmentation models do not effectively utilize prior knowledge, such as descriptions of organ locations; previous text-visual models focus on identifying the target rather than improving the segmentation accuracy; prior models attempt to use prior knowledge to enhance accuracy but do not incorporate pre-trained models. To address these issues, TG-LMM integrates prior knowledge, specifically expert descriptions of the spatial locations of organs, into the segmentation process. Our model utilizes pre-trained image and text encoders to reduce the number of training parameters and accelerate the training process. Additionally, we designed a comprehensive image-text information fusion structure to ensure thorough integration of the two modalities of data. We evaluated TG-LMM on three authoritative medical image datasets, encompassing the segmentation of various parts of the human body. Our method demonstrated superior performance compared to existing approaches, such as MedSAM, SAM and nnUnet.

[CV-38] KAN See In the Dark

链接: https://arxiv.org/abs/2409.03404
作者: Aoxiang Ning,Minglong Xue,Jinhong He,Chengyun Song
关键词-EN: Existing low-light image, complex nonlinear relationship, low-light image enhancement, low-light images due, Existing low-light
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing low-light image enhancement methods are difficult to fit the complex nonlinear relationship between normal and low-light images due to uneven illumination and noise effects. The recently proposed Kolmogorov-Arnold networks (KANs) feature spline-based convolutional layers and learnable activation functions, which can effectively capture nonlinear dependencies. In this paper, we design a KAN-Block based on KANs and innovatively apply it to low-light image enhancement. This method effectively alleviates the limitations of current methods constrained by linear network structures and lack of interpretability, further demonstrating the potential of KANs in low-level vision tasks. Given the poor perception of current low-light image enhancement methods and the stochastic nature of the inverse diffusion process, we further introduce frequency-domain perception for visually oriented enhancement. Extensive experiments demonstrate the competitive performance of our method on benchmark datasets. The code will be available at: this https URLthis https URL.

[CV-39] Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression

链接: https://arxiv.org/abs/2409.03385
作者: Jingcheng Ke,Dele Wang,Jun-Cheng Chen,I-Hong Jhuo,Chia-Wen Lin,Yen-Yu Lin
关键词-EN: referring expression comprehension, existing graph-based methods, expression comprehension, common belief, complex models
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 12 pages to appear in IEEE Transactions on Multimedia

点击查看摘要

Abstract:One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.

[CV-40] MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice ECCV

链接: https://arxiv.org/abs/2409.03358
作者: Friedhelm Hamann,Hanxiong Li,Paul Mieske,Lars Lewejohann,Guillermo Gallego
关键词-EN: made remarkable progress, Enabled by large, recent years, made remarkable, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 18 pages, 5 figures, ECCV Workshops

点击查看摘要

Abstract:Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: ( i ) a new task termed \emphspace-time instance segmentation, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and ( ii ) \emph\dname, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\urlthis https URL

[CV-41] Few-Shot Continual Learning for Activity Recognition in Classroom Surveillance Images

链接: https://arxiv.org/abs/2409.03354
作者: Yilei Qian,Kanglei Geng,Kailong Chen,Shaoxu Cheng,Linfeng Xu,Hongliang Li,Fanman Meng,Qingbo Wu
关键词-EN: gaining increasing attention, activity recognition, image activity recognition, field is gaining, activity recognition called
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The application of activity recognition in the “AI + Education” field is gaining increasing attention. However, current work mainly focuses on the recognition of activities in manually captured videos and a limited number of activity types, with little attention given to recognizing activities in surveillance images from real classrooms. In real classroom settings, normal teaching activities such as reading, account for a large proportion of samples, while rare non-teaching activities such as eating, continue to appear. This requires a model that can learn non-teaching activities from few samples without forgetting the normal teaching activities, which necessitates fewshot continual learning (FSCL) capability. To address this gap, we constructed a continual learning dataset focused on classroom surveillance image activity recognition called ARIC (Activity Recognition in Classroom). The dataset has advantages such as multiple perspectives, a wide variety of activities, and real-world scenarios, but it also presents challenges like similar activities and imbalanced sample distribution. To overcome these challenges, we designed a few-shot continual learning method that combines supervised contrastive learning (SCL) and an adaptive covariance classifier (ACC). During the base phase, we proposed a SCL approach based on feature augmentation to enhance the model’s generalization ability. In the incremental phase, we employed an ACC to more accurately describe the distribution of new classes. Experimental results demonstrate that our method outperforms other existing methods on the ARIC dataset.

[CV-42] Eetimating Indoor Scene Depth Maps from Ultrasonic Echoes ICIP2024

链接: https://arxiv.org/abs/2409.03336
作者: Junpei Honma,Akisato Kimura,Go Irie
关键词-EN: indoor scenes requires, scenes requires dedicated, dedicated depth sensors, requires dedicated depth, depth estimation
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: ICIP 2024

点击查看摘要

Abstract:Measuring 3D geometric structures of indoor scenes requires dedicated depth sensors, which are not always available. Echo-based depth estimation has recently been studied as a promising alternative solution. All previous studies have assumed the use of echoes in the audible range. However, one major problem is that audible echoes cannot be used in quiet spaces or other situations where producing audible sounds is prohibited. In this paper, we consider echo-based depth estimation using inaudible ultrasonic echoes. While ultrasonic waves provide high measurement accuracy in theory, the actual depth estimation accuracy when ultrasonic echoes are used has remained unclear, due to its disadvantage of being sensitive to noise and susceptible to attenuation. We first investigate the depth estimation accuracy when the frequency of the sound source is restricted to the high-frequency band, and found that the accuracy decreased when the frequency was limited to ultrasonic ranges. Based on this observation, we propose a novel deep learning method to improve the accuracy of ultrasonic echo-based depth estimation by using audible echoes as auxiliary data only during training. Experimental results with a public dataset demonstrate that our method improves the estimation accuracy.

[CV-43] Enhancing User-Centric Privacy Protection: An Interactive Framework through Diffusion Models and Machine Unlearning

链接: https://arxiv.org/abs/2409.03326
作者: Huaxi Huang,Xin Yuan,Qiyu Liao,Dadong Wang,Tongliang Liu
关键词-EN: multimedia data analysis, privacy protection, privacy, realm of multimedia, escalated concerns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the realm of multimedia data analysis, the extensive use of image datasets has escalated concerns over privacy protection within such data. Current research predominantly focuses on privacy protection either in data sharing or upon the release of trained machine learning models. Our study pioneers a comprehensive privacy protection framework that safeguards image data privacy concurrently during data sharing and model publication. We propose an interactive image privacy protection framework that utilizes generative machine learning models to modify image information at the attribute level and employs machine unlearning algorithms for the privacy preservation of model parameters. This user-interactive framework allows for adjustments in privacy protection intensity based on user feedback on generated images, striking a balance between maximal privacy safeguarding and maintaining model performance. Within this framework, we instantiate two modules: a differential privacy diffusion model for protecting attribute information in images and a feature unlearning algorithm for efficient updates of the trained model on the revised image dataset. Our approach demonstrated superiority over existing methods on facial datasets across various attribute classifications.

[CV-44] YOLO-PPA based Efficient Traffic Sign Detection for Cruise Control in Autonomous Driving

链接: https://arxiv.org/abs/2409.03320
作者: Jingyu Zhang,Wenqing Zhang,Chaoyi Tan,Xiangtian Li,Qianyi Sun
关键词-EN: autonomous driving systems, traffic signs efficiently, detect traffic signs, traffic sign detection, proposed YOLO PPA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:It is very important to detect traffic signs efficiently and accurately in autonomous driving systems. However, the farther the distance, the smaller the traffic signs. Existing object detection algorithms can hardly detect these small scaled this http URL addition, the performance of embedded devices on vehicles limits the scale of detection this http URL address these challenges, a YOLO PPA based traffic sign detection algorithm is proposed in this paper.The experimental results on the GTSDB dataset show that compared to the original YOLO, the proposed method improves inference efficiency by 11.2%. The mAP 50 is also improved by 93.2%, which demonstrates the effectiveness of the proposed YOLO PPA.

[CV-45] Improving Robustness to Multiple Spurious Correlations by Multi-Objective Optimization

链接: https://arxiv.org/abs/2409.03303
作者: Nayeong Kim,Juwon Kang,Sungsoo Ahn,Jungseul Ok,Suha Kwak
关键词-EN: multiple biases, unbiased and accurate, accurate model, multiple, training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: International Conference on Machine Learning 2024

点击查看摘要

Abstract:We study the problem of training an unbiased and accurate model given a dataset with multiple biases. This problem is challenging since the multiple biases cause multiple undesirable shortcuts during training, and even worse, mitigating one may exacerbate the other. We propose a novel training method to tackle this challenge. Our method first groups training data so that different groups induce different shortcuts, and then optimizes a linear combination of group-wise losses while adjusting their weights dynamically to alleviate conflicts between the groups in performance; this approach, rooted in the multi-objective optimization theory, encourages to achieve the minimax Pareto solution. We also present a new benchmark with multiple biases, dubbed MultiCelebA, for evaluating debiased training methods under realistic and challenging scenarios. Our method achieved the best on three datasets with multiple biases, and also showed superior performance on conventional single-bias datasets.

[CV-46] ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding

链接: https://arxiv.org/abs/2409.03277
作者: Zhengzhuo Xu,Bowen Qu,Yiyan Qi,Sinan Du,Chengjin Xu,Chun Yuan,Jian Guo
关键词-EN: Automatic chart understanding, Automatic chart, document parsing, chart understanding, crucial for content
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, the application of alignment training within the chart domain is still underexplored. To address this, we propose ChartMoE, which employs the mixture of expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train multiple linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with over 900K chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts in four distinct ways and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.

[CV-47] OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

链接: https://arxiv.org/abs/2409.03272
作者: Julong Wei,Shanshuai Yuan,Pengfei Li,Qingda Hu,Zhongxue Gan,Wenchao Ding
关键词-EN: spurred their applications, autonomous driving, large language models, multi-modal large language, applications in autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The rise of multi-modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM-based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy-language-action generative world model, which uses semantic occupancy as a general visual representation and unifies vision-language-action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE-like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a unified multi-modal vocabulary for vision, language and action. Furthermore, we enhance LLM, specifically LLaMA, to perform the next token/scene prediction on the unified vocabulary to complete multiple tasks in autonomous driving. Extensive experiments demonstrate that OccLLaMA achieves competitive performance across multiple tasks, including 4D occupancy forecasting, motion planning, and visual question answering, showcasing its potential as a foundation model in autonomous driving.

[CV-48] SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model

链接: https://arxiv.org/abs/2409.03270
作者: Weipeng Tan,Chuming Lin,Chengming Xu,Xiaozhong Ji,Junwei Zhu,Chengjie Wang,Yanwei Fu
关键词-EN: Talking Head Generation, Talking Head, broad application prospects, Head Generation, film production
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the ‘bespoked’ contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.

[CV-49] Bones Cant Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision ECCV2024

链接: https://arxiv.org/abs/2409.03261
作者: Jinhee Kim,Taesung Kim,Jaegul Choo
关键词-EN: Recent advances, minimizing user intervention, vertebrae keypoint estimation, keypoint estimation, enhanced accuracy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 33 pages, ECCV 2024, Project Page: this https URL

点击查看摘要

Abstract:Recent advances in interactive keypoint estimation methods have enhanced accuracy while minimizing user intervention. However, these methods require user input for error correction, which can be costly in vertebrae keypoint estimation where inaccurate keypoints are densely clustered or overlap. We introduce a novel approach, KeyBot, specifically designed to identify and correct significant and typical errors in existing models, akin to user revision. By characterizing typical error types and using simulated errors for training, KeyBot effectively corrects these errors and significantly reduces user workload. Comprehensive quantitative and qualitative evaluations on three public datasets confirm that KeyBot significantly outperforms existing methods, achieving state-of-the-art performance in interactive vertebrae keypoint estimation. The source code and demo video are available at: this https URL

[CV-50] Granular-ball Representation Learning for Deep CNN on Learning with Label Noise

链接: https://arxiv.org/abs/2409.03254
作者: Dawei Dai,Hao Zhu,Shuyin Xia,Guoyin Wang
关键词-EN: deep CNN models, actual scenarios, automatically annotated, manually or automatically, noise is inevitably
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In actual scenarios, whether manually or automatically annotated, label noise is inevitably generated in the training data, which can affect the effectiveness of deep CNN models. The popular solutions require data cleaning or designing additional optimizations to punish the data with mislabeled data, thereby enhancing the robustness of models. However, these methods come at the cost of weakening or even losing some data during the training process. As we know, content is the inherent attribute of an image that does not change with changes in annotations. In this study, we propose a general granular-ball computing (GBC) module that can be embedded into a CNN model, where the classifier finally predicts the label of granular-ball ( gb ) samples instead of each individual samples. Specifically, considering the classification task: (1) in forward process, we split the input samples as gb samples at feature-level, each of which can correspond to multiple samples with varying numbers and share one single label; (2) during the backpropagation process, we modify the gradient allocation strategy of the GBC module to enable it to propagate normally; and (3) we develop an experience replay policy to ensure the stability of the training process. Experiments demonstrate that the proposed method can improve the robustness of CNN models with no additional data or optimization.

[CV-51] Gr-IoU: Ground-Intersection over Union for Robust Multi-Object Tracking with 3D Geometric Constraints ECCV2024

链接: https://arxiv.org/abs/2409.03252
作者: Keisuke Toida,Naoki Kato,Osamu Segawa,Takeshi Nakamura,Kazuhiro Hotta
关键词-EN: problem in multi-object, data association problem, multi-object tracking, association problem, tracking objects detected
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for the ECCV 2024 Workshop on Affective Behavior Analysis in-the-wild(ABAW)

点击查看摘要

Abstract:We propose a Ground IoU (Gr-IoU) to address the data association problem in multi-object tracking. When tracking objects detected by a camera, it often occurs that the same object is assigned different IDs in consecutive frames, especially when objects are close to each other or overlapping. To address this issue, we introduce Gr-IoU, which takes into account the 3D structure of the scene. Gr-IoU transforms traditional bounding boxes from the image space to the ground plane using the vanishing point geometry. The IoU calculated with these transformed bounding boxes is more sensitive to the front-to-back relationships of objects, thereby improving data association accuracy and reducing ID switches. We evaluated our Gr-IoU method on the MOT17 and MOT20 datasets, which contain diverse tracking scenarios including crowded scenes and sequences with frequent occlusions. Experimental results demonstrated that Gr-IoU outperforms conventional real-time methods without appearance features.

[CV-52] Multiple weather images restoration using the task transformer and adaptive mixup strategy

链接: https://arxiv.org/abs/2409.03249
作者: Yang Wen,Anyu Lai,Bo Qian,Hao Wang,Wuzhen Shi,Wenming Cao
关键词-EN: severe weather removal, removal predominantly focuses, weather, weather removal, weather removal predominantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures and 2 table

点击查看摘要

Abstract:The current state-of-the-art in severe weather removal predominantly focuses on single-task applications, such as rain removal, haze removal, and snow removal. However, real-world weather conditions often consist of a mixture of several weather types, and the degree of weather mixing in autonomous driving scenarios remains unknown. In the presence of complex and diverse weather conditions, a single weather removal model often encounters challenges in producing clear images from severe weather images. Therefore, there is a need for the development of multi-task severe weather removal models that can effectively handle mixed weather conditions and improve image quality in autonomous driving scenarios. In this paper, we introduce a novel multi-task severe weather removal model that can effectively handle complex weather conditions in an adaptive manner. Our model incorporates a weather task sequence generator, enabling the self-attention mechanism to selectively focus on features specific to different weather types. To tackle the challenge of repairing large areas of weather degradation, we introduce Fast Fourier Convolution (FFC) to increase the receptive field. Additionally, we propose an adaptive upsampling technique that effectively processes both the weather task information and underlying image features by selectively retaining relevant information. Our proposed model has achieved state-of-the-art performance on the publicly available dataset.

[CV-53] UAV (Unmanned Aerial Vehicles): Diverse Applications of UAV Datasets in Segmentation Classification Detection and Tracking

链接: https://arxiv.org/abs/2409.03245
作者: Md. Mahfuzur Rahman,Sunzida Siddique,Marufa Kamal,Rakib Hossain Rifat,Kishor Datta Gupta
关键词-EN: providing unmatched adaptability, Unmanned Aerial Vehicles, Unmanned Aerial, diverse research domains, UAV datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs), have greatly revolutionized the process of gathering and analyzing data in diverse research domains, providing unmatched adaptability and effectiveness. This paper presents a thorough examination of Unmanned Aerial Vehicle (UAV) datasets, emphasizing their wide range of applications and progress. UAV datasets consist of various types of data, such as satellite imagery, images captured by drones, and videos. These datasets can be categorized as either unimodal or multimodal, offering a wide range of detailed and comprehensive information. These datasets play a crucial role in disaster damage assessment, aerial surveillance, object recognition, and tracking. They facilitate the development of sophisticated models for tasks like semantic segmentation, pose estimation, vehicle re-identification, and gesture recognition. By leveraging UAV datasets, researchers can significantly enhance the capabilities of computer vision models, thereby advancing technology and improving our understanding of complex, dynamic environments from an aerial perspective. This review aims to encapsulate the multifaceted utility of UAV datasets, emphasizing their pivotal role in driving innovation and practical applications in multiple domains.

[CV-54] Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly Detection

链接: https://arxiv.org/abs/2409.03236
作者: Chenglizhao Chen,Xinyu Liu,Mengke Song,Luming Li,Xu Yu,Shanchen Pang
关键词-EN: surveillance applications, crucial for surveillance, Detecting anomalies, scenes, methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13pages, 9 figures

点击查看摘要

Abstract:Detecting anomalies in human-related videos is crucial for surveillance applications. Current methods primarily include appearance-based and action-based techniques. Appearance-based methods rely on low-level visual features such as color, texture, and shape. They learn a large number of pixel patterns and features related to known scenes during training, making them effective in detecting anomalies within these familiar contexts. However, when encountering new or significantly changed scenes, i.e., unknown scenes, they often fail because existing SOTA methods do not effectively capture the relationship between actions and their surrounding scenes, resulting in low generalization. In contrast, action-based methods focus on detecting anomalies in human actions but are usually less informative because they tend to overlook the relationship between actions and their scenes, leading to incorrect detection. For instance, the normal event of running on the beach and the abnormal event of running on the street might both be considered normal due to the lack of scene information. In short, current methods struggle to integrate low-level visual and high-level action features, leading to poor anomaly detection in varied and complex scenes. To address this challenge, we propose a novel decoupling-based architecture for human-related video anomaly detection (DecoAD). DecoAD significantly improves the integration of visual and action features through the decoupling and interweaving of scenes and actions, thereby enabling a more intuitive and accurate understanding of complex behaviors and scenes. DecoAD supports fully supervised, weakly supervised, and unsupervised settings.

[CV-55] Labeled-to-Unlabeled Distribution Alignment for Partially-Supervised Multi-Organ Medical Image Segmentation

链接: https://arxiv.org/abs/2409.03228
作者: Xixi Jiang,Dong Zhang,Xiang Li,Kangyi Liu,Kwang-Ting Cheng,Xin Yang
关键词-EN: medical image segmentation, image segmentation aims, unified semantic segmentation, semantic segmentation model, multi-organ medical image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by Medical Image Analysis

点击查看摘要

Abstract:Partially-supervised multi-organ medical image segmentation aims to develop a unified semantic segmentation model by utilizing multiple partially-labeled datasets, with each dataset providing labels for a single class of organs. However, the limited availability of labeled foreground organs and the absence of supervision to distinguish unlabeled foreground organs from the background pose a significant challenge, which leads to a distribution mismatch between labeled and unlabeled pixels. Although existing pseudo-labeling methods can be employed to learn from both labeled and unlabeled pixels, they are prone to performance degradation in this task, as they rely on the assumption that labeled and unlabeled pixels have the same distribution. In this paper, to address the problem of distribution mismatch, we propose a labeled-to-unlabeled distribution alignment (LTUDA) framework that aligns feature distributions and enhances discriminative capability. Specifically, we introduce a cross-set data augmentation strategy, which performs region-level mixing between labeled and unlabeled organs to reduce distribution discrepancy and enrich the training set. Besides, we propose a prototype-based distribution alignment method that implicitly reduces intra-class variation and increases the separation between the unlabeled foreground and background. This can be achieved by encouraging consistency between the outputs of two prototype classifiers and a linear classifier. Extensive experimental results on the AbdomenCT-1K dataset and a union of four benchmark datasets (including LiTS, MSD-Spleen, KiTS, and NIH82) demonstrate that our method outperforms the state-of-the-art partially-supervised methods by a considerable margin, and even surpasses the fully-supervised methods. The source code is publicly available at this https URL.

[CV-56] Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

链接: https://arxiv.org/abs/2409.03223
作者: Chenguang Zhu,Shan Gao,Huafeng Chen,Guangqian Guo,Chaowei Wang,Yaoxing Wang,Chen Shu Lei,Quanjiang Fan
关键词-EN: Multi-modality image fusion, render high-quality fusion, Multi-modality image, high-quality fusion images, image fusion aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process.

[CV-57] Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction

链接: https://arxiv.org/abs/2409.03213
作者: Shen Chen,Jiale Zhou,Lei Li
关键词-EN: Neural Radiance Fields, Radiance Fields, Neural Radiance, computational overhead compared, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising approach for 3D scene representation, offering a reduction in computational overhead compared to Neural Radiance Fields (NeRF). However, 3DGS is susceptible to high-frequency artifacts and demonstrates suboptimal performance under sparse viewpoint conditions, thereby limiting its applicability in robotics and computer vision. To address these limitations, we introduce SVS-GS, a novel framework for Sparse Viewpoint Scene reconstruction that integrates a 3D Gaussian smoothing filter to suppress artifacts. Furthermore, our approach incorporates a Depth Gradient Profile Prior (DGPP) loss with a dynamic depth mask to sharpen edges and 2D diffusion with Score Distillation Sampling (SDS) loss to enhance geometric consistency in novel view synthesis. Experimental evaluations on the MipNeRF-360 and SeaThru-NeRF datasets demonstrate that SVS-GS markedly improves 3D reconstruction from sparse viewpoints, offering a robust and efficient solution for scene understanding in robotics and computer vision applications.

[CV-58] Bi-capacity Choquet Integral for Sensor Fusion with Label Uncertainty

链接: https://arxiv.org/abs/2409.03212
作者: Hersh Vakharia,Xiaoxiao Du
关键词-EN: improve reliability, Multiple Instance Learning, Sensor fusion combines, multiple sensor sources, Choquet integral
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, 7 tables; Accepted to 2024 FUZZ-IEEE and presented at 2024 IEEE WCCI; Code available at this https URL

点击查看摘要

Abstract:Sensor fusion combines data from multiple sensor sources to improve reliability, robustness, and accuracy of data interpretation. The Fuzzy Integral (FI), in particular, the Choquet integral (ChI), is often used as a powerful nonlinear aggregator for fusion across multiple sensors. However, existing supervised ChI learning algorithms typically require precise training labels for each input data point, which can be difficult or impossible to obtain. Additionally, prior work on ChI fusion is often based only on the normalized fuzzy measures, which bounds the fuzzy measure values between [0, 1]. This can be limiting in cases where the underlying scales of input data sources are bipolar (i.e., between [-1, 1]). To address these challenges, this paper proposes a novel Choquet integral-based fusion framework, named Bi-MIChI (pronounced “bi-mi-kee”), which uses bi-capacities to represent the interactions between pairs of subsets of the input sensor sources on a bi-polar scale. This allows for extended non-linear interactions between the sensor sources and can lead to interesting fusion results. Bi-MIChI also addresses label uncertainty through Multiple Instance Learning, where training labels are applied to “bags” (sets) of data instead of per-instance. Our proposed Bi-MIChI framework shows effective classification and detection performance on both synthetic and real-world experiments for sensor fusion with label uncertainty. We also provide detailed analyses on the behavior of the fuzzy measures to demonstrate our fusion process.

[CV-59] Seg: An Iterative Refinement-based Framework for Training-free Segmentation

链接: https://arxiv.org/abs/2409.03209
作者: Lin Sun,Jiale Cao,Jin Xie,Fahad Shahbaz Khan,Yanwei Pang
关键词-EN: Stable diffusion, strong semantic clue, demonstrated strong image, strong image synthesis, employing stable diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. Inspired by this, researchers have explored employing stable diffusion for trainingfree segmentation. Most existing approaches either simply employ cross-attention map or refine it by self-attention map, to generate segmentation masks. We believe that iterative refinement with self-attention map would lead to better results. However, we mpirically demonstrate that such a refinement is sub-optimal likely due to the self-attention map containing irrelevant global information which hampers accurately refining cross-attention map with multiple iterations. To address this, we propose an iterative refinement framework for training-free segmentation, named iSeg, having an entropy-reduced self-attention module which utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves refined crossattention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kind of images and interactions.

[CV-60] C-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

链接: https://arxiv.org/abs/2409.03206
作者: Mingze Gao,Jingyu Liu,Mingda Li,Jiangtao Xie,Qingbin Liu,Bo Zhao,Xi Chen,Hui Xiong
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, significantly improved performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model’s capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM’s temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.

[CV-61] Active Fake: DeepFake Camouflage

链接: https://arxiv.org/abs/2409.03200
作者: Pu Sun,Honggang Qi,Yuezun Li
关键词-EN: gained significant attention, significant attention due, manipulate facial attributes, Deep Neural Networks, high realism
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:DeepFake technology has gained significant attention due to its ability to manipulate facial attributes with high realism, raising serious societal concerns. Face-Swap DeepFake is the most harmful among these techniques, which fabricates behaviors by swapping original faces with synthesized ones. Existing forensic methods, primarily based on Deep Neural Networks (DNNs), effectively expose these manipulations and have become important authenticity indicators. However, these methods mainly concentrate on capturing the blending inconsistency in DeepFake faces, raising a new security issue, termed Active Fake, emerges when individuals intentionally create blending inconsistency in their authentic videos to evade responsibility. This tactic is called DeepFake Camouflage. To achieve this, we introduce a new framework for creating DeepFake camouflage that generates blending inconsistencies while ensuring imperceptibility, effectiveness, and transferability. This framework, optimized via an adversarial learning strategy, crafts imperceptible yet effective inconsistencies to mislead forensic detectors. Extensive experiments demonstrate the effectiveness and robustness of our method, highlighting the need for further research in active fake detection.

[CV-62] RoomDiffusion: A Specialized Diffusion Model in the Interior Design Industry

链接: https://arxiv.org/abs/2409.03198
作者: Zhaowei Wang,Ying Hao,Hao Wei,Qing Xiao,Lulu Chen,Yulong Li,Yue Yang,Tianyi Li
关键词-EN: design remains underexplored, Recent advancements, visual content generation, significantly transformed visual, transformed visual content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-image diffusion models have significantly transformed visual content generation, yet their application in specialized fields such as interior design remains underexplored. In this paper, we present RoomDiffusion, a pioneering diffusion model meticulously tailored for the interior design industry. To begin with, we build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. Subsequently, techniques such as multiaspect training, multi-stage fine-tune and model fusion are applied to enhance both the visual appeal and precision of the generated results. Lastly, leveraging the latent consistency Distillation method, we distill and expedite the model for optimal efficiency. Unlike existing models optimized for general scenarios, RoomDiffusion addresses specific challenges in interior design, such as lack of fashion, high furniture duplication rate, and inaccurate style. Through our holistic human evaluation protocol with more than 20 professional human evaluators, RoomDiffusion demonstrates industry-leading performance in terms of aesthetics, accuracy, and efficiency, surpassing all existing open source models such as stable diffusion and SDXL.

[CV-63] PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning

链接: https://arxiv.org/abs/2409.03192
作者: Bowen Tian,Songning Lai,Lujundong Li,Zhihao Shuai,Runwei Guan,Tian Wu,Yutao Yue
关键词-EN: computer vision technologies, witnessed significant advancements, vision technologies, Fine-grained image classification, advent of deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.Our code has been open sourced at this https URL.

[CV-64] Mastoidectomy Multi-View Synthesis from a Single Microscopy Image

链接: https://arxiv.org/abs/2409.03190
作者: Yike Zhang,Jack Noble
关键词-EN: Cochlear Implant, procedures involve performing, procedures involve, involve performing, performing an invasive
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Submitted to Medical Imaging 2025: Image-Guided Procedures, Robotic Interventions, and Modeling

点击查看摘要

Abstract:Cochlear Implant (CI) procedures involve performing an invasive mastoidectomy to insert an electrode array into the cochlea. In this paper, we introduce a novel pipeline that is capable of generating synthetic multi-view videos from a single CI microscope image. In our approach, we use a patient’s pre-operative CT scan to predict the post-mastoidectomy surface using a method designed for this purpose. We manually align the surface with a selected microscope frame to obtain an accurate initial pose of the reconstructed CT mesh relative to the microscope. We then perform UV projection to transfer the colors from the frame to surface textures. Novel views of the textured surface can be used to generate a large dataset of synthetic frames with ground truth poses. We evaluated the quality of synthetic views rendered using Pytorch3D and PyVista. We found both rendering engines lead to similarly high-quality synthetic novel-view frames compared to ground truth with a structural similarity index for both methods averaging about 0.86. A large dataset of novel views with known poses is critical for ongoing training of a method to automatically estimate microscope pose for 2D to 3D registration with the pre-operative CT to facilitate augmented reality surgery. This dataset will empower various downstream tasks, such as integrating Augmented Reality (AR) in the OR, tracking surgical tools, and supporting other video analysis studies.

[CV-65] Developing Analyzing and Evaluating Self-Drive Algorithms Using Drive-by-Wire Electric Vehicles

链接: https://arxiv.org/abs/2409.03114
作者: Beñat Froemming-Aldanondo,Tatiana Rastoskueva,Michael Evans,Marcial Machado,Anna Vadella,Rickey Johnson,Luis Escamilla,Milan Jostes,Devson Butani,Ryan Kaddis,Chan-Jin Chung,Joshua Siegel
关键词-EN: effective autonomous driving, Robot Operating System, essential for safe, safe and effective, effective autonomous
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Supported by the National Science Foundation under Grants No. 2150292 and 2150096

点击查看摘要

Abstract:Reliable lane-following algorithms are essential for safe and effective autonomous driving. This project was primarily focused on developing and evaluating different lane-following programs to find the most reliable algorithm for a Vehicle to Everything (V2X) project. The algorithms were first tested on a simulator and then with real vehicles equipped with a drive-by-wire system using ROS (Robot Operating System). Their performance was assessed through reliability, comfort, speed, and adaptability metrics. The results show that the two most reliable approaches detect both lane lines and use unsupervised learning to separate them. These approaches proved to be robust in various driving scenarios, making them suitable candidates for integration into the V2X project.

[CV-66] FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

链接: https://arxiv.org/abs/2409.03109
作者: Mamadou Keita,Wassim Hamidouche,Hessen Bougueffa Eutamene,Abdelmalik Taleb-Ahmed,Abdenour Hadid
关键词-EN: Fake Image Detection, introduce FIDAVL, Vision-Language Model, Detection and Attribution, FIDAVL
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We introduce FIDAVL: Fake Image Detection and Attribution using a Vision-Language Model. FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. Leveraging the benefits of zero-shot learning, FIDAVL exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images and accurately attribute them to their originating source models. We conducted extensive experiments on a comprehensive dataset comprising synthetic images generated by various state-of-the-art models. Our results demonstrate that FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47% while also obtaining noteworthy performance metrics, with an average F1-score of 92.64% and ROUGE-L score of 96.50% for attributing synthetic images to their respective source generation models. The source code of this work will be publicly released at this https URL.

[CV-67] Spatial Diffusion for Cell Layout Generation MICCAI2024

链接: https://arxiv.org/abs/2409.03106
作者: Chen Li,Xiaoling Hu,Shahira Abousamra,Meilong Xu,Chao Chen
关键词-EN: augment training sets, augment training, training sets, Generative models, Generative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 4 figures, accepted by MICCAI 2024

点击查看摘要

Abstract:Generative models, such as GANs and diffusion models, have been used to augment training sets and boost performances in different tasks. We focus on generative models for cell detection instead, i.e., locating and classifying cells in given pathology images. One important information that has been largely overlooked is the spatial patterns of the cells. In this paper, we propose a spatial-pattern-guided generative model for cell layout generation. Specifically, a novel diffusion model guided by spatial features and generates realistic cell layouts has been proposed. We explore different density models as spatial features for the diffusion model. In downstream tasks, we show that the generated cell layouts can be used to guide the generation of high-quality pathology images. Augmenting with these images can significantly boost the performance of SOTA cell detection methods. The code is available at this https URL.

[CV-68] MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation ECCV2024

链接: https://arxiv.org/abs/2409.03062
作者: Shehan Perera,Yunus Erzurumlu,Deepak Gulati,Alper Yilmaz
关键词-EN: medical image analysis, cancer segmentation poses, poses a significant, significant challenge, challenge in medical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at ECCV 2024 - BioImage Computing Workshop (Oral)

点击查看摘要

Abstract:Skin cancer segmentation poses a significant challenge in medical image analysis. Numerous existing solutions, predominantly CNN-based, face issues related to a lack of global contextual understanding. Alternatively, some approaches resort to large-scale Transformer models to bridge the global contextual gaps, but at the expense of model size and computational complexity. Finally many Transformer based approaches rely primarily on CNN based decoders overlooking the benefits of Transformer based decoding models. Recognizing these limitations, we address the need efficient lightweight solutions by introducing MobileUNETR, which aims to overcome the performance constraints associated with both CNNs and Transformers while minimizing model size, presenting a promising stride towards efficient image segmentation. MobileUNETR has 3 main features. 1) MobileUNETR comprises of a lightweight hybrid CNN-Transformer encoder to help balance local and global contextual feature extraction in an efficient manner; 2) A novel hybrid decoder that simultaneously utilizes low-level and global features at different resolutions within the decoding stage for accurate mask generation; 3) surpassing large and complex architectures, MobileUNETR achieves superior performance with 3 million parameters and a computational complexity of 1.3 GFLOP resulting in 10x and 23x reduction in parameters and FLOPS, respectively. Extensive experiments have been conducted to validate the effectiveness of our proposed method on four publicly available skin lesion segmentation datasets, including ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. The code will be publicly available at: this https URL

[CV-69] Incorporating dense metric depth into neural 3D representations for view synthesis and relighting

链接: https://arxiv.org/abs/2409.03061
作者: Arkadeep Narayan Chaudhury,Igor Vasiljevic,Sergey Zakharov,Vitor Guizilini,Rares Ambrus,Srinivasa Narasimhan,Christopher G. Atkeson
关键词-EN: Synthesizing accurate geometry, Synthesizing accurate, convenient product capture, virtual reality, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Synthesizing accurate geometry and photo-realistic appearance of small scenes is an active area of research with compelling use cases in gaming, virtual reality, robotic-manipulation, autonomous driving, convenient product capture, and consumer-level photography. When applying scene geometry and appearance estimation techniques to robotics, we found that the narrow cone of possible viewpoints due to the limited range of robot motion and scene clutter caused current estimation techniques to produce poor quality estimates or even fail. On the other hand, in robotic applications, dense metric depth can often be measured directly using stereo and illumination can be controlled. Depth can provide a good initial estimate of the object geometry to improve reconstruction, while multi-illumination images can facilitate relighting. In this work we demonstrate a method to incorporate dense metric depth into the training of neural 3D representations and address an artifact observed while jointly refining geometry and appearance by disambiguating between texture and geometry edges. We also discuss a multi-flash stereo camera system developed to capture the necessary data for our pipeline and show results on relighting and view synthesis with a few training views.

[CV-70] Can Your Generative Model Detect Out-of-Distribution Covariate Shift? ECCV2024

链接: https://arxiv.org/abs/2409.03043
作者: Christiaan Viviers,Amaan Valiuddin,Francisco Caetano,Lemar Abdi,Lena Filatova,Peter de With,Fons van der Sommen
关键词-EN: high-level image statistics, normal and In-Distribution, high-level image, distribution shift aims, OOD detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Detecting Out-of-Distribution~(OOD) sensory data and covariate distribution shift aims to identify new test examples with different high-level image statistics to the captured, normal and In-Distribution (ID) set. Existing OOD detection literature largely focuses on semantic shift with little-to-no consensus over covariate shift. Generative models capture the ID data in an unsupervised manner, enabling them to effectively identify samples that deviate significantly from this learned distribution, irrespective of the downstream task. In this work, we elucidate the ability of generative models to detect and quantify domain-specific covariate shift through extensive analyses that involves a variety of models. To this end, we conjecture that it is sufficient to detect most occurring sensory faults (anomalies and deviations in global signals statistics) by solely modeling high-frequency signal-dependent and independent details. We propose a novel method, CovariateFlow, for OOD detection, specifically tailored to covariate heteroscedastic high-frequency image-components using conditional Normalizing Flows (cNFs). Our results on CIFAR10 vs. CIFAR10-C and ImageNet200 vs. ImageNet200-C demonstrate the effectiveness of the method by accurately detecting OOD covariate shift. This work contributes to enhancing the fidelity of imaging systems and aiding machine learning models in OOD detection in the presence of covariate shift.

[CV-71] MDNF: Multi-Diffusion-Nets for Neural Fields on Meshes

链接: https://arxiv.org/abs/2409.03034
作者: Avigail Cohen Rimon,Tal Shnitzer,Mirela Ben Chen
关键词-EN: Fourier Filter Bank, Neural Fourier Filter, frequency domains, framework for representing, triangle meshes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel framework for representing neural fields on triangle meshes that is multi-resolution across both spatial and frequency domains. Inspired by the Neural Fourier Filter Bank (NFFB), our architecture decomposes the spatial and frequency domains by associating finer spatial resolution levels with higher frequency bands, while coarser resolutions are mapped to lower frequencies. To achieve geometry-aware spatial decomposition we leverage multiple DiffusionNet components, each associated with a different spatial resolution level. Subsequently, we apply a Fourier feature mapping to encourage finer resolution levels to be associated with higher frequencies. The final signal is composed in a wavelet-inspired manner using a sine-activated MLP, aggregating higher-frequency signals on top of lower-frequency ones. Our architecture attains high accuracy in learning complex neural fields and is robust to discontinuities, exponential scale variations of the target field, and mesh modification. We demonstrate the effectiveness of our approach through its application to diverse neural fields, such as synthetic RGB functions, UV texture coordinates, and vertex normals, illustrating different challenges. To validate our method, we compare its performance against two alternatives, showcasing the advantages of our multi-resolution architecture.

[CV-72] A General Albedo Recovery Approach for Aerial Photogrammetric Images through Inverse Rendering

链接: https://arxiv.org/abs/2409.03032
作者: Shuang Song,Rongjun Qin
关键词-EN: Modeling outdoor scenes, complicated unmodeled physics, Modeling outdoor, ill-posed problem due, volume scattering
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: ISPRS Journal of Photogrammetry and Remote Sensing

点击查看摘要

Abstract:Modeling outdoor scenes for the synthetic 3D environment requires the recovery of reflectance/albedo information from raw images, which is an ill-posed problem due to the complicated unmodeled physics in this process (e.g., indirect lighting, volume scattering, specular reflection). The problem remains unsolved in a practical context. The recovered albedo can facilitate model relighting and shading, which can further enhance the realism of rendered models and the applications of digital twins. Typically, photogrammetric 3D models simply take the source images as texture materials, which inherently embed unwanted lighting artifacts (at the time of capture) into the texture. Therefore, these polluted textures are suboptimal for a synthetic environment to enable realistic rendering. In addition, these embedded environmental lightings further bring challenges to photo-consistencies across different images that cause image-matching uncertainties. This paper presents a general image formation model for albedo recovery from typical aerial photogrammetric images under natural illuminations and derives the inverse model to resolve the albedo information through inverse rendering intrinsic image decomposition. Our approach builds on the fact that both the sun illumination and scene geometry are estimable in aerial photogrammetry, thus they can provide direct inputs for this ill-posed problem. This physics-based approach does not require additional input other than data acquired through the typical drone-based photogrammetric collection and was shown to favorably outperform existing approaches. We also demonstrate that the recovered albedo image can in turn improve typical image processing tasks in photogrammetry such as feature and dense matching, edge, and line extraction.

[CV-73] No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

链接: https://arxiv.org/abs/2409.03025
作者: Manu Gaur,Darshan Singh S,Makarand Tapaswi
关键词-EN: unable to generate, trained on data, generate fine-grained captions, Visual Caption Boosting, human annotations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model’s fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner’s ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.03025 [cs.CV] (or arXiv:2409.03025v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.03025 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-74] Boundless: Generating Photorealistic Synthetic Data for Object Detection in Urban Streetscapes

链接: https://arxiv.org/abs/2409.03022
作者: Mehmet Kerem Turkcan,Ian Li,Chengbo Zang,Javad Ghaderi,Gil Zussman,Zoran Kostic
关键词-EN: dense urban streetscapes, data generation system, photo-realistic synthetic data, highly accurate object, enabling highly accurate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce Boundless, a photo-realistic synthetic data generation system for enabling highly accurate object detection in dense urban streetscapes. Boundless can replace massive real-world data collection and manual ground-truth object annotation (labeling) with an automated and configurable process. Boundless is based on the Unreal Engine 5 (UE5) City Sample project with improvements enabling accurate collection of 3D bounding boxes across different lighting and scene variability conditions. We evaluate the performance of object detection models trained on the dataset generated by Boundless when used for inference on a real-world dataset acquired from medium-altitude cameras. We compare the performance of the Boundless-trained model against the CARLA-trained model and observe an improvement of 7.8 mAP. The results we achieved support the premise that synthetic data generation is a credible methodology for training/fine-tuning scalable object detection models for urban scenes. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.03022 [cs.CV] (or arXiv:2409.03022v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.03022 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-75] Design and Evaluation of Camera-Centric Mobile Crowdsourcing Applications

链接: https://arxiv.org/abs/2409.03012
作者: Abby Stylianou,Michelle Brachman,Albatool Wazzan,Samuel Black,Richard Souvenir
关键词-EN: underlies automated methods, machine learning, fine-grained recognition, underlies automated, automated methods
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The data that underlies automated methods in computer vision and machine learning, such as image retrieval and fine-grained recognition, often comes from crowdsourcing. In contexts that rely on the intrinsic motivation of users, we seek to understand how the application design affects a user’s willingness to contribute and the quantity and quality of the data they capture. In this project, we designed three versions of a camera-based mobile crowdsourcing application, which varied in the amount of labeling effort requested of the user and conducted a user study to evaluate the trade-off between the level of user-contributed information requested and the quantity and quality of labeled images collected. The results suggest that higher levels of user labeling do not lead to reduced contribution. Users collected and annotated the most images using the application version with the highest requested level of labeling with no decrease in user satisfaction. In preliminary experiments, the additional labeled data supported increased performance on an image retrieval task.

[CV-76] Vec2Face: Scaling Face Dataset Generation with Loosely Constrained Vectors

链接: https://arxiv.org/abs/2409.02979
作者: Haiyu Wu,Jaskirat Singh,Sicong Tian,Liang Zheng,Kevin W. Bowyer
关键词-EN: non-existent persons, paper studies, synthesize face images, face, identities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper studies how to synthesize face images of non-existent persons, to create a dataset that allows effective training of face recognition (FR) models. Two important goals are (1) the ability to generate a large number of distinct identities (inter-class separation) with (2) a wide variation in appearance of each identity (intra-class variation). However, existing works 1) are typically limited in how many well-separated identities can be generated and 2) either neglect or use a separate editing model for attribute augmentation. We propose Vec2Face, a holistic model that uses only a sampled vector as input and can flexibly generate and control face images and their attributes. Composed of a feature masked autoencoder and a decoder, Vec2Face is supervised by face image reconstruction and can be conveniently used in inference. Using vectors with low similarity among themselves as inputs, Vec2Face generates well-separated identities. Randomly perturbing an input identity vector within a small range allows Vec2Face to generate faces of the same identity with robust variation in face attributes. It is also possible to generate images with designated attributes by adjusting vector values with a gradient descent method. Vec2Face has efficiently synthesized as many as 300K identities with 15 million total images, whereas 60K is the largest number of identities created in the previous works. FR models trained with the generated HSFace datasets, from 10k to 300k identities, achieve state-of-the-art accuracy, from 92% to 93.52%, on five real-world test sets. For the first time, our model created using a synthetic training set achieves higher accuracy than the model created using a same-scale training set of real face images (on the CALFW test set).

[CV-77] Multi-Modal Adapter for Vision-Language Models

链接: https://arxiv.org/abs/2409.02958
作者: Dominykas Seputis,Serghei Mihailov,Soham Chatterjee,Zehao Xiao
关键词-EN: Large pre-trained vision-language, Large pre-trained, pre-trained vision-language models, requiring retraining, image classification tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large pre-trained vision-language models, such as CLIP, have demonstrated state-of-the-art performance across a wide range of image classification tasks, without requiring retraining. Few-shot CLIP is competitive with existing specialized architectures that were trained on the downstream tasks. Recent research demonstrates that the performance of CLIP can be further improved using lightweight adaptation approaches. However, previous methods adapt different modalities of the CLIP model individually, ignoring the interactions and relationships between visual and textual representations. In this work, we propose Multi-Modal Adapter, an approach for Multi-Modal adaptation of CLIP. Specifically, we add a trainable Multi-Head Attention layer that combines text and image features to produce an additive adaptation of both. Multi-Modal Adapter demonstrates improved generalizability, based on its performance on unseen classes compared to existing adaptation methods. We perform additional ablations and investigations to validate and interpret the proposed approach.

[CV-78] ssue Concepts: supervised foundation models in computational pathology

链接: https://arxiv.org/abs/2409.03519
作者: Till Nicke,Jan Raphael Schaefer,Henning Hoefener,Friedrich Feuerhake,Dorit Merhof,Fabian Kiessling,Johannes Lotz
关键词-EN: quantitative biomarker evaluation, Tissue Concepts encoder, Tissue Concepts, support diagnostic tasks, Tissue Concepts model
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 Pages, 3 Figures, submitted to and under revision at Computers in Biology and Medicine

点击查看摘要

Abstract:Due to the increasing workload of pathologists, the need for automation to support diagnostic tasks and quantitative biomarker evaluation is becoming more and more apparent. Foundation models have the potential to improve generalizability within and across centers and serve as starting points for data efficient development of specialized yet robust AI models. However, the training foundation models themselves is usually very expensive in terms of data, computation, and time. This paper proposes a supervised training method that drastically reduces these expenses. The proposed method is based on multi-task learning to train a joint encoder, by combining 16 different classification, segmentation, and detection tasks on a total of 912,000 patches. Since the encoder is capable of capturing the properties of the samples, we term it the Tissue Concepts encoder. To evaluate the performance and generalizability of the Tissue Concepts encoder across centers, classification of whole slide images from four of the most prevalent solid cancers - breast, colon, lung, and prostate - was used. The experiments show that the Tissue Concepts model achieve comparable performance to models trained with self-supervision, while requiring only 6% of the amount of training patches. Furthermore, the Tissue Concepts encoder outperforms an ImageNet pre-trained encoder on both in-domain and out-of-domain data.

[CV-79] BConvL-Net: A Hybrid Deep Learning Architecture for Robust Medical Image Segmentation

链接: https://arxiv.org/abs/2409.03367
作者: Shahzaib Iqbal,Tariq M. Khan,Syed S. Naqvi,Asim Naveed,Erik Meijering
关键词-EN: shown great potential, automated medical image, disease diagnostics, medical image segmentation, shown great
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning has shown great potential for automated medical image segmentation to improve the precision and speed of disease diagnostics. However, the task presents significant difficulties due to variations in the scale, shape, texture, and contrast of the pathologies. Traditional convolutional neural network (CNN) models have certain limitations when it comes to effectively modelling multiscale context information and facilitating information interaction between skip connections across levels. To overcome these limitations, a novel deep learning architecture is introduced for medical image segmentation, taking advantage of CNNs and vision transformers. Our proposed model, named TBConvL-Net, involves a hybrid network that combines the local features of a CNN encoder-decoder architecture with long-range and temporal dependencies using biconvolutional long-short-term memory (LSTM) networks and vision transformers (ViT). This enables the model to capture contextual channel relationships in the data and account for the uncertainty of segmentation over time. Additionally, we introduce a novel composite loss function that considers both the segmentation robustness and the boundary agreement of the predicted output with the gold standard. Our proposed model shows consistent improvement over the state of the art on ten publicly available datasets of seven different medical imaging modalities.

[CV-80] Perceptual-Distortion Balanced Image Super-Resolution is a Multi-Objective Optimization Problem

链接: https://arxiv.org/abs/2409.03179
作者: Qiwen Zhu,Yanjie Wang,Shilv Cai,Liqun Chen,Jiahuan Zhou,Luxin Yan,Sheng Zhong,Xu Zou
关键词-EN: PSNR and SSIM, pixel-based regression losses, blurry images due, Training Single-Image Super-Resolution, distortion metrics scores
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Training Single-Image Super-Resolution (SISR) models using pixel-based regression losses can achieve high distortion metrics scores (e.g., PSNR and SSIM), but often results in blurry images due to insufficient recovery of high-frequency details. Conversely, using GAN or perceptual losses can produce sharp images with high perceptual metric scores (e.g., LPIPS), but may introduce artifacts and incorrect textures. Balancing these two types of losses can help achieve a trade-off between distortion and perception, but the challenge lies in tuning the loss function weights. To address this issue, we propose a novel method that incorporates Multi-Objective Optimization (MOO) into the training process of SISR models to balance perceptual quality and distortion. We conceptualize the relationship between loss weights and image quality assessment (IQA) metrics as black-box objective functions to be optimized within our Multi-Objective Bayesian Optimization Super-Resolution (MOBOSR) framework. This approach automates the hyperparameter tuning process, reduces overall computational cost, and enables the use of numerous loss functions simultaneously. Extensive experiments demonstrate that MOBOSR outperforms state-of-the-art methods in terms of both perceptual quality and distortion, significantly advancing the perception-distortion Pareto frontier. Our work points towards a new direction for future research on balancing perceptual quality and fidelity in nearly all image restoration tasks. The source code and pretrained models are available at: this https URL.

[CV-81] MSTT-199: MRI Dataset for Musculoskeletal Soft Tissue Tumor Segmentation

链接: https://arxiv.org/abs/2409.03110
作者: Tahsin Reasat,Stephen Chenard,Akhil Rekulapelli,Nicholas Chadwick,Joanna Shechtel,Katherine van Schaik,David S. Smith,Joshua Lawrenz
关键词-EN: Accurate musculoskeletal soft, influencing patient outcomes, Accurate musculoskeletal, musculoskeletal soft tissue, response to treatment
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Dataset will be made publicly available after the acceptance of the paper

点击查看摘要

Abstract:Accurate musculoskeletal soft tissue tumor segmentation is vital for assessing tumor size, location, diagnosis, and response to treatment, thereby influencing patient outcomes. However, segmentation of these tumors requires clinical expertise, and an automated segmentation model would save valuable time for both clinician and patient. Training an automatic model requires a large dataset of annotated images. In this work, we describe the collection of an MR imaging dataset of 199 musculoskeletal soft tissue tumors from 199 patients. We trained segmentation models on this dataset and then benchmarked them on a publicly available dataset. Our model achieved the state-of-the-art dice score of 0.79 out of the box without any fine tuning, which shows the diversity and utility of our curated dataset. We analyzed the model predictions and found that its performance suffered on fibrous and vascular tumors due to their diverse anatomical location, size, and intensity heterogeneity. The code and models are available in the following github repository, this https URL

[CV-82] Coupling AI and Citizen Science in Creation of Enhanced Training Dataset for Medical Image Segmentation

链接: https://arxiv.org/abs/2409.03087
作者: Amir Syahmi,Xiangrong Lu,Yinxuan Li,Haoxuan Yao,Hanjun Jiang,Ishita Acharya,Shiyi Wang,Yang Nan,Xiaodan Xing,Guang Yang
关键词-EN: Recent advancements, high-quality annotated datasets, enhanced diagnostic capabilities, greatly enhanced diagnostic, artificial intelligence
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in medical imaging and artificial intelligence (AI) have greatly enhanced diagnostic capabilities, but the development of effective deep learning (DL) models is still constrained by the lack of high-quality annotated datasets. The traditional manual annotation process by medical experts is time- and resource-intensive, limiting the scalability of these datasets. In this work, we introduce a robust and versatile framework that combines AI and crowdsourcing to improve both the quality and quantity of medical image datasets across different modalities. Our approach utilises a user-friendly online platform that enables a diverse group of crowd annotators to label medical images efficiently. By integrating the MedSAM segmentation AI with this platform, we accelerate the annotation process while maintaining expert-level quality through an algorithm that merges crowd-labelled images. Additionally, we employ pix2pixGAN, a generative AI model, to expand the training dataset with synthetic images that capture realistic morphological features. These methods are combined into a cohesive framework designed to produce an enhanced dataset, which can serve as a universal pre-processing pipeline to boost the training of any medical deep learning segmentation model. Our results demonstrate that this framework significantly improves model performance, especially when training data is limited.

机器学习

[LG-0] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

链接: https://arxiv.org/abs/2409.03757
作者: Yunze Man,Shuhong Zheng,Zhipeng Bao,Martial Hebert,Liang-Yan Gui,Yu-Xiong Wang
关键词-EN: gained increasing attention, scene encoding strategies, encoding strategies playing, increasing attention, gained increasing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project page: this https URL , Github: this https URL

点击查看摘要

Abstract:Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.

[LG-1] WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild

链接: https://arxiv.org/abs/2409.03753
作者: Yuntian Deng,Wenting Zhao,Jack Hessel,Xiang Ren,Claire Cardie,Yejin Choi
关键词-EN: offers exciting opportunities, data offers exciting, study user-chatbot interactions, conversation data offers, real-world conversation data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis’s utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.

[LG-2] Dynamics of Supervised and Reinforcement Learning in the Non-Linear Perceptron

链接: https://arxiv.org/abs/2409.03749
作者: Christian Schmid,James M. Murray
关键词-EN: efficiently learn depends, learn depends crucially, learning, equations describing learning, efficiently learn
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The ability of a brain or a neural network to efficiently learn depends crucially on both the task structure and the learning rule. Previous works have analyzed the dynamical equations describing learning in the relatively simplified context of the perceptron under assumptions of a student-teacher framework or a linearized output. While these assumptions have facilitated theoretical understanding, they have precluded a detailed understanding of the roles of the nonlinearity and input-data distribution in determining the learning dynamics, limiting the applicability of the theories to real biological or artificial neural networks. Here, we use a stochastic-process approach to derive flow equations describing learning, applying this framework to the case of a nonlinear perceptron performing binary classification. We characterize the effects of the learning rule (supervised or reinforcement learning, SL/RL) and input-data distribution on the perceptron’s learning curve and the forgetting curve as subsequent tasks are learned. In particular, we find that the input-data noise differently affects the learning speed under SL vs. RL, as well as determines how quickly learning of a task is overwritten by subsequent learning. Additionally, we verify our approach with real data using the MNIST dataset. This approach points a way toward analyzing learning dynamics for more-complex circuit architectures.

[LG-3] Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm? NDSS

链接: https://arxiv.org/abs/2409.03741
作者: Rui Wen,Michael Backes,Yang Zhang
关键词-EN: revolutionized numerous domains, enabling data-centric processes, Machine learning, numerous domains, playing a crucial
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: To Appear in Network and Distributed System Security (NDSS) Symposium 2025

点击查看摘要

Abstract:Machine learning has revolutionized numerous domains, playing a crucial role in driving advancements and enabling data-centric processes. The significance of data in training models and shaping their performance cannot be overstated. Recent research has highlighted the heterogeneous impact of individual data samples, particularly the presence of valuable data that significantly contributes to the utility and effectiveness of machine learning models. However, a critical question remains unanswered: are these valuable data samples more vulnerable to machine learning attacks? In this work, we investigate the relationship between data importance and machine learning attacks by analyzing five distinct attack types. Our findings reveal notable insights. For example, we observe that high importance data samples exhibit increased vulnerability in certain attacks, such as membership inference and model stealing. By analyzing the linkage between membership inference vulnerability and data importance, we demonstrate that sample characteristics can be integrated into membership metrics by introducing sample-specific criteria, therefore enhancing the membership inference performance. These findings emphasize the urgent need for innovative defense mechanisms that strike a balance between maximizing utility and safeguarding valuable data against potential exploitation.

[LG-4] Differentiable Discrete Event Simulation for Queuing Network Control

链接: https://arxiv.org/abs/2409.03740
作者: Ethan Che,Jing Dong,Hongseok Namkoong
关键词-EN: Queuing network control, Queuing network, manufacturing processes, essential for managing, managing congestion
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Queuing network control is essential for managing congestion in job-processing systems such as service systems, communication networks, and manufacturing processes. Despite growing interest in applying reinforcement learning (RL) techniques, queueing network control poses distinct challenges, including high stochasticity, large state and action spaces, and lack of stability. To tackle these challenges, we propose a scalable framework for policy optimization based on differentiable discrete event simulation. Our main insight is that by implementing a well-designed smoothing technique for discrete event dynamics, we can compute pathwise policy gradients for large-scale queueing networks using auto-differentiation software (e.g., Tensorflow, PyTorch) and GPU parallelization. Through extensive empirical experiments, we observe that our policy gradient estimators are several orders of magnitude more accurate than typical REINFORCE-based estimators. In addition, We propose a new policy architecture, which drastically improves stability while maintaining the flexibility of neural-network policies. In a wide variety of scheduling and admission control tasks, we demonstrate that training control policies with pathwise gradients leads to a 50-1000x improvement in sample efficiency over state-of-the-art RL methods. Unlike prior tailored approaches to queueing, our methods can flexibly handle realistic scenarios, including systems operating in non-stationary environments and those with non-exponential interarrival/service times.

[LG-5] LLM-CI: Assessing Contextual Integrity Norms in Language Models

链接: https://arxiv.org/abs/2409.03735
作者: Yan Shvartzshnaider,Vasisht Duddu,John Lacalamita
关键词-EN: Large language models, training data scraped, Large language, inadvertently encode societal, encode societal preferences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
*备注: 20 pages, 8 Figures, 4 Tables

点击查看摘要

Abstract:Large language models (LLMs), while memorizing parts of their training data scraped from the Internet, may also inadvertently encode societal preferences and norms. As these models are integrated into sociotechnical systems, it is crucial that the norms they encode align with societal expectations. These norms could vary across models, hyperparameters, optimization techniques, and datasets. This is especially challenging due to prompt sensitivity - small variations in prompts yield different responses, rendering existing assessment methodologies unreliable. There is a need for a comprehensive framework covering various models, optimization, and datasets, along with a reliable methodology to assess encoded norms. We present LLM-CI, the first open-sourced framework to assess privacy norms encoded in LLMs. LLM-CI uses a Contextual Integrity-based factorial vignette methodology to assess the encoded norms across different contexts and LLMs. We propose the multi-prompt assessment methodology to address prompt sensitivity by assessing the norms from only the prompts that yield consistent responses across multiple variants. Using LLM-CI and our proposed methodology, we comprehensively evaluate LLMs using IoT and COPPA vignettes datasets from prior work, examining the impact of model properties (e.g., hyperparameters, capacity) and optimization strategies (e.g., alignment, quantization). Comments: 20 pages, 8 Figures, 4 Tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY) Cite as: arXiv:2409.03735 [cs.LG] (or arXiv:2409.03735v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.03735 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry

链接: https://arxiv.org/abs/2409.03734
作者: Meena Jagadeesan,Michael I. Jordan,Jacob Steinhardt
关键词-EN: large-scale machine learning, Emerging marketplaces, exhibit market concentration, barriers to entry, machine learning
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); General Economics (econ.GN); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Emerging marketplaces for large language models and other large-scale machine learning (ML) models appear to exhibit market concentration, which has raised concerns about whether there are insurmountable barriers to entry in such markets. In this work, we study this issue from both an economic and an algorithmic point of view, focusing on a phenomenon that reduces barriers to entry. Specifically, an incumbent company risks reputational damage unless its model is sufficiently aligned with safety objectives, whereas a new company can more easily avoid reputational damage. To study this issue formally, we define a multi-objective high-dimensional regression framework that captures reputational damage, and we characterize the number of data points that a new company needs to enter the market. Our results demonstrate how multi-objective considerations can fundamentally reduce barriers to entry – the required number of data points can be significantly smaller than the incumbent company’s dataset size. En route to proving these results, we develop scaling laws for high-dimensional linear regression in multi-objective environments, showing that the scaling rate becomes slower when the dataset size is large, which could be of independent interest.

[LG-7] Planning In Natural Language Improves LLM Search For Code Generation

链接: https://arxiv.org/abs/2409.03733
作者: Evan Wang,Federico Cassano,Catherine Wu,Yunfeng Bai,Will Song,Vaskar Nath,Ziwen Han,Sean Hendryx,Summer Yue,Hugh Zhang
关键词-EN: scaling training compute, scaling inference compute, yielded analogous gains, training compute, compute has led
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PLANSEARCH generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas.

[LG-8] A Deep Generative Learning Approach for Two-stage Adaptive Robust Optimization

链接: https://arxiv.org/abs/2409.03731
作者: Aron Brenner,Rahman Khorramfar,Jennifer Sun,Saurabh Amin
关键词-EN: recourse decisions made, Two-stage adaptive robust, first-stage decisions, decisions made, adaptive robust optimization
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Two-stage adaptive robust optimization is a powerful approach for planning under uncertainty that aims to balance costs of “here-and-now” first-stage decisions with those of “wait-and-see” recourse decisions made after uncertainty is realized. To embed robustness against uncertainty, modelers typically assume a simple polyhedral or ellipsoidal set over which contingencies may be realized. However, these simple uncertainty sets tend to yield highly conservative decision-making when uncertainties are high-dimensional. In this work, we introduce AGRO, a column-and-constraint generation algorithm that performs adversarial generation for two-stage adaptive robust optimization using a variational autoencoder. AGRO identifies realistic and cost-maximizing contingencies by optimizing over spherical uncertainty sets in a latent space using a projected gradient ascent approach that differentiates the optimal recourse cost with respect to the latent variable. To demonstrate the cost- and time-efficiency of our approach experimentally, we apply AGRO to an adaptive robust capacity expansion problem for a regional power system and show that AGRO is able to reduce costs by up to 7.8% and runtimes by up to 77% in comparison to the conventional column-and-constraint generation algorithm.

[LG-9] Sample-Efficient Diffusion for Text-To-Speech Synthesis INTERSPEECH2024

链接: https://arxiv.org/abs/2409.03717
作者: Justin Lovelace,Soham Ray,Kwangyoun Kim,Kilian Q. Weinberger,Felix Wu
关键词-EN: work introduces Sample-Efficient, introduces Sample-Efficient Speech, effective speech synthesis, modest data regimes, Sample-Efficient Speech Diffusion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Interspeech 2024

点击查看摘要

Abstract:This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.

[LG-10] Clustering of Indonesian and Western Gamelan Orchestras through Machine Learning of Performance Parameters

链接: https://arxiv.org/abs/2409.03713
作者: Simon Linke,Gerrit Wendt,Rolf Bader
关键词-EN: Western ensembles, Indonesian and Western, Western, Indonesian, large-scale form differences
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 figures, 4 tables

点击查看摘要

Abstract:Indonesian and Western gamelan ensembles are investigated with respect to performance differences. Thereby, the often exotistic history of this music in the West might be reflected in contemporary tonal system, articulation, or large-scale form differences. Analyzing recordings of four Western and five Indonesian orchestras with respect to tonal systems and timbre features and using self-organizing Kohonen map (SOM) as a machine learning algorithm, a clear clustering between Indonesian and Western ensembles appears using certain psychoacoustic features. These point to a reduced articulation and large-scale form variability of Western ensembles compared to Indonesian ones. The SOM also clusters the ensembles with respect to their tonal systems, but no clusters between Indonesian and Western ensembles can be found in this respect. Therefore, a clear analogy between lower articulatory variability and large-scale form variation and a more exostistic, mediative and calm performance expectation and reception of gamelan in the West therefore appears.

[LG-11] Inverse decision-making using neural amortized Bayesian actors

链接: https://arxiv.org/abs/2409.03710
作者: Dominik Straub,Tobias F. Niehues,Jan Peters,Constantin A. Rothkopf
关键词-EN: provided normative explanations, sensorimotor control, phenomena in perception, science and neuroscience, provided normative
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Bayesian observer and actor models have provided normative explanations for many behavioral phenomena in perception, sensorimotor control, and other areas of cognitive science and neuroscience. They attribute behavioral variability and biases to different interpretable entities such as perceptual and motor uncertainty, prior beliefs, and behavioral costs. However, when extending these models to more complex tasks with continuous actions, solving the Bayesian decision-making problem is often analytically intractable. Moreover, inverting such models to perform inference over their parameters given behavioral data is computationally even more difficult. Therefore, researchers typically constrain their models to easily tractable components, such as Gaussian distributions or quadratic cost functions, or resort to numerical methods. To overcome these limitations, we amortize the Bayesian actor using a neural network trained on a wide range of different parameter settings in an unsupervised fashion. Using the pre-trained neural network enables performing gradient-based Bayesian inference of the Bayesian actor model’s parameters. We show on synthetic data that the inferred posterior distributions are in close alignment with those obtained using analytical solutions where they exist. Where no analytical solution is available, we recover posterior distributions close to the ground truth. We then show that identifiability problems between priors and costs can arise in more complex cost functions. Finally, we apply our method to empirical data and show that it explains systematic individual differences of behavioral patterns.

[LG-12] Classification and Prediction of Heart Diseases using Machine Learning Algorithms

链接: https://arxiv.org/abs/2409.03697
作者: Akua Sekyiwaa Osei-Nkwantabisa,Redeemer Ntumy
关键词-EN: Heart disease, worldwide health issue, Heart, predicting heart diseases, disease
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Heart disease is a serious worldwide health issue because it claims the lives of many people who might have been treated if the disease had been identified earlier. The leading cause of death in the world is cardiovascular disease, usually referred to as heart disease. Creating reliable, effective, and precise predictions for these diseases is one of the biggest issues facing the medical world today. Although there are tools for predicting heart diseases, they are either expensive or challenging to apply for determining a patient’s risk. The best classifier for foretelling and spotting heart disease was the aim of this research. This experiment examined a range of machine learning approaches, including Logistic Regression, K-Nearest Neighbor, Support Vector Machine, and Artificial Neural Networks, to determine which machine learning algorithm was most effective at predicting heart diseases. One of the most often utilized data sets for this purpose, the UCI heart disease repository provided the data set for this study. The K-Nearest Neighbor technique was shown to be the most effective machine learning algorithm for determining whether a patient has heart disease. It will be beneficial to conduct further studies on the application of additional machine learning algorithms for heart disease prediction.

[LG-13] View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

链接: https://arxiv.org/abs/2409.03685
作者: Stephen Tian,Blake Wulfe,Kyle Sargent,Katherine Liu,Sergey Zakharov,Vitor Guizilini,Jiajun Wu
关键词-EN: Large-scale visuomotor policy, visuomotor policy learning, generalizable manipulation systems, visuomotor policy, promising approach
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to CoRL 2024

点击查看摘要

Abstract:Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at this https URL.

[LG-14] A New First-Order Meta-Learning Algorithm with Convergence Guarantees

链接: https://arxiv.org/abs/2409.03682
作者: El Mahdi Chayti,Martin Jaggi
关键词-EN: prior experience gathered, Learning new tasks, intelligent system, drawing on prior, prior experience
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Learning new tasks by drawing on prior experience gathered from other (related) tasks is a core property of any intelligent system. Gradient-based meta-learning, especially MAML and its variants, has emerged as a viable solution to accomplish this goal. One problem MAML encounters is its computational and memory burdens needed to compute the meta-gradients. We propose a new first-order variant of MAML that we prove converges to a stationary point of the MAML objective, unlike other first-order variants. We also show that the MAML objective does not satisfy the smoothness assumption assumed in previous works; we show instead that its smoothness constant grows with the norm of the meta-gradient, which theoretically suggests the use of normalized or clipped-gradient methods compared to the plain gradient method used in previous works. We validate our theory on a synthetic experiment.

[LG-15] Practical Forecasting of Cryptocoins Timeseries using Correlation Patterns

链接: https://arxiv.org/abs/2409.03674
作者: Pasquale De Rosa,Pascal Felber,Valerio Schiavoni
关键词-EN: tradable digital assets, digital assets, tradable digital, Cryptocoins, Litecoin
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cryptocoins (i.e., Bitcoin, Ether, Litecoin) are tradable digital assets. Ownerships of cryptocoins are registered on distributed ledgers (i.e., blockchains). Secure encryption techniques guarantee the security of the transactions (transfers of coins among owners), registered into the ledger. Cryptocoins are exchanged for specific trading prices. The extreme volatility of such trading prices across all different sets of crypto-assets remains undisputed. However, the relations between the trading prices across different cryptocoins remains largely unexplored. Major coin exchanges indicate trend correlation to advise for sells or buys. However, price correlations remain largely unexplored. We shed some light on the trend correlations across a large variety of cryptocoins, by investigating their coin/price correlation trends over the past two years. We study the causality between the trends, and exploit the derived correlations to understand the accuracy of state-of-the-art forecasting techniques for time series modeling (e.g., GBMs, LSTM and GRU) of correlated cryptocoins. Our evaluation shows (i) strong correlation patterns between the most traded coins (e.g., Bitcoin and Ether) and other types of cryptocurrencies, and (ii) state-of-the-art time series forecasting algorithms can be used to forecast cryptocoins price trends. We released datasets and code to reproduce our analysis to the research community.

[LG-16] Wind turbine condition monitoring based on intra- and inter-farm federated learning

链接: https://arxiv.org/abs/2409.03672
作者: Albin Grataloup,Stefan Jonas,Angela Meyer
关键词-EN: maximizing energy production, wind energy adoption, wind, adoption is growing, ensuring the efficient
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:As wind energy adoption is growing, ensuring the efficient operation and maintenance of wind turbines becomes essential for maximizing energy production and minimizing costs and downtime. Many AI applications in wind energy, such as in condition monitoring and power forecasting, may benefit from using operational data not only from individual wind turbines but from multiple turbines and multiple wind farms. Collaborative distributed AI which preserves data privacy holds a strong potential for these applications. Federated learning has emerged as a privacy-preserving distributed machine learning approach in this context. We explore federated learning in wind turbine condition monitoring, specifically for fault detection using normal behaviour models. We investigate various federated learning strategies, including collaboration across different wind farms and turbine models, as well as collaboration restricted to the same wind farm and turbine model. Our case study results indicate that federated learning across multiple wind turbines consistently outperforms models trained on a single turbine, especially when training data is scarce. Moreover, the amount of historical data necessary to train an effective model can be significantly reduced by employing a collaborative federated learning strategy. Finally, our findings show that extending the collaboration to multiple wind farms may result in inferior performance compared to restricting learning within a farm, specifically when faced with statistical heterogeneity and imbalanced datasets.

[LG-17] A Fused Large Language Model for Predicting Startup Success

链接: https://arxiv.org/abs/2409.03668
作者: Abdurahman Maarouf,Stefan Feuerriegel,Nicolas Pröllochs
关键词-EN: continuously seeking profitable, predict startup success, continuously seeking, startup success, startup
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Investors are continuously seeking profitable investment opportunities in startups and, hence, for effective decision-making, need to predict a startup’s probability of success. Nowadays, investors can use not only various fundamental information about a startup (e.g., the age of the startup, the number of founders, and the business sector) but also textual description of a startup’s innovation and business model, which is widely available through online venture capital (VC) platforms such as Crunchbase. To support the decision-making of investors, we develop a machine learning approach with the aim of locating successful startups on VC platforms. Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success. Thereby, we assess to what extent self-descriptions on VC platforms are predictive of startup success. Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success, with textual self-descriptions being responsible for a significant part of the predictive power. Our work provides a decision support tool for investors to find profitable investment opportunities.

[LG-18] hreat Classification on Deployed Optical Networks Using MIMO Digital Fiber Sensing Wavelets and Machine Learning

链接: https://arxiv.org/abs/2409.03667
作者: Khouloud Abdelli,Henrique Pavani,Christian Dorize,Sterenn Guerrier,Haik Mardoyan,Patricia Layec,Jeremie Renaudier
关键词-EN: leveraging wavelet transform, operational network link, demonstrate mechanical threats, mechanical threats classification, threats classification including
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We demonstrate mechanical threats classification including jackhammers and excavators, leveraging wavelet transform of MIMO-DFS output data across a 57-km operational network link. Our machine learning framework incorporates transfer learning and shows 93% classification accuracy from field data, with benefits for optical network supervision.

[LG-19] Weather-Adaptive Multi-Step Forecasting of State of Polarization Changes in Aerial Fibers Using Wavelet Neural Networks

链接: https://arxiv.org/abs/2409.03663
作者: Khouloud Abdelli,Matteo Lonardi,Jurgen Gripp,Samuel Olsson Fabien Boitier,Patricia Layec
关键词-EN: aerial fiber links, multi-scale SOP, fiber links, aerial fiber, weather-adaptive approach
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: ECOC 2024

点击查看摘要

Abstract:We introduce a novel weather-adaptive approach for multi-step forecasting of multi-scale SOP changes in aerial fiber links. By harnessing the discrete wavelet transform and incorporating weather data, our approach improves forecasting accuracy by over 65% in RMSE and 63% in MAPE compared to baselines.

[LG-20] he representation landscape of few-shot learning and fine-tuning in large language models

链接: https://arxiv.org/abs/2409.03662
作者: Diego Doimo,Alessandro Serra,Alessio Ansuini,Alberto Cazzaniga
关键词-EN: In-context learning, modern large language, supervised fine-tuning, modern large, large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.

[LG-21] A DNN Biophysics Model with Topological and Electrostatic Features

链接: https://arxiv.org/abs/2409.03658
作者: Elyssa Sliheet,Md Abu Talha,Weihua Geng
关键词-EN: based biophysics model, deep-learning neural network, based biophysics, predict protein properties, features
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:In this project, we provide a deep-learning neural network (DNN) based biophysics model to predict protein properties. The model uses multi-scale and uniform topological and electrostatic features generated with protein structural information and force field, which governs the molecular mechanics. The topological features are generated using the element specified persistent homology (ESPH) while the electrostatic features are fast computed using a Cartesian treecode. These features are uniform in number for proteins with various sizes thus the broadly available protein structure database can be used in training the network. These features are also multi-scale thus the resolution and computational cost can be balanced by the users. The machine learning simulation on over 4000 protein structures shows the efficiency and fidelity of these features in representing the protein structure and force field for the predication of their biophysical properties such as electrostatic solvation energy. Tests on topological or electrostatic features alone and the combination of both showed the optimal performance when both features are used. This model shows its potential as a general tool in assisting biophysical properties and function prediction for the broad biomolecules using data from both theoretical computing and experiments.

[LG-22] Unsupervised Anomaly Detection and Localization with Generative Adversarial Networks

链接: https://arxiv.org/abs/2409.03657
作者: Khouloud Abdelli,Matteo Lonardi,Jurgen Gripp,Samuel Olsson,Fabien Boitier,Patricia Layec
关键词-EN: unsupervised anomaly detection, anomaly detection approach, generative adversarial networks, SOP-derived spectrograms, unsupervised anomaly
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: ECOC 2024

点击查看摘要

Abstract:We propose a novel unsupervised anomaly detection approach using generative adversarial networks and SOP-derived spectrograms. Demonstrating remarkable efficacy, our method achieves over 97% accuracy on SOP datasets from both submarine and terrestrial fiber links, all achieved without the need for labelled data.

[LG-23] On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

链接: https://arxiv.org/abs/2409.03650
作者: Yong Lin,Skyler Seto,Maartje ter Hoeve,Katherine Metcalf,Barry-John Theobald,Xuan Wang,Yizhe Zhang,Chen Huang,Tong Zhang
关键词-EN: Human Feedback, Reinforcement Learning, aligning language models, Direct Preference Optimization, human preferences
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 12 pages, 8 tables, 2 figures

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO (denoted as DPORM) can approximate an EXRM in the limit. DPORM’s effectiveness directly implies the optimality of the learned policy, and also has practical implication for LLM alignment methods including iterative DPO. However, it is unclear how well DPORM empirically matches the performance of EXRM. This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM. Our findings indicate that even though DPORM fits the training dataset comparably, it generalizes less effectively than EXRM, especially when the validation datasets contain distribution shifts. Across five out-of-distribution settings, DPORM has a mean drop in accuracy of 3% and a maximum drop of 7%. These findings highlight that DPORM has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.

[LG-24] Limited but consistent gains in adversarial robustness by co-training object recognition models with human EEG

链接: https://arxiv.org/abs/2409.03646
作者: Manshan Guo,Bhavin Choksi,Sari Sadiya,Alessandro T. Gifford,Martina G. Vilas,Radoslaw M. Cichy,Gemma Roig
关键词-EN: artificial neural networks, artificial neural, neural networks, remain relatively susceptible, EEG prediction accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In contrast to human vision, artificial neural networks (ANNs) remain relatively susceptible to adversarial attacks. To address this vulnerability, efforts have been made to transfer inductive bias from human brains to ANNs, often by training the ANN representations to match their biological counterparts. Previous works relied on brain data acquired in rodents or primates using invasive techniques, from specific regions of the brain, under non-natural conditions (anesthetized animals), and with stimulus datasets lacking diversity and naturalness. In this work, we explored whether aligning model representations to human EEG responses to a rich set of real-world images increases robustness to ANNs. Specifically, we trained ResNet50-backbone models on a dual task of classification and EEG prediction; and evaluated their EEG prediction accuracy and robustness to adversarial attacks. We observed significant correlation between the networks’ EEG prediction accuracy, often highest around 100 ms post stimulus onset, and their gains in adversarial robustness. Although effect size was limited, effects were consistent across different random initializations and robust for architectural variants. We further teased apart the data from individual EEG channels and observed strongest contribution from electrodes in the parieto-occipital regions. The demonstrated utility of human EEG for such tasks opens up avenues for future efforts that scale to larger datasets under diverse stimuli conditions with the promise of stronger effects.

[LG-25] Beyond Model Interpretability: Socio-Structural Explanations in Machine Learning

链接: https://arxiv.org/abs/2409.03632
作者: Andrew Smart,Atoosa Kasirzadeh
关键词-EN: machine learning, machine learning models, opaque machine learning, learning, machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:What is it to interpret the outputs of an opaque machine learning model. One approach is to develop interpretable machine learning techniques. These techniques aim to show how machine learning models function by providing either model centric local or global explanations, which can be based on mechanistic interpretations revealing the inner working mechanisms of models or nonmechanistic approximations showing input feature output data relationships. In this paper, we draw on social philosophy to argue that interpreting machine learning outputs in certain normatively salient domains could require appealing to a third type of explanation that we call sociostructural explanation. The relevance of this explanation type is motivated by the fact that machine learning models are not isolated entities but are embedded within and shaped by social structures. Sociostructural explanations aim to illustrate how social structures contribute to and partially explain the outputs of machine learning models. We demonstrate the importance of sociostructural explanations by examining a racially biased healthcare allocation algorithm. Our proposal highlights the need for transparency beyond model interpretability, understanding the outputs of machine learning systems could require a broader analysis that extends beyond the understanding of the machine learning model itself.

[LG-26] 1 Modular Parallel Manipulator for Long-Term Soft Robotic Data Collection

链接: https://arxiv.org/abs/2409.03614
作者: Kiyn Chin,Carmel Majidi,Abhinav Gupta
关键词-EN: large-scale data collection, Performing long-term experimentation, Performing long-term, experimental flexibility required, large-scale data
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Performing long-term experimentation or large-scale data collection for machine learning in the field of soft robotics is challenging, due to the hardware robustness and experimental flexibility required. In this work, we propose a modular parallel robotic manipulation platform suitable for such large-scale data collection and compatible with various soft-robotic fabrication methods. Considering the computational and theoretical difficulty of replicating the high-fidelity, faster-than-real-time simulations that enable large-scale data collection in rigid robotic systems, a robust soft-robotic hardware platform becomes a high priority development task for the field. The platform’s modules consist of a pair of off-the-shelf electrical motors which actuate a customizable finger consisting of a compliant parallel structure. The parallel mechanism of the finger can be as simple as a single 3D-printed urethane or molded silicone bulk structure, due to the motors being able to fully actuate a passive structure. This design flexibility allows experimentation with soft mechanism varied geometries, bulk properties and surface properties. Additionally, while the parallel mechanism does not require separate electronics or additional parts, these can be included, and it can be constructed using multi-functional soft materials to study compatible soft sensors and actuators in the learning process. In this work, we validate the platform’s ability to be used for policy gradient reinforcement learning directly on hardware in a benchmark 2D manipulation task. We additionally demonstrate compatibility with multiple fingers and characterize the design constraints for compatible extensions. Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2409.03614 [cs.RO] (or arXiv:2409.03614v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.03614 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] VFLGAN-TS: Vertical Federated Learning-based Generative Adversarial Networks for Publication of Vertically Partitioned Time-Series Data

链接: https://arxiv.org/abs/2409.03612
作者: Xun Yuan,Zilong Zhao,Prosanta Gope,Biplab Sikdar
关键词-EN: current artificial intelligence, artificial intelligence, current artificial, scale and quality, play a crucial
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, often original data cannot be shared due to privacy concerns and regulations. A potential solution is to release a synthetic dataset with a similar distribution to the private dataset. Nevertheless, in some scenarios, the attributes required to train an AI model are distributed among different parties, and the parties cannot share the local data for synthetic data construction due to privacy regulations. In PETS 2024, we recently introduced the first Vertical Federated Learning-based Generative Adversarial Network (VFLGAN) for publishing vertically partitioned static data. However, VFLGAN cannot effectively handle time-series data, presenting both temporal and attribute dimensions. In this article, we proposed VFLGAN-TS, which combines the ideas of attribute discriminator and vertical federated learning to generate synthetic time-series data in the vertically partitioned scenario. The performance of VFLGAN-TS is close to that of its counterpart, which is trained in a centralized manner and represents the upper limit for VFLGAN-TS. To further protect privacy, we apply a Gaussian mechanism to make VFLGAN-TS satisfy an (\epsilon,\delta) -differential privacy. Besides, we develop an enhanced privacy auditing scheme to evaluate the potential privacy breach through the framework of VFLGAN-TS and synthetic datasets.

[LG-28] A practical approach to evaluating the adversarial distance for machine learning classifiers

链接: https://arxiv.org/abs/2409.03598
作者: Georg Siedel,Ekagra Gupta,Andrey Morozov
关键词-EN: ensure consistent performance, adversarial, machine learning, adversarial robustness, critical for machine
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted manuscript at International Mechanical Engineering Congress and Exposition IMECE2024

点击查看摘要

Abstract:Robustness is critical for machine learning (ML) classifiers to ensure consistent performance in real-world applications where models may encounter corrupted or adversarial inputs. In particular, assessing the robustness of classifiers to adversarial inputs is essential to protect systems from vulnerabilities and thus ensure safety in use. However, methods to accurately compute adversarial robustness have been challenging for complex ML models and high-dimensional data. Furthermore, evaluations typically measure adversarial accuracy on specific attack budgets, limiting the informative value of the resulting metrics. This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach. Combined, the methods provide a comprehensive evaluation of adversarial robustness by computing estimates for the upper and lower bounds of the adversarial distance. We present visualisations and ablation studies that provide insights into how this evaluation method should be applied and parameterised. We find that our adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations. The approach in this paper should encourage a more informative way of evaluating the adversarial robustness of ML classifiers.

[LG-29] Costs Estimation in Unit Commitment Problems using Simulation-Based Inference

链接: https://arxiv.org/abs/2409.03588
作者: Matthias Pirlet,Adrien Bolland,Gilles Louppe,Damien Ernst
关键词-EN: Unit Commitment, key optimization task, finite time period, power units, power systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Unit Commitment (UC) problem is a key optimization task in power systems to forecast the generation schedules of power units over a finite time period by minimizing costs while meeting demand and technical constraints. However, many parameters required by the UC problem are unknown, such as the costs. In this work, we estimate these unknown costs using simulation-based inference on an illustrative UC problem, which provides an approximated posterior distribution of the parameters given observed generation schedules and demands. Our results highlight that the learned posterior distribution effectively captures the underlying distribution of the data, providing a range of possible values for the unknown parameters given a past observation. This posterior allows for the estimation of past costs using observed past generation schedules, enabling operators to better forecast future costs and make more robust generation scheduling forecasts. We present avenues for future research to address overconfidence in posterior estimation, enhance the scalability of the methodology and apply it to more complex UC problems modeling the network constraints and renewable energy sources.

[LG-30] CHIRPs: Change-Induced Regret Proxy metrics for Lifelong Reinforcement Learning

链接: https://arxiv.org/abs/2409.03577
作者: John Birkbeck,Adam Sobey,Federico Cerutti,Katherine Heseltine Hurley Flynn,Timothy J. Norman
关键词-EN: costly to train, train and fragile, Reinforcement learning, CHIRP, Reinforcement learning agents
类目: Machine Learning (cs.LG)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:Reinforcement learning agents can achieve superhuman performance in static tasks but are costly to train and fragile to task changes. This limits their deployment in real-world scenarios where training experience is expensive or the context changes through factors like sensor degradation, environmental processes or changing mission priorities. Lifelong reinforcement learning aims to improve sample efficiency and adaptability by studying how agents perform in evolving problems. The difficulty that these changes pose to an agent is rarely measured directly, however. Agent performances can be compared across a change, but this is often prohibitively expensive. We propose Change-Induced Regret Proxy (CHIRP) metrics, a class of metrics for approximating a change’s difficulty while avoiding the high costs of using trained agents. A relationship between a CHIRP metric and agent performance is identified in two environments, a simple grid world and MetaWorld’s suite of robotic arm tasks. We demonstrate two uses for these metrics: for learning, an agent that clusters MDPs based on a CHIRP metric achieves 17% higher average returns than three existing agents in a sequence of MetaWorld tasks. We also show how a CHIRP can be calibrated to compare the difficulty of changes across distinctly different environments.

[LG-31] 100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances KDD

链接: https://arxiv.org/abs/2409.03563
作者: Lorenzo Pacchiardi,Lucy G. Cheke,José Hernández-Orallo
关键词-EN: individual task instances, task instances, LLM, performance, instances
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Presented at the 2024 KDD workshop on Evaluation and Trustworthiness of Generative AI Models

点击查看摘要

Abstract:Predicting the performance of LLMs on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, a possibility is to evaluate the considered LLM on a set of task instances and train an assessor to predict its performance based on features of the instances. However, this approach requires evaluating each new LLM on a sufficiently large set of task instances to train an assessor specific to it. In this work, we leverage the evaluation results of previously tested LLMs to reduce the number of evaluations required to predict the performance of a new LLM. In practice, we propose to test the new LLM on a small set of reference instances and train a generic assessor which predicts the performance of the LLM on an instance based on the performance of the former on the reference set and features of the instance of interest. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until the January 2024 version of GPT4. When predicting performance on instances with the same distribution as those used to train the generic assessor, we find this achieves performance comparable to the LLM-specific assessors trained on the full set of instances. Additionally, we find that randomly selecting the reference instances performs as well as some advanced selection methods we tested. For out of distribution, however, no clear winner emerges and the overall performance is worse, suggesting that the inherent predictability of LLMs is low.

[LG-32] MaskVal: Simple but Effective Uncertainty Quantification for 6D Pose Estimation

链接: https://arxiv.org/abs/2409.03556
作者: Philipp Quentin,Daniel Goehring
关键词-EN: predictable operational performance, utmost importance, importance to ensure, predictable operational, pose
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For the use of 6D pose estimation in robotic applications, reliable poses are of utmost importance to ensure a safe, reliable and predictable operational performance. Despite these requirements, state-of-the-art 6D pose estimators often do not provide any uncertainty quantification for their pose estimates at all, or if they do, it has been shown that the uncertainty provided is only weakly correlated with the actual true error. To address this issue, we investigate a simple but effective uncertainty quantification, that we call MaskVal, which compares the pose estimates with their corresponding instance segmentations by rendering and does not require any modification of the pose estimator itself. Despite its simplicity, MaskVal significantly outperforms a state-of-the-art ensemble method on both a dataset and a robotic setup. We show that by using MaskVal, the performance of a state-of-the-art 6D pose estimator is significantly improved towards a safe and reliable operation. In addition, we propose a new and specific approach to compare and evaluate uncertainty quantification methods for 6D pose estimation in the context of robotic manipulation.

[LG-33] Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection

链接: https://arxiv.org/abs/2409.03555
作者: Ali Aghababaei-Harandi,Massih-Reza Amini
关键词-EN: complex neural networks, significant computational resources, neural networks demand, networks demand significant, demand significant computational
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite their high accuracy, complex neural networks demand significant computational resources, posing challenges for deployment on resource-constrained devices such as mobile phones and embedded systems. Compression algorithms have been developed to address these challenges by reducing model size and computational demands while maintaining accuracy. Among these approaches, factorization methods based on tensor decomposition are theoretically sound and effective. However, they face difficulties in selecting the appropriate rank for decomposition. This paper tackles this issue by presenting a unified framework that simultaneously applies decomposition and optimal rank selection, employing a composite compression loss within defined rank constraints. Our approach includes an automatic rank search in a continuous space, efficiently identifying optimal rank configurations without the use of training data, making it computationally efficient. Combined with a subsequent fine-tuning step, our approach maintains the performance of highly compressed models on par with their original counterparts. Using various benchmark datasets, we demonstrate the efficacy of our method through a comprehensive analysis.

[LG-34] DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture

链接: https://arxiv.org/abs/2409.03550
作者: Qianlong Xiang,Miao Zhang,Yuzhang Shang,Jianlong Wu,Yan Yan,Liqiang Nie
关键词-EN: high computational demands, demonstrated exceptional generative, exceptional generative capabilities, slow inference speeds, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated exceptional generative capabilities across various areas, while they are hindered by slow inference speeds and high computational demands during deployment. The most common way to accelerate DMs involves reducing the number of denoising steps during generation, achieved through faster sampling solvers or knowledge distillation (KD). In contrast to prior approaches, we propose a novel method that transfers the capability of large pretrained DMs to faster architectures. Specifically, we employ KD in a distinct manner to compress DMs by distilling their generative ability into more rapid variants. Furthermore, considering that the source data is either unaccessible or too enormous to store for current generative models, we introduce a new paradigm for their distillation without source data, termed Data-Free Knowledge Distillation for Diffusion Models (DKDM). Generally, our established DKDM framework comprises two main components: 1) a DKDM objective that uses synthetic denoising data produced by pretrained DMs to optimize faster DMs without source data, and 2) a dynamic iterative distillation method that flexibly organizes the synthesis of denoising data, preventing it from slowing down the optimization process as the generation is slow. To our knowledge, this is the first attempt at using KD to distill DMs into any architecture in a data-free manner. Importantly, our DKDM is orthogonal to most existing acceleration methods, such as denoising step reduction, quantization and pruning. Experiments show that our DKDM is capable of deriving 2x faster DMs with performance remaining on par with the baseline. Notably, our DKDM enables pretrained DMs to function as “datasets” for training new DMs.

[LG-35] he Power of Second Chance: Personalized Submodular Maximization with Two Candidates

链接: https://arxiv.org/abs/2409.03545
作者: Jing Yuan,Shaojie Tang
关键词-EN: user-specific functions, existing studies, focus on selecting, functions, candidate solutions
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Most of existing studies on submodular maximization focus on selecting a subset of items that maximizes a \emphsingle submodular function. However, in many real-world scenarios, we might have multiple user-specific functions, each of which models the utility of a particular type of user. In these settings, our goal would be to choose a set of items that performs well across all the user-specific functions. One way to tackle this problem is to select a single subset that maximizes the sum of all of the user-specific functions. Although this aggregate approach is efficient in the sense that it avoids computation of sets for individual functions, it really misses the power of personalization - for it does not allow to choose different sets for different functions. In this paper, we introduce the problem of personalized submodular maximization with two candidate solutions. For any two candidate solutions, the utility of each user-specific function is defined as the better of these two candidates. Our objective is, therefore, to select the best set of two candidates that maximize the sum of utilities of all the user-specific functions. We have designed effective algorithms for this problem. We also discuss how our approach generalizes to multiple candidate solutions, increasing flexibility and personalization in our solution.

[LG-36] Prediction Accuracy Reliability: Classification and Object Localization under Distribution Shift

链接: https://arxiv.org/abs/2409.03543
作者: Fabian Diet,Moussa Kassem Sbeyti,Michelle Karg
关键词-EN: Natural distribution shift, convolutional neural networks, distribution shift, Natural distribution, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This preprint has not undergone any post-submission improvements or corrections

点击查看摘要

Abstract:Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.

[LG-37] Risk-based Calibration for Probabilistic Classifiers

链接: https://arxiv.org/abs/2409.03542
作者: Aritz Pérez,Carlos Echegoyen,Guzmán Santafé
关键词-EN: called risk-based calibration, general iterative procedure, iterative procedure called, procedure called risk-based, risk-based calibration
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a general iterative procedure called risk-based calibration (RC) designed to minimize the empirical risk under the 0-1 loss (empirical error) for probabilistic classifiers. These classifiers are based on modeling probability distributions, including those constructed from the joint distribution (generative) and those based on the class conditional distribution (conditional). RC can be particularized to any probabilistic classifier provided a specific learning algorithm that computes the classifier’s parameters in closed form using data statistics. RC reinforces the statistics aligned with the true class while penalizing those associated with other classes, guided by the 0-1 loss. The proposed method has been empirically tested on 30 datasets using naïve Bayes, quadratic discriminant analysis, and logistic regression classifiers. RC improves the empirical error of the original closed-form learning algorithms and, more notably, consistently outperforms the gradient descent approach with the three classifiers.

[LG-38] A Physics-Informed Machine Learning Approach for Solving Distributed Order Fractional Differential Equations

链接: https://arxiv.org/abs/2409.03507
作者: Alireza Afzal Aghaei
关键词-EN: physics-informed machine learning, machine learning framework, solving distributed-order fractional, fractional differential equations, distributed-order fractional differential
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper introduces a novel methodology for solving distributed-order fractional differential equations using a physics-informed machine learning framework. The core of this approach involves extending the support vector regression (SVR) algorithm to approximate the unknown solutions of the governing equations during the training phase. By embedding the distributed-order functional equation into the SVR framework, we incorporate physical laws directly into the learning process. To further enhance computational efficiency, Gegenbauer orthogonal polynomials are employed as the kernel function, capitalizing on their fractional differentiation properties to streamline the problem formulation. Finally, the resulting optimization problem of SVR is addressed either as a quadratic programming problem or as a positive definite system in its dual form. The effectiveness of the proposed approach is validated through a series of numerical experiments on Caputo-based distributed-order fractional differential equations, encompassing both ordinary and partial derivatives.

[LG-39] Sparsifying Parametric Models with L0 Regularization

链接: https://arxiv.org/abs/2409.03489
作者: Nicolò Botteghi,Urban Fasel
关键词-EN: sparsifying parametric models, educational introduction, problem of sparsifying, sparsifying parametric, parametric models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This document contains an educational introduction to the problem of sparsifying parametric models with L0 regularization. We utilize this approach together with dictionary learning to learn sparse polynomial policies for deep reinforcement learning to control parametric partial differential equations. The code and a tutorial are provided here: this https URL.

[LG-40] LLM-based event abstraction and integration for IoT-sourced logs

链接: https://arxiv.org/abs/2409.03478
作者: Mohsen Shirali,Mohammadreza Fani Sani,Zahra Ahmadi,Estefania Serral
关键词-EN: Internet of Things, collected by Internet, Large Language Models, continuous flow, revolutionised our ability
类目: Databases (cs.DB); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:The continuous flow of data collected by Internet of Things (IoT) devices, has revolutionised our ability to understand and interact with the world across various applications. However, this data must be prepared and transformed into event data before analysis can begin. In this paper, we shed light on the potential of leveraging Large Language Models (LLMs) in event abstraction and integration. Our approach aims to create event records from raw sensor readings and merge the logs from multiple IoT sources into a single event log suitable for further Process Mining applications. We demonstrate the capabilities of LLMs in event abstraction considering a case study for IoT application in elderly care and longitudinal health monitoring. The results, showing on average an accuracy of 90% in detecting high-level activities. These results highlight LLMs’ promising potential in addressing event abstraction and integration challenges, effectively bridging the existing gap.

[LG-41] Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation

链接: https://arxiv.org/abs/2409.03470
作者: Prerak Mody,Nicolas F. Chaves-de-Plaza,Chinmay Rao,Eleftheria Astrenidou,Mischa de Ridder,Nienke Hoekstra,Klaus Hildebrandt,Marius Staring
关键词-EN: medical image segmentation, Increased usage, learning in medical, medical image, image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the “utility” of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at this https URL

[LG-42] Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks

链接: https://arxiv.org/abs/2409.03463
作者: Lorenzo Bini,Marco Sorbi,Stephane Marchand-Maillet
关键词-EN: Graph Neural Networks, Neural Networks, effectively modeling data, Graph Neural, increasingly popular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have become increasingly popular for effectively modeling data with graph structures. Recently, attention mechanisms have been integrated into GNNs to improve their ability to capture complex patterns. This paper presents the first comprehensive study revealing a critical, unexplored consequence of this integration: the emergence of Massive Activations (MAs) within attention layers. We introduce a novel method for detecting and analyzing MAs, focusing on edge features in different graph transformer architectures. Our study assesses various GNN models using benchmark datasets, including ZINC, TOX21, and PROTEINS. Key contributions include (1) establishing the direct link between attention mechanisms and MAs generation in GNNs, (2) developing a robust definition and detection method for MAs based on activation ratio distributions, (3) introducing the Explicit Bias Term (EBT) as a potential countermeasure and exploring it as an adversarial framework to assess models robustness based on the presence or absence of MAs. Our findings highlight the prevalence and impact of attention-induced MAs across different architectures, such as GraphTransformer, GraphiT, and SAN. The study reveals the complex interplay between attention mechanisms, model architecture, dataset characteristics, and MAs emergence, providing crucial insights for developing more robust and reliable graph models.

[LG-43] Raw Speech Enhancement with Deep State Space Modeling

链接: https://arxiv.org/abs/2409.03377
作者: Yan Ru Pei,Ritik Shrivastava,FNU Sidharth
关键词-EN: simple deep state-space, deep state-space autoencoder, state-space autoencoder configured, efficient online raw, online raw speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement in an end-to-end fashion. The network’s performance is primarily evaluated on raw speech denoising, with additional assessments on tasks such as super-resolution and de-quantization. We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets. The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency. Even as a raw waveform processing model, the model maintains high fidelity to the clean signal with minimal audible artifacts. In addition, the model remains performant even when the noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech enhancement capabilities in low-resource environments.

[LG-44] Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time

链接: https://arxiv.org/abs/2409.03375
作者: Francisco de Arriba-Pérez,Silvia García-Méndez
关键词-EN: million people worldwide, Based on official, million people, natural language analysis, official estimates
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, Artificial Intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using Large Language Models (LLMs) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical-patient communication using intelligent systems. Consequently, we leverage an LLM using the latest Natural Language Processing (NLP) techniques in a chatbot solution to provide interpretable Machine Learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing NLP-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80 % in all evaluation metrics, with a recall value for the mental deterioration class about 85 %. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.

[LG-45] Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

链接: https://arxiv.org/abs/2409.03365
作者: Yujie Wang,Shenhan Zhu,Fangcheng Fu,Xupeng Miao,Jie Zhang,Juan Zhu,Fan Hong,Yong Li,Bin Cui
关键词-EN: Recent foundation models, Recent foundation, handling multiple machine, specialized model components, multiple machine learning
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the heterogeneous workloads of different ML tasks and data modalities, training these models usually requires massive GPU resources and suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of large-scale MT MM models through data heterogeneity-aware model management optimization. The key idea is to decompose the model execution into stages and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. Based on this, we build a prototype system and evaluate it on various large MT MM models. Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2409.03365 [cs.DC] (or arXiv:2409.03365v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2409.03365 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice ECCV

链接: https://arxiv.org/abs/2409.03358
作者: Friedhelm Hamann,Hanxiong Li,Paul Mieske,Lars Lewejohann,Guillermo Gallego
关键词-EN: made remarkable progress, Enabled by large, recent years, made remarkable, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 18 pages, 5 figures, ECCV Workshops

点击查看摘要

Abstract:Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: ( i ) a new task termed \emphspace-time instance segmentation, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and ( ii ) \emph\dname, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\urlthis https URL

[LG-47] owards training digitally-tied analog blocks via hybrid gradient computation

链接: https://arxiv.org/abs/2409.03306
作者: Timothy Nest,Maxence Ernoult
关键词-EN: Power efficiency, digital electronics realm, efficiency is plateauing, electronics realm, needed to reduce
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Power efficiency is plateauing in the standard digital electronics realm such that novel hardware, models, and algorithms are needed to reduce the costs of AI training. The combination of energy-based analog circuits and the Equilibrium Propagation (EP) algorithm constitutes one compelling alternative compute paradigm for gradient-based optimization of neural nets. Existing analog hardware accelerators, however, typically incorporate digital circuitry to sustain auxiliary non-weight-stationary operations, mitigate analog device imperfections, and leverage existing digital accelerators.This heterogeneous hardware approach calls for a new theoretical model building block. In this work, we introduce Feedforward-tied Energy-based Models (ff-EBMs), a hybrid model comprising feedforward and energy-based blocks accounting for digital and analog circuits. We derive a novel algorithm to compute gradients end-to-end in ff-EBMs by backpropagating and “eq-propagating” through feedforward and energy-based parts respectively, enabling EP to be applied to much more flexible and realistic architectures. We experimentally demonstrate the effectiveness of the proposed approach on ff-EBMs where Deep Hopfield Networks (DHNs) are used as energy-based blocks. We first show that a standard DHN can be arbitrarily split into any uniform size while maintaining performance. We then train ff-EBMs on ImageNet32 where we establish new SOTA performance in the EP literature (46 top-1 %). Our approach offers a principled, scalable, and incremental roadmap to gradually integrate self-trainable analog computational primitives into existing digital accelerators.

[LG-48] Improving Robustness to Multiple Spurious Correlations by Multi-Objective Optimization

链接: https://arxiv.org/abs/2409.03303
作者: Nayeong Kim,Juwon Kang,Sungsoo Ahn,Jungseul Ok,Suha Kwak
关键词-EN: multiple biases, unbiased and accurate, accurate model, multiple, training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: International Conference on Machine Learning 2024

点击查看摘要

Abstract:We study the problem of training an unbiased and accurate model given a dataset with multiple biases. This problem is challenging since the multiple biases cause multiple undesirable shortcuts during training, and even worse, mitigating one may exacerbate the other. We propose a novel training method to tackle this challenge. Our method first groups training data so that different groups induce different shortcuts, and then optimizes a linear combination of group-wise losses while adjusting their weights dynamically to alleviate conflicts between the groups in performance; this approach, rooted in the multi-objective optimization theory, encourages to achieve the minimax Pareto solution. We also present a new benchmark with multiple biases, dubbed MultiCelebA, for evaluating debiased training methods under realistic and challenging scenarios. Our method achieved the best on three datasets with multiple biases, and also showed superior performance on conventional single-bias datasets.

[LG-49] ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models

链接: https://arxiv.org/abs/2409.03301
作者: Qi Ju,Falin Hei,Zhemei Fang,Yunfeng Luo
关键词-EN: Reinforcement Learning, highly dependent, meticulous design, Reinforcement, Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) is highly dependent on the meticulous design of the reward function. However, accurately assigning rewards to each state-action pair in Long-Term RL (LTRL) challenges is formidable. Consequently, RL agents are predominantly trained with expert guidance. Drawing on the principles of ordinal utility theory from economics, we propose a novel reward estimation algorithm: ELO-Rating based RL (ERRL). This approach is distinguished by two main features. Firstly, it leverages expert preferences over trajectories instead of cardinal rewards (utilities) to compute the ELO rating of each trajectory as its reward. Secondly, a new reward redistribution algorithm is introduced to mitigate training volatility in the absence of a fixed anchor reward. Our method demonstrates superior performance over several leading baselines in long-term scenarios (extending up to 5000 steps), where conventional RL algorithms falter. Furthermore, we conduct a thorough analysis of how expert preferences affect the outcomes.

[LG-50] Bringing the RT-1-X Foundation Model to a SCARA robot

链接: https://arxiv.org/abs/2409.03299
作者: Jonathan Salzer,Arnoud Visser
关键词-EN: Traditional robotic systems, systems require specific, robotic systems require, require specific training, specific training data
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, submitted to the joint Artificial Intelligence Machine Learning conference for Belgium, Netherlands Luxembourg (BNAIC/BeNeLearn)

点击查看摘要

Abstract:Traditional robotic systems require specific training data for each task, environment, and robot form. While recent advancements in machine learning have enabled models to generalize across new tasks and environments, the challenge of adapting these models to entirely new settings remains largely unexplored. This study addresses this by investigating the generalization capabilities of the RT-1-X robotic foundation model to a type of robot unseen during its training: a SCARA robot from UMI-RTX. Initial experiments reveal that RT-1-X does not generalize zero-shot to the unseen type of robot. However, fine-tuning of the RT-1-X model by demonstration allows the robot to learn a pickup task which was part of the foundation model (but learned for another type of robot). When the robot is presented with an object that is included in the foundation model but not in the fine-tuning dataset, it demonstrates that only the skill, but not the object-specific knowledge, has been transferred. Comments: 14 pages, submitted to the joint Artificial Intelligence Machine Learning conference for Belgium, Netherlands Luxembourg (BNAIC/BeNeLearn) Subjects: Robotics (cs.RO); Machine Learning (cs.LG) MSC classes: 68 ACMclasses: I.2.9 Cite as: arXiv:2409.03299 [cs.RO] (or arXiv:2409.03299v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.03299 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts EMNLP

链接: https://arxiv.org/abs/2409.03291
作者: Henrique Da Silva Gameiro,Andrei Kucharavy,Ljiljana Dolamic
关键词-EN: large Language Models, Language Models, major concern, emergence of widely, widely available powerful
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 20 pages, 7 tables, 13 figures, under consideration for EMNLP

点击查看摘要

Abstract:With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations – short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (this https URL). Comments: 20 pages, 7 tables, 13 figures, under consideration for EMNLP Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: I.2.7; K.6.5 Cite as: arXiv:2409.03291 [cs.CL] (or arXiv:2409.03291v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.03291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Interpretable mixture of experts for time series prediction under recurrent and non-recurrent conditions

链接: https://arxiv.org/abs/2409.03282
作者: Zemian Ke,Haocheng Duan,Sean Qian
关键词-EN: follow periodic patterns, follow periodic, traffic speed prediction, Non-recurrent conditions caused, traffic speed
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Non-recurrent conditions caused by incidents are different from recurrent conditions that follow periodic patterns. Existing traffic speed prediction studies are incident-agnostic and use one single model to learn all possible patterns from these drastically diverse conditions. This study proposes a novel Mixture of Experts (MoE) model to improve traffic speed prediction under two separate conditions, recurrent and non-recurrent (i.e., with and without incidents). The MoE leverages separate recurrent and non-recurrent expert models (Temporal Fusion Transformers) to capture the distinct patterns of each traffic condition. Additionally, we propose a training pipeline for non-recurrent models to remedy the limited data issues. To train our model, multi-source datasets, including traffic speed, incident reports, and weather data, are integrated and processed to be informative features. Evaluations on a real road network demonstrate that the MoE achieves lower errors compared to other benchmark algorithms. The model predictions are interpreted in terms of temporal dependencies and variable importance in each condition separately to shed light on the differences between recurrent and non-recurrent conditions.

[LG-53] nsor network square root Kalman filter for online Gaussian process regression

链接: https://arxiv.org/abs/2409.03276
作者: Clara Menzen,Manon Kok,Kim Batselier
关键词-EN: network Kalman filter, tensor network Kalman, Kalman filter lifts, high-dimensional recursive estimation, Kalman filter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The state-of-the-art tensor network Kalman filter lifts the curse of dimensionality for high-dimensional recursive estimation problems. However, the required rounding operation can cause filter divergence due to the loss of positive definiteness of covariance matrices. We solve this issue by developing, for the first time, a tensor network square root Kalman filter, and apply it to high-dimensional online Gaussian process regression. In our experiments, we demonstrate that our method is equivalent to the conventional Kalman filter when choosing a full-rank tensor network. Furthermore, we apply our method to a real-life system identification problem where we estimate 4^14 parameters on a standard laptop. The estimated model outperforms the state-of-the-art tensor network Kalman filter in terms of prediction accuracy and uncertainty quantification.

[LG-54] In Search of Trees: Decision-Tree Policy Synthesis for Black-Box Systems via Search

链接: https://arxiv.org/abs/2409.03260
作者: Emir Demirović,Christian Schilling,Anna Lukina
关键词-EN: attractive as control, control policies, Decision trees, formal synthesis, policies
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages main text incl. references, 1 page appendix

点击查看摘要

Abstract:Decision trees, owing to their interpretability, are attractive as control policies for (dynamical) systems. Unfortunately, constructing, or synthesising, such policies is a challenging task. Previous approaches do so by imitating a neural-network policy, approximating a tabular policy obtained via formal synthesis, employing reinforcement learning, or modelling the problem as a mixed-integer linear program. However, these works may require access to a hard-to-obtain accurate policy or a formal model of the environment (within reach of formal synthesis), and may not provide guarantees on the quality or size of the final tree policy. In contrast, we present an approach to synthesise optimal decision-tree policies given a black-box environment and specification, and a discretisation of the tree predicates, where optimality is defined with respect to the number of steps to achieve the goal. Our approach is a specialised search algorithm which systematically explores the (exponentially large) space of decision trees under the given discretisation. The key component is a novel pruning mechanism that significantly reduces the search space. Our approach represents a conceptually novel way of synthesising small decision-tree policies with optimality guarantees even for black-box environments with black-box specifications.

[LG-55] Dual-TSST: A Dual-Branch Temporal-Spectral-Spatial Transformer Model for EEG Decoding

链接: https://arxiv.org/abs/2409.03251
作者: Hongqi Li,Haodong Zhang,Yitong Chen
关键词-EN: user intentions conveniently, signals allows access, intentions conveniently, human-machine interaction, EEG
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The decoding of electroencephalography (EEG) signals allows access to user intentions conveniently, which plays an important role in the fields of human-machine interaction. To effectively extract sufficient characteristics of the multichannel EEG, a novel decoding architecture network with a dual-branch temporal-spectral-spatial transformer (Dual-TSST) is proposed in this study. Specifically, by utilizing convolutional neural networks (CNNs) on different branches, the proposed processing network first extracts the temporal-spatial features of the original EEG and the temporal-spectral-spatial features of time-frequency domain data converted by wavelet transformation, respectively. These perceived features are then integrated by a feature fusion block, serving as the input of the transformer to capture the global long-range dependencies entailed in the non-stationary EEG, and being classified via the global average pooling and multi-layer perceptron blocks. To evaluate the efficacy of the proposed approach, the competitive experiments are conducted on three publicly available datasets of BCI IV 2a, BCI IV 2b, and SEED, with the head-to-head comparison of more than ten other state-of-the-art methods. As a result, our proposed Dual-TSST performs superiorly in various tasks, which achieves the promising EEG classification performance of average accuracy of 80.67% in BCI IV 2a, 88.64% in BCI IV 2b, and 96.65% in SEED, respectively. Extensive ablation experiments conducted between the Dual-TSST and comparative baseline model also reveal the enhanced decoding performance with each module of our proposed method. This study provides a new approach to high-performance EEG decoding, and has great potential for future CNN-Transformer based applications.

[LG-56] DiffGrad for Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2409.03239
作者: Jamshaid Ul Rahman,Nimra
关键词-EN: Physics-Informed Neural Networks, addressing highly nonlinear, highly nonlinear problems, nonlinear problems based, Physics-Informed Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 20 pages, 14 figures

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are regarded as state-of-the-art tools for addressing highly nonlinear problems based on partial differential equations. Despite their broad range of applications, PINNs encounter several performance challenges, including issues related to efficiency, minimization of computational cost, and enhancement of accuracy. Burgers’ equation, a fundamental equation in fluid dynamics that is extensively used in PINNs, provides flexible results with the Adam optimizer that does not account for past gradients. This paper introduces a novel strategy for solving Burgers’ equation by incorporating DiffGrad with PINNs, a method that leverages the difference between current and immediately preceding gradients to enhance performance. A comprehensive computational analysis is conducted using optimizers such as Adam, Adamax, RMSprop, and DiffGrad to evaluate and compare their effectiveness. Our approach includes visualizing the solutions over space at various time intervals to demonstrate the accuracy of the network. The results show that DiffGrad not only improves the accuracy of the solution but also reduces training time compared to the other optimizers.

[LG-57] Preserving Empirical Probabilities in BERT for Small-sample Clinical Entity Recognition

链接: https://arxiv.org/abs/2409.03238
作者: Abdul Rehman,Jian Jun Zhang,Xiaosong Yang
关键词-EN: Named Entity Recognition, Entity Recognition, Named Entity, equitable entity recognition, encounters the challenge
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Named Entity Recognition (NER) encounters the challenge of unbalanced labels, where certain entity types are overrepresented while others are underrepresented in real-world datasets. This imbalance can lead to biased models that perform poorly on minority entity classes, impeding accurate and equitable entity recognition. This paper explores the effects of unbalanced entity labels of the BERT-based pre-trained model. We analyze the different mechanisms of loss calculation and loss propagation for the task of token classification on randomized datasets. Then we propose ways to improve the token classification for the highly imbalanced task of clinical entity recognition.

[LG-58] Robust Q-Learning under Corrupted Rewards

链接: https://arxiv.org/abs/2409.03237
作者: Sreejeet Maity,Aritra Mitra
关键词-EN: model-free reinforcement learning, reinforcement learning algorithms, Q-learning algorithm, surge of interest, interest in analyzing
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted to the Decision and Control Conference (CDC) 2024

点击查看摘要

Abstract:Recently, there has been a surge of interest in analyzing the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of such algorithms in non-ideal environments, such as in the presence of corrupted rewards, is poorly understood. Motivated by this gap, we investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We start by proving that such an attack can cause the vanilla Q-learning algorithm to incur arbitrarily large errors. We then develop a novel robust synchronous Q-learning algorithm that uses historical reward data to construct robust empirical Bellman operators at each time step. Finally, we prove a finite-time convergence rate for our algorithm that matches known state-of-the-art bounds (in the absence of attacks) up to a small inevitable O(\varepsilon) error term that scales with the adversarial corruption fraction \varepsilon . Notably, our results continue to hold even when the true reward distributions have infinite support, provided they admit bounded second moments.

[LG-59] State-space models are accurate and efficient neural operators for dynamical systems

链接: https://arxiv.org/abs/2409.03231
作者: Zheyuan Hu,Nazanin Ahmadi Daryakenari,Qianli Shen,Kenji Kawaguchi,George Em Karniadakis
关键词-EN: Physics-informed machine learning, Physics-informed machine, offering faster, generalizable solutions, promising alternative
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 34 pages

点击查看摘要

Abstract:Physics-informed machine learning (PIML) has emerged as a promising alternative to classical methods for predicting dynamical systems, offering faster and more generalizable solutions. However, existing models, including recurrent neural networks (RNNs), transformers, and neural operators, face challenges such as long-time integration, long-range dependencies, chaotic dynamics, and extrapolation, to name a few. To this end, this paper introduces state-space models implemented in Mamba for accurate and efficient dynamical system operator learning. Mamba addresses the limitations of existing architectures by dynamically capturing long-range dependencies and enhancing computational efficiency through reparameterization techniques. To extensively test Mamba and compare against another 11 baselines, we introduce several strict extrapolation testbeds that go beyond the standard interpolation benchmarks. We demonstrate Mamba’s superior performance in both interpolation and challenging extrapolation tasks. Mamba consistently ranks among the top models while maintaining the lowest computational cost and exceptional extrapolation capabilities. Moreover, we demonstrate the good performance of Mamba for a real-world application in quantitative systems pharmacology for assessing the efficacy of drugs in tumor growth under limited data scenarios. Taken together, our findings highlight Mamba’s potential as a powerful tool for advancing scientific machine learning in dynamical systems modeling. (The code will be available at this https URL upon acceptance.)

[LG-60] FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks ICSE2025

链接: https://arxiv.org/abs/2409.03220
作者: Brian Hyeongseok Kim,Jingbo Wang,Chao Wang
关键词-EN: quantifying individual fairness, formally certifying, certifying and quantifying, DNN, quantifying individual
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: To Appear In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025)

点击查看摘要

Abstract:We propose a method for formally certifying and quantifying individual fairness of deep neural networks (DNN). Individual fairness guarantees that any two individuals who are identical except for a legally protected attribute (e.g., gender or race) receive the same treatment. While there are existing techniques that provide such a guarantee, they tend to suffer from lack of scalability or accuracy as the size and input dimension of the DNN increase. Our method overcomes this limitation by applying abstraction to a symbolic interval based analysis of the DNN followed by iterative refinement guided by the fairness property. Furthermore, our method lifts the symbolic interval based analysis from conventional qualitative certification to quantitative certification, by computing the percentage of individuals whose classification outputs are provably fair, instead of merely deciding if the DNN is fair. We have implemented our method and evaluated it on deep neural networks trained on four popular fairness research datasets. The experimental results show that our method is not only more accurate than state-of-the-art techniques but also several orders-of-magnitude faster.

[LG-61] Content Moderation by LLM: From Accuracy to Legitimacy

链接: https://arxiv.org/abs/2409.03219
作者: Tao Huang
关键词-EN: large language model, LLM, large language, language model, content moderation
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One trending application of LLM (large language model) is to use it for content moderation in online platforms. Most current studies on this application have focused on the metric of accuracy - the extent to which LLM makes correct decisions about content. This article argues that accuracy is insufficient and misleading, because it fails to grasp the distinction between easy cases and hard cases as well as the inevitable trade-offs in achieving higher accuracy. Closer examination reveals that content moderation is a constitutive part of platform governance, the key of which is to gain and enhance legitimacy. Instead of making moderation decisions correct, the chief goal of LLM is to make them legitimate. In this regard, this article proposes a paradigm shift from the single benchmark of accuracy towards a legitimacy-based framework of evaluating the performance of LLM moderators. The framework suggests that for easy cases, the key is to ensure accuracy, speed and transparency, while for hard cases, what matters is reasoned justification and user participation. Examined under this framework, LLM’s real potential in moderation is not accuracy improvement. Rather, LLM can better contribute in four other aspects: to conduct screening of hard cases from easy cases, to provide quality explanations for moderation decisions, to assist human reviewers in getting more contextual information, and to facilitate user participation in a more interactive way. Using normative theories from law and social sciences to critically assess the new technological application, this article seeks to redefine LLM’s role in content moderation and redirect relevant research in this field.

[LG-62] Application Research On Real-Time Perception Of Device Performance Status

链接: https://arxiv.org/abs/2409.03218
作者: Zhe Wang,Zhen Wang,Jianwen Wu,Wangzhong Xiao,Yidong Chen,Zihua Feng,Dian Yang,Hongchen Liu,Bo Liang,Jiaojiao Fu
关键词-EN: Ideal Solution, Preference by Similarity, Similarity to Ideal, Order Preference, Technique for Order
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In order to accurately identify the performance status of mobile devices and finely adjust the user experience, a real-time performance perception evaluation method based on TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) combined with entropy weighting method and time series model construction was studied. After collecting the performance characteristics of various mobile devices, the device performance profile was fitted by using PCA (principal component analysis) dimensionality reduction and feature engineering methods such as descriptive time series analysis. The ability of performance features and profiles to describe the real-time performance status of devices was understood and studied by applying the TOPSIS method and multi-level weighting processing. A time series model was constructed for the feature set under objective weighting, and multiple sensitivity (real-time, short-term, long-term) performance status perception results were provided to obtain real-time performance evaluation data and long-term stable performance prediction data. Finally, by configuring dynamic AB experiments and overlaying fine-grained power reduction strategies, the usability of the method was verified, and the accuracy of device performance status identification and prediction was compared with the performance of the profile features including dimensionality reduction time series modeling, TOPSIS method and entropy weighting method, subjective weighting, HMA method. The results show that accurate real-time performance perception results can greatly enhance business value, and this research has application effectiveness and certain forward-looking significance.

[LG-63] xLAM: A Family of Large Action Models to Empower AI Agent Systems

链接: https://arxiv.org/abs/2409.03215
作者: Jianguo Zhang,Tian Lan,Ming Zhu,Zuxin Liu,Thai Hoang,Shirley Kokane,Weiran Yao,Juntao Tan,Akshara Prabhakar,Haolin Chen,Zhiwei Liu,Yihao Feng,Tulika Awalgaonkar,Rithesh Murthy,Eric Hu,Zeyuan Chen,Ran Xu,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong
关键词-EN: significant research interest, attracted significant research, research interest, agent tasks, Autonomous agents powered
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Technical report for the Salesforce xLAM model series

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents’ generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at this https URL

[LG-64] Bi-capacity Choquet Integral for Sensor Fusion with Label Uncertainty

链接: https://arxiv.org/abs/2409.03212
作者: Hersh Vakharia,Xiaoxiao Du
关键词-EN: improve reliability, Multiple Instance Learning, Sensor fusion combines, multiple sensor sources, Choquet integral
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, 7 tables; Accepted to 2024 FUZZ-IEEE and presented at 2024 IEEE WCCI; Code available at this https URL

点击查看摘要

Abstract:Sensor fusion combines data from multiple sensor sources to improve reliability, robustness, and accuracy of data interpretation. The Fuzzy Integral (FI), in particular, the Choquet integral (ChI), is often used as a powerful nonlinear aggregator for fusion across multiple sensors. However, existing supervised ChI learning algorithms typically require precise training labels for each input data point, which can be difficult or impossible to obtain. Additionally, prior work on ChI fusion is often based only on the normalized fuzzy measures, which bounds the fuzzy measure values between [0, 1]. This can be limiting in cases where the underlying scales of input data sources are bipolar (i.e., between [-1, 1]). To address these challenges, this paper proposes a novel Choquet integral-based fusion framework, named Bi-MIChI (pronounced “bi-mi-kee”), which uses bi-capacities to represent the interactions between pairs of subsets of the input sensor sources on a bi-polar scale. This allows for extended non-linear interactions between the sensor sources and can lead to interesting fusion results. Bi-MIChI also addresses label uncertainty through Multiple Instance Learning, where training labels are applied to “bags” (sets) of data instead of per-instance. Our proposed Bi-MIChI framework shows effective classification and detection performance on both synthetic and real-world experiments for sensor fusion with label uncertainty. We also provide detailed analyses on the behavior of the fuzzy measures to demonstrate our fusion process.

[LG-65] Pricing American Options using Machine Learning Algorithms

链接: https://arxiv.org/abs/2409.03204
作者: Prudence Djagba,Callixte Ndizihiwe
关键词-EN: Monte Carlo simulations, Monte Carlo, Monte Carlo methods, pricing American options, American options
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:This study investigates the application of machine learning algorithms, particularly in the context of pricing American options using Monte Carlo simulations. Traditional models, such as the Black-Scholes-Merton framework, often fail to adequately address the complexities of American options, which include the ability for early exercise and non-linear payoff structures. By leveraging Monte Carlo methods in conjunction Least Square Method machine learning was used. This research aims to improve the accuracy and efficiency of option pricing. The study evaluates several machine learning models, including neural networks and decision trees, highlighting their potential to outperform traditional approaches. The results from applying machine learning algorithm in LSM indicate that integrating machine learning with Monte Carlo simulations can enhance pricing accuracy and provide more robust predictions, offering significant insights into quantitative finance by merging classical financial theories with modern computational techniques. The dataset was split into features and the target variable representing bid prices, with an 80-20 train-validation split. LSTM and GRU models were constructed using TensorFlow’s Keras API, each with four hidden layers of 200 neurons and an output layer for bid price prediction, optimized with the Adam optimizer and MSE loss function. The GRU model outperformed the LSTM model across all evaluated metrics, demonstrating lower mean absolute error, mean squared error, and root mean squared error, along with greater stability and efficiency in training.

[LG-66] How noise affects memory in linear recurrent networks

链接: https://arxiv.org/abs/2409.03187
作者: JingChuan Guan,Tomoyuki Kubota,Yasuo Kuniyoshi,Kohei Nakajima
关键词-EN: linear recurrent network, theoretically investigated, linear recurrent, recurrent network, noise
类目: Neural and Evolutionary Computing (cs.NE); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:The effects of noise on memory in a linear recurrent network are theoretically investigated. Memory is characterized by its ability to store previous inputs in its instantaneous state of network, which receives a correlated or uncorrelated noise. Two major properties are revealed: First, the memory reduced by noise is uniquely determined by the noise’s power spectral density (PSD). Second, the memory will not decrease regardless of noise intensity if the PSD is in a certain class of distribution (including power law). The results are verified using the human brain signals, showing good agreement.

[LG-67] Machine learning-based algorithms for at-home respiratory disease monitoring and respiratory assessment

链接: https://arxiv.org/abs/2409.03180
作者: Negar Orangi-Fard,Alexandru Bogdan,Hersh Sagreiya
关键词-EN: management practices primarily, practices primarily reliant, specialist clinical testing, Respiratory diseases impose, global health
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Respiratory diseases impose a significant burden on global health, with current diagnostic and management practices primarily reliant on specialist clinical testing. This work aims to develop machine learning-based algorithms to facilitate at-home respiratory disease monitoring and assessment for patients undergoing continuous positive airway pressure (CPAP) therapy. Data were collected from 30 healthy adults, encompassing respiratory pressure, flow, and dynamic thoraco-abdominal circumferential measurements under three breathing conditions: normal, panting, and deep breathing. Various machine learning models, including the random forest classifier, logistic regression, and support vector machine (SVM), were trained to predict breathing types. The random forest classifier demonstrated the highest accuracy, particularly when incorporating breathing rate as a feature. These findings support the potential of AI-driven respiratory monitoring systems to transition respiratory assessments from clinical settings to home environments, enhancing accessibility and patient autonomy. Future work involves validating these models with larger, more diverse populations and exploring additional machine learning techniques.

[LG-68] InfraLib: Enabling Reinforcement Learning and Decision Making for Large Scale Infrastructure Management

链接: https://arxiv.org/abs/2409.03167
作者: Pranay Thangeda,Trevor S. Betz,Michael N. Grussing,Melkior Ornik
关键词-EN: Efficient management, economic stability, public safety, crucial for economic, Efficient
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Efficient management of infrastructure systems is crucial for economic stability, sustainability, and public safety. However, infrastructure management is challenging due to the vast scale of systems, stochastic deterioration of components, partial observability, and resource constraints. While data-driven approaches like reinforcement learning (RL) offer a promising avenue for optimizing management policies, their application to infrastructure has been limited by the lack of suitable simulation environments. We introduce InfraLib, a comprehensive framework for modeling and analyzing infrastructure management problems. InfraLib employs a hierarchical, stochastic approach to realistically model infrastructure systems and their deterioration. It supports practical functionality such as modeling component unavailability, cyclical budgets, and catastrophic failures. To facilitate research, InfraLib provides tools for expert data collection, simulation-driven analysis, and visualization. We demonstrate InfraLib’s capabilities through case studies on a real-world road network and a synthetic benchmark with 100,000 components.

[LG-69] A Scalable Matrix Visualization for Understanding Tree Ensemble Classifiers

链接: https://arxiv.org/abs/2409.03164
作者: Zhen Li,Weikai Yang,Jun Yuan,Jing Wu,Changjian Chen,Yao Ming,Fan Yang,Hui Zhang,Shixia Liu
关键词-EN: hard to understand, ensemble classifiers benefits, tree ensemble classifiers, high performance, rules
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注: 15 pages, 10 figures

点击查看摘要

Abstract:The high performance of tree ensemble classifiers benefits from a large set of rules, which, in turn, makes the models hard to understand. To improve interpretability, existing methods extract a subset of rules for approximation using model reduction techniques. However, by focusing on the reduced rule set, these methods often lose fidelity and ignore anomalous rules that, despite their infrequency, play crucial roles in real-world applications. This paper introduces a scalable visual analysis method to explain tree ensemble classifiers that contain tens of thousands of rules. The key idea is to address the issue of losing fidelity by adaptively organizing the rules as a hierarchy rather than reducing them. To ensure the inclusion of anomalous rules, we develop an anomaly-biased model reduction method to prioritize these rules at each hierarchical level. Synergized with this hierarchical organization of rules, we develop a matrix-based hierarchical visualization to support exploration at different levels of detail. Our quantitative experiments and case studies demonstrate how our method fosters a deeper understanding of both common and anomalous rules, thereby enhancing interpretability without sacrificing comprehensiveness.

[LG-70] Standing on the shoulders of giants

链接: https://arxiv.org/abs/2409.03151
作者: Lucas Felipe Ferraro Cardoso,José de Sousa Ribeiro Filho,Vitor Cirilo Araujo Santos,Regiane Silva Kawasaki Frances,Ronnie Cley de Oliveira Alves
关键词-EN: Machine Learning, advancement of Machine, classic evaluation metrics, evaluation metrics extracted, Item Response Theory
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 8 figures, 3 tables, submitted for the BRACIS’24 conference

点击查看摘要

Abstract:Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models’ performance, without considering the complexity of the data or the quality of the hit. To overcome these limitations, recent research has introduced the use of psychometric metrics such as Item Response Theory (IRT), which allows an assessment at the level of latent characteristics of instances. This work investigates how IRT concepts can enrich a confusion matrix in order to identify which model is the most appropriate among options with similar performance. In the study carried out, IRT does not replace, but complements classical metrics by offering a new layer of evaluation and observation of the fine behavior of models in specific instances. It was also observed that there is 97% confidence that the score from the IRT has different contributions from 66% of the classical metrics analyzed.

[LG-71] Discovering Cyclists Street Visual Preferences Through Multi-Source Big Data Using Deep Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2409.03148
作者: Ren Kezhou,Gong Yongxi
关键词-EN: gained global popularity, positive urban impacts, cyclists’ preferences, gained global, global popularity
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 38 pages, 16 figures

点击查看摘要

Abstract:Cycling has gained global popularity for its health benefits and positive urban impacts. To effectively promote cycling, early studies have extensively investigated the relationship between cycling behaviors and environmental factors, especially cyclists’ preferences when making route decisions. However, these studies often struggle to comprehensively describe detailed cycling procedures at a large scale due to data limitations, and they tend to overlook the complex nature of cyclists’ preferences. To address these issues, we propose a novel framework aimed to quantify and interpret cyclists’ complicated street visual preferences from cycling records by leveraging maximum entropy deep inverse reinforcement learning (MEDIRL) and explainable artificial intelligence (XAI). Implemented in Bantian Sub-district, Shenzhen, we adapt MEDIRL model for efficient estimation of cycling reward function by integrating dockless-bike-sharing (DBS) trajectory and street view images (SVIs), which serves as a representation of cyclists’ preferences for street visual environments during routing. In addition, we demonstrate the feasibility and reliability of MEDIRL in discovering cyclists’ street visual preferences. Further analysis reveals the nonlinear and interactive effects of street visual elements on cyclists’ preferences, offering a holistic perspective on streetscape design. Our proposed framework advances the understanding of individual cycling behaviors and provides actionable insights for urban planners to design bicycle-friendly streetscapes that prioritize cyclists’ preferences.

[LG-72] Addressing the Gaps in Early Dementia Detection: A Path Towards Enhanced Diagnostic Models through Machine Learning

链接: https://arxiv.org/abs/2409.03147
作者: Juan A. Berrios Moya
关键词-EN: rapid global aging, global aging trend, accurate diagnostic methods, underscoring the urgent, rapid global
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid global aging trend has led to an increase in dementia cases, including Alzheimer’s disease, underscoring the urgent need for early and accurate diagnostic methods. Traditional diagnostic techniques, such as cognitive tests, neuroimaging, and biomarker analysis, face significant limitations in sensitivity, accessibility, and cost, particularly in the early stages. This study explores the potential of machine learning (ML) as a transformative approach to enhance early dementia detection by leveraging ML models to analyze and integrate complex multimodal datasets, including cognitive assessments, neuroimaging, and genetic information. A comprehensive review of existing literature was conducted to evaluate various ML models, including supervised learning, deep learning, and advanced techniques such as ensemble learning and transformer models, assessing their accuracy, interpretability, and potential for clinical integration. The findings indicate that while ML models show significant promise in improving diagnostic precision and enabling earlier interventions, challenges remain in their generalizability, interpretability, and ethical deployment. This research concludes by outlining future directions aimed at enhancing the clinical utility of ML models in dementia detection, emphasizing interdisciplinary collaboration and ethically sound frameworks to improve early detection and intervention strategies for Alzheimer’s disease and other forms of dementia.

[LG-73] Causal Temporal Representation Learning with Nonstationary Sparse Transition

链接: https://arxiv.org/abs/2409.03142
作者: Xiangchen Song,Zijian Li,Guangyi Chen,Yujia Zheng,Yewen Fan,Xinshuai Dong,Kun Zhang
关键词-EN: Temporal Representation Learning, Causal Temporal Representation, temporal causal dynamics, nonstationary temporal sequences, complex nonstationary temporal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal Temporal Representation Learning (Ctrl) methods aim to identify the temporal causal dynamics of complex nonstationary temporal sequences. Despite the success of existing Ctrl methods, they require either directly observing the domain variables or assuming a Markov prior on them. Such requirements limit the application of these methods in real-world scenarios when we do not have such prior knowledge of the domain variables. To address this problem, this work adopts a sparse transition assumption, aligned with intuitive human understanding, and presents identifiability results from a theoretical perspective. In particular, we explore under what conditions on the significance of the variability of the transitions we can build a model to identify the distribution shifts. Based on the theoretical result, we introduce a novel framework, Causal Temporal Representation Learning with Nonstationary Sparse Transition (CtrlNS), designed to leverage the constraints on transition sparsity and conditional independence to reliably identify both distribution shifts and latent factors. Our experimental evaluations on synthetic and real-world datasets demonstrate significant improvements over existing baselines, highlighting the effectiveness of our approach.

[LG-74] owards Autonomous Cybersecurity: An Intelligent AutoML Framework for Autonomous Intrusion Detection CCS2024

链接: https://arxiv.org/abs/2409.03141
作者: Li Yang,Abdallah Shami
关键词-EN: network management systems, Machine Learning, Automated Machine Learning, Intrusion Detection Systems, rapid evolution
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
*备注: Accepted to the Workshop on Autonomous Cybersecurity, ACM CCS 2024; Code is available at Github link: this https URL

点击查看摘要

Abstract:The rapid evolution of mobile networks from 5G to 6G has necessitated the development of autonomous network management systems, such as Zero-Touch Networks (ZTNs). However, the increased complexity and automation of these networks have also escalated cybersecurity risks. Existing Intrusion Detection Systems (IDSs) leveraging traditional Machine Learning (ML) techniques have shown effectiveness in mitigating these risks, but they often require extensive manual effort and expert knowledge. To address these challenges, this paper proposes an Automated Machine Learning (AutoML)-based autonomous IDS framework towards achieving autonomous cybersecurity for next-generation networks. To achieve autonomous intrusion detection, the proposed AutoML framework automates all critical procedures of the data analytics pipeline, including data pre-processing, feature engineering, model selection, hyperparameter tuning, and model ensemble. Specifically, it utilizes a Tabular Variational Auto-Encoder (TVAE) method for automated data balancing, tree-based ML models for automated feature selection and base model learning, Bayesian Optimization (BO) for hyperparameter optimization, and a novel Optimized Confidence-based Stacking Ensemble (OCSE) method for automated model ensemble. The proposed AutoML-based IDS was evaluated on two public benchmark network security datasets, CICIDS2017 and 5G-NIDD, and demonstrated improved performance compared to state-of-the-art cybersecurity methods. This research marks a significant step towards fully autonomous cybersecurity in next-generation networks, potentially revolutionizing network security applications.

[LG-75] GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation

链接: https://arxiv.org/abs/2409.03140
作者: Ashirbad Mishra,Soumik Dey,Marshall Wu,Jinyu Zhao,He Yu,Kaichen Ni,Binbin Li,Kamesh Madduri
关键词-EN: Extreme Multi-Label Classification, Online sellers, listed products, enhance their sales, advertisers are recommended
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online sellers and advertisers are recommended keyphrases for their listed products, which they bid on to enhance their sales. One popular paradigm that generates such recommendations is Extreme Multi-Label Classification (XMC), which involves tagging/mapping keyphrases to items. We outline the limitations of using traditional item-query based tagging or mapping techniques for keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles. Additionally, we demonstrate that relying on traditional metrics such as precision/recall can be misleading in practical applications, thereby necessitating a combination of metrics to evaluate performance in real-world scenarios. These metrics are designed to assess the relevance of keyphrases to items and the potential for buyer outreach. GraphEx outperforms production models at eBay, achieving the objectives mentioned above. It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.

[LG-76] he AdEMAMix Optimizer: Better Faster Older

链接: https://arxiv.org/abs/2409.03137
作者: Matteo Pagliardini,Pierre Ablin,David Grangier
关键词-EN: Momentum based optimizers, machine learning applications, Exponential Moving Average, Momentum based, learning applications
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 38 pages, 27 figures

点击查看摘要

Abstract:Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show – quite surprisingly – that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a 1.3 B parameter AdEMAMix LLM trained on 101 B tokens performs comparably to an AdamW model trained on 197 B tokens ( +95% ). Moreover, our method significantly slows-down model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.

[LG-77] Subsidy design for better social outcomes

链接: https://arxiv.org/abs/2409.03129
作者: Maria-Florina Balcan,Matteo Pozzi,Dravyansh Sharma
关键词-EN: Overcoming the impact, Price of Anarchy, players in multiagent, system performance, multiagent systems
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 30 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Overcoming the impact of selfish behavior of rational players in multiagent systems is a fundamental problem in game theory. Without any intervention from a central agent, strategic users take actions in order to maximize their personal utility, which can lead to extremely inefficient overall system performance, often indicated by a high Price of Anarchy. Recent work (Lin et al. 2021) investigated and formalized yet another undesirable behavior of rational agents, that of avoiding freely available information about the game for selfish reasons, leading to worse social outcomes. A central planner can significantly mitigate these issues by injecting a subsidy to reduce certain costs associated with the system and obtain net gains in the system performance. Crucially, the planner needs to determine how to allocate this subsidy effectively. We formally show that designing subsidies that perfectly optimize the social good, in terms of minimizing the Price of Anarchy or preventing the information avoidance behavior, is computationally hard under standard complexity theoretic assumptions. On the positive side, we show that we can learn provably good values of subsidy in repeated games coming from the same domain. This data-driven subsidy design approach avoids solving computationally hard problems for unseen games by learning over polynomially many games. We also show that optimal subsidy can be learned with no-regret given an online sequence of games, under mild assumptions on the cost matrix. Our study focuses on two distinct games: a Bayesian extension of the well-studied fair cost-sharing game, and a component maintenance game with engineering applications. Comments: 30 pages, 3 figures, 5 tables Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2409.03129 [cs.GT] (or arXiv:2409.03129v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2409.03129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Probing self-attention in self-supervised speech models for cross-linguistic differences

链接: https://arxiv.org/abs/2409.03115
作者: Sai Gopinath,Joselyn Rodriguez
关键词-EN: gained traction, increase in accuracy, transformer architectures, Speech, models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages, 18 figures

点击查看摘要

Abstract:Speech models have gained traction thanks to increase in accuracy from novel transformer architectures. While this impressive increase in performance across automatic speech recognition (ASR) benchmarks is noteworthy, there is still much that is unknown about the use of attention mechanisms for speech-related tasks. For example, while it is assumed that these models are learning language-independent (i.e., universal) speech representations, there has not yet been an in-depth exploration of what it would mean for the models to be language-independent. In the current paper, we explore this question within the realm of self-attention mechanisms of one small self-supervised speech transformer model (TERA). We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language. We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining. We also present a head ablation study which shows that models across languages primarily rely on diagonal heads to classify phonemes.

[LG-79] RoboKoop: Efficient Control Conditioned Representations from Visual Input in Robotics using Koopman Operator

链接: https://arxiv.org/abs/2409.03107
作者: Hemant Kumawat,Biswadeep Chakraborty,Saibal Mukhopadhyay
关键词-EN: requires underlying robust, underlying robust task, Developing agents, underlying visual representations, requires underlying
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted to the 8th8^{th} Conference on Robot Learning (CoRL 2024)

点击查看摘要

Abstract:Developing agents that can perform complex control tasks from high-dimensional observations is a core ability of autonomous agents that requires underlying robust task control policies and adapting the underlying visual representations to the task. Most existing policies need a lot of training samples and treat this problem from the lens of two-stage learning with a controller learned on top of pre-trained vision models. We approach this problem from the lens of Koopman theory and learn visual representations from robotic agents conditioned on specific downstream tasks in the context of learning stabilizing control for the agent. We introduce a Contrastive Spectral Koopman Embedding network that allows us to learn efficient linearized visual representations from the agent’s visual data in a high dimensional latent space and utilizes reinforcement learning to perform off-policy control on top of the extracted representations with a linear controller. Our method enhances stability and control in gradient dynamics over time, significantly outperforming existing approaches by improving efficiency and accuracy in learning task policies over extended horizons.

[LG-80] Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources

链接: https://arxiv.org/abs/2409.03103
作者: Amadou Ba,Pavithra Harsha,Chitra Subramanian
关键词-EN: Modern web services, web services adopt, services adopt cloud-native, adopt cloud-native principles, Service Level Agreements
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Modern web services adopt cloud-native principles to leverage the advantages of microservices. To consistently guarantee high Quality of Service (QoS) according to Service Level Agreements (SLAs), ensure satisfactory user experiences, and minimize operational costs, each microservice must be provisioned with the right amount of resources. However, accurately provisioning microservices with adequate resources is complex and depends on many factors, including workload intensity and the complex interconnections between microservices. To address this challenge, we develop a model that captures the relationship between an end-to-end latency, requests at the front-end level, and resource utilization. We then use the developed model to predict the end-to-end latency. Our solution leverages the Temporal Fusion Transformer (TFT), an attention-based architecture equipped with interpretability features. When the prediction results indicate SLA non-compliance, we use the feature importance provided by the TFT as covariates in Kernel Ridge Regression (KRR), with the response variable being the desired latency, to learn the parameters associated with the feature importance. These learned parameters reflect the adjustments required to the features to ensure SLA compliance. We demonstrate the merit of our approach with a microservice-based application and provide a roadmap to deployment.

[LG-81] Backdoor defense learnability and obfuscation

链接: https://arxiv.org/abs/2409.03077
作者: Paul Christiano,Jacob Hilton,Victor Lecomte,Mark Xu
关键词-EN: introduce a formal, formal notion, attacker, PAC learnability, function class
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 29 pages

点击查看摘要

Abstract:We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the “trigger”, while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker’s strategy must work for a randomly-chosen trigger. Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of Hanneke et al. (2022) to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation. Comments: 29 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2409.03077 [cs.LG] (or arXiv:2409.03077v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.03077 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] Better Verified Explanations with Applications to Incorrectness and Out-of-Distribution Detection

链接: https://arxiv.org/abs/2409.03060
作者: Min Wu,Xiaofu Li,Haoze Wu,Clark Barrett
关键词-EN: learning model outputs, machine learning model, producing optimal verified, Building on VeriX, present VeriX
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Building on VeriX (Verified eXplainability, arXiv:2212.01051), a system for producing optimal verified explanations for machine learning model outputs, we present VeriX+, which significantly improves both the size and the generation time of verified explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time – the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain (Junker 2004) algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of 38% on the GTSRB dataset and a time reduction of 90% on MNIST. We also explore applications of our verified explanations and show that explanation size is a useful proxy for both incorrectness detection and out-of-distribution detection.

[LG-83] An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2409.03052
作者: Christopher Amato
关键词-EN: Multi-agent reinforcement learning, Multi-agent reinforcement, recent years, execution, CTDE
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: arXiv admin note: text overlap with arXiv:2405.06161

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) has exploded in popularity in recent years. Many approaches have been developed but they can be divided into three main types: centralized training and execution (CTE), centralized training for decentralized execution (CTDE), and Decentralized training and execution (DTE). CTDE methods are the most common as they can use centralized information during training but execute in a decentralized manner – using only information available to that agent during execution. CTDE is the only paradigm that requires a separate training phase where any available information (e.g., other agent policies, underlying states) can be used. As a result, they can be more scalable than CTE methods, do not require communication during execution, and can often perform well. CTDE fits most naturally with the cooperative case, but can be potentially applied in competitive or mixed settings depending on what information is assumed to be observed. This text is an introduction to CTDE in cooperative MARL. It is meant to explain the setting, basic concepts, and common methods. It does not cover all work in CTDE MARL as the subarea is quite extensive. I have included work that I believe is important for understanding the main concepts in the subarea and apologize to those that I have omitted. Comments: arXiv admin note: text overlap with arXiv:2405.06161 Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2409.03052 [cs.LG] (or arXiv:2409.03052v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.03052 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-84] Can Your Generative Model Detect Out-of-Distribution Covariate Shift? ECCV2024

链接: https://arxiv.org/abs/2409.03043
作者: Christiaan Viviers,Amaan Valiuddin,Francisco Caetano,Lemar Abdi,Lena Filatova,Peter de With,Fons van der Sommen
关键词-EN: high-level image statistics, normal and In-Distribution, high-level image, distribution shift aims, OOD detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Detecting Out-of-Distribution~(OOD) sensory data and covariate distribution shift aims to identify new test examples with different high-level image statistics to the captured, normal and In-Distribution (ID) set. Existing OOD detection literature largely focuses on semantic shift with little-to-no consensus over covariate shift. Generative models capture the ID data in an unsupervised manner, enabling them to effectively identify samples that deviate significantly from this learned distribution, irrespective of the downstream task. In this work, we elucidate the ability of generative models to detect and quantify domain-specific covariate shift through extensive analyses that involves a variety of models. To this end, we conjecture that it is sufficient to detect most occurring sensory faults (anomalies and deviations in global signals statistics) by solely modeling high-frequency signal-dependent and independent details. We propose a novel method, CovariateFlow, for OOD detection, specifically tailored to covariate heteroscedastic high-frequency image-components using conditional Normalizing Flows (cNFs). Our results on CIFAR10 vs. CIFAR10-C and ImageNet200 vs. ImageNet200-C demonstrate the effectiveness of the method by accurately detecting OOD covariate shift. This work contributes to enhancing the fidelity of imaging systems and aiding machine learning models in OOD detection in the presence of covariate shift.

[LG-85] MDNF: Multi-Diffusion-Nets for Neural Fields on Meshes

链接: https://arxiv.org/abs/2409.03034
作者: Avigail Cohen Rimon,Tal Shnitzer,Mirela Ben Chen
关键词-EN: Fourier Filter Bank, Neural Fourier Filter, frequency domains, framework for representing, triangle meshes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel framework for representing neural fields on triangle meshes that is multi-resolution across both spatial and frequency domains. Inspired by the Neural Fourier Filter Bank (NFFB), our architecture decomposes the spatial and frequency domains by associating finer spatial resolution levels with higher frequency bands, while coarser resolutions are mapped to lower frequencies. To achieve geometry-aware spatial decomposition we leverage multiple DiffusionNet components, each associated with a different spatial resolution level. Subsequently, we apply a Fourier feature mapping to encourage finer resolution levels to be associated with higher frequencies. The final signal is composed in a wavelet-inspired manner using a sine-activated MLP, aggregating higher-frequency signals on top of lower-frequency ones. Our architecture attains high accuracy in learning complex neural fields and is robust to discontinuities, exponential scale variations of the target field, and mesh modification. We demonstrate the effectiveness of our approach through its application to diverse neural fields, such as synthetic RGB functions, UV texture coordinates, and vertex normals, illustrating different challenges. To validate our method, we compare its performance against two alternatives, showcasing the advantages of our multi-resolution architecture.

[LG-86] NUMOSIM: A Synthetic Mobility Dataset with Anomaly Detection Benchmarks

链接: https://arxiv.org/abs/2409.03024
作者: Chris Stanford,Suman Adari,Xishun Liao,Yueshuai He,Qinhua Jiang,Chenchen Kuai,Jiaqi Ma,Emmanuel Tung,Yinlong Qian,Lingyi Zhao,Zihao Zhou,Zeeshan Rasheed,Khurram Shafique
关键词-EN: Collecting real-world mobility, Collecting real-world, Collecting, anomaly detection, NUMOSIM
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Collecting real-world mobility data is challenging. It is often fraught with privacy concerns, logistical difficulties, and inherent biases. Moreover, accurately annotating anomalies in large-scale data is nearly impossible, as it demands meticulous effort to distinguish subtle and complex patterns. These challenges significantly impede progress in geospatial anomaly detection research by restricting access to reliable data and complicating the rigorous evaluation, comparison, and benchmarking of methodologies. To address these limitations, we introduce a synthetic mobility dataset, NUMOSIM, that provides a controlled, ethical, and diverse environment for benchmarking anomaly detection techniques. NUMOSIM simulates a wide array of realistic mobility scenarios, encompassing both typical and anomalous behaviours, generated through advanced deep learning models trained on real mobility data. This approach allows NUMOSIM to accurately replicate the complexities of real-world movement patterns while strategically injecting anomalies to challenge and evaluate detection algorithms based on how effectively they capture the interplay between demographic, geospatial, and temporal factors. Our goal is to advance geospatial mobility analysis by offering a realistic benchmark for improving anomaly detection and mobility modeling techniques. To support this, we provide open access to the NUMOSIM dataset, along with comprehensive documentation, evaluation metrics, and benchmark results.

[LG-87] CLUE: Concept-Level Uncertainty Estimation for Large Language Models

链接: https://arxiv.org/abs/2409.03021
作者: Yu-Hsiang Wang,Andrew Bai,Che-Ping Tsai,Cho-Jui Hsieh
关键词-EN: Large Language Models, Large Language, Language Models, natural language generation, demonstrated remarkable proficiency
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in various natural language generation (NLG) tasks. Previous studies suggest that LLMs’ generation process involves uncertainty. However, existing approaches to uncertainty estimation mainly focus on sequence-level uncertainty, overlooking individual pieces of information within sequences. These methods fall short in separately assessing the uncertainty of each component in a sequence. In response, we propose a novel framework for Concept-Level Uncertainty Estimation (CLUE) for LLMs. We leverage LLMs to convert output sequences into concept-level representations, breaking down sequences into individual concepts and measuring the uncertainty of each concept separately. We conduct experiments to demonstrate that CLUE can provide more interpretable uncertainty estimation results compared with sentence-level uncertainty, and could be a useful tool for various tasks such as hallucination detection and story generation.

[LG-88] PIETRA: Physics-Informed Evidential Learning for Traversing Out-of-Distribution Terrain

链接: https://arxiv.org/abs/2409.03005
作者: Xiaoyi Cai,James Queeney,Tong Xu,Aniket Datar,Chenhui Pan,Max Miller,Ashton Flather,Philip R. Osteen,Nicholas Roy,Xuesu Xiao,Jonathan P. How
关键词-EN: powerful approach, approach for developing, developing traversability models, developing traversability, Self-supervised learning
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to RA-L. Video: this https URL

点击查看摘要

Abstract:Self-supervised learning is a powerful approach for developing traversability models for off-road navigation, but these models often struggle with inputs unseen during training. Existing methods utilize techniques like evidential deep learning to quantify model uncertainty, helping to identify and avoid out-of-distribution terrain. However, always avoiding out-of-distribution terrain can be overly conservative, e.g., when novel terrain can be effectively analyzed using a physics-based model. To overcome this challenge, we introduce Physics-Informed Evidential Traversability (PIETRA), a self-supervised learning framework that integrates physics priors directly into the mathematical formulation of evidential neural networks and introduces physics knowledge implicitly through an uncertainty-aware, physics-informed training loss. Our evidential network seamlessly transitions between learned and physics-based predictions for out-of-distribution inputs. Additionally, the physics-informed loss regularizes the learned model, ensuring better alignment with the physics model. Extensive simulations and hardware experiments demonstrate that PIETRA improves both learning accuracy and navigation performance in environments with significant distribution shifts.

[LG-89] Hallucination Detection in LLMs: Fast and Memory-Efficient Finetuned Models

链接: https://arxiv.org/abs/2409.02976
作者: Gabriel Y. Arteaga,Thomas B. Schön,Nicolas Pielawski
关键词-EN: Uncertainty estimation, high-risk settings, Large Language Models, autonomous cars, component when implementing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Uncertainty estimation is a necessary component when implementing AI in high-risk settings, such as autonomous cars, medicine, or insurances. Large Language Models (LLMs) have seen a surge in popularity in recent years, but they are subject to hallucinations, which may cause serious harm in high-risk settings. Despite their success, LLMs are expensive to train and run: they need a large amount of computations and memory, preventing the use of ensembling methods in practice. In this work, we present a novel method that allows for fast and memory-friendly training of LLM ensembles. We show that the resulting ensembles can detect hallucinations and are a viable approach in practice as only one GPU is needed for training and inference.

[LG-90] SDOoop: Capturing Periodical Patterns and Out-of-phase Anomalies in Streaming Data Analysis

链接: https://arxiv.org/abs/2409.02973
作者: Alexander Hartl,Félix Iglesias Vázquez,Tanja Zseby
关键词-EN: required in applications, mechatronics or cyber-physical, cyber-physical systems, analysis is increasingly, increasingly required
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Streaming data analysis is increasingly required in applications, e.g., IoT, cybersecurity, robotics, mechatronics or cyber-physical systems. Despite its relevance, it is still an emerging field with open challenges. SDO is a recent anomaly detection method designed to meet requirements of speed, interpretability and intuitive parameterization. In this work, we present SDOoop, which extends the capabilities of SDO’s streaming version to retain temporal information of data structures. SDOoop spots contextual anomalies undetectable by traditional algorithms, while enabling the inspection of data geometries, clusters and temporal patterns. We used SDOoop to model real network communications in critical infrastructures and extract patterns that disclose their dynamics. Moreover, we evaluated SDOoop with data from intrusion detection and natural science domains and obtained performances equivalent or superior to state-of-the-art approaches. Our results show the high potential of new model-based methods to analyze and explain streaming data. Since SDOoop operates with constant per-sample space and time complexity, it is ideal for big data, being able to instantly process large volumes of information. SDOoop conforms to next-generation machine learning, which, in addition to accuracy and speed, is expected to provide highly interpretable and informative models.

[LG-91] Meal-taking activity monitoring in the elderly based on sensor data: Comparison of unsupervised classification methods

链接: https://arxiv.org/abs/2409.02971
作者: Abderrahim Derouiche(LAAS-S4M, UT3),Damien Brulin(LAAS-S4M, UT2J),Eric Campo(LAAS-S4M, UT2J),Antoine Piau
关键词-EN: improve nutritional monitoring, older population, increase in frailty, era marked, demographic change
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In an era marked by a demographic change towards an older population, there is an urgent need to improve nutritional monitoring in view of the increase in frailty. This research aims to enhance the identification of meal-taking activities by combining K-Means, GMM, and DBSCAN techniques. Using the Davies-Bouldin Index (DBI) for the optimal meal taking activity clustering, the results show that K-Means seems to be the best solution, thanks to its unrivalled efficiency in data demarcation, compared with the capabilities of GMM and DBSCAN. Although capable of identifying complex patterns and outliers, the latter methods are limited by their operational complexities and dependence on precise parameter configurations. In this paper, we have processed data from 4 houses equipped with sensors. The findings indicate that applying the K-Means method results in high performance, evidenced by a particularly low Davies-Bouldin Index (DBI), illustrating optimal cluster separation and cohesion. Calculating the average duration of each activity using the GMM algorithm allows distinguishing various categories of meal-taking activities. Alternatively, this can correspond to different times of the day fitting to each meal-taking activity. Using K-Means, GMM, and DBSCAN clustering algorithms, the study demonstrates an effective strategy for thoroughly understanding the data. This approach facilitates the comparison and selection of the most suitable method for optimal meal-taking activity clustering.

[LG-92] LibMOON: A Gradient-based MultiObjective OptimizatioN Library in PyTorch

链接: https://arxiv.org/abs/2409.02969
作者: Xiaoyuan Zhang,Liang Zhao,Yingying Yu,Xi Lin,Zhenkun Wang,Han Zhao,Qingfu Zhang
关键词-EN: Multiobjective optimization problems, robustness constraints, prevalent in machine, applications in multi-task, fairness or robustness
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Multiobjective optimization problems (MOPs) are prevalent in machine learning, with applications in multi-task learning, learning under fairness or robustness constraints, etc. Instead of reducing multiple objective functions into a scalar objective, MOPs aim to optimize for the so-called Pareto optimality or Pareto set learning, which involves optimizing more than one objective function simultaneously, over models with millions of parameters. Existing benchmark libraries for MOPs mainly focus on evolutionary algorithms, most of which are zeroth-order methods that do not effectively utilize higher-order information from objectives and cannot scale to large-scale models with millions of parameters. In light of the above gap, this paper introduces LibMOON, the first multiobjective optimization library that supports state-of-the-art gradient-based methods, provides a fair benchmark, and is open-sourced for the community.

[LG-93] Do We Trust What They Say or What They Do? A Multimodal User Embedding Provides Personalized Explanations

链接: https://arxiv.org/abs/2409.02965
作者: Zhicheng Ren,Zhiping Xiao,Yizhou Sun
关键词-EN: analyzing social network, social network user, social media, network user data, user
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of social media, the importance of analyzing social network user data has also been put on the agenda. User representation learning in social media is a critical area of research, based on which we can conduct personalized content delivery, or detect malicious actors. Being more complicated than many other types of data, social network user data has inherent multimodal nature. Various multimodal approaches have been proposed to harness both text (i.e. post content) and relation (i.e. inter-user interaction) information to learn user embeddings of higher quality. The advent of Graph Neural Network models enables more end-to-end integration of user text embeddings and user interaction graphs in social networks. However, most of those approaches do not adequately elucidate which aspects of the data - text or graph structure information - are more helpful for predicting each specific user under a particular task, putting some burden on personalized downstream analysis and untrustworthy information filtering. We propose a simple yet effective framework called Contribution-Aware Multimodal User Embedding (CAMUE) for social networks. We have demonstrated with empirical evidence, that our approach can provide personalized explainable predictions, automatically mitigating the impact of unreliable information. We also conducted case studies to show how reasonable our results are. We observe that for most users, graph structure information is more trustworthy than text information, but there are some reasonable cases where text helps more. Our work paves the way for more explainable, reliable, and effective social media user embedding which allows for better personalized content delivery.

[LG-94] Multi-Modal Adapter for Vision-Language Models

链接: https://arxiv.org/abs/2409.02958
作者: Dominykas Seputis,Serghei Mihailov,Soham Chatterjee,Zehao Xiao
关键词-EN: Large pre-trained vision-language, Large pre-trained, pre-trained vision-language models, requiring retraining, image classification tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large pre-trained vision-language models, such as CLIP, have demonstrated state-of-the-art performance across a wide range of image classification tasks, without requiring retraining. Few-shot CLIP is competitive with existing specialized architectures that were trained on the downstream tasks. Recent research demonstrates that the performance of CLIP can be further improved using lightweight adaptation approaches. However, previous methods adapt different modalities of the CLIP model individually, ignoring the interactions and relationships between visual and textual representations. In this work, we propose Multi-Modal Adapter, an approach for Multi-Modal adaptation of CLIP. Specifically, we add a trainable Multi-Head Attention layer that combines text and image features to produce an additive adaptation of both. Multi-Modal Adapter demonstrates improved generalizability, based on its performance on unseen classes compared to existing adaptation methods. We perform additional ablations and investigations to validate and interpret the proposed approach.

[LG-95] Comparison of Epilepsy Induced by Ischemic Hypoxic Brain Injury and Hypoglycemic Brain Injury using Multilevel Fusion of Data Features

链接: https://arxiv.org/abs/2409.02957
作者: Sameer Kadem,Noor Sami,Ahmed Elaraby,Shahad Alyousif,Mohammed Jalil,M. Altaee,Muntather Almusawi,A. Ghany Ismaeel,Ali Kamil Kareem,Massila Kamalrudin,Adnan Allwi ftaiet
关键词-EN: Bayesian Neural Network, brain damage caused, Support Vector Machine, Neural Network, Bayesian Neural
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: 16 Pages, 12 Figures, 2 Tables

点击查看摘要

Abstract:The study aims to investigate the similarities and differences in the brain damage caused by Hypoxia-Ischemia (HI), Hypoglycemia, and Epilepsy. Hypoglycemia poses a significant challenge in improving glycemic regulation for insulin-treated patients, while HI brain disease in neonates is associated with low oxygen levels. The study examines the possibility of using a combination of medical data and Electroencephalography (EEG) measurements to predict outcomes over a two-year period. The study employs a multilevel fusion of data features to enhance the accuracy of the predictions. Therefore this paper suggests a hybridized classification model for Hypoxia-Ischemia and Hypoglycemia, Epilepsy brain injury (HCM-BI). A Support Vector Machine is applied with clinical details to define the Hypoxia-Ischemia outcomes of each infant. The newborn babies are assessed every two years again to know the neural development results. A selection of four attributes is derived from the Electroencephalography records, and SVM does not get conclusions regarding the classification of diseases. The final feature extraction of the EEG signal is optimized by the Bayesian Neural Network (BNN) to get the clear health condition of Hypoglycemia and Epilepsy patients. Through monitoring and assessing physical effects resulting from Electroencephalography, The Bayesian Neural Network (BNN) is used to extract the test samples with the most log data and to report hypoglycemia and epilepsy Keywords- Hypoxia-Ischemia , Hypoglycemia , Epilepsy , Multilevel Fusion of Data Features , Bayesian Neural Network (BNN) , Support Vector Machine (SVM)

[LG-96] A Note On Deterministic Submodular Maximization With Bounded Curvature

链接: https://arxiv.org/abs/2409.02943
作者: Wenxin Li
关键词-EN: Buchbinder and Feldman, recent breakthrough result, kappa, approximate algorithm, function with curvature
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show that the recent breakthrough result of [Buchbinder and Feldman, FOCS’24] could further lead to a deterministic (1-\kappa_f/e-\varepsilon) -approximate algorithm for maximizing a submodular function with curvature \kappa_f under matroid constraint.

[LG-97] CortexCompile: Harnessing Cortical-Inspired Architectures for Enhanced Multi-Agent NLP Code Synthesis

链接: https://arxiv.org/abs/2409.02938
作者: Gautham Ramachandran,Rick Yang
关键词-EN: automated code generation, Natural Language Processing, lack real-time adaptability, automated code, complex programming tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Current approaches to automated code generation often rely on monolithic models that lack real-time adaptability and scalability. This limitation is particularly evident in complex programming tasks that require dynamic adjustment and efficiency. The integration of neuroscience principles into Natural Language Processing (NLP) has the potential to revolutionize automated code generation. This paper presents CortexCompile, a novel modular system inspired by the specialized functions of the human brain’s cortical regions. By emulating the distinct roles of the Prefrontal Cortex, Parietal Cortex, Temporal Lobe, and Motor Cortex, CortexCompile achieves significant advancements in scalability, efficiency, and adaptability compared to traditional monolithic models like GPT-4o. The system’s architecture features a Task Orchestration Agent that manages dynamic task delegation and parallel processing, facilitating the generation of highly accurate and optimized code across increasingly complex programming tasks. Experimental evaluations demonstrate that CortexCompile consistently outperforms GPT-4o in development time, accuracy, and user satisfaction, particularly in tasks involving real-time strategy games and first-person shooters. These findings underscore the viability of neuroscience-inspired architectures in addressing the limitations of current NLP models, paving the way for more efficient and human-like AI systems.

[LG-98] Iterative thresholding for non-linear learning in the strong varepsilon-contamination model

链接: https://arxiv.org/abs/2409.03703
作者: Arvind Rathnashyam,Alex Gittens
关键词-EN: possibly corrupted adversarially, thresholded gradient descent, single neuron models, learning single neuron, epsilon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages

点击查看摘要

Abstract:We derive approximation bounds for learning single neuron models using thresholded gradient descent when both the labels and the covariates are possibly corrupted adversarially. We assume the data follows the model y = \sigma(\mathbfw^* \cdot \mathbfx) + \xi, where \sigma is a nonlinear activation function, the noise \xi is Gaussian, and the covariate vector \mathbfx is sampled from a sub-Gaussian distribution. We study sigmoidal, leaky-ReLU, and ReLU activation functions and derive a O(\nu\sqrt\epsilon\log(1/\epsilon)) approximation bound in \ell_2 -norm, with sample complexity O(d/\epsilon) and failure probability e^-\Omega(d) . We also study the linear regression problem, where \sigma(\mathbfx) = \mathbfx . We derive a O(\nu\epsilon\log(1/\epsilon)) approximation bound, improving upon the previous O(\nu) approximation bounds for the gradient-descent based iterative thresholding algorithms of Bhatia et al. (NeurIPS 2015) and Shen and Sanghavi (ICML 2019). Our algorithm has a O(\textrmpolylog(N,d)\log(R/\epsilon)) runtime complexity when |\mathbfw^*|_2 \leq R , improving upon the O(\textpolylog(N,d)/\epsilon^2) runtime complexity of Awasthi et al. (NeurIPS 2022). Comments: 35 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2409.03703 [stat.ML] (or arXiv:2409.03703v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2409.03703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-99] Predicting quantum channels over general product distributions

链接: https://arxiv.org/abs/2409.03684
作者: Sitan Chen,Jaume de Dios Pont,Jun-Ting Hsieh,Hsin-Yuan Huang,Jane Lange,Jerry Li
关键词-EN: unknown quantum channels, investigate the problem, problem of predicting, predicting the output, output behavior
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 20 pages, comments welcome

点击查看摘要

Abstract:We investigate the problem of predicting the output behavior of unknown quantum channels. Given query access to an n -qubit channel E and an observable O , we aim to learn the mapping \beginequation* \rho \mapsto \mathrmTr(O E[\rho]) \endequation* to within a small error for most \rho sampled from a distribution D . Previously, Huang, Chen, and Preskill proved a surprising result that even if E is arbitrary, this task can be solved in time roughly n^O(\log(1/\epsilon)) , where \epsilon is the target prediction error. However, their guarantee applied only to input distributions D invariant under all single-qubit Clifford gates, and their algorithm fails for important cases such as general product distributions over product states \rho . In this work, we propose a new approach that achieves accurate prediction over essentially any product distribution D , provided it is not “classical” in which case there is a trivial exponential lower bound. Our method employs a “biased Pauli analysis,” analogous to classical biased Fourier analysis. Implementing this approach requires overcoming several challenges unique to the quantum setting, including the lack of a basis with appropriate orthogonality properties. The techniques we develop to address these issues may have broader applications in quantum information. Comments: 20 pages, comments welcome Subjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2409.03684 [quant-ph] (or arXiv:2409.03684v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2409.03684 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-100] A method to benchmark high-dimensional process drift detection

链接: https://arxiv.org/abs/2409.03669
作者: Edgar Wolf,Tobias Windisch
关键词-EN: multi-variate finite time, finite time series, time series data, series data coming, manufacturing processes
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Process curves are multi-variate finite time series data coming from manufacturing processes. This paper studies machine learning methods for drifts of process curves. A theoretic framework to synthetically generate process curves in a controlled way is introduced in order to benchmark machine learning algorithms for process drift detection. A evaluation score, called the temporal area under the curve, is introduced, which allows to quantify how well machine learning models unveil curves belonging to drift segments. Finally, a benchmark study comparing popular machine learning approaches on synthetic data generated with the introduced framework shown.

[LG-101] Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization

链接: https://arxiv.org/abs/2409.03655
作者: Zexin Cai,Henry Li Xinyuan,Ashi Garg,Leibny Paola García-Perera,Kevin Duh,Sanjeev Khudanpur,Nicholas Andrews,Matthew Wiesner
关键词-EN: personally identifiable information, unprecedented access, access to personally, personally identifiable, Advances
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: accepted by 2024 IEEE Spoken Language Technology Workshop

点击查看摘要

Abstract:Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its utility, including linguistic and paralinguistic aspects. However, anonymizing speech while maintaining emotional state remains challenging. We explore this problem in the context of the VoicePrivacy 2024 challenge. Specifically, we developed various speaker anonymization pipelines and find that approaches either excel at anonymization or preserving emotion state, but not both simultaneously. Achieving both would require an in-domain emotion recognizer. Additionally, we found that it is feasible to train a semi-effective speaker verification system using only emotion representations, demonstrating the challenge of separating these two modalities.

[LG-102] DART2: a robust multiple testing method to smartly leverage helpful or misleading ancillary information

链接: https://arxiv.org/abs/2409.03618
作者: Xuechan Li,Jichun Xie
关键词-EN: ancillary information, ancillary, information, reflecting the hypothesis, alternative status
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 6 figures

点击查看摘要

Abstract:In many applications of multiple testing, ancillary information is available, reflecting the hypothesis null or alternative status. Several methods have been developed to leverage this ancillary information to enhance testing power, typically requiring the ancillary information is helpful enough to ensure favorable performance. In this paper, we develop a robust and effective distance-assisted multiple testing procedure named DART2, designed to be powerful and robust regardless of the quality of ancillary information. When the ancillary information is helpful, DART2 can asymptotically control FDR while improving power; otherwise, DART2 can still control FDR and maintain power at least as high as ignoring the ancillary information. We demonstrated DART2’s superior performance compared to existing methods through numerical studies under various settings. In addition, DART2 has been applied to a gene association study where we have shown its superior accuracy and robustness under two different types of ancillary information.

[LG-103] Survey of Data-driven Newsvendor: Unified Analysis and Spectrum of Achievable Regrets

链接: https://arxiv.org/abs/2409.03505
作者: Zhuoxin Chen,Will Ma
关键词-EN: Newsvendor problem, guess the number, asymmetric consequences, consequences for guessing, Data-driven Newsvendor
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the Newsvendor problem, the goal is to guess the number that will be drawn from some distribution, with asymmetric consequences for guessing too high vs. too low. In the data-driven version, the distribution is unknown, and one must work with samples from the distribution. Data-driven Newsvendor has been studied under many variants: additive vs. multiplicative regret, high probability vs. expectation bounds, and different distribution classes. This paper studies all combinations of these variants, filling in many gaps in the literature and simplifying many proofs. In particular, we provide a unified analysis based on the notion of clustered distributions, which in conjunction with our new lower bounds, shows that the entire spectrum of regrets between 1/\sqrtn and 1/n can be possible.

[LG-104] Maximum likelihood inference for high-dimensional problems with multiaffine variable relations

链接: https://arxiv.org/abs/2409.03495
作者: Jean-Sébastien Brouillon,Florian Dörfler,Giancarlo Ferrari-Trecate
关键词-EN: Maximum Likelihood Estimation, Maximum Likelihood, Likelihood Estimation, Estimation of continuous, potentially complex probability
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Maximum Likelihood Estimation of continuous variable models can be very challenging in high dimensions, due to potentially complex probability distributions. The existence of multiple interdependencies among variables can make it very difficult to establish convergence guarantees. This leads to a wide use of brute-force methods, such as grid searching and Monte-Carlo sampling and, when applicable, complex and problem-specific algorithms. In this paper, we consider inference problems where the variables are related by multiaffine expressions. We propose a novel Alternating and Iteratively-Reweighted Least Squares (AIRLS) algorithm, and prove its convergence for problems with Generalized Normal Distributions. We also provide an efficient method to compute the variance of the estimates obtained using AIRLS. Finally, we show how the method can be applied to graphical statistical models. We perform numerical experiments on several inference problems, showing significantly better performance than state-of-the-art approaches in terms of scalability, robustness to noise, and convergence speed due to an empirically observed super-linear convergence rate.

[LG-105] Distributionally Robust Optimisation with Bayesian Ambiguity Sets

链接: https://arxiv.org/abs/2409.03492
作者: Charita Dellaporta,Patrick O’Hara,Theodoros Damoulas
关键词-EN: data-generating process, Bayesian Ambiguity Sets, Distributionally Robust Optimisation, Decision making, DGP
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures. Under review

点击查看摘要

Abstract:Decision making under uncertainty is challenging since the data-generating process (DGP) is often unknown. Bayesian inference proceeds by estimating the DGP through posterior beliefs about the model’s parameters. However, minimising the expected risk under these posterior beliefs can lead to sub-optimal decisions due to model uncertainty or limited, noisy observations. To address this, we introduce Distributionally Robust Optimisation with Bayesian Ambiguity Sets (DRO-BAS) which hedges against uncertainty in the model by optimising the worst-case risk over a posterior-informed ambiguity set. We show that our method admits a closed-form dual representation for many exponential family members and showcase its improved out-of-sample robustness against existing Bayesian DRO methodology in the Newsvendor problem.

[LG-106] Panopticon: a novel deep learning model to detect single transit events with no prior data filtering in PLATO light curves

链接: https://arxiv.org/abs/2409.03466
作者: H.G. Vivien,M. Deleuil,N. Jannsen,J. De Ridder,D. Seynaeve,M.-A. Carpine,Y. Zerah
关键词-EN: PLATO light curves, future PLATO light, light curves, PLATO light, deep learning model
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Submitted to AA

点击查看摘要

Abstract:To prepare for the analyses of the future PLATO light curves, we develop a deep learning model, Panopticon, to detect transits in high precision photometric light curves. Since PLATO’s main objective is the detection of temperate Earth-size planets around solar-type stars, the code is designed to detect individual transit events. The filtering step, required by conventional detection methods, can affect the transit, which could be an issue for long and shallow transits. To protect transit shape and depth, the code is also designed to work on unfiltered light curves. We trained the model on a set of simulated PLATO light curves in which we injected, at pixel level, either planetary, eclipsing binary, or background eclipsing binary signals. We also include a variety of noises in our data, such as granulation, stellar spots or cosmic rays. The approach is able to recover 90% of our test population, including more than 25% of the Earth-analogs, even in the unfiltered light curves. The model also recovers the transits irrespective of the orbital period, and is able to retrieve transits on a unique event basis. These figures are obtained when accepting a false alarm rate of 1%. When keeping the false alarm rate low (0.01%), it is still able to recover more than 85% of the transit signals. Any transit deeper than 180ppm is essentially guaranteed to be recovered. This method is able to recover transits on a unique event basis, and does so with a low false alarm rate. Thanks to light curves being one-dimensional, model training is fast, on the order of a few hours per model. This speed in training and inference, coupled to the recovery effectiveness and precision of the model make it an ideal tool to complement, or be used ahead of, classical approaches.

[LG-107] Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data

链接: https://arxiv.org/abs/2409.03335
作者: Eyar Azar,Boaz Nadler
关键词-EN: data yields significantly, SSL, premise of semi-supervised, yields significantly, unlabeled data yields
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data yields significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. In this work, we study SSL for high dimensional sparse Gaussian classification. To construct an accurate classifier a key task is feature selection, detecting the few variables that separate the two classes. % For this SSL setting, we analyze information theoretic lower bounds for accurate feature selection as well as computational lower bounds, assuming the low-degree likelihood hardness conjecture. % Our key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where SSL is guaranteed to be advantageous for classification. Specifically, there is a regime where it is possible to construct in polynomial time an accurate SSL classifier. However, % any computationally efficient supervised or unsupervised learning schemes, that separately use only the labeled or unlabeled data would fail. Our work highlights the provable benefits of combining labeled and unlabeled data for classification and feature selection in high dimensions. We present simulations that complement our theoretical analysis.

[LG-108] Fourier Neural Operators for Learning Dynamics in Quantum Spin Systems

链接: https://arxiv.org/abs/2409.03302
作者: Freya Shah,Taylor L. Patti,Julius Berner,Bahareh Tolooshams,Jean Kossaifi,Anima Anandkumar
关键词-EN: Fourier Neural Operators, Fourier Neural, Neural Operators, partial differential equations, functional data
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Fourier Neural Operators (FNOs) excel on tasks using functional data, such as those originating from partial differential equations. Such characteristics render them an effective approach for simulating the time evolution of quantum wavefunctions, which is a computationally challenging, yet coveted task for understanding quantum systems. In this manuscript, we use FNOs to model the evolution of random quantum spin systems, so chosen due to their representative quantum dynamics and minimal symmetry. We explore two distinct FNO architectures and examine their performance for learning and predicting time evolution using both random and low-energy input states. Additionally, we apply FNOs to a compact set of Hamiltonian observables ( \sim\textpoly(n) ) instead of the entire 2^n quantum wavefunction, which greatly reduces the size of our inputs and outputs and, consequently, the requisite dimensions of the resulting FNOs. Moreover, this Hamiltonian observable-based method demonstrates that FNOs can effectively distill information from high-dimensional spaces into lower-dimensional spaces. The extrapolation of Hamiltonian observables to times later than those used in training is of particular interest, as this stands to fundamentally increase the simulatability of quantum systems past both the coherence times of contemporary quantum architectures and the circuit-depths of tractable tensor networks.

[LG-109] SpinMultiNet: Neural Network Potential Incorporating Spin Degrees of Freedom with Multi-Task Learning

链接: https://arxiv.org/abs/2409.03253
作者: Koki Ueno,Satoru Ohuchi,Kazuhide Ichikawa,Kei Amii,Kensuke Wakasugi
关键词-EN: Neural Network Potentials, Neural Network, Network Potentials, density functional theory, attracted significant attention
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Neural Network Potentials (NNPs) have attracted significant attention as a method for accelerating density functional theory (DFT) calculations. However, conventional NNP models typically do not incorporate spin degrees of freedom, limiting their applicability to systems where spin states critically influence material properties, such as transition metal oxides. This study introduces SpinMultiNet, a novel NNP model that integrates spin degrees of freedom through multi-task learning. SpinMultiNet achieves accurate predictions without relying on correct spin values obtained from DFT calculations. Instead, it utilizes initial spin estimates as input and leverages multi-task learning to optimize the spin latent representation while maintaining both E(3) and time-reversal equivariance. Validation on a dataset of transition metal oxides demonstrates the high predictive accuracy of SpinMultiNet. The model successfully reproduces the energy ordering of stable spin configurations originating from superexchange interactions and accurately captures the rhombohedral distortion of the rocksalt structure. These results pave the way for new possibilities in materials simulations that consider spin degrees of freedom, promising future applications in large-scale simulations of various material systems, including magnetic materials.

[LG-110] Non-stationary and Sparsely-correlated Multi-output Gaussian Process with Spike-and-Slab Prior

链接: https://arxiv.org/abs/2409.03149
作者: Wang Xinming,Li Yongxiang,Yue Xiaowei,Wu Jianguo
关键词-EN: Multi-output Gaussian process, Multi-output Gaussian, Gaussian process, method to leverage, MGP
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Multi-output Gaussian process (MGP) is commonly used as a transfer learning method to leverage information among multiple outputs. A key advantage of MGP is providing uncertainty quantification for prediction, which is highly important for subsequent decision-making tasks. However, traditional MGP may not be sufficiently flexible to handle multivariate data with dynamic characteristics, particularly when dealing with complex temporal correlations. Additionally, since some outputs may lack correlation, transferring information among them may lead to negative transfer. To address these issues, this study proposes a non-stationary MGP model that can capture both the dynamic and sparse correlation among outputs. Specifically, the covariance functions of MGP are constructed using convolutions of time-varying kernel functions. Then a dynamic spike-and-slab prior is placed on correlation parameters to automatically decide which sources are informative to the target output in the training process. An expectation-maximization (EM) algorithm is proposed for efficient model fitting. Both numerical studies and a real case demonstrate its efficacy in capturing dynamic and sparse correlation structure and mitigating negative transfer for high-dimensional time-series data. Finally, a mountain-car reinforcement learning case highlights its potential application in decision making problems.

[LG-111] Generative artificial intelligence for computational chemistry: a roadmap to predicting emergent phenomena

链接: https://arxiv.org/abs/2409.03118
作者: Pratyush Tiwary,Lukas Herron,Richard John,Suemin Lee,Disha Sanwal,Ruiyu Wang
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, introduced exciting possibilities, Generative Artificial, recent surge
类目: atistical Mechanics (cond-mat.stat-mech); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:The recent surge in Generative Artificial Intelligence (AI) has introduced exciting possibilities for computational chemistry. Generative AI methods have made significant progress in sampling molecular structures across chemical species, developing force fields, and speeding up simulations. This Perspective offers a structured overview, beginning with the fundamental theoretical concepts in both Generative AI and computational chemistry. It then covers widely used Generative AI methods, including autoencoders, generative adversarial networks, reinforcement learning, flow models and language models, and highlights their selected applications in diverse areas including force field development, and protein/RNA structure prediction. A key focus is on the challenges these methods face before they become truly predictive, particularly in predicting emergent chemical phenomena. We believe that the ultimate goal of a simulation method or theory is to predict phenomena not seen before, and that Generative AI should be subject to these same standards before it is deemed useful for chemistry. We suggest that to overcome these challenges, future AI models need to integrate core chemical principles, especially from statistical mechanics.

[LG-112] How DREAMS are made: Emulating Satellite Galaxy and Subhalo Populations with Diffusion Models and Point Clouds

链接: https://arxiv.org/abs/2409.02980
作者: Tri Nguyen,Francisco Villaescusa-Navarro,Siddharth Mishra-Sharma,Carolina Cuesta-Lazaro,Paul Torrey,Arya Farahi,Alex M. Garcia,Jonah C. Rose,Stephanie O’Neil,Mark Vogelsberger,Xuejian Shen,Cian Roche,Daniel Anglés-Alcázar,Nitya Kallivayalil,Julian B. Muñoz,Francis-Yan Cyr-Racine,Sandip Roy,Lina Necib,Kassidy E. Kollmann
关键词-EN: host dark matter, dark matter, understanding of cosmology, Halo Occupation Distribution, host dark
类目: Astrophysics of Galaxies (astro-ph.GA); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: Submitted to ApJ; 30 + 6 pages; 11 + 4 figures; Comments welcomed

点击查看摘要

Abstract:The connection between galaxies and their host dark matter (DM) halos is critical to our understanding of cosmology, galaxy formation, and DM physics. To maximize the return of upcoming cosmological surveys, we need an accurate way to model this complex relationship. Many techniques have been developed to model this connection, from Halo Occupation Distribution (HOD) to empirical and semi-analytic models to hydrodynamic. Hydrodynamic simulations can incorporate more detailed astrophysical processes but are computationally expensive; HODs, on the other hand, are computationally cheap but have limited accuracy. In this work, we present NeHOD, a generative framework based on variational diffusion model and Transformer, for painting galaxies/subhalos on top of DM with an accuracy of hydrodynamic simulations but at a computational cost similar to HOD. By modeling galaxies/subhalos as point clouds, instead of binning or voxelization, we can resolve small spatial scales down to the resolution of the simulations. For each halo, NeHOD predicts the positions, velocities, masses, and concentrations of its central and satellite galaxies. We train NeHOD on the TNG-Warm DM suite of the DREAMS project, which consists of 1024 high-resolution zoom-in hydrodynamic simulations of Milky Way-mass halos with varying warm DM mass and astrophysical parameters. We show that our model captures the complex relationships between subhalo properties as a function of the simulation parameters, including the mass functions, stellar-halo mass relations, concentration-mass relations, and spatial clustering. Our method can be used for a large variety of downstream applications, from galaxy clustering to strong lensing studies.

[LG-113] Fair Minimum Representation Clustering via Integer Programming

链接: https://arxiv.org/abs/2409.02963
作者: Connor Lawless,Oktay Gunluk
关键词-EN: unsupervised learning task, unsupervised learning, learning task, task that aims, aims to partition
类目: Optimization and Control (math.OC); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2302.03151

点击查看摘要

Abstract:Clustering is an unsupervised learning task that aims to partition data into a set of clusters. In many applications, these clusters correspond to real-world constructs (e.g., electoral districts, playlists, TV channels) whose benefit can only be attained by groups when they reach a minimum level of representation (e.g., 50% to elect their desired candidate). In this paper, we study the k-means and k-medians clustering problems with the additional constraint that each group (e.g., demographic group) must have a minimum level of representation in at least a given number of clusters. We formulate the problem through a mixed-integer optimization framework and present an alternating minimization algorithm, called MiniReL, that directly incorporates the fairness constraints. While incorporating the fairness criteria leads to an NP-Hard assignment problem within the algorithm, we provide computational approaches that make the algorithm practical even for large datasets. Numerical results show that the approach is able to create fairer clusters with practically no increase in the clustering cost across standard benchmark datasets.

信息检索

[IR-0] WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild

链接: https://arxiv.org/abs/2409.03753
作者: Yuntian Deng,Wenting Zhao,Jack Hessel,Xiang Ren,Claire Cardie,Yejin Choi
关键词-EN: offers exciting opportunities, data offers exciting, study user-chatbot interactions, conversation data offers, real-world conversation data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing availability of real-world conversation data offers exciting opportunities for researchers to study user-chatbot interactions. However, the sheer volume of this data makes manually examining individual conversations impractical. To overcome this challenge, we introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. To manage million-scale datasets, we implemented optimizations including search index construction, embedding precomputation and compression, and caching to ensure responsive user interactions within seconds. We demonstrate WildVis’s utility through three case studies: facilitating chatbot misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns. WildVis is open-source and designed to be extendable, supporting additional datasets and customized search and visualization functionalities.

[IR-1] RAG based Question-Answering for Contextual Response Prediction System CIKM’24

链接: https://arxiv.org/abs/2409.03708
作者: Sriram Veturi,Saurabh Vaichal,Nafis Irtiza Tripto,Reshma Lal Jagadheesh,Nian Yan
关键词-EN: Large Language Models, Natural Language Processing, Large Language, Language Models, Language Processing
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Accepted at the 1st Workshop on GenAI and RAG Systems for Enterprise, CIKM’24. 6 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-answering framework for real-world applications using RAG entails several challenges: 1) data availability issues, 2) evaluating the quality of generated content, and 3) the costly nature of human evaluation. In this paper, we introduce an end-to-end framework that employs LLMs with RAG capabilities for industry use cases. Given a customer query, the proposed system retrieves relevant knowledge documents and leverages them, along with previous chat history, to generate response suggestions for customer service agents in the contact centers of a major retail company. Through comprehensive automated and human evaluations, we show that this solution outperforms the current BERT-based algorithms in accuracy and relevance. Our findings suggest that RAG-based LLMs can be an excellent support to human customer service representatives by lightening their workload.

[IR-2] HGAMN: Heterogeneous Graph Attention Matching Network for Multilingual POI Retrieval at Baidu Maps KDD’21

链接: https://arxiv.org/abs/2409.03504
作者: Jizhou Huang,Haifeng Wang,Yibo Sun,Miao Fan,Zhengjie Huang,Chunyuan Yuan,Yawen Li
关键词-EN: Baidu Maps, increasing interest, interest in international, interests in multiple, international travel
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD’21

点击查看摘要

Abstract:The increasing interest in international travel has raised the demand of retrieving point of interests in multiple languages. This is even superior to find local venues such as restaurants and scenic spots in unfamiliar languages when traveling abroad. Multilingual POI retrieval, enabling users to find desired POIs in a demanded language using queries in numerous languages, has become an indispensable feature of today’s global map applications such as Baidu Maps. This task is non-trivial because of two key challenges: (1) visiting sparsity and (2) multilingual query-POI matching. To this end, we propose a Heterogeneous Graph Attention Matching Network (HGAMN) to concurrently address both challenges. Specifically, we construct a heterogeneous graph that contains two types of nodes: POI node and query node using the search logs of Baidu Maps. To alleviate challenge #1, we construct edges between different POI nodes to link the low-frequency POIs with the high-frequency ones, which enables the transfer of knowledge from the latter to the former. To mitigate challenge #2, we construct edges between POI and query nodes based on the co-occurrences between queries and POIs, where queries in different languages and formulations can be aggregated for individual POIs. Moreover, we develop an attention-based network to jointly learn node representations of the heterogeneous graph and further design a cross-attention module to fuse the representations of both types of nodes for query-POI relevance scoring. Extensive experiments conducted on large-scale real-world datasets from Baidu Maps demonstrate the superiority and effectiveness of HGAMN. In addition, HGAMN has already been deployed in production at Baidu Maps, and it successfully keeps serving hundreds of millions of requests every day.

[IR-3] MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidus Sponsored Search KDD’19

链接: https://arxiv.org/abs/2409.03449
作者: Miao Fan,Jiacheng Guo,Shuai Zhu,Shuo Miao,Mingming Sun,Ping Li
关键词-EN: web search engine, largest commercial web, Baidu runs, commercial web search, sponsored search engine
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD’19

点击查看摘要

Abstract:Baidu runs the largest commercial web search engine in China, serving hundreds of millions of online users every day in response to a great variety of queries. In order to build a high-efficiency sponsored search engine, we used to adopt a three-layer funnel-shaped structure to screen and sort hundreds of ads from billions of ad candidates subject to the requirement of low response latency and the restraints of computing resources. Given a user query, the top matching layer is responsible for providing semantically relevant ad candidates to the next layer, while the ranking layer at the bottom concerns more about business indicators (e.g., CPM, ROI, etc.) of those ads. The clear separation between the matching and ranking objectives results in a lower commercial return. The Mobius project has been established to address this serious issue. It is our first attempt to train the matching layer to consider CPM as an additional optimization objective besides the query-ad relevance, via directly predicting CTR (click-through rate) from billions of query-ad pairs. Specifically, this paper will elaborate on how we adopt active learning to overcome the insufficiency of click history at the matching layer when training our neural click networks offline, and how we use the SOTA ANN search technique for retrieving ads more efficiently (Here ``ANN’’ stands for approximate nearest neighbor search). We contribute the solutions to Mobius-V1 as the first version of our next generation query-ad matching system.

[IR-4] Federated Prototype-based Contrastive Learning for Privacy-Preserving Cross-domain Recommendation

链接: https://arxiv.org/abs/2409.03294
作者: Li Wang,Quangui Zhang,Lei Sang,Qiang Wu,Min Xu
关键词-EN: Cross-domain recommendation, improve recommendation accuracy, recommendation accuracy, user, CDR
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Cross-domain recommendation (CDR) aims to improve recommendation accuracy in sparse domains by transferring knowledge from data-rich domains. However, existing CDR methods often assume the availability of user-item interaction data across domains, overlooking user privacy concerns. Furthermore, these methods suffer from performance degradation in scenarios with sparse overlapping users, as they typically depend on a large number of fully shared users for effective knowledge transfer. To address these challenges, we propose a Federated Prototype-based Contrastive Learning (CL) method for Privacy-Preserving CDR, named FedPCL-CDR. This approach utilizes non-overlapping user information and prototypes to improve multi-domain performance while protecting user privacy. FedPCL-CDR comprises two modules: local domain (client) learning and global server aggregation. In the local domain, FedPCL-CDR clusters all user data to learn representative prototypes, effectively utilizing non-overlapping user information and addressing the sparse overlapping user issue. It then facilitates knowledge transfer by employing both local and global prototypes returned from the server in a CL manner. Simultaneously, the global server aggregates representative prototypes from local domains to learn both local and global prototypes. The combination of prototypes and federated learning (FL) ensures that sensitive user data remains decentralized, with only prototypes being shared across domains, thereby protecting user privacy. Extensive experiments on four CDR tasks using two real-world datasets demonstrate that FedPCL-CDR outperforms the state-of-the-art baselines.

[IR-5] xt2KG: Incremental Knowledge Graphs Construction Using Large Language Models

链接: https://arxiv.org/abs/2409.03284
作者: Yassir Lairgi,Ludovic Moncla,Rémy Cazabet,Khalid Benabdeslem,Pierre Cléau
关键词-EN: access valuable information, challenging to access, access valuable, making it challenging, building Knowledge Graphs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Accepted at The International Web Information Systems Engineering conference (the WISE conference) 2024

点击查看摘要

Abstract:Most available data is unstructured, making it challenging to access valuable information. Automatically building Knowledge Graphs (KGs) is crucial for structuring data and making it accessible, allowing users to search for information effectively. KGs also facilitate insights, inference, and reasoning. Traditional NLP methods, such as named entity recognition and relation extraction, are key in information retrieval but face limitations, including the use of predefined entity types and the need for supervised learning. Current research leverages large language models’ capabilities, such as zero- or few-shot learning. However, unresolved and semantically duplicated entities and relations still pose challenges, leading to inconsistent graphs and requiring extensive post-processing. Additionally, most approaches are topic-dependent. In this paper, we propose iText2KG, a method for incremental, topic-independent KG construction without post-processing. This plug-and-play, zero-shot method is applicable across a wide range of KG construction scenarios and comprises four modules: Document Distiller, Incremental Entity Extractor, Incremental Relation Extractor, and Graph Integrator and Visualization. Our method demonstrates superior performance compared to baseline methods across three scenarios: converting scientific papers to graphs, websites to graphs, and CVs to graphs.

[IR-6] GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation

链接: https://arxiv.org/abs/2409.03140
作者: Ashirbad Mishra,Soumik Dey,Marshall Wu,Jinyu Zhao,He Yu,Kaichen Ni,Binbin Li,Kamesh Madduri
关键词-EN: Extreme Multi-Label Classification, Online sellers, listed products, enhance their sales, advertisers are recommended
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online sellers and advertisers are recommended keyphrases for their listed products, which they bid on to enhance their sales. One popular paradigm that generates such recommendations is Extreme Multi-Label Classification (XMC), which involves tagging/mapping keyphrases to items. We outline the limitations of using traditional item-query based tagging or mapping techniques for keyphrase recommendations on E-Commerce platforms. We introduce GraphEx, an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles. Additionally, we demonstrate that relying on traditional metrics such as precision/recall can be misleading in practical applications, thereby necessitating a combination of metrics to evaluate performance in real-world scenarios. These metrics are designed to assess the relevance of keyphrases to items and the potential for buyer outreach. GraphEx outperforms production models at eBay, achieving the objectives mentioned above. It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.

[IR-7] Do We Trust What They Say or What They Do? A Multimodal User Embedding Provides Personalized Explanations

链接: https://arxiv.org/abs/2409.02965
作者: Zhicheng Ren,Zhiping Xiao,Yizhou Sun
关键词-EN: analyzing social network, social network user, social media, network user data, user
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of social media, the importance of analyzing social network user data has also been put on the agenda. User representation learning in social media is a critical area of research, based on which we can conduct personalized content delivery, or detect malicious actors. Being more complicated than many other types of data, social network user data has inherent multimodal nature. Various multimodal approaches have been proposed to harness both text (i.e. post content) and relation (i.e. inter-user interaction) information to learn user embeddings of higher quality. The advent of Graph Neural Network models enables more end-to-end integration of user text embeddings and user interaction graphs in social networks. However, most of those approaches do not adequately elucidate which aspects of the data - text or graph structure information - are more helpful for predicting each specific user under a particular task, putting some burden on personalized downstream analysis and untrustworthy information filtering. We propose a simple yet effective framework called Contribution-Aware Multimodal User Embedding (CAMUE) for social networks. We have demonstrated with empirical evidence, that our approach can provide personalized explainable predictions, automatically mitigating the impact of unreliable information. We also conducted case studies to show how reasonable our results are. We observe that for most users, graph structure information is more trustworthy than text information, but there are some reasonable cases where text helps more. Our work paves the way for more explainable, reliable, and effective social media user embedding which allows for better personalized content delivery.

附件下载

点击下载今日全部论文列表