本篇博文主要展示 2024-08-21 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-21)

今日共更新455篇论文,其中:

  • 自然语言处理68篇(Computation and Language (cs.CL))
  • 人工智能165篇(Artificial Intelligence (cs.AI))
  • 计算机视觉115篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习122篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] FLAME: Learning to Navigate with Multimodal LLM in Urban Environments
[NLP-0] FLAME:学习在城市环境中使用多模式LLM导航

链接: https://arxiv.org/abs/2408.11051
作者: Yunzhe Xu,Yiyuan Pan,Zhe Liu,Hesheng Wang
关键词-EN: Large Language Models, Large Language, Language Models, specialized VLN models, applications face challenges
关键词-ZH: 大型语言模型、大型语言、语言模型、专业VLN模型、应用程序面临挑战
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME’s superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion rate on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards practical applications of MLLMs in embodied AI. Project page: this https URL
摘要:大语言模型(LLM)在视觉与语言导航(VLN)任务中显示出了巨大的潜力,但目前的应用面临着挑战。虽然LLM在一般对话场景中表现出色,但它们在专门的导航任务中举步维艰,与专门的VLN模型相比,性能不佳。本文介绍了FLAM(Flamingo-Architted Emported Agent),它是一种新的基于多通道LLM的智能体和体系结构,专为城市VLN任务而设计,可以有效地处理多个观测。我们的方法实现了一种三阶段调整技术来有效地适应导航任务,包括用于街景描述的单一感知调整,用于轨迹摘要的多感知调整,以及对VLN数据集的端到端训练。扩充后的数据集被自动合成。实验结果表明,FLAME方法优于现有的方法,在触地数据集上的任务完成率比现有方法提高了7.3%。这项工作展示了多模式LLMS(MLLMS)在复杂导航任务中的潜力,代表了MLLMS在体现人工智能中的实际应用方面的进步。项目页面:此HTTPS URL

[NLP-1] MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
[NLP-1] MagicDec:通过推测解码打破长上下文生成的潜在权衡

链接: https://arxiv.org/abs/2408.11049
作者: Jian Chen,Vashisth Tiwari,Ranajoy Sadhukhan,Zhuoming Chen,Jinyuan Shi,Ian En-Hsu Yen,Beidi Chen
关键词-EN: Large Language Models, Large Language, serve long-context requests, Language Models, interactive chatbots
关键词-ZH: 大型语言模型、大型语言、服务长上下文请求、语言模型、交互式聊天机器人
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency without sacrificing performance but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy speculative decoding more effectively for high throughput inference. Then, it leverages draft models with sparse KV cache to address the KV bottleneck that scales with both sequence length and batch size.
摘要:大语言模型在交互式聊天机器人、文档分析和代理工作流等长上下文应用中变得越来越普遍,但以低延迟和高吞吐量来服务长上下文请求是一项具有挑战性的任务。推测解码(SD)是一种广泛使用的技术,可以在不牺牲性能的情况下减少延迟,但传统观点认为其有效性仅限于小批量。在MagicDec中,我们证明了即使对于中到长序列的高通量推理机制,SD也可以获得令人惊讶的加速比。更有趣的是,根据我们的严格分析,智能起草策略可以随着批次大小的增加而实现更好的加速比。MagicDec首先识别随着批大小和序列长度的增加而发生的瓶颈转移,并利用这些洞察力来更有效地部署推测解码以实现高吞吐量推理。然后,它利用具有稀疏KV缓存的草稿模型来解决随序列长度和批次大小而扩展的KV瓶颈。

[NLP-2] Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders ECAI24
[NLP-2] 黑匣子内部:检测预训练语言编码器中的数据泄露

链接: https://arxiv.org/abs/2408.11046
作者: Yuan Xin,Zheng Li,Ning Yu,Dingfan Chen,Mario Fritz,Michael Backes,Yang Zhang
关键词-EN: Natural Language Processing, copyright concerns due, large-scale web-scraped data, field of Natural, Language Processing
关键词-ZH: 自然语言处理、版权问题、大规模网络抓取数据、自然、语言处理领域
类目: Computation and Language (cs.CL)
备注: ECAI24

点击查看摘要

Abstract:Despite being prevalent in the general field of Natural Language Processing (NLP), pre-trained language models inherently carry privacy and copyright concerns due to their nature of training on large-scale web-scraped data. In this paper, we pioneer a systematic exploration of such risks associated with pre-trained language encoders, specifically focusing on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders-an aspect largely overlooked in existing literature. Our study encompasses comprehensive experiments across four types of pre-trained encoder architectures, three representative downstream tasks, and five benchmark datasets. Intriguingly, our evaluations reveal, for the first time, the existence of membership leakage even when only the black-box output of the downstream model is exposed, highlighting a privacy risk far greater than previously assumed. Alongside, we present in-depth analysis and insights toward guiding future researchers and practitioners in addressing the privacy considerations in developing pre-trained language models.
摘要:尽管在自然语言处理(NLP)的一般领域中普遍存在,但由于其针对大规模网络抓取数据的训练的性质,预训练的语言模型固有地存在隐私和版权问题。在本文中,我们率先对与预训练的语言编码器相关的此类风险进行了系统的探索,特别是关注了通过改编自预训练的语言编码器的下游模型暴露的预训练数据的成员泄漏-这一方面在现有文献中基本上被忽视了。我们的研究涵盖了四种类型的预训练编码器体系结构、三个具有代表性的下游任务和五个基准数据集的全面实验。有趣的是,我们的评估首次揭示了成员泄露的存在,即使只暴露了下游模型的黑盒输出,这突显了隐私风险远远大于之前的假设。同时,我们还提供了深入的分析和见解,以指导未来的研究人员和实践者在开发预先培训的语言模型时解决隐私考虑因素。

[NLP-3] Scaling Law with Learning Rate Annealing
[NLP-3] 具有学习率安妮的标度定律

链接: https://arxiv.org/abs/2408.11029
作者: Howe Tissue,Venus Wang,Lu Wang
关键词-EN: models empirically adhere, rate annealing area, scaling law, learning rate, cross-entropy loss curves
关键词-ZH: 模型经验上遵守、速率退变区、缩放定律、学习率、交叉熵损失曲线
类目: Computation and Language (cs.CL)
备注: 25 pages, 23 figures

点击查看摘要

Abstract:We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps ( s ): L(s) = L_0 + A\cdot S_1^-\alpha - C\cdot S_2 Where S_1 is forward area and S_2 is learning rate annealing area. This formulation takes into account two factors: (1) The forward scaling defined as typical scaling law, and (2) the additional loss drop brought by LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss of language model training at any given step and across any learning rate scheduler (LRS). Furthermore, this equation accurately describes the dynamics during training process, and provides a theoretical verification and explanation for numerous experimental findings of previous studies, particularly those focusing on LR schedule and LR annealing. The resulting insights, also serve as a guide for researchers to select critical LRS in advance by prediction using our equation. Most significantly, since all the points in a full training curve follow the equation, we can achieve accurate loss prediction at any given step across any learning rate scheduler, while expending less than 1% of the computational cost required by the chinchilla scaling law to fit language modeling loss. This approach extremely democratizes scaling law fitting and predicting in developing large language models.
文摘:我们发现神经语言模型的交叉熵损失曲线经验性地服从一个随训练步长(S)的学习率(LR)退火区而变化的标度律:L(S)=L_0+A\CDOT S_1^-\α-C\CDOT S_2,其中S_1为前沿区,S_2为学习率退火区。该公式考虑了两个因素:(1)定义为典型标度律的正向标度;(2)LR退火法带来的附加损耗降。因此,该公式可以描述每一步的完整损失曲线,而不是训练结束时的单个损失点。应用带有LR退火法的标度律,只拟合一条或两条训练曲线,我们可以准确地预测在任何给定步骤和任何学习率调度器(LRS)上的语言模型训练的损失。此外,该方程准确地描述了训练过程中的动力学过程,并为以往研究的大量实验结果提供了理论验证和解释,特别是那些关注LR调度和LR退火法的实验结果。所得到的结论,也可以作为研究人员通过使用我们的方程进行预测来提前选择关键的LRS的指导。最重要的是,由于完整训练曲线中的所有点都遵循方程,所以我们可以在任何学习速率调度器上在任何给定的步长上实现准确的损失预测,同时花费不到Chinchilla标度律所需的计算代价来适应语言建模损失。这种方法在开发大型语言模型时,极大地简化了标度律的拟合和预测。

[NLP-4] Athena: Safe Autonomous Agents with Verbal Contrastive Learning
[NLP-4] 雅典娜:具有言语对比学习的安全自治代理人

链接: https://arxiv.org/abs/2408.11021
作者: Tanmana Sadhu,Ali Pesaranghader,Yanan Chen,Dong Hoon Yi
关键词-EN: large language models, large language, language models, degree of autonomy, utilized as language-based
关键词-ZH: 大型语言模型、大型语言、语言模型、自治程度、作为基于语言的使用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Due to emergent capabilities, large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks and make decisions with an increasing degree of autonomy. These autonomous agents can understand high-level instructions, interact with their environments, and execute complex tasks using a selection of tools available to them. As the capabilities of the agents expand, ensuring their safety and trustworthiness becomes more imperative. In this study, we introduce the Athena framework which leverages the concept of verbal contrastive learning where past safe and unsafe trajectories are used as in-context (contrastive) examples to guide the agent towards safety while fulfilling a given task. The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step. Furthermore, due to the lack of existing benchmarks on the safety reasoning ability of LLM-based agents, we curate a set of 80 toolkits across 8 categories with 180 scenarios to provide a safety evaluation benchmark. Our experimental evaluation, with both closed- and open-source LLMs, indicates verbal contrastive learning and interaction-level critiquing improve the safety rate significantly.
摘要:大型语言模型(LLM)由于具有应急能力,已被用作基于语言的代理,以执行各种任务和进行决策,并具有越来越高的自主性。这些自主代理可以理解高级指令,与其环境交互,并使用可用工具执行复杂的任务。随着代理能力的扩大,确保其安全性和可信性变得更加迫切。在本研究中,我们引入了雅典娜框架,该框架利用言语对比学习的概念,将过去的安全和不安全轨迹用作上下文(对比)示例,以指导代理人在完成给定任务时走向安全。该框架还包含了一种批评机制,以指导代理在每一步都防止危险的行为。此外,由于缺乏关于基于LLM的代理的安全推理能力的现有基准,我们挑选了8个类别180个场景的80个工具包来提供安全评估基准。我们对封闭和开放源代码的LLMS的实验评估表明,言语对比学习和交互级别的批评显著提高了安全率。

[NLP-5] While GitHub Copilot Excels at Coding Does It Ensure Responsible Output?
[NLP-5] GitHub Copilot擅长编码,它能否确保负责任的输出?

链接: https://arxiv.org/abs/2408.11006
作者: Wen Cheng,Ke Sun,Xinyu Zhang,Wei Wang
关键词-EN: code completion capabilities, Code Completion Tools, advanced code completion, LLM-based Code Completion, large language models
关键词-ZH: 代码完成功能、代码完成工具、高级代码完成、基于LLM的代码完成、大型语言模型
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has significantly advanced code completion capabilities, giving rise to a new generation of LLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these tools possess unique workflows, integrating multiple information sources as input and prioritizing code suggestions over natural language interaction, which introduces distinct security challenges. Additionally, LCCTs often rely on proprietary code datasets for training, raising concerns about the potential exposure of sensitive data. This paper exploits these distinct characteristics of LCCTs to develop targeted attack methodologies on two critical security risks: jailbreaking and training data extraction attacks. Our experimental results expose significant vulnerabilities within LCCTs, including a 99.4% success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate on Amazon Q. Furthermore, We successfully extracted sensitive user data from GitHub Copilot, including 54 real email addresses and 314 physical addresses associated with GitHub usernames. Our study also demonstrates that these code-based attack methods are effective against general-purpose LLMs, such as the GPT series, highlighting a broader security misalignment in the handling of code by modern LLMs. These findings underscore critical security challenges associated with LCCTs and suggest essential directions for strengthening their security frameworks. The example code and attack samples from our research are provided at this https URL.
摘要:大型语言模型的快速发展极大地提高了代码补全能力,催生了新一代基于大型语言模型的代码补全工具。与通用的LLMS不同,这些工具拥有独特的工作流,将多个信息源集成为输入,并优先考虑代码建议而不是自然语言交互,这带来了明显的安全挑战。此外,LCCT经常依赖专有代码数据集进行培训,这引发了人们对敏感数据潜在暴露的担忧。针对越狱攻击和训练数据提取攻击这两个关键安全风险,本文利用LCCT的这些显著特点,提出了针对性的攻击方法。我们的实验结果暴露了LCCT中的重大漏洞,包括对GitHub Copilot的越狱攻击成功率为99.4%,对Amazon Q的成功率为46.3%。此外,我们成功地从GitHub Copilot中提取了敏感用户数据,包括与GitHub用户名关联的54个真实电子邮件地址和314个物理地址。我们的研究还表明,这些基于代码的攻击方法对通用LLM是有效的,例如GPT系列,突显了现代LLM在处理代码时存在更广泛的安全错位。这些调查结果强调了与土地利用、土地利用、土地退化和土地退化有关的重大安全挑战,并提出了加强其安全框架的基本方向。我们研究的示例代码和攻击示例在此HTTPS URL中提供。

[NLP-6] Disentangling segmental and prosodic factors to non-native speech comprehensibility
[NLP-6] 解析非母语言语可理解性的分段和韵律因素

链接: https://arxiv.org/abs/2408.10997
作者: Waris Quamer,Ricardo Gutierrez-Osuna
关键词-EN: Current accent conversion, Current accent, segmental, Current, prosodic characteristics
关键词-ZH: 当前口音转换、当前口音、小节、当前、韵律特征
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Current accent conversion (AC) systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics. Being able to manipulate a non-native speaker’s segmental and/or prosodic channels independently is critical to quantify how these two channels contribute to speech comprehensibility and social attitudes. We present an AC system that not only decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics. The system is able to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. We show that vector quantization of acoustic embeddings and removal of consecutive duplicated codewords allows the system to transfer prosody and improve voice similarity. We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech. Our results indicate that, contrary to prior research in non-native speech, segmental features have a larger impact on comprehensibility than prosody. The proposed AC system may also be used to study how segmental and prosody cues affect social attitudes towards non-native speech.
摘要:当前的重音转换(AC)系统没有区分非母语口音的两个主要来源:音节特征和韵律特征。能够独立地操纵非母语说话人的音段和/或韵律通道对于量化这两个通道如何有助于言语理解和社会态度至关重要。我们提出了一种AC系统,它不仅将语音质量与口音分离,而且将后者分离为其节段和韵律特征。该系统能够生成结合(1)来自源话语的分段特征、(2)来自目标话语的声音特征以及(3)参考话语的韵律的重音转换。我们证明,声学嵌入的矢量量化和连续重复码字的去除使系统能够传输韵律并提高语音相似度。我们进行了知觉听力测试,以量化分段特征和韵律对非母语言语感知可理解性的个体贡献。我们的研究结果表明,与以往对非母语语音的研究相反,音段特征对可理解性的影响比韵律更大。所提出的AC系统也可以用来研究节段和韵律线索如何影响社会对非母语言语的态度。

[NLP-7] CTP-LLM: Clinical Trial Phase Transition Prediction Using Large Language Models
[NLP-7] CTP-LLM:使用大型语言模型的临床试验阶段转换预测

链接: https://arxiv.org/abs/2408.10995
作者: Michael Reinisch,Jianfeng He,Chenxi Liao,Sauleh Ahmad Siddiqui,Bei Xiao
关键词-EN: medical treatment development, treatment development requires, development requires multiple, requires multiple phases, trial
关键词-ZH: 医疗发展,治疗发展需要,发展需要多个,需要多个阶段,试验
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:New medical treatment development requires multiple phases of clinical trials. Despite the significant human and financial costs of bringing a drug to market, less than 20% of drugs in testing will make it from the first phase to final approval. Recent literature indicates that the design of the trial protocols significantly contributes to trial performance. We investigated Clinical Trial Outcome Prediction (CTOP) using trial design documents to predict phase transitions automatically. We propose CTP-LLM, the first Large Language Model (LLM) based model for CTOP. We also introduce the PhaseTransition (PT) Dataset; which labels trials based on their progression through the regulatory process and serves as a benchmark for CTOP evaluation. Our fine-tuned GPT-3.5-based model (CTP-LLM) predicts clinical trial phase transition by analyzing the trial’s original protocol texts without requiring human-selected features. CTP-LLM achieves a 67% accuracy rate in predicting trial phase transitions across all phases and a 75% accuracy rate specifically in predicting the transition from Phase~III to final approval. Our experimental performance highlights the potential of LLM-powered applications in forecasting clinical trial outcomes and assessing trial design.
摘要:新的治疗方法的发展需要多阶段的临床试验。尽管将一种药物推向市场需要巨大的人力和经济成本,但只有不到20%的正在测试的药物能够从第一阶段进入最终批准。最近的文献表明,试验方案的设计对试验性能有很大的贡献。我们研究了临床试验结果预测(CTOP),使用试验设计文件自动预测相变。我们提出了第一个基于大语言模型的CTOP模型CTP-LLM。我们还介绍了阶段转换(PT)数据集;它根据试验在监管过程中的进展来标记试验,并作为CTOP评估的基准。我们基于GPT-3.5的微调模型(CTP-LLM)通过分析试验的原始协议文本来预测临床试验的阶段转变,而不需要人类选择的特征。CTP-LLM在预测所有阶段的试验阶段转变时的准确率为67%,特别是在预测从第三阶段到最终批准阶段的转变时,准确率为75%。我们的实验表现突出了LLM驱动的应用在预测临床试验结果和评估试验设计方面的潜力。

[NLP-8] he fusion of phonography and ideographic characters into virtual Chinese characters – Based on Chinese and English
[NLP-8] 语音与象形文字融合为虚拟中文–基于汉、英

链接: https://arxiv.org/abs/2408.10979
作者: Hongfa Zi,Zhen Liu
关键词-EN: modern countries, divided into ideographic, characters, ideographic characters, phonetic characters
关键词-ZH: 现代国家,分为表意文字、文字、表意文字、拼音文字
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:The characters used in modern countries are mainly divided into ideographic characters and phonetic characters, both of which have their advantages and disadvantages. Chinese is difficult to learn and easy to master, while English is easy to learn but has a large vocabulary. There is still no language that combines the advantages of both languages and has less memory capacity, can form words, and is easy to learn. Therefore, inventing new characters that can be combined and the popularization of deep knowledge, and reduce disputes through communication. Firstly, observe the advantages and disadvantages of Chinese and English, such as their vocabulary, information content, and ease of learning in deep scientific knowledge, and create a new writing system. Then, use comparative analysis to observe the total score of the new language. Through this article, it can be concluded that the new text combines the advantages of both pictographic and alphabetical writing: new characters that can be combined into words reduces the vocabulary that needs to be learned; Special prefixes allow beginners to quickly guess the approximate category and meaning of unseen words; New characters can enable humans to quickly learn more advanced knowledge.
摘要:现代国家使用的文字主要有表意文字和表音文字,它们各有优缺点。汉语很难学,很容易掌握,而英语很容易学,但词汇量很大。目前还没有一种语言能够结合两种语言的优点,记忆容量较小,可以组成单词,而且很容易学习。因此,发明新的可以结合的汉字和普及深厚的知识,并通过交流减少纠纷。首先,观察汉语和英语在词汇、信息量、易学程度等方面的优势和劣势,创造一种新的写作系统。然后,用对比分析的方法观察新语言的总分。通过这篇文章可以得出结论,新文字结合了象形文字和字母书写的优点:可以组合成单词的新字符减少了需要学习的词汇;特殊前缀让初学者可以快速猜测未见单词的大致类别和含义;新字符可以让人类快速学习更高级的知识。

[NLP-9] NLP for The Greek Language: A Longer Survey
[NLP-9] 希腊语NLP:更长期的调查

链接: https://arxiv.org/abs/2408.10962
作者: Katerina Papantoniou,Yannis Tzitzikas
关键词-EN: Natural Language Processing, English language, Natural Language, Greek language, offered methods
关键词-ZH: 自然语言处理,英语,自然语言,希腊语,提供的方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:English language is in the spotlight of the Natural Language Processing (NLP) community with other languages, like Greek, lagging behind in terms of offered methods, tools and resources. Due to the increasing interest in NLP, in this paper we try to condense research efforts for the automatic processing of Greek language covering the last three decades. In particular, we list and briefly discuss related works, resources and tools, categorized according to various processing layers and contexts. We are not restricted to the modern form of Greek language but also cover Ancient Greek and various Greek dialects. This survey can be useful for researchers and students interested in NLP tasks, Information Retrieval and Knowledge Management for the Greek language.
摘要:英语是自然语言处理(NLP)社区的焦点,而希腊语等其他语言在提供的方法、工具和资源方面落后。由于人们对NLP的兴趣日益浓厚,在本文中,我们试图浓缩过去三十年来希腊语自动处理的研究工作。特别是,我们列出并简要讨论相关作品、资源和工具,并根据各种处理层和上下文进行分类。我们不仅限于希腊语言的现代形式,还涵盖古希腊和各种希腊方言。这项调查对于对希腊语的NLP任务、信息检索和知识管理感兴趣的研究人员和学生很有用。

[NLP-10] Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models ACL2024
[NLP-10] Dr.Academy:评估大型语言模型教育提问能力的基准

链接: https://arxiv.org/abs/2408.10947
作者: Yuyan Chen,Chenwei Wu,Songzhou Yan,Panjun Liu,Haoyu Zhou,Yanghua Xiao
关键词-EN: large language models, important area, language models, area of study, imparting knowledge
关键词-ZH: 大语言模型,重要领域,语言模型,学习领域,传授知识
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs’ capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl’s taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs’ outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.
摘要:教师在传授知识和指导学生方面起着重要的作用,而大型语言模型作为潜在的教育者的作用正在成为一个重要的研究领域。认识到LLMS生成教育内容的能力可以促进自动化和个性化学习的进步。虽然LLMS的理解能力和解决问题的能力已经过测试,但它们的教学能力在很大程度上仍未得到开发。在教学中,提问是引导学生分析、评价和综合核心概念和原则的一项关键技能。因此,本研究采用Anderson和Krathwohl的一般、单学科和跨学科领域的分类方法,通过评估LLMS教师生成的教育问题,引入了一个基准来评估LLMS教师的教育提问能力。我们将重点从作为学习者的LLMS转移到作为教育者的LLMS,通过指导他们产生问题来评估他们的教学能力。我们应用相关性、覆盖率、代表性和一致性四个指标来评估LLMS输出的教育质量。我们的结果表明,GPT-4在普通、人文和科学课程的教学中显示出巨大的潜力;Claude2似乎更适合作为一名跨学科教师。此外,自动评分与人类的观点一致。

[NLP-11] SysBench: Can Large Language Models Follow System Messages?
[NLP-11] SystBench:大型语言模型可以遵循系统消息吗?

链接: https://arxiv.org/abs/2408.10943
作者: Yanzhao Qin,Tao Zhang,Tao Zhang,Yanjun Shen,Wenjing Luo,Haoze Sun,Yan Zhang,Yujing Qiao,Weipeng Chen,Zenan Zhou,Wentao Zhang,Bin Cui
关键词-EN: Large Language Models, Large Language, Language Models, system messages, increasingly critical
关键词-ZH: 大型语言模型、大型语言、语言模型、系统消息,越来越重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become instrumental across various applications, with the customization of these models to specific scenarios becoming increasingly critical. System message, a fundamental component of LLMs, is consist of carefully crafted instructions that guide the behavior of model to meet intended goals. Despite the recognized potential of system messages to optimize AI-driven solutions, there is a notable absence of a comprehensive benchmark for evaluating how well different LLMs follow these system messages. To fill this gap, we introduce SysBench, a benchmark that systematically analyzes system message following ability in terms of three challenging aspects: constraint complexity, instruction misalignment and multi-turn stability. In order to enable effective evaluation, SysBench constructs multi-turn user conversations covering various interaction relationships, based on six common types of constraints from system messages in real-world scenarios. Our dataset contains 500 system messages from various domains, each paired with 5 turns of user conversations, which have been manually formulated and checked to guarantee high quality. SysBench provides extensive evaluation across various LLMs, measuring their ability to follow specified constraints given in system messages. The results highlight both the strengths and weaknesses of existing models, offering key insights and directions for future research. The open source library SysBench is available at this https URL.
摘要:随着大型语言模型针对特定场景的定制变得越来越重要,大型语言模型已成为各种应用程序中的重要工具。系统消息是LLMS的基本组成部分,由精心设计的指令组成,这些指令指导模型的行为以满足预期目标。尽管公认系统消息具有优化人工智能驱动的解决方案的潜力,但值得注意的是,还没有一个全面的基准来评估不同的LLM在多大程度上遵循这些系统消息。为了填补这一空白,我们引入了SysBch,这是一个基准测试程序,从约束复杂性、指令错位和多回合稳定性三个方面系统地分析了系统的消息跟踪能力。为了实现有效的评估,SysBtch基于现实世界场景中系统消息的六种常见类型的约束,构建了涵盖各种交互关系的多轮用户对话。我们的数据集包含来自不同域的500条系统消息,每个消息都与5轮用户对话配对,这些对话是手动制定和检查的,以确保高质量。SysBtch提供对各种LLM的广泛评估,衡量它们遵守系统消息中指定约束的能力。研究结果突出了现有模型的优点和缺点,为未来的研究提供了关键的见解和方向。可以从以下的HTTPS URL获得开源库SysBtch。

[NLP-12] LBC: Language-Based-Classifier for Out-Of-Variable Generalization
[NLP-12] LBC:用于变量外概括的基于百分比的分类器

链接: https://arxiv.org/abs/2408.10923
作者: Kangjun Noh,Baekryun Seong,Hoyoon Byun,Sungjin Song,Kyungwoo Song
关键词-EN: Large Language Models, natural language processing, Large Language, language processing tasks, natural language
关键词-ZH: 大型语言模型、自然语言处理、大型语言、语言处理任务、自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have great success in natural language processing tasks such as response generation. However, their use in tabular data has been limited due to their inferior performance compared to traditional machine learning models (TMLs) such as XGBoost. We find that the pre-trained knowledge of LLMs enables them to interpret new variables that appear in a test without additional training, a capability central to the concept of Out-of-Variable (OOV). From the findings, we propose a Language-Based-Classifier (LBC), a classifier that maximizes the benefits of LLMs to outperform TMLs on OOV tasks. LBC employs three key methodological strategies: 1) Categorical changes to adjust data to better fit the model’s understanding, 2) Advanced order and indicator to enhance data representation to the model, and 3) Using verbalizer to map logit scores to classes during inference to generate model predictions. These strategies, combined with the pre-trained knowledge of LBC, emphasize the model’s ability to effectively handle OOV tasks. We empirically and theoretically validate the superiority of LBC. LBC is the first study to apply an LLM-based model to OOV tasks. The source code is at this https URL.
摘要:大语言模型在响应生成等自然语言处理任务中取得了巨大的成功。然而,与XGBoost等传统机器学习模型(TML)相比,它们在表格数据中的使用受到了限制,因为它们的性能较差。我们发现,预先训练的LLMS知识使他们能够解释测试中出现的新变量,而不需要额外的培训,这是变量外(OOV)概念的核心能力。根据这些发现,我们提出了一种基于语言的分类器(LBC),该分类器最大限度地发挥了LLMS的优势,在OOV任务中的表现优于TML。LBC采用了三种关键的方法论策略:1)范畴变化以调整数据以更好地符合模型的理解,2)高级顺序和指标以增强模型的数据表示,以及3)在推理过程中使用动词器将Logit分数映射到类以生成模型预测。这些策略与预先训练的LBC知识相结合,强调了模型有效处理OOV任务的能力。我们从经验和理论上验证了LBC的优越性。LBC是第一个将基于LLM的模型应用于OOV任务的研究。源代码位于以下的HTTPS URL。

[NLP-13] CHECKWHY: Causal Fact Verification via Argument Structure ACL2024
[NLP-13] CLARKWHY:通过论点结构进行因果事实验证

链接: https://arxiv.org/abs/2408.10918
作者: Jiasheng Si,Yibo Zhao,Yingjie Zhu,Haiyang Zhu,Wenpeng Lu,Deyu Zhou
关键词-EN: fact verification, fact verification tasks, causal fact verification, capabilities is increasing, recent fact verification
关键词-ZH: 事实验证,事实验证任务,因果事实验证,能力正在增强,最近的事实验证
类目: Computation and Language (cs.CL)
备注: Accepted by ACL2024; Awarded as Outstanding Paper Award and Area Chair Award

点击查看摘要

Abstract:With the growing complexity of fact verification tasks, the concern with “thoughtful” reasoning capabilities is increasing. However, recent fact verification benchmarks mainly focus on checking a narrow scope of semantic factoids within claims and lack an explicit logical reasoning process. In this paper, we introduce CheckWhy, a challenging dataset tailored to a novel causal fact verification task: checking the truthfulness of the causal relation within claims through rigorous reasoning steps. CheckWhy consists of over 19K “why” claim-evidence-argument structure triplets with supports, refutes, and not enough info labels. Each argument structure is composed of connected evidence, representing the reasoning process that begins with foundational evidence and progresses toward claim establishment. Through extensive experiments on state-of-the-art models, we validate the importance of incorporating the argument structure for causal fact verification. Moreover, the automated and human evaluation of argument structure generation reveals the difficulty in producing satisfying argument structure by fine-tuned models or Chain-of-Thought prompted LLMs, leaving considerable room for future improvements.
摘要:随着事实验证任务的日益复杂,人们越来越关注“深思熟虑”的推理能力。然而,目前的事实验证基准主要集中在检查索赔中狭小范围的语义事实,缺乏明确的逻辑推理过程。在本文中,我们介绍了Checkly,一个具有挑战性的数据集,专为一项新的因果事实验证任务而定制:通过严格的推理步骤检查索赔中因果关系的真实性。“为什么”由19K多个“为什么”的声明-证据-论点结构组成,带有支持、反驳和信息不足的标签。每个论据结构都由相互关联的证据组成,代表了从基本证据开始并向主张成立迈进的推理过程。通过对最先进的模型的大量实验,我们验证了纳入论据结构对因果事实验证的重要性。此外,论元结构生成的自动化和人工评估揭示了通过微调的模型或思想链提示的LLM生成令人满意的论元结构的困难,这为未来的改进留下了相当大的空间。

[NLP-14] o Code or Not To Code? Exploring Impact of Code in Pre-training
[NLP-14] o编码还是不编码?探索代码在预培训中的影响

链接: https://arxiv.org/abs/2408.10914
作者: Viraat Aryabumi,Yixuan Su,Raymond Ma,Adrien Morisot,Ivan Zhang,Acyr Locatelli,Marzieh Fadaee,Ahmet Üstün,Sara Hooker
关键词-EN: Including code, code, pre-training data mixture, code data, specifically designed
关键词-ZH: 包括代码、代码、训练前数据混合、代码数据、专门设计的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs’ performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask “what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation”. We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.
摘要:将代码包括在预训练数据混合中,即使是对于不是专门为代码设计的模型,也已成为LLMS预训练中的常见做法。虽然在实践者中已经有了坊间的共识,即代码数据在一般LLMS的性能中起着至关重要的作用,但分析代码对非代码任务的准确影响的工作有限。在这项工作中,我们系统地研究了代码数据对总体性能的影响。我们问“在预培训中使用的代码数据对代码生成之外的大量下游任务有什么影响”。我们进行广泛的消融和评估,涵盖广泛的自然语言推理任务、世界知识任务、代码基准和LLM作为法官的胜率,适用于大小从470M到2.8B参数的模型。在各种设置中,我们发现一个一致的结果,即代码是泛化的关键构建块,远远超出了编码任务的范围,代码质量的改进对所有任务都有非常大的影响。特别是,与纯文本预训练相比,代码的添加导致自然语言(NL)推理的相对增长8.2%,世界知识的相对增长4.2%,生成成功率的提高6.6%,代码性能的提高12倍。我们的工作表明,在预培训期间对代码质量和保留代码进行投资具有积极的影响。

[NLP-15] BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model
[NLP-15] 超越对话:面向通用角色扮演语言模型的个人资料-对话协调框架

链接: https://arxiv.org/abs/2408.10903
作者: Yeyong Yu,Rusheng Yu,Haojie Wei,Zhanqiu Zhang,Quan Qian
关键词-EN: large language models, enabling the development, rapid advancement, advancement of large, large language
关键词-ZH: 大型语言模型,实现大型语言的发展、快速进步、进步
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has revolutionized role-playing, enabling the development of general role-playing models. However, current role-playing training has two significant issues: (I) Using a predefined role profile to prompt dialogue training for specific scenarios usually leads to inconsistencies and even conflicts between the dialogue and the profile, resulting in training biases. (II) The model learns to imitate the role based solely on the profile, neglecting profile-dialogue alignment at the sentence level. In this work, we propose a simple yet effective framework called BEYOND DIALOGUE, designed to overcome these hurdles. This framework innovatively introduces “beyond dialogue” tasks to align dialogue with profile traits based on each specific scenario, thereby eliminating biases during training. Furthermore, by adopting an innovative prompting mechanism that generates reasoning outcomes for training, the framework allows the model to achieve fine-grained alignment between profile and dialogue at the sentence level. The aforementioned methods are fully automated and low-cost. Additionally, the integration of automated dialogue and objective evaluation methods forms a comprehensive framework, paving the way for general role-playing. Experimental results demonstrate that our model excels in adhering to and reflecting various dimensions of role profiles, outperforming most proprietary general and specialized role-playing baselines. All code and datasets are available at this https URL.
摘要:大型语言模型的快速发展使角色扮演发生了革命性的变化,使得通用角色扮演模型的发展成为可能。然而,当前的角色扮演训练有两个显著的问题:(1)使用预定义的角色配置文件来提示特定情景的对话培训通常会导致对话和配置文件之间的不一致甚至冲突,从而导致培训偏差。(2)该模式只根据侧面来模仿角色,忽略了句子层面的侧面-对话对齐。在这项工作中,我们提出了一个简单但有效的框架,称为超越对话,旨在克服这些障碍。该框架创新性地引入了“超越对话”任务,使对话与基于每个特定场景的个人特征保持一致,从而消除了培训过程中的偏见。此外,通过采用创新的提示机制来生成用于训练的推理结果,该框架允许模型在句子级别实现轮廓和对话之间的细粒度对齐。上述方法是全自动化的,成本低。此外,自动化对话和客观评价方法的结合形成了一个全面的框架,为全面发挥作用铺平了道路。实验结果表明,我们的模型在坚持和反映角色配置文件的各个维度方面表现出色,超过了大多数专有的通用和专门的角色扮演基线。所有代码和数据集都可以在此HTTPS URL上找到。

[NLP-16] Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs
[NLP-16] Soda-Eval:法学硕士时代的开放领域对话评估

链接: https://arxiv.org/abs/2408.10902
作者: John Mendonça,Isabel Trancoso,Alon Lavie
关键词-EN: Large Language Models, Large Language, gold standard, standard for open-domain, growing popularity
关键词-ZH: 大型语言模型,大型语言,黄金标准,开放领域标准,日益流行
类目: Computation and Language (cs.CL)
备注: 22 pages, 10 figures

点击查看摘要

Abstract:Although human evaluation remains the gold standard for open-domain dialogue evaluation, the growing popularity of automated evaluation using Large Language Models (LLMs) has also extended to dialogue. However, most frameworks leverage benchmarks that assess older chatbots on aspects such as fluency and relevance, which are not reflective of the challenges associated with contemporary models. In fact, a qualitative analysis on Soda, a GPT-3.5 generated dialogue dataset, suggests that current chatbots may exhibit several recurring issues related to coherence and commonsense knowledge, but generally produce highly fluent and relevant responses. Noting the aforementioned limitations, this paper introduces Soda-Eval, an annotated dataset based on Soda that covers over 120K turn-level assessments across 10K dialogues, where the annotations were generated by GPT-4. Using Soda-Eval as a benchmark, we then study the performance of several open-access instruction-tuned LLMs, finding that dialogue evaluation remains challenging. Fine-tuning these models improves performance over few-shot inferences, both in terms of correlation and explanation. Comments: 22 pages, 10 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.10902 [cs.CL] (or arXiv:2408.10902v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.10902 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:虽然人工评价仍然是开放领域对话评价的黄金标准,但使用大型语言模型(LLMS)进行自动评价的日益流行也已经扩展到对话。然而,大多数框架利用基准来评估老式聊天机器人在流畅性和相关性等方面的表现,这并没有反映出与当代模型相关的挑战。事实上,对GPT-3.5生成的对话数据集Soda的定性分析表明,当前的聊天机器人可能会出现几个与一致性和常识知识相关的反复出现的问题,但通常会产生高度流畅和相关的响应。考虑到上述局限性,本文介绍了Soda-Eval,这是一个基于Soda的注释数据集,涵盖了10K个对话中超过12万个话轮水平评估,其中注释是由GPT-4生成的。然后以Soda-Eval为基准,研究了几种开放存取指令调优的LLMS的性能,发现对话评估仍然具有挑战性。微调这些模型,无论是在相关性方面还是在解释方面,都比少数几次推论的性能有所提高。评论:22页,10位数字主题:计算与语言(cs.CL)引用如下:arxiv:2408.10902cs.CLhttps://doi.org/10.48550/arXiv.2408.10902 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-17] DELIA: Diversity-Enhanced Learning for Instruction Adaptation in Large Language Models
[NLP-17] DELIA:大型语言模型中教学适应的多元化增强学习

链接: https://arxiv.org/abs/2408.10841
作者: Yuanhao Zeng,Fei Ren,Xinpeng Zhou,Yihang Wang,Yingxia Shao
关键词-EN: Large Language Models, Large Language, Language Models, specific task formats, behavior in Large
关键词-ZH: 大型语言模型、大型语言、语言模型、特定任务格式、大型行为
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Although instruction tuning is widely used to adjust behavior in Large Language Models (LLMs), extensive empirical evidence and research indicates that it is primarily a process where the model fits to specific task formats, rather than acquiring new knowledge or capabilities. We propose that this limitation stems from biased features learned during instruction tuning, which differ from ideal task-specfic features, leading to learn less underlying semantics in downstream tasks. However, ideal features are unknown and incalculable, constraining past work to rely on prior knowledge to assist reasoning or training, which limits LLMs’ capabilities to the developers’ abilities, rather than data-driven scalable learning. In our paper, through our novel data synthesis method, DELIA (Diversity-Enhanced Learning for Instruction Adaptation), we leverage the buffering effect of extensive diverse data in LLMs training to transform biased features in instruction tuning into approximations of ideal features, without explicit prior ideal features. Experiments show DELIA’s better performance compared to common instruction tuning and other baselines. It outperforms common instruction tuning by 17.07%-33.41% on Icelandic-English translation bleurt score (WMT-21 dataset, gemma-7b-it) and improves accuracy by 36.1% on formatted text generation (Llama2-7b-chat). Notably, among knowledge injection methods we’ve known, DELIA uniquely align the internal representations of new special tokens with their prior semantics.
摘要:虽然教学调整在大型语言模型中被广泛用于调整行为,但大量的经验证据和研究表明,它主要是一个模型适合特定任务格式的过程,而不是获得新的知识或能力的过程。我们认为,这种限制源于在指令调优过程中学习到的有偏见的特征,这些特征不同于理想的任务特定特征,导致在下游任务中学习的潜在语义较少。然而,理想的特征是未知的和不可估量的,限制了过去的工作依赖于先验知识来辅助推理或训练,这将LLMS的能力限制在开发人员的能力上,而不是数据驱动的可伸缩学习。在本文中,我们利用LLMS训练中大量不同数据的缓冲效应,通过我们提出的一种新的数据合成方法–DELIA(多样性增强学习),将指令调整中的偏向特征转化为理想特征的近似,而不需要显式的先验理想特征。实验表明,与普通的指令调优和其他基线相比,Delia的性能更好。在冰岛语-英语翻译简明分数(WMT-21数据集,Gema-7b-It)上,它的性能比普通指令调优高17.07%-33.41%;在格式化文本生成(Llama2-7b-Chat)上,它的准确率提高了36.1%。值得注意的是,在我们已知的知识注入方法中,Delia独一无二地将新的特殊令牌的内部表示与其先前的语义保持一致。

[NLP-18] Benchmarking Large Language Models for Math Reasoning Tasks
[NLP-18] 数学推理任务的大型语言模型基准测试

链接: https://arxiv.org/abs/2408.10839
作者: Kathrin Seßler,Yao Rong,Emek Gözlüklü,Enkelejda Kasneci
关键词-EN: Large Language Models, Large Language, enabling potential practical, mathematical problem solving, Language Models
关键词-ZH: 大型语言模型,大型语言,实现潜在的实用数学问题解决,语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical applications of LLMs for mathematical reasoning. Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy, while for smaller models the in-context learning approach significantly influences the performance. Moreover, the optimal prompt depends on the chosen foundation model. We open-source our benchmark code to support the integration of additional models in future research.
摘要:大型语言模型在数学推理中的应用已经成为相关研究的基石,展示了这些模型的智能,并通过它们的高级性能使潜在的实际应用成为可能,例如在教育环境中。尽管有各种各样的数据集和情境学习算法旨在提高LLMS自动解决数学问题的能力,但缺乏跨不同数据集的全面基准测试,使得为特定任务选择合适的模型变得复杂。在这个项目中,我们提出了一个基准,在四个强大的基础模型上,在五个广泛使用的数学数据集上公平地比较了七种最先进的上下文学习算法来解决数学问题。此外,我们还探讨了效率和性能之间的权衡,重点介绍了LLMS在数学推理中的实际应用。结果表明,较大的基础模型如GPT-40和Llama 3-70B可以独立于具体的提示策略解决数学推理问题,而对于较小的模型,情境学习方法对成绩有显著影响。此外,最优提示取决于所选择的基础模型。我们开放了我们的基准代码,以支持在未来的研究中集成更多的模型。

[NLP-19] Exploiting Large Language Models Capabilities for Question Answer-Driven Knowledge Graph Completion Across Static and Temporal Domains
[NLP-19] 利用大型语言模型功能实现跨静态和时态领域的问题响应者驱动的知识图完成

链接: https://arxiv.org/abs/2408.10819
作者: Rui Yang,Jiahao Zhu,Jianping Man,Li Fang,Yi Zhou
关键词-EN: identify missing triples, Knowledge graph completion, aims to identify, Knowledge graph, Generative Subgraph-based KGC
关键词-ZH: 识别缺失的三重组,知识图完成,旨在识别,知识图,基于生成子图的KGC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph completion (KGC) aims to identify missing triples in a knowledge graph (KG). This is typically achieved through tasks such as link prediction and instance completion. However, these methods often focus on either static knowledge graphs (SKGs) or temporal knowledge graphs (TKGs), addressing only within-scope triples. This paper introduces a new generative completion framework called Generative Subgraph-based KGC (GS-KGC). GS-KGC employs a question-answering format to directly generate target entities, addressing the challenge of questions having multiple possible answers. We propose a strategy that extracts subgraphs centered on entities and relationships within the KG, from which negative samples and neighborhood information are separately obtained to address the one-to-many problem. Our method generates negative samples using known facts to facilitate the discovery of new information. Furthermore, we collect and refine neighborhood path data of known entities, providing contextual information to enhance reasoning in large language models (LLMs). Our experiments evaluated the proposed method on four SKGs and two TKGs, achieving state-of-the-art Hits@1 metrics on five datasets. Analysis of the results shows that GS-KGC can discover new triples within existing KGs and generate new facts beyond the closed KG, effectively bridging the gap between closed-world and open-world KGC.
摘要:知识图补全(KGC)旨在识别知识图(KG)中缺失的三元组。这通常是通过链接预测和实例完成等任务来实现的。然而,这些方法通常关注静态知识图(SKG)或时间知识图(TKG),仅处理范围内的三元组。提出了一种新的产生式补全框架–基于生成子图的KGC(GS-KGC)。GS-KGC使用问答格式直接生成目标实体,解决了问题具有多个可能答案的挑战。我们提出了一种以KG内部实体和关系为中心的子图提取策略,从子图中分别获取负样本和邻域信息,以解决一对多问题。我们的方法使用已知事实生成负样本,以便于发现新信息。此外,我们收集和提炼已知实体的邻域路径数据,提供上下文信息来增强大型语言模型(LLM)中的推理。我们的实验在四个SKG和两个TKG上进行了评估,在五个数据集上实现了最先进的HITS@1度量。结果分析表明,GS-KGC能够在现有KG中发现新的三元组,并在封闭KG之外产生新的事实,有效地弥合了封闭世界KGC和开放世界KGC之间的差距。

[NLP-20] Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?
[NLP-20] 超越以英语为中心的法学硕士:多语言语言模型思考什么语言?

链接: https://arxiv.org/abs/2408.10811
作者: Chengzhi Zhong,Fei Cheng,Qianying Liu,Junfeng Jiang,Zhen Wan,Chenhui Chu,Yugo Murawaki,Sadao Kurohashi
关键词-EN: exhibit higher probabilities, respective dominant language, respective dominant, strong performance, vocabulary space
关键词-ZH: 表现出更高的可能性、各自的主导语言、各自的主导语言、较强的表现、词汇空间
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress

点击查看摘要

Abstract:In this study, we investigate whether non-English-centric LLMs, despite their strong performance, think' in their respective dominant language: more precisely, think’ refers to how the representations of intermediate layers, when un-embedded into the vocabulary space, exhibit higher probabilities for certain dominant languages during generation. We term such languages as internal \textbflatent languages . We examine the latent language of three typical categories of models for Japanese processing: Llama2, an English-centric model; Swallow, an English-centric model with continued pre-training in Japanese; and LLM-jp, a model pre-trained on balanced English and Japanese corpora. Our empirical findings reveal that, unlike Llama2 which relies exclusively on English as the internal latent language, Japanese-specific Swallow and LLM-jp employ both Japanese and English, exhibiting dual internal latent languages. For any given target language, the model preferentially activates the latent language most closely related to it. In addition, we explore how intermediate layers respond to questions involving cultural conflicts between latent internal and target output languages. We further explore how the language identity shifts across layers while keeping consistent semantic meaning reflected in the intermediate layer representations. This study deepens the understanding of non-English-centric large language models, highlighting the intricate dynamics of language representation within their intermediate layers. Comments: work in progress Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.10811 [cs.CL] (or arXiv:2408.10811v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.10811 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在这项研究中,我们调查了非以英语为中心的LLM是否在各自的主导语言中表现出了强劲的表现,更准确地说,‘Think’指的是中间层的表征在没有嵌入词汇空间的情况下,在生成过程中对某些主导语言表现出更高的概率。我们称这样的语言为内部语言。我们考察了三类典型的日语加工模型的潜在语言:Llama2,一种以英语为中心的模型;Swallow,一种以英语为中心,并继续进行日语预训练的模型;以及LLM-JP,一种预先训练好的平衡英语和日语语料库的模型。我们的实证结果表明,与完全依赖英语作为内部潜在语言的Llama2不同,日语特有的燕子和LLm-JP同时使用日语和英语,表现出双重内部潜在语言。对于任何给定的目标语,该模型优先激活与其关系最密切的潜在语言。此外,我们还探讨了中间层如何应对潜在的内部语言和目标输出语言之间的文化冲突问题。我们进一步探讨了语言身份如何跨层转移,同时保持中间层表示中反映的一致的语义。这项研究加深了对非以英语为中心的大型语言模式的理解,突出了中间层语言表征的错综复杂的动态。评论:正在进行的主题:计算和语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2408.10811cs.CLhttps://doi.org/10.48550/arXiv.2408.10811 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-21] ColBERT Retrieval and Ensemble Response Scoring for Language Model Question Answering
[NLP-21] 语言模型问题回答的ColBERT检索和集合响应评分

链接: https://arxiv.org/abs/2408.10808
作者: Alex Gichamba,Tewodros Kederalah Idris,Brian Ebiyau,Eric Nyberg,Teruko Mitamura
关键词-EN: deep technical knowledge, technical knowledge required, Domain-specific question answering, answer questions correctly, answering remains challenging
关键词-ZH: 深厚的技术知识,所需的技术知识,特定领域的问题回答,正确回答问题,回答仍然具有挑战性
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: This work has been submitted to the 2024 IEEE Globecom Workshops for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Domain-specific question answering remains challenging for language models, given the deep technical knowledge required to answer questions correctly. This difficulty is amplified for smaller language models that cannot encode as much information in their parameters as larger models. The “Specializing Large Language Models for Telecom Networks” challenge aimed to enhance the performance of two small language models, Phi-2 and Falcon-7B in telecommunication question answering. In this paper, we present our question answering systems for this challenge. Our solutions achieved leading marks of 81.9% accuracy for Phi-2 and 57.3% for Falcon-7B. We have publicly released our code and fine-tuned models.
摘要:鉴于正确回答问题所需的深厚技术知识,特定领域的问题回答对于语言模型来说仍然具有挑战性。对于较小的语言模型来说,这种困难会被放大,因为它们无法在参数中编码与较大的模型一样多的信息。“电信网络专业化大型语言模型”挑战旨在增强两种小型语言模型Phi-2和Falcon-7 B在电信问答中的性能。在本文中,我们介绍了针对这一挑战的问答系统。我们的解决方案对Phi-2和Falcon-7 B的准确性分别达到81.9%和57.3%的领先水平。我们已经公开发布了我们的代码和微调模型。

[NLP-22] Adversarial Attack for Explanation Robustness of Rationalization Models
[NLP-22] 对合理化模型解释稳健性的对抗攻击

链接: https://arxiv.org/abs/2408.10795
作者: Yuankai Zhang,Lingxiao Kong,Haozhao Wang,Ruixuan Li,Jun Wang,Yuhua Li,Wei Liu
关键词-EN: eXplainable Artificial Intelligence, Artificial Intelligence, trust predictions-have recently, predictions-have recently emerged, prominent research area
关键词-ZH: eXplanable人工智能、人工智能、信任预测-最近,预测-最近出现了,突出的研究领域
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rationalization models, which select a subset of input text as rationale-crucial for humans to understand and trust predictions-have recently emerged as a prominent research area in eXplainable Artificial Intelligence. However, most of previous studies mainly focus on improving the quality of the rationale, ignoring its robustness to malicious attack. Specifically, whether the rationalization models can still generate high-quality rationale under the adversarial attack remains unknown. To explore this, this paper proposes UAT2E, which aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users. UAT2E employs the gradient-based search on triggers and then inserts them into the original input to conduct both the non-target and target attack. Experimental results on five datasets reveal the vulnerability of rationalization models in terms of explanation, where they tend to select more meaningless tokens under attacks. Based on this, we make a series of recommendations for improving rationalization models in terms of explanation.
摘要:合理化模型选择输入文本的一个子集作为理论基础–这对人类理解和信任预测至关重要–最近成为可解释人工智能的一个重要研究领域。然而,以往的研究大多侧重于提高理论基础的质量,而忽略了其对恶意攻击的健壮性。具体地说,在对抗性攻击下,合理化模型是否仍能产生高质量的推理仍是未知的。为了探索这一点,本文提出了UAT2E,其目的是在不改变其预测的情况下削弱合理化模型的可解释性,从而引起人类用户对这些模型的不信任。UAT2E在触发器上采用基于梯度的搜索,然后将它们插入到原始输入中,以进行非目标攻击和目标攻击。在五个数据集上的实验结果揭示了合理化模型在解释方面的脆弱性,在攻击下,它们倾向于选择更多无意义的标记。在此基础上,本文从解释的角度提出了一系列改进合理化模型的建议。

[NLP-23] Flexora: Flexible Low Rank Adaptation for Large Language Models
[NLP-23] Flexora:针对大型语言模型的灵活低等级适应

链接: https://arxiv.org/abs/2408.10774
作者: Chenxing Wei,Yao Shu,Ying Tiffany He,Fei Richard Yu
关键词-EN: significantly enhanced generalization, enhanced generalization ability, Large Language Models, Large Language, driving advancements
关键词-ZH: 显着增强的概括,增强的概括能力,大型语言模型,大型语言,推动进步
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 13 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are driving advancements in artificial intelligence by increasing the scale of model parameters, which has significantly enhanced generalization ability and unlocked new capabilities in practice. However, their performance in specific downstream tasks is usually hindered by their knowledge boundaries on these tasks. Thus, fine-tuning techniques, especially the widely used Low-Rank Adaptation (LoRA) method, have been introduced to expand the boundaries on these tasks, whereas LoRA would underperform on certain tasks owing to its potential overfitting on these tasks. To overcome this overfitting and improve the performance of LoRA, we propose the flexible low rank adaptation (Flexora) method to automatically and flexibly select the most important layers needing to be fine-tuned to achieve the best performance on different downstream tasks. Specifically, Flexora firstly frames this layer selection problem as a well-defined hyperparameter optimization (HPO) problem, then addresses it using the unrolled differentiation (UD) method, and finally selects the most useful layers based on the optimized hyperparameters. Our extensive experiments on many pretrained models and natural language tasks show that Flexora is able to consistently improve over the existing baselines, indicating the effectiveness of our Flexora in practice. We additionally provide insightful theoretical results and many ablation studies to deliver a comprehensive understanding of our Flexora.
摘要:大语言模型通过增加模型参数的规模来推动人工智能的进步,在实践中显著增强了泛化能力,释放了新的能力。然而,他们在特定的下游任务中的表现通常会受到这些任务的知识边界的阻碍。因此,微调技术,特别是广泛使用的低等级适应(LORA)方法被引入来扩大这些任务的边界,而LORA在某些任务上会表现不佳,因为它在这些任务上可能过于匹配。为了克服这种过度适配和提高LORA的性能,我们提出了灵活的低阶自适应(Flexora)方法来自动灵活地选择需要微调的最重要的层,以在不同的下游任务上获得最佳的性能。具体而言,Flexora首先将层选择问题框架化为定义良好的超参数优化(HPO)问题,然后使用展开微分法(UD)对其进行求解,最后基于优化的超参数选择最有用的层。我们在许多预先训练的模型和自然语言任务上的广泛实验表明,Flexora能够持续改进现有的基线,表明我们的Flexora在实践中的有效性。此外,我们还提供了有洞察力的理论结果和许多消融研究,以全面了解我们的Flexora。

[NLP-24] Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model
[NLP-24] 与代币一起预测奖励:无破坏性参数插入以实现大型语言模型中的高效推理干预

链接: https://arxiv.org/abs/2408.10764
作者: Chenhan Yuan,Fei Huang,Ru Peng,Keming Lu,Bowen Yu,Chang Zhou,Jingren Zhou
关键词-EN: Transformer-based large language, Transformer-based large, large language models, unreliable reasoning, generating unsafe responses
关键词-ZH: 基于转换器的大型语言、基于转换器的大型语言模型、不可靠推理、生成不安全响应
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Transformer-based large language models (LLMs) exhibit limitations such as generating unsafe responses, unreliable reasoning, etc. Existing inference intervention approaches attempt to mitigate these issues by finetuning additional models to produce calibration signals (such as rewards) that guide the LLM’s decoding process. However, this solution introduces substantial time and space overhead due to the separate models required. This work proposes Non-disruptive parameters insertion (Otter), inserting extra parameters into the transformer architecture to predict calibration signals along with the original LLM output. Otter offers state-of-the-art performance on multiple demanding tasks while saving up to 86.5% extra space and 98.5% extra time. Furthermore, Otter seamlessly integrates with existing inference engines, requiring only a one-line code change, and the original model response remains accessible after the parameter insertion. Our code is publicly available at \urlthis https URL
摘要:基于转换器的大语言模型存在产生不安全响应、不可靠推理等局限性。现有的推理干预方法试图通过微调额外的模型来产生校准信号(如奖励)来指导LLM的解码过程,从而缓解这些问题。但是,由于需要单独的模型,此解决方案会带来大量的时间和空间开销。这项工作提出了无中断参数插入(OTTER),将额外的参数插入到变压器体系结构中以预测校准信号以及原始的LLM输出。Otter在多个要求苛刻的任务上提供最先进的性能,同时节省高达86.5%的额外空间和98.5%的额外时间。此外,Otter与现有推理引擎无缝集成,只需更改一行代码,在插入参数后仍可访问原始模型响应。我们的代码在此HTTPS URL上公开提供

[NLP-25] owards Efficient Large Language Models for Scientific Text: A Review
[NLP-25] 面向科学文本的高效大型语言模型:评论

链接: https://arxiv.org/abs/2408.10729
作者: Huy Quoc To,Ming Liu,Guangyan Huang
关键词-EN: Large language models, processing complex information, Large language, era for processing, processing complex
关键词-ZH: 大型语言模型,处理复杂信息,大型语言,处理时代,处理复杂
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have ushered in a new era for processing complex information in various fields, including science. The increasing amount of scientific literature allows these models to acquire and understand scientific knowledge effectively, thus improving their performance in a wide range of tasks. Due to the power of LLMs, they require extremely expensive computational resources, intense amounts of data, and training time. Therefore, in recent years, researchers have proposed various methodologies to make scientific LLMs more affordable. The most well-known approaches align in two directions. It can be either focusing on the size of the models or enhancing the quality of data. To date, a comprehensive review of these two families of methods has not yet been undertaken. In this paper, we (I) summarize the current advances in the emerging abilities of LLMs into more accessible AI solutions for science, and (II) investigate the challenges and opportunities of developing affordable solutions for scientific domains using LLMs.
摘要:大型语言模型开启了一个处理包括科学在内的各个领域复杂信息的新时代。越来越多的科学文献使这些模型能够有效地获取和理解科学知识,从而提高了它们在广泛任务中的表现。由于LLMS的强大功能,它们需要极其昂贵的计算资源、大量的数据和训练时间。因此,近年来,研究人员提出了各种方法,以使科学的低成本管理更负担得起。最广为人知的方法有两个方向。它既可以关注模型的大小,也可以提高数据的质量。迄今为止,尚未对这两类方法进行全面审查。在本文中,我们(I)将LLMS新兴能力的当前进展总结为更容易获得的科学人工智能解决方案,以及(Ii)调查使用LLMS为科学领域开发负担得起的解决方案所面临的挑战和机遇。

[NLP-26] Crafting Tomorrows Headlines: Neural News Generation and Detection in English Turkish Hungarian and Persian
[NLP-26] 制作明天头条新闻:英语、土耳其语、匈牙利语和波斯语的神经新闻生成和检测

链接: https://arxiv.org/abs/2408.10724
作者: Cem Üyük,Danica Rovó,Shaghayegh Kolli,Rabia Varol,Georg Groh,Daryna Dementieva
关键词-EN: Large Language Models, facilitation with Large, societal well-being, Large Language, era dominated
关键词-ZH: 大型语言模型,促进大型、社会福祉、大型语言、主导时代
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era dominated by information overload and its facilitation with Large Language Models (LLMs), the prevalence of misinformation poses a significant threat to public discourse and societal well-being. A critical concern at present involves the identification of machine-generated news. In this work, we take a significant step by introducing a benchmark dataset designed for neural news detection in four languages: English, Turkish, Hungarian, and Persian. The dataset incorporates outputs from multiple multilingual generators (in both, zero-shot and fine-tuned setups) such as BloomZ, LLaMa-2, Mistral, Mixtral, and GPT-4. Next, we experiment with a variety of classifiers, ranging from those based on linguistic features to advanced Transformer-based models and LLMs prompting. We present the detection results aiming to delve into the interpretablity and robustness of machine-generated texts detectors across all target languages.
摘要:在信息过载及其对大型语言模型(LLM)的促进主导的时代,错误信息的盛行对公共话语和社会福祉构成了重大威胁。目前的一个关键问题涉及机器生成的新闻的识别。在这项工作中,我们迈出了重要的一步,引入了专为四种语言(英语、土耳其语、匈牙利语和波斯语)的神经新闻检测而设计的基准数据集。该数据集包含了多个多语言生成器(零触发和微调设置)的输出,例如BloomZ、LLaMa-2、Mistral、Mixtral和GPT-4。接下来,我们尝试各种分类器,从基于语言特征的分类器到高级的基于Transformer的模型和LLM提示。我们提供的检测结果旨在深入研究机器生成的文本检测器在所有目标语言中的可解释性和稳健性。

[NLP-27] MEGen: Generative Backdoor in Large Language Models via Model Editing
[NLP-27] MEGen:通过模型编辑在大型语言模型中实现生成后门

链接: https://arxiv.org/abs/2408.10722
作者: Jiyang Qiu,Xinbei Ma,Zhuosheng Zhang,Hai Zhao
关键词-EN: demonstrated remarkable capabilities, Large language models, Large language, remarkable capabilities, demonstrated remarkable
关键词-ZH: 展示了非凡的能力,大型语言模型,大型语言,非凡的能力,展示了非凡的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Working in progress

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. Emerging as widely adopted generalists for diverse tasks, LLMs are still vulnerable to backdoors. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects. In our approach, we first leverage a language model to insert a trigger selected on fixed metrics into the input, then design a pipeline of model editing to directly embed a backdoor into an LLM. By adjusting a small set of local parameters with a mini-batch of samples, MEGen significantly enhances time efficiency and achieves high robustness. Experimental results indicate that our backdoor attack strategy achieves a high attack success rate on poison data while maintaining the model’s performance on clean data. Notably, the backdoored model, when triggered, can freely output pre-set dangerous information while successfully completing downstream tasks. This suggests that future LLM applications could be guided to deliver certain dangerous information, thus altering the LLM’s generative style. We believe this approach provides insights for future LLM applications and the execution of backdoor attacks on conversational AI systems.
摘要:大型语言模型(LLM)已经显示出显著的能力。它们强大的生成能力使人们能够根据各种查询或指令做出灵活的反应。作为在不同任务中被广泛采用的多面手,LLM仍很容易受到后门的攻击。本文提出了一种基于编辑的生成性后门Megen,旨在为NLP任务创建一个具有最小副作用的定制后门。在我们的方法中,我们首先利用语言模型将根据固定指标选择的触发器插入到输入中,然后设计一个模型编辑管道来直接将后门嵌入到LLM中。通过用一小批样本调整一小组局部参数,Megen显著提高了时间效率并实现了高稳健性。实验结果表明,我们的后门攻击策略在保持模型对干净数据的性能的同时,对有毒数据取得了较高的攻击成功率。值得注意的是,后置模型在被触发时,可以在成功完成下游任务的同时自由输出预设的危险信息。这表明,未来的LLM应用程序可能会被引导来传递某些危险的信息,从而改变LLM的生成风格。我们相信,这种方法为未来的LLM应用程序和对对话式人工智能系统执行后门攻击提供了见解。

[NLP-28] CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
[NLP-28] CodeJudge-Eval:大型语言模型可以成为代码理解的好法官吗?

链接: https://arxiv.org/abs/2408.10718
作者: Yuwei Zhao,Ziyang Luo,Yuchen Tian,Hongzhan Lin,Weixiang Yan,Annan Li,Jing Ma
关键词-EN: Recent advancements, showcased impressive code, large language models, code understanding abilities, primarily evaluated
关键词-ZH: 最近的进步,展示了令人印象深刻的代码、大型语言模型、代码理解能力,主要评估
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model’s code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs’ code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark’s ability to probe deeper into models’ code understanding abilities. Our benchmark will be available at \urlthis https URL.
摘要:大型语言模型(LLM)的最新进展展示了令人印象深刻的代码生成能力,主要是通过语言到代码的基准测试进行评估。然而,这些基准测试可能不能完全捕获模型的代码理解能力。本文介绍了一种新的基准测试工具CodeJustice-Eval(CJ-Eval),该基准测试旨在从代码判断而不是代码生成的角度来评估LLMS的代码理解能力。CJ-Eval挑战模型以确定所提供代码解决方案的正确性,包括各种错误类型和编译问题。通过利用各种问题和细粒度的评判系统,CJ-Eval解决了传统基准的局限性,包括潜在的对解决方案的记忆。在CJ-Eval上对12个知名LLM的评估显示,即使是最先进的模型也难以实现,这突显了基准测试更深入地探索模型的代码理解能力的能力。我们的基准测试将在此HTTPS URL上提供。

[NLP-29] Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
[NLP-29] Ferret:更快、更有效的自动化红色团队,采用基于奖励的评分技术

链接: https://arxiv.org/abs/2408.10701
作者: Tej Deep Pala,Vernon Y.H. Toh,Rishabh Bhardwaj,Soujanya Poria
关键词-EN: numerous real-world applications, Rainbow Teaming, today era, real-world applications, ensuring their safety
关键词-ZH: 众多现实应用程序,Rainbow Teaming,当今时代,现实应用程序,确保其安全
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In today’s era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt. We explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge, to rank adversarial mutations based on their potential harm to improve the efficiency of the search for harmful mutations. Our results demonstrate that Ferret, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline and generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size. Our codes are available at this https URL.
摘要:在当今时代,大型语言模型(LLM)被集成到大量现实世界的应用中,确保它们的安全性和健壮性对于负责任的人工智能使用至关重要。自动红团队方法通过生成对抗性攻击来识别和缓解这些模型中的潜在漏洞,从而在这一过程中发挥关键作用。然而,现有的方法往往在性能缓慢、分类多样性有限和资源需求高的情况下苦苦挣扎。虽然彩虹组合是最近的一种方法,通过将敌意提示生成框定为一种质量多样性搜索来解决多样性挑战,但它仍然很慢,需要一个大型微调赋值器来实现最佳性能。为了克服这些局限性,我们提出了一种新的方法–FERRET,它建立在彩虹分组的基础上,通过每次迭代产生多个对抗性提示突变,并使用评分函数来对最有效的对抗性提示进行排序和选择。我们探索了各种评分函数,包括奖励模型、骆驼警卫和LLM作为法官,根据潜在的危害对对手突变进行排名,以提高有害突变的搜索效率。我们的结果表明,利用奖励模型作为得分函数的雪貂,将总体攻击成功率(ASR)提高到95%,比彩虹组合高出46%。此外,与基线相比,雪貂将达到90%的ASR所需的时间减少了15.2%,并生成可转移的对抗性提示,即对其他较大规模的LLM有效。我们的代码可以在这个HTTPS URL上找到。

[NLP-30] Unconditional Truthfulness: Learning Conditional Dependency for Uncertainty Quantification of Large Language Models
[NLP-30] 无条件真实性:学习条件依赖性以实现大型语言模型的不确定性量化

链接: https://arxiv.org/abs/2408.10692
作者: Artem Vazhentsev,Ekaterina Fadeeva,Rui Xing,Alexander Panchenko,Preslav Nakov,Timothy Baldwin,Maxim Panov,Artem Shelmanov
关键词-EN: detecting Large Language, Large Language Model, Large Language, low quality output, detecting Large
关键词-ZH: 检测大型语言、大型语言模型、大型语言、低质量输出、检测大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) is a perspective approach to detecting Large Language Model (LLM) hallucinations and low quality output. In this work, we address one of the challenges of UQ in generation tasks that arises from the conditional dependency between the generation steps of an LLM. We propose to learn this dependency from data. We train a regression model, which target variable is the gap between the conditional and the unconditional generation confidence. During LLM inference, we use this learned conditional dependency model to modulate the uncertainty of the current generation step based on the uncertainty of the previous step. Our experimental evaluation on nine datasets and three LLMs shows that the proposed method is highly effective for uncertainty quantification, achieving substantial improvements over rivaling approaches.
摘要:不确定性量化(UQ)是检测大型语言模型(LLM)幻觉和低质量输出的一种透视方法。在这项工作中,我们解决了UQ在生成任务中的挑战之一,该挑战源于LLM生成步骤之间的条件依赖性。我们建议从数据中学习这种依赖性。我们训练一个回归模型,其目标变量是条件一代信心和无条件一代信心之间的差距。在LLM推理过程中,我们使用这个学习到的条件依赖模型来根据前一步的不确定性调节当前生成步骤的不确定性。我们对九个数据集和三个LLM的实验评估表明,所提出的方法对于不确定性量化非常有效,比竞争方法取得了重大改进。

[NLP-31] owards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models
[NLP-31] owards稳健知识消除:评估和改进大型语言模型中消除稳健性的对抗框架

链接: https://arxiv.org/abs/2408.10682
作者: Hongbang Yuan,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: unlearned knowledge, unlearned, training corpora, achieved success, troubled by problematic
关键词-ZH: 未学到的知识,未学到的,培训文集,取得成功,被问题困扰
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:LLM have achieved success in many fields but still troubled by problematic content in the training corpora. LLM unlearning aims at reducing their influence and avoid undesirable behaviours. However, existing unlearning methods remain vulnerable to adversarial queries and the unlearned knowledge resurfaces after the manually designed attack queries. As part of a red-team effort to proactively assess the vulnerabilities of unlearned models, we design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack these models and evaluate their robustness. It optimizes adversarial suffixes to reintroduce the unlearned knowledge in various scenarios. We find that unlearned knowledge can be recovered in 55.2% of the questions, even without revealing the unlearned model’s parameters. In response to this vulnerability, we propose Latent Adversarial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. It formulates the unlearning process as a min-max optimization problem and resolves it through two stages: an attack stage, where perturbation vectors are trained and added to the latent space of LLMs to recover the unlearned knowledge, and a defense stage, where previously trained perturbation vectors are used to enhance unlearned model’s robustness. With our LAU framework, we obtain two robust unlearning methods, AdvGA and AdvNPO. We conduct extensive experiments across multiple unlearning benchmarks and various models, and demonstrate that they improve the unlearning effectiveness by over 53.5% , cause only less than a 11.6% reduction in neighboring knowledge, and have almost no impact on the model’s general capabilities.
摘要:LLM在许多领域取得了成功,但仍然受到培训语料库中存在问题的内容的困扰。LLM遗忘的目的是减少他们的影响,避免不良行为。然而,现有的遗忘方法仍然容易受到敌意查询的攻击,在人工设计的攻击查询之后,未学习的知识重新浮出水面。作为红团队主动评估未学习模型漏洞的努力的一部分,我们设计了动态遗忘攻击(DUA),这是一个动态和自动化的框架来攻击这些模型并评估它们的健壮性。它优化对抗性后缀,在不同的场景中重新引入未学习的知识。我们发现,即使在不透露未学习模型参数的情况下,也可以在55.2%的问题中恢复未学习知识。针对这一弱点,我们提出了潜在对抗性遗忘(LAU),这是一个通用的框架,有效地增强了未学习过程的稳健性。它将无学习过程描述为一个极小极大优化问题,并分两个阶段进行求解:攻击阶段,训练扰动向量并将其添加到LLMS的潜在空间以恢复未学习知识;防御阶段,利用先前训练的扰动向量来增强未学习模型的稳健性。在LAU框架下,我们得到了两种稳健的遗忘方法:AdvGA和AdvNPO。我们在多个遗忘基准和不同的模型上进行了大量的实验,结果表明,它们使遗忘效率提高了53.5%以上,邻域知识减少不到11.6%,对模型的总体性能几乎没有影响。

[NLP-32] HMoE: Heterogeneous Mixture of Experts for Language Modeling
[NLP-32] HMoE:语言建模专家的异类混合体

链接: https://arxiv.org/abs/2408.10681
作者: An Wang,Xingwu Sun,Ruobing Xie,Shuaipeng Li,Jiaqi Zhu,Zhen Yang,Pinxue Zhao,J.N.Han,Zhanhui Kang,Di Wang,Naoaki Okazaki,Cheng-zhong Xu
关键词-EN: offers remarkable performance, selectively activating subsets, offers remarkable, remarkable performance, selectively activating
关键词-ZH: 提供非凡的性能,选择性激活子集,提供非凡的性能,选择性激活
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) offers remarkable performance and computational efficiency by selectively activating subsets of model parameters. Traditionally, MoE models use homogeneous experts, each with identical capacity. However, varying complexity in input data necessitates experts with diverse capabilities, while homogeneous MoE hinders effective expert specialization and efficient parameter utilization. In this study, we propose a novel Heterogeneous Mixture of Experts (HMoE), where experts differ in size and thus possess diverse capacities. This heterogeneity allows for more specialized experts to handle varying token complexities more effectively. To address the imbalance in expert activation, we propose a novel training objective that encourages the frequent activation of smaller experts, enhancing computational efficiency and parameter utilization. Extensive experiments demonstrate that HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks. Codes will be released upon acceptance.
摘要:专家混合(MOE)通过选择性地激活模型参数的子集,提供了显著的性能和计算效率。传统上,MOE模型使用相同的专家,每个专家都具有相同的能力。然而,输入数据的不同复杂性要求专家具有不同的能力,而同质的MOE阻碍了有效的专家专业化和高效的参数利用。在这项研究中,我们提出了一种新的专家异质混合(HMoE),其中专家的规模不同,因此拥有不同的能力。这种异构性允许更专业的专家更有效地处理不同的令牌复杂性。为了解决专家激活的不平衡问题,我们提出了一种新的训练目标,鼓励频繁激活较小的专家,提高计算效率和参数利用率。大量实验表明,HMOE以较少的激活参数实现了较低的损失,并在各种训练前评估基准上优于传统的同质MOE模型。代码将在接受后发布。

[NLP-33] owards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper
[NLP-33] owards免排练多语言ASB:基于LoRA的Whisper案例研究

链接: https://arxiv.org/abs/2408.10680
作者: Tianyi Xu,Kaixun Huang,Pengcheng Guo,Yu Zhou,Longtao Huang,Hui Xue,Lei Xie
关键词-EN: Pre-trained multilingual speech, multilingual speech foundation, Pre-trained multilingual, shown impressive performance, speech foundation models
关键词-ZH: 预训练的多语言语音,多语言语音基础,预训练的多语言,表现出令人印象深刻的性能,语音基础模型
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Pre-trained multilingual speech foundation models, like Whisper, have shown impressive performance across different languages. However, adapting these models to new or specific languages is computationally extensive and faces catastrophic forgetting problems. Addressing these issues, our study investigates strategies to enhance the model on new languages in the absence of original training data, while also preserving the established performance on the original languages. Specifically, we first compare various LoRA-based methods to find out their vulnerability to forgetting. To mitigate this issue, we propose to leverage the LoRA parameters from the original model for approximate orthogonal gradient descent on the new samples. Additionally, we also introduce a learnable rank coefficient to allocate trainable parameters for more efficient training. Our experiments with a Chinese Whisper model (for Uyghur and Tibetan) yield better results with a more compact parameter set.
摘要:预先训练的多语言语音基础模型(例如Whisper)在不同语言中表现出令人印象深刻的性能。然而,将这些模型适应新的或特定的语言是计算量很大的,并且面临灾难性的遗忘问题。为了解决这些问题,我们的研究调查了在缺乏原始训练数据的情况下增强新语言模型的策略,同时保留原始语言的既定性能。具体来说,我们首先比较各种基于LoRA的方法,以找出它们容易遗忘的脆弱性。为了缓解这个问题,我们建议利用原始模型中的LoRA参数来对新样本进行近似的垂直梯度下降。此外,我们还引入了可学习的排名系数来分配可训练参数,以实现更高效的训练。我们对中国耳语模型(维吾尔语和藏族语)的实验通过更紧凑的参数集产生了更好的结果。

[NLP-34] REInstruct: Building Instruction Data from Unlabeled Corpus ACL2024
[NLP-34] REDirecct:从未标记的数据库中构建指令数据

链接: https://arxiv.org/abs/2408.10663
作者: Shu Chen,Xinyan Guan,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun
关键词-EN: Manually annotating instruction, Manually annotating, large language models, instruction data, annotating instruction data
关键词-ZH: 手动注释指令,手动注释,大型语言模型,指令数据,注释指令数据
类目: Computation and Language (cs.CL)
备注: Accepted by ACL2024 Findings

点击查看摘要

Abstract:Manually annotating instruction data for large language models is difficult, costly, and hard to scale. Meanwhile, current automatic annotation methods typically rely on distilling synthetic data from proprietary LLMs, which not only limits the upper bound of the quality of the instruction data but also raises potential copyright issues. In this paper, we propose REInstruct, a simple and scalable method to automatically build instruction data from an unlabeled corpus without heavy reliance on proprietary LLMs and human annotation. Specifically, REInstruct first selects a subset of unlabeled texts that potentially contain well-structured helpful and insightful content and then generates instructions for these texts. To generate accurate and relevant responses for effective and robust training, REInstruct further proposes a rewriting-based approach to improve the quality of the generated instruction data. By training Llama-7b on a combination of 3k seed data and 32k synthetic data from REInstruct, fine-tuned model achieves a 65.41% win rate on AlpacaEval leaderboard against text-davinci-003, outperforming other open-source, non-distilled instruction data construction methods. The code is publicly available at \urlthis https URL.
摘要:为大型语言模型手动标注教学数据是一件困难、昂贵且难以扩展的事情。同时,当前的自动标注方法通常依赖于从专有的LLM中提取合成数据,这不仅限制了指令数据的质量上限,而且还带来了潜在的版权问题。在本文中,我们提出了一种简单且可扩展的方法restruct,它可以从未标记的语料库中自动构建指令数据,而不需要严重依赖专有的LLM和人工标注。具体地说,restruct首先选择可能包含结构良好、有帮助和有洞察力的内容的未标记文本的子集,然后为这些文本生成说明。为了为有效和健壮的训练生成准确和相关的响应,restruct进一步提出了一种基于重写的方法来提高生成的指令数据的质量。通过在3k种子数据和32k合成数据的组合上训练Llama-7b,微调模型在AlpacaEval排行榜上相对于Text-DaVinci-003获得了65.41%的胜率,优于其他开源的非提取指令数据构建方法。代码在此HTTPS URL上公开提供。

[NLP-35] Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs
[NLP-35] 一致性的表面之下:探索LLM中的跨语言知识表示共享

链接: https://arxiv.org/abs/2408.10646
作者: Maxim Ifergan,Leshem Choshen,Roee Aharoni,Idan Szpektor,Omri Abend
关键词-EN: factoid is largely, largely independent, languages, representation, multilingual
关键词-ZH: 事实陈述在很大程度上是独立的、语言、表示、多语言的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model’s ability to answer a query consistently across languages, and the ability to ‘‘store’’ answers in a shared representation for several languages. We propose a methodology to measure the extent of representation sharing across languages by repurposing knowledge editing methods. We examine LLMs with various multilingual configurations using a new multilingual dataset. We reveal that high consistency does not necessarily imply shared representation, particularly for languages with different scripts. Moreover, we find that script similarity is a dominant factor in representation sharing. Finally, we observe that if LLMs could fully share knowledge across languages, their accuracy in their best-performing language could benefit an increase of up to 150% on average. These findings highlight the need for improved multilingual knowledge representation in LLMs and suggest a path for the development of more robust and consistent multilingual LLMs.
摘要:事实的真实性在很大程度上与它所用的语言无关。然而,语言模型在回答不同语言的相同事实问题的能力上是不一致的。这引发了关于LLM如何跨语言表示给定事实的问题。我们通过两个方面来探索多语言事实知识:模型跨语言一致地回答查询的能力,以及在几种语言的共享表示中“存储”答案的能力。我们提出了一种通过重新调整知识编辑方法来衡量跨语言表示共享程度的方法。我们使用一个新的多语言数据集来检查具有各种多语言配置的LLM。我们发现,高一致性并不一定意味着共享表示,特别是对于具有不同脚本的语言。此外,我们发现脚本相似度是影响表征共享的主要因素。最后,我们观察到,如果LLMS能够完全跨语言共享知识,那么他们在表现最好的语言中的准确率将平均提高高达150%。这些发现突显了改进小岛屿发展中国家多语种知识表达的必要性,并为发展更稳健和一致的多语种小岛屿发展中国家提供了一条途径。

[NLP-36] Minor SFT loss for LLM fine-tune to increase performance and reduce model deviation
[NLP-36] LLM微调的SFT损失较小,以提高性能并减少模型偏差

链接: https://arxiv.org/abs/2408.10642
作者: Shiming Xie,Hong Chen,Fred Yu,Zeye Sun,Xiuyu Wu
关键词-EN: Instruct LLM provide, large scale language, Instruct LLM, scale language model, large scale
关键词-ZH: 指令LLM提供,大规模语言,指令LLM,规模语言模型,大规模
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Instruct LLM provide a paradigm used in large scale language model to align LLM to human preference. The paradigm contains supervised fine tuning and reinforce learning from human feedback. This paradigm is also used in downstream scenarios to adapt LLM to specific corpora and applications. Comparing to SFT, there are many efforts focused on RLHF and several algorithms being proposed, such as PPO, DPO, IPO, KTO, MinorDPO and etc. Meanwhile most efforts for SFT are focused on how to collect, filter and mix high quality data. In this article with insight from DPO and MinorDPO, we propose a training metric for SFT to measure the discrepancy between the optimized model and the original model, and a loss function MinorSFT that can increase the training effectiveness, and reduce the discrepancy between the optimized LLM and original LLM.
摘要:指令LLM提供了一种用于大规模语言模型的范式,以使LLM与人类偏好保持一致。该范式包含有监督的微调和加强从人类反馈中的学习。该范式还用于下游场景,以使LLM适应特定的数据库和应用程序。与SFT相比,有很多工作都集中在RL HF上,并提出了多种算法,例如PPO、DPO、IPO、KTO、MinorDPO等。同时,SFT的大部分工作都集中在如何收集、过滤和混合高质量数据上。在本文中,我们借鉴了DPO和MinorDPO的见解,提出了SFT的训练指标来衡量优化模型与原始模型之间的差异,并提出了损失函数MinorSFT,可以提高训练有效性,并减少优化的LLM和原始LLM之间的差异。

[NLP-37] Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search
[NLP-37] 策略师:通过双层树搜索由法学硕士学习战略技能

链接: https://arxiv.org/abs/2408.10635
作者: Jonathan Light,Min Cai,Weiqin Chen,Guanzhi Wang,Xiusi Chen,Wei Cheng,Yisong Yue,Ziniu Hu
关键词-EN: Strategist that utilizes, playing multi-agent games, self-improvement process, method Strategist, Monte Carlo tree
关键词-ZH: 利用的策略师,玩多智能体游戏,自我完善过程,方法策略师,蒙特卡洛树
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: website: this https URL

点击查看摘要

Abstract:In this paper, we propose a new method Strategist that utilizes LLMs to acquire new skills for playing multi-agent games through a self-improvement process. Our method gathers quality feedback through self-play simulations with Monte Carlo tree search and LLM-based reflection, which can then be used to learn high-level strategic skills such as how to evaluate states that guide the low-level execution.We showcase how our method can be used in both action planning and dialogue generation in the context of games, achieving good performance on both tasks. Specifically, we demonstrate that our method can help train agents with better performance than both traditional reinforcement learning-based approaches and other LLM-based skill learning approaches in games including the Game of Pure Strategy (GOPS) and The Resistance: Avalon.
摘要:在本文中,我们提出了一种新的方法策略师,利用LLM通过自我改进过程来获得玩多智能体游戏的新技能。我们的方法通过具有蒙特卡洛树搜索和基于LLM的反射的自我游戏模拟收集质量反馈,然后可用于学习高级战略技能,例如如何评估指导低层执行的状态。我们展示了我们的方法如何用于游戏环境中的动作规划和对话生成,在这两项任务中实现良好的性能。具体来说,我们证明,我们的方法可以帮助在包括纯粹策略游戏(GOPS)和抵抗:阿瓦隆在内的游戏中以比传统的基于强化学习的方法和其他基于LLM的技能学习方法更好的性能训练代理。

[NLP-38] LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models
[NLP-38] LLM-Barber:大型语言模型的稀疏性Mass的块感知重建器

链接: https://arxiv.org/abs/2408.10631
作者: Yupeng Su,Ziyi Guan,Xiaoqun Liu,Tianlai Jin,Dongkuan Wu,Graziano Chesi,Ngai Wong,Hao Yu
关键词-EN: Large language models, Large language, significantly in scale, grown significantly, Large
关键词-ZH: 大型语言模型,大型语言,规模显着,增长显着,大型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have grown significantly in scale, leading to a critical need for efficient model pruning techniques. Existing post-training pruning techniques primarily focus on measuring weight importance on converged dense models to determine salient weights to retain. However, they often overlook the changes in weight importance during the pruning process, which can lead to performance degradation in the pruned models. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, ensuring global performance optimization. Inspired by the recent discovery of prominent outliers in LLMs, LLM-Barber introduces an innovative pruning metric that identifies weight importance using weights multiplied by gradients. Our experiments show that LLM-Barber can efficiently prune models like LLaMA and OPT families with 7B to 13B parameters on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at this https URL.
摘要:大型语言模型的规模越来越大,迫切需要一种高效的模型剪枝技术。现有的训练后剪枝技术主要集中于在收敛的稠密模型上测量权重重要性,以确定要保留的显著权重。然而,它们往往忽略了剪枝过程中权重重要性的变化,这可能会导致剪枝模型的性能下降。针对这一问题,我们提出了一种新的单次剪枝框架LLM-Barber(Block-Aware ReBuilder for Sparthy MASK in One-Sshot),该框架无需任何重新训练或权重重建即可重建剪枝后模型的稀疏掩模。LLM-Barber在自我关注和MLP块之间整合了块感知错误优化,确保了全局性能优化。受最近在LLMS中发现显著异常值的启发,LLM-Barber引入了一种创新的剪枝度量,该度量使用权重乘以梯度来确定权重重要性。我们的实验表明,LLM-Barber可以在短短30分钟内在单个A100 GPU上高效地修剪具有7B到13B参数的骆驼和OPT家族等模型,在各种语言基准测试中实现最先进的困惑和零点性能。代码可在此HTTPS URL上找到。

[NLP-39] Enhancing Robustness in Large Language Models : Prompting for Mitigating the Impact of Irrelevant Information
[NLP-39] 增强大型语言模型的鲁棒性:为减轻不相关信息的影响而制定预算

链接: https://arxiv.org/abs/2408.10615
作者: Ming Jiang,Tingting Huang,Biao Guo,Yao Lu,Feng Zhang
关键词-EN: Large language models, garnered significant attention, significant attention due, Large language, complex reasoning tasks
关键词-ZH: 大型语言模型,引起了极大的关注,值得关注,大型语言,复杂的推理任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, Large language models (LLMs) have garnered significant attention due to their superior performance in complex reasoning tasks. However, recent studies may diminish their reasoning capabilities markedly when problem descriptions contain irrelevant information, even with the use of advanced prompting techniques. To further investigate this issue, a dataset of primary school mathematics problems containing irrelevant information, named GSMIR, was constructed. Testing prominent LLMs and prompting techniques on this dataset revealed that while LLMs can identify irrelevant information, they do not effectively mitigate the interference it causes once identified. A novel automatic construction method, ATF, which enhances the ability of LLMs to identify and self-mitigate the influence of irrelevant information, is proposed to address this shortcoming. This method operates in two steps: first, analysis of irrelevant information, followed by its filtering. The ATF method, as demonstrated by experimental results, significantly improves the reasoning performance of LLMs and prompting techniques, even in the presence of irrelevant information on the GSMIR dataset.
摘要:近年来,大语言模型因其在复杂推理任务中的优异性能而引起了人们的广泛关注。然而,最近的研究可能会显著削弱他们的推理能力,当问题描述包含无关信息时,即使使用先进的提示技术也是如此。为了进一步研究这一问题,我们构建了一个包含无关信息的小学数学问题数据集GSMIR。在这个数据集上测试重要的LLM和提示技术表明,虽然LLM可以识别不相关的信息,但一旦识别出来,它们并不能有效地减轻它造成的干扰。针对这一不足,提出了一种新的自动构造方法ATF,该方法增强了LLMS识别无关信息的能力,并自适应地减轻了无关信息的影响。该方法分两步进行:首先对无关信息进行分析,然后对其进行过滤。实验结果表明,即使在GSMIR数据集上存在无关信息的情况下,ATF方法也能显著提高LLMS和提示技术的推理性能。

[NLP-40] Promoting Equality in Large Language Models : Identifying and Mitigating the Implicit Bias based on Bayesian Theory
[NLP-40] 促进大型语言模型中的平等:基于Bayesian理论识别和缓解隐性偏见

链接: https://arxiv.org/abs/2408.10608
作者: Yongxin Deng(1),Xihe Qiu(1),Xiaoyu Tan(2),Jing Pan(3),Chen Jue(1),Zhijun Fang(4),Yinghui Xu(5),Wei Chu(2),Yuan Qi(5) ((1) Shanghai University of Engineering Science, (2) INF Technology (Shanghai) Co., Ltd., (3) Monash University, (4) Donghua University, (5) Fudan University)
关键词-EN: Large language models, Large language, extensive text corpora, inevitably include biased, text corpora
关键词-ZH: 大型语言模型,大型语言、广泛的文本库,不可避免地包括有偏见的文本库
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model’s weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs relevant knowledge triples and expunges bias information from LLMs using model editing techniques. Through extensive experimentation, we have confirmed the presence of the implicit bias problem in LLMs and demonstrated the effectiveness of our BTBR approach.
摘要:大型语言模型是在广泛的文本语料库上进行训练的,其中不可避免地包含有偏见的信息。虽然情感对齐等技术可以缓解这些偏差的一些负面影响,但现有的基于提示的攻击方法仍然可以从模型的权重中提取这些偏差。此外,当LLM被提示在不同的人口群体中执行相同的任务时,这些偏见经常微妙地出现,从而掩盖了他们的存在。为了解决这个问题,我们正式定义了隐含偏差问题,并在贝叶斯理论的基础上提出了一种新的去偏框架–基于贝叶斯理论的去偏方法(BTBR)。BTBR使用似然比筛选来精确定位可公开访问的有偏数据集中的数据条目,这些数据表示在LLM训练阶段无意中并入的偏差。然后利用模型编辑技术自动构建相关知识三元组,并从低似然模型中剔除偏差信息。通过大量的实验,我们证实了LLMS中隐含偏差问题的存在,并证明了我们的BTBR方法的有效性。

[NLP-41] Multilingual Non-Factoid Question Answering with Silver Answers
[NLP-41] 多语言非事实问答与银色答案

链接: https://arxiv.org/abs/2408.10604
作者: Ritwik Mishra,Sreeram Vennam,Rajiv Ratn Shah,Ponnurangam Kumaraguru
关键词-EN: existing Question Answering, short-context Question Answering, Question Answering Datasets, Question Answering, Answering Datasets
关键词-ZH: 现有的问题志愿服务、短上下文问题志愿服务、问题志愿服务数据集、问题志愿服务、志愿服务数据集
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Most existing Question Answering Datasets (QuADs) primarily focus on factoid-based short-context Question Answering (QA) in high-resource languages. However, the scope of such datasets for low-resource languages remains limited, with only a few works centered on factoid-based QuADs and none on non-factoid QuADs. Therefore, this work presents MuNfQuAD, a multilingual QuAD with non-factoid questions. It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers. The dataset comprises over 370K QA pairs across 38 languages, encompassing several low-resource languages, and stands as the largest multilingual QA dataset to date. Based on the manual annotations of 790 QA-pairs from MuNfQuAD (golden set), we observe that 98% of questions can be answered using their corresponding silver answer. Our fine-tuned Answer Paragraph Selection (APS) model outperforms the baselines. The APS model attained an accuracy of 80% and 72%, as well as a macro F1 of 72% and 66%, on the MuNfQuAD testset and the golden set, respectively. Furthermore, the APS model effectively generalizes certain a language within the golden set, even after being fine-tuned on silver labels.
摘要:现有的问答数据集(QUAD)主要集中在高资源语言中基于事实的短上下文问答(QA)。然而,针对低资源语言的这类数据集的范围仍然有限,只有少数工作集中于基于事实的四元组,而没有关于非事实四元组的工作。因此,这项工作提出了MuNfQuAD,一个带有非事实问题的多语言四边形。它利用BBC新闻文章中的疑问副标题作为疑问句,并使用相应的段落作为银色答案。该数据集包括38种语言的37万多个QA对,包括几种资源较少的语言,是迄今为止最大的多语言QA数据集。基于MuNfQuAD(黄金集)的790个问答对的人工标注,我们观察到98%的问题可以用它们对应的银色答案来回答。我们微调的答案段落选择(APS)模型表现优于基线。在MuNfQuAD测试集和黄金集上,APS模型的精度分别为80和72,宏观F1分别为72和66。此外,APS模型有效地概括了黄金套装中的某些语言,即使在银色标签上进行了微调之后也是如此。

[NLP-42] An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs
[NLP-42] 利用LLM的空间配置和运动动力学进行高效手语翻译

链接: https://arxiv.org/abs/2408.10593
作者: Eui Jun Hwang,Sukmin Cho,Junmyeong Lee,Jong C. Park
关键词-EN: converts sign videos, Large Language Models, Sign Language Translation, sign videos directly, spoken language sentences
关键词-ZH: 转换手语视频、大型语言模型、手语翻译、直接手语视频、口语句子
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses. Recently, Large Language Models (LLMs) have shown remarkable translation performance in gloss-free methods by harnessing their powerful natural language generation capabilities. However, these methods often rely on domain-specific fine-tuning of visual encoders to achieve optimal results. By contrast, this paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language. With this in mind, we introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework. The core idea of SpaMo is simple yet effective. We first extract spatial and motion features using off-the-shelf visual encoders and then input these features into an LLM with a language prompt. Additionally, we employ a visual-text alignment process as a warm-up before the SLT supervision. Our experiments demonstrate that SpaMo achieves state-of-the-art performance on two popular datasets, PHOENIX14T and How2Sign.
摘要:无注解手语翻译(SLT)将手语视频直接转换成口语句子,不依赖于注解。最近,大型语言模型(LLM)利用其强大的自然语言生成能力,在无注释方法中表现出了显著的翻译性能。然而,这些方法通常依赖于特定于领域的视觉编码器的微调来实现最佳结果。相比之下,本文强调了捕捉手语固有的空间构型和运动动力学的重要性。考虑到这一点,我们引入了基于空间和运动的手语翻译(SPAMO),这是一个基于LLM的手语翻译框架。SpaMo的核心理念是简单而有效的。我们首先使用现成的视觉编码器提取空间和运动特征,然后将这些特征输入到带有语言提示的LLM中。此外,我们使用视觉-文本对齐过程作为SLT监督之前的热身。我们的实验表明,SpaMo在PHOENIX14T和How2Sign这两个流行的数据集上取得了最好的性能。

[NLP-43] Putting People in LLMs Shoes: Generating Better Answers via Question Rewriter
[NLP-43] 让人们站在LLM的立场上:通过问题重写器生成更好的答案

链接: https://arxiv.org/abs/2408.10573
作者: Junhao Chen,Bowen Wang,Zhouqiang jiang,Yuta Nakashima
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated significant capabilities, significant capabilities
关键词-ZH: 大型语言模型,大型语言,语言模型,展示了显着的能力,显着的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant capabilities, particularly in the domain of question answering (QA). However, their effectiveness in QA is often undermined by the vagueness of user questions. To address this issue, we introduce single-round instance-level prompt optimization, referred to as question rewriter. By enhancing the intelligibility of human questions for black-box LLMs, our question rewriter improves the quality of generated answers. The rewriter is optimized using direct preference optimization based on feedback collected from automatic criteria for evaluating generated answers; therefore, its training does not require costly human annotations. The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method. This paper provides a practical framework for training question rewriters and sets a precedent for future explorations in prompt optimization within LFQA tasks. Code is available at \urlthis https URL.
摘要:大型语言模型(LLM)已经显示出显著的能力,特别是在问答领域。然而,它们在QA中的有效性经常被用户问题的模糊性所削弱。为了解决这个问题,我们引入了单轮实例级提示优化,称为问题重写器。我们的问题重写器通过增强黑盒LLMS的人类问题的可理解性,提高了生成答案的质量。重写器使用基于从用于评估生成的答案的自动标准收集的反馈的直接偏好优化来优化;因此,其训练不需要昂贵的人工注释。在多个黑盒LLMS和长形式问答(LFQA)数据集上的实验证明了该方法的有效性。本文为问题重写者的培训提供了一个实用的框架,为今后在LFQA任务中进行快速优化的探索奠定了先例。代码位于此HTTPS URL。

[NLP-44] Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation
[NLP-44] 重新审视语音表示学习:分离可学习参数和稳健数据增强的必要性

链接: https://arxiv.org/abs/2408.10557
作者: Hemant Yadav,Sunayana Sitaram,Rajiv Ratn Shah
关键词-EN: learn one embedding, fixed segment, information, Speech, modeling methods learn
关键词-ZH: 学习一种嵌入、固定段、信息、语音、建模方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech modeling methods learn one embedding for a fixed segment of speech, typically in between 10-25 ms. The information present in speech can be divided into two categories: “what is being said” (content) and “how it is expressed” (other) and these two are orthogonal in nature causing the optimization algorithm to find a sub-optimal solution if forced to optimize together. This leads to sub-optimal performance in one or all downstream tasks as shown by previous studies. Current self-supervised learning (SSL) methods such as HuBERT are very good at modeling the content information present in speech. Data augmentation improves the performance on tasks which require effective modeling of other information but this leads to a divided capacity of the model. In this work, we conduct a preliminary study to understand the importance of modeling other information using separate learnable parameters. We propose a modified version of HuBERT, termed Other HuBERT (O-HuBERT), to test our hypothesis. Our findings are twofold: first, the O-HuBERT method is able to utilize all layers to build complex features to encode other information; second, a robust data augmentation strategy is essential for learning the information required by tasks that depend on other information and to achieve state-of-the-art (SOTA) performance on the SUPERB benchmark with a similarly sized model (100 million parameters) and pre-training data (960 hours).
摘要:语音建模方法学习针对固定语音片段的一次嵌入,通常在10-25ms之间。语音中存在的信息可以分为两类:“说了什么”(内容)和“如何表达”(其他),这两种信息本质上是正交的,如果被迫一起优化,优化算法会找到次优解。如前所述,这会导致在一个或所有下游任务中表现不佳。当前的自监督学习方法(如Hubert)非常适合对语音中的内容信息进行建模。数据扩充提高了需要对其他信息进行有效建模的任务的性能,但这导致了模型的容量划分。在这项工作中,我们进行了初步研究,以了解使用单独的可学习参数对其他信息建模的重要性。我们提出了休伯特的一个修正版本,称为其他休伯特(O-休伯特),以检验我们的假设。我们的发现有两个方面:首先,O-Hubert方法能够利用所有层来构建复杂的特征来编码其他信息;其次,稳健的数据增强策略对于学习依赖于其他信息的任务所需的信息,以及在具有类似大小的模型(1亿个参数)和预训练数据(960小时)的极佳基准上实现最先进的性能至关重要。

[NLP-45] Language Modeling on Tabular Data: A Survey of Foundations Techniques and Evolution
[NLP-45] 表格数据上的语言建模:基础技术和演变概览

链接: https://arxiv.org/abs/2408.10548
作者: Yucheng Ruan,Xiang Lan,Jingying Ma,Yizhi Dong,Kai He,Mengling Feng
关键词-EN: complex structural relationships, Tabular data, Tabular, presents unique challenges, tabular data analysis
关键词-ZH: 复杂的结构关系,表格数据,表格,提出了独特的挑战,表格数据分析
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: this https URL.
摘要:表格数据是一种流行于各个领域的数据类型,由于其异构性和复杂的结构关系,提出了独特的挑战。在表格数据分析中实现高预测性能和稳健性对许多应用程序具有重要的前景。受自然语言处理的最新进展,特别是转换器体系结构的影响,表格数据建模的新方法应运而生。早期的技术集中在从头开始对变压器进行预培训,经常遇到可伸缩性问题。随后,利用预先训练的语言模型(如ERT)的方法被开发出来,这些方法需要更少的数据并产生更好的性能。最近出现的大型语言模型,如GPT和Llama,进一步革新了该领域,以最小的微调促进了更高级和更多样化的应用程序。尽管人们的兴趣与日俱增,但对表格数据的语言建模技术的全面调查仍是空白。本文通过系统地回顾表格数据的语言建模的发展来填补这一空白,包括:(1)不同表格数据结构和数据类型的分类;(2)模型训练中使用的关键数据集和用于评估的任务的回顾;(3)建模技术的总结,包括广泛采用的数据处理方法、流行的体系结构和训练目标;(4)从采用传统的预训练/预训练的语言模型到使用大型语言模型的演变;(5)在用于表格数据分析的语言建模方面的持续挑战和潜在的未来研究方向。与此调查相关的GitHub页面位于:此HTTPS URL。

[NLP-46] Synergistic Approach for Simultaneous Optimization of Monolingual Cross-lingual and Multilingual Information Retrieval
[NLP-46] 单语跨语和多语信息检索同时优化的协同方法

链接: https://arxiv.org/abs/2408.10536
作者: Adel Elmahdy,Sheng-Chieh Lin,Amin Ahmad
关键词-EN: increasingly important challenge, Information retrieval, natural language processing, increasingly important, important challenge
关键词-ZH: 越来越重要的挑战,信息检索,自然语言处理,越来越重要,重要的挑战
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 15 pages, 2 figures, 13 tables

点击查看摘要

Abstract:Information retrieval across different languages is an increasingly important challenge in natural language processing. Recent approaches based on multilingual pre-trained language models have achieved remarkable success, yet they often optimize for either monolingual, cross-lingual, or multilingual retrieval performance at the expense of others. This paper proposes a novel hybrid batch training strategy to simultaneously improve zero-shot retrieval performance across monolingual, cross-lingual, and multilingual settings while mitigating language bias. The approach fine-tunes multilingual language models using a mix of monolingual and cross-lingual question-answer pair batches sampled based on dataset size. Experiments on XQuAD-R, MLQA-R, and MIRACL benchmark datasets show that the proposed method consistently achieves comparable or superior results in zero-shot retrieval across various languages and retrieval tasks compared to monolingual-only or cross-lingual-only training. Hybrid batch training also substantially reduces language bias in multilingual retrieval compared to monolingual training. These results demonstrate the effectiveness of the proposed approach for learning language-agnostic representations that enable strong zero-shot retrieval performance across diverse languages.
摘要:跨不同语言的信息检索是自然语言处理中日益重要的挑战。最近基于多语言预训练语言模型的方法取得了显著的成功,但它们往往以牺牲他人为代价来优化单语言、跨语言或多语言的检索性能。本文提出了一种新的混合批量训练策略,以同时提高单语、跨语和多语环境下的零射击检索性能,同时减轻语言偏见。该方法使用基于数据集大小采样的单语言和跨语言问答对批次的混合来微调多语言语言模型。在XQuAD-R、MLQA-R和MIRACL基准数据集上的实验表明,该方法在跨语言和跨语言的检索任务中取得了与仅单语言或仅跨语言训练相当或更好的结果。与单一语言训练相比,混合批处理训练也大大减少了多语言检索中的语言偏见。这些结果证明了所提出的学习语言不可知表征的方法的有效性,该方法能够实现跨不同语言的强大的零命中检索性能。

[NLP-47] NoMatterXAI: Generating “No Matter What” Alterfactual Examples for Explaining Black-Box Text Classification Models
[NLP-47] NoMatterXAI:生成“无论如何”的替代事实示例以解释黑匣子文本分类模型

链接: https://arxiv.org/abs/2408.10528
作者: Tuc Nguyen,James Michels,Hua Shen,Thai Le
关键词-EN: communicate feature relevance, well-studied method, method to communicate, relevance through contrastive, contrastive reasoning
关键词-ZH: 沟通特征相关性、经过充分研究的方法、沟通方法、通过对比、对比推理的相关性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Explainable AI (XAI), counterfactual explanations (CEs) are a well-studied method to communicate feature relevance through contrastive reasoning of “what if” to explain AI models’ predictions. However, they only focus on important (i.e., relevant) features and largely disregard less important (i.e., irrelevant) ones. Such irrelevant features can be crucial in many applications, especially when users need to ensure that an AI model’s decisions are not affected or biased against specific attributes such as gender, race, religion, or political affiliation. To address this gap, the concept of alterfactual explanations (AEs) has been proposed. AEs explore an alternative reality of “no matter what”, where irrelevant features are substituted with alternative features (e.g., “republicans” - “democrats”) within the same attribute (e.g., “politics”) while maintaining a similar prediction output. This serves to validate whether AI model predictions are influenced by the specified attributes. Despite the promise of AEs, there is a lack of computational approaches to systematically generate them, particularly in the text domain, where creating AEs for AI text classifiers presents unique challenges. This paper addresses this challenge by formulating AE generation as an optimization problem and introducing MoMatterXAI, a novel algorithm that generates AEs for text classification tasks. Our approach achieves high fidelity of up to 95% while preserving context similarity of over 90% across multiple models and datasets. A human study further validates the effectiveness of AEs in explaining AI text classifiers to end users. All codes will be publicly available.
摘要:在可解释人工智能(XAI)中,反事实解释(CES)是一种被广泛研究的方法,它通过对比推理“如果”来解释AI模型的预测,从而传达特征相关性。然而,它们只关注重要(即,相关)的特征,而基本上忽略了不太重要(即,无关)的特征。这些不相关的功能在许多应用程序中可能是至关重要的,特别是当用户需要确保AI模型的决策不受性别、种族、宗教或政治派别等特定属性的影响或偏见时。为了解决这一差距,人们提出了非事实解释(AES)的概念。AES探索了一种可供选择的现实–“无论如何”,其中不相关的特征被相同属性(例如,“政治”)内的可供选择的特征(例如,“共和党人”-“民主党人”)所替代,同时保持相似的预测输出。这用于验证AI模型预测是否受指定属性的影响。尽管人工智能有希望,但缺乏系统地生成人工智能的计算方法,特别是在文本领域,为人工智能文本分类器创建人工智能人工智能是一个独特的挑战。本文通过将AE生成描述为一个优化问题并引入MoMatterXAI算法来解决这一挑战,该算法为文本分类任务生成AE。我们的方法实现了高达95%的高保真,同时在多个模型和数据集上保持了90%以上的上下文相似度。一项人类研究进一步验证了人工智能在向最终用户解释人工智能文本分类器方面的有效性。所有代码都将公开提供。

[NLP-48] XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition
[NLP-48] XCB:语音识别中对跨语言短语进行偏误的有效上下文偏误方法

链接: https://arxiv.org/abs/2408.10524
作者: Xucheng Wan,Naijun Zheng,Kai Liu,Huan Zhou
关键词-EN: Contextualized ASR models, Contextualized ASR, predefined phrase list, demonstrated to effectively, effectively improve
关键词-ZH: 上下文化的ASB模型、上下文化的ASB、预定义的短语列表,已被证明可以有效、有效地改进
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: accepted to NCMMSC 2024

点击查看摘要

Abstract:Contextualized ASR models have been demonstrated to effectively improve the recognition accuracy of uncommon phrases when a predefined phrase list is available. However, these models often struggle with bilingual settings, which are prevalent in code-switching speech recognition. In this study, we make the initial attempt to address this challenge by introducing a Cross-lingual Contextual Biasing(XCB) module. Specifically, we augment a pre-trained ASR model for the dominant language by integrating an auxiliary language biasing module and a supplementary language-specific loss, aimed at enhancing the recognition of phrases in the secondary language. Experimental results conducted on our in-house code-switching dataset have validated the efficacy of our approach, demonstrating significant improvements in the recognition of biasing phrases in the secondary language, even without any additional inference overhead. Additionally, our proposed system exhibits both efficiency and generalization when is applied by the unseen ASRU-2019 test set.
摘要:语境化ASR模型已被证明能够有效地提高预定义短语列表中不常见短语的识别准确率。然而,这些模型经常与双语设置作斗争,而双语设置在代码转换语音识别中很常见。在这项研究中,我们初步尝试通过引入跨语言语境偏向(XCB)模块来应对这一挑战。具体地说,我们通过集成辅助语言偏向模块和补充语言特定损失来增强针对主要语言的预训练ASR模型,旨在增强对第二语言短语的识别。在我们的内部代码转换数据集上进行的实验结果验证了该方法的有效性,表明即使在没有任何额外推理开销的情况下,在识别第二语言中的偏向短语方面也有显著的改进。此外,在未见ASRU-2019测试集上的应用表明,该系统具有较高的效率和较好的通用性。

[NLP-49] Data Augmentation Integrating Dialogue Flow and Style to Adapt Spoken Dialogue Systems to Low-Resource User Groups SIGDIAL2024
[NLP-49] 数据增强集成对话流程和风格,以使口语对话系统适应低资源用户群体

链接: https://arxiv.org/abs/2408.10516
作者: Zhiyang Qi,Michimasa Inaba
关键词-EN: distinct conversational behaviors, exhibit distinct conversational, interaction challenges encountered, conversational behaviors, study addresses
关键词-ZH: 独特的对话行为、表现出独特的对话、遇到的互动挑战、对话行为、研究地址
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to SIGDIAL 2024

点击查看摘要

Abstract:This study addresses the interaction challenges encountered by spoken dialogue systems (SDSs) when engaging with users who exhibit distinct conversational behaviors, particularly minors, in scenarios where data are scarce. We propose a novel data augmentation framework to enhance SDS performance for user groups with limited resources. Our approach leverages a large language model (LLM) to extract speaker styles and a pre-trained language model (PLM) to simulate dialogue act history. This method generates enriched and personalized dialogue data, facilitating improved interactions with unique user demographics. Extensive experiments validate the efficacy of our methodology, highlighting its potential to foster the development of more adaptive and inclusive dialogue systems.
摘要:本研究解决了在数据稀缺的情况下,口语对话系统(SDs)在与表现出独特对话行为的用户(尤其是未成年人)互动时遇到的交互挑战。我们提出了一种新颖的数据增强框架,以增强资源有限的用户群体的SDs性能。我们的方法利用大型语言模型(LLM)来提取说话者风格,并利用预先训练的语言模型(PLM)来模拟对话行为历史。该方法生成丰富和个性化的对话数据,促进与独特用户人口统计数据的改善互动。大量的实验验证了我们方法论的有效性,凸显了其促进更具适应性和包容性的对话系统发展的潜力。

[NLP-50] QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention
[NLP-50] QUITO-X:一种基于信息瓶颈、具有交叉注意力的压缩算法

链接: https://arxiv.org/abs/2408.10497
作者: Yihang Wang,Xu Huang,Bowen Tian,Yixing Fan,Jiafeng Guo
关键词-EN: Generative LLM, achieved significant success, LLM have achieved, effectively adapt, adapt to vertical
关键词-ZH: 一代又一代的LLM,取得了重大成功,LLM已经实现,有效适应,适应垂直
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative LLM have achieved significant success in various industrial tasks and can effectively adapt to vertical domains and downstream tasks through ICL. However, with tasks becoming increasingly complex, the context length required by ICL is also getting longer, and two significant issues arise: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the “lost in the middle” problem. Recently, compressing prompts by removing tokens according to some metric obtained from some causal language models, such as llama-7b, has emerged as an effective approach to mitigate these issues. However, the metric used by prior method such as self-information or PPL do not fully align with the objective of distinuishing the most important tokens when conditioning on query. In this work, we introduce information bottleneck theory to carefully examine the properties required by the metric. Inspired by this, we use cross-attention in encoder-decoder architecture as a new metric. Our simple method leads to significantly better performance in smaller models with lower latency. We evaluate our method on four datasets: DROP, CoQA, SQuAD, and Quoref. The experimental results show that, while maintaining the same performance, our compression rate can improve by nearly 25% over previous SOTA. Remarkably, in experiments where 25% of the tokens are removed, our model’s EM score for answers sometimes even exceeds that of the control group using uncompressed text as context. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.10497 [cs.CL] (or arXiv:2408.10497v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.10497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:产生式LLM已经在各种工业任务中取得了显著的成功,并能通过ICL有效地适应垂直域和下游任务。然而,随着任务变得越来越复杂,ICL所需的上下文长度也越来越长,出现了两个重要问题:(1)过长的上下文导致高昂的代价和推理延迟。(Ii)长语境带来的大量与任务无关的信息加剧了“中间迷失”的问题。最近,通过根据从一些因果语言模型(如Llama-7b)获得的度量来删除标记来压缩提示已经成为缓解这些问题的一种有效方法。然而,现有方法所使用的度量,如自我信息或PPL,并不完全符合条件查询时区分最重要的标记的目标。在这项工作中,我们引入信息瓶颈理论来仔细研究度量所需的性质。受此启发,我们在编解码器体系结构中使用交叉注意作为一种新的度量。我们的简单方法在较小的模型中以较低的延迟带来显著更好的性能。我们在四个数据集上对我们的方法进行了评估:Drop、CoQA、Team和Quoref。实验结果表明,在保持相同性能的情况下,我们的压缩比可以比以前的SOTA提高近25%。值得注意的是,在去除25%的标记的实验中,我们的模型对答案的EM分数有时甚至超过了使用未压缩文本作为上下文的对照组。主题:计算与语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2408.10497cs.CLhttps://doi.org/10.48550/arXiv.2408.10497 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-51] Analysis of Plan-based Retrieval for Grounded Text Generation
[NLP-51] 基于计划检索的固定文本生成分析

链接: https://arxiv.org/abs/2408.10490
作者: Ameya Godbole,Nicholas Monath,Seungyeon Kim,Ankit Singh Rawat,Andrew McCallum,Manzil Zaheer
关键词-EN: contradicts established knowledge, seemingly coherent text, seemingly coherent, contradicts established, hallucinations refer
关键词-ZH: 与既定知识相矛盾,看似连贯的文本,看似连贯的,与既定的相矛盾,幻觉指涉
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In text generation, hallucinations refer to the generation of seemingly coherent text that contradicts established knowledge. One compelling hypothesis is that hallucinations occur when a language model is given a generation task outside its parametric knowledge (due to rarity, recency, domain, etc.). A common strategy to address this limitation is to infuse the language models with retrieval mechanisms, providing the model with relevant knowledge for the task. In this paper, we leverage the planning capabilities of instruction-tuned LLMs and analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations. We empirically evaluate several variations of our proposed approach on long-form text generation tasks. By improving the coverage of relevant facts, plan-guided retrieval and generation can produce more informative responses while providing a higher rate of attribution to source documents.
摘要:在文本生成中,幻觉指的是与既定知识相矛盾的看似连贯的文本的生成。一个令人信服的假设是,当语言模型被赋予其参数知识之外的生成任务时(由于稀有性、新近性、领域等),就会出现幻觉。解决这一限制的常见策略是向语言模型注入检索机制,为模型提供任务的相关知识。在本文中,我们利用了经描述调整的LLM的规划功能,并分析了如何使用规划来指导检索,以进一步减少幻觉的频率。我们根据经验评估了我们提出的长格式文本生成任务方法的几种变体。通过改善相关事实的覆盖范围,计划引导的检索和生成可以产生更多信息量的响应,同时提供更高的源文档归因率。

[NLP-52] Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm
[NLP-52] 基于事件流的手语翻译:高清基准数据集和新算法

链接: https://arxiv.org/abs/2408.10488
作者: Xiao Wang,Yao Rong,Fuling Wang,Jianing Li,Lin Zhu,Bo Jiang,Yaowei Wang
关键词-EN: Sign Language Translation, AI-assisted disability, Event stream sign, core task, field of AI-assisted
关键词-ZH: 手语翻译、人工智能辅助残疾、事件流手语、核心任务、人工智能辅助领域
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: First Large-scale and High-Definition Benchmark Dataset for Event-based Sign Language Translation

点击查看摘要

Abstract:Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Unlike traditional SLT based on visible light videos, which is easily affected by factors such as lighting, rapid hand movements, and privacy breaches, this paper proposes the use of high-definition Event streams for SLT, effectively mitigating the aforementioned issues. This is primarily because Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well. Additionally, due to their sparsity in space, they effectively protect the privacy of the target person. More specifically, we propose a new high-resolution Event stream sign language dataset, termed Event-CSL, which effectively fills the data gap in this area of research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected in a variety of indoor and outdoor scenes, encompassing multiple angles, light intensities, and camera movements. We have benchmarked existing mainstream SLT works to enable fair comparison for future efforts. Based on this dataset and several other large-scale datasets, we propose a novel baseline method that fully leverages the Mamba model’s ability to integrate temporal information of CNN features, resulting in improved sign language translation outcomes. Both the benchmark dataset and source code will be released on this https URL
摘要:手语翻译是人工智能辅助残疾领域的核心任务。不同于传统的基于可见光视频的SLT容易受到光照、快速手部运动和隐私泄露等因素的影响,本文提出了使用高清事件流来进行SLT,有效地缓解了上述问题。这主要是因为事件流具有高动态范围和密集的时间信号,能够很好地抵御低照度和运动模糊。此外,由于它们在空间上的稀疏性,它们有效地保护了目标人物的隐私。更具体地说,我们提出了一种新的高分辨率事件流手语数据集,称为Event-CSL,有效地填补了这一领域的数据空白。它包含14,827个视频,14,821个注释,以及2544个中文词汇。这些样本是在各种室内和室外场景中收集的,包括多个角度、光线强度和相机移动。我们已对现有的主流SLT作品进行基准比较,以便在未来的工作中进行公平的比较。基于这个数据集和其他几个大规模数据集,我们提出了一种新的基线方法,该方法充分利用了Mamba模型的能力来整合CNN特征的时间信息,从而提高了手语翻译的结果。基准数据集和源代码都将在此HTTPS URL上发布

[NLP-53] LeCov: Multi-level Testing Criteria for Large Language Models
[NLP-53] LeCov:大型语言模型的多层测试标准

链接: https://arxiv.org/abs/2408.10474
作者: Xuan Xie,Jiayang Song,Yuheng Huang,Da Song,Fuyuan Zhang,Felix Juefei-Xu,Lei Ma
关键词-EN: Large Language Models, Large Language, truthfulness and toxicity, Language Models, limited interpretability
关键词-ZH: 大型语言模型,大型语言,真实性和毒性,语言模型,有限的解释性
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in many different domains, but because of their limited interpretability, there are questions about how trustworthy they are in various perspectives, e.g., truthfulness and toxicity. Recent research has started developing testing methods for LLMs, aiming to uncover untrustworthy issues, i.e., defects, before deployment. However, systematic and formalized testing criteria are lacking, which hinders a comprehensive assessment of the extent and adequacy of testing exploration. To mitigate this threat, we propose a set of multi-level testing criteria, LeCov, for LLMs. The criteria consider three crucial LLM internal components, i.e., the attention mechanism, feed-forward neurons, and uncertainty, and contain nine types of testing criteria in total. We apply the criteria in two scenarios: test prioritization and coverage-guided testing. The experiment evaluation, on three models and four datasets, demonstrates the usefulness and effectiveness of LeCov.
摘要:大语言模型被广泛应用于许多不同的领域,但由于它们的可解释性有限,在真实性和毒性等方面存在可信度的问题。最近的研究已经开始开发LLMS的测试方法,旨在在部署之前发现不可信任的问题,即缺陷。然而,缺乏系统化和形式化的测试标准,这阻碍了对测试探索的程度和充分性的全面评估。为了缓解这一威胁,我们提出了一套针对LLM的多级别测试标准LeCov。该标准考虑了LLM的三个重要内部成分,即注意机制、前馈神经元和不确定性,共包含九种类型的测试标准。我们在两个场景中应用标准:测试优先顺序和覆盖率指导的测试。在三个模型和四个数据集上的实验评估表明了LeCov的实用性和有效性。

[NLP-54] Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism
[NLP-54] 通过稀疏-密集-稀疏机制增强一次性修剪预训练语言模型

链接: https://arxiv.org/abs/2408.10473
作者: Guanchen Li,Xiandong Zhao,Lian Liu,Zeping Li,Dong Li,Lu Tian,Jie He,Ashish Sirasao,Emad Barsoum
关键词-EN: language processing tasks, Pre-trained language models, natural language processing, Pre-trained language, exhibit outstanding performance
关键词-ZH: 语言处理任务,预训练语言模型,自然语言处理,预训练语言,表现出出色的性能
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model using conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Experimental results demonstrate that SDS outperforms the state-of-the-art pruning techniques SparseGPT and Wanda under an identical sparsity configuration. For instance, SDS reduces perplexity by 9.13 on Raw-Wikitext2 and improves accuracy by an average of 2.05% across multiple zero-shot benchmarks for OPT-125M with 2:4 sparsity.
摘要:预先训练的语言模型(PLM)具有较强的上下文理解能力,并在各种自然语言处理任务中表现出优异的性能。然而,它们相当大的大小会导致巨大的计算和存储成本。现代剪枝策略使用一次性技术来压缩PLM,而不需要对特定任务或其他一般数据进行重新训练;然而,这些方法通常会导致不可避免的性能下降。本文从权值分布优化的角度,提出了一种稀疏-密集-稀疏剪枝框架来提高剪枝后的PLM的性能。我们分三个步骤概述修剪过程。最初,我们使用传统的一次修剪方法修剪模型中不太重要的连接。接下来,我们通过稀疏正则化重新激活剪枝连接来重建具有剪枝友好权重分布的稠密模型。最后,我们执行第二轮剪枝,与最初的剪枝相比,产生了更好的剪枝模型。实验结果表明,在相同的稀疏配置下,该算法的性能优于最新的剪枝技术SparseGPT和Wanda。例如,对于具有2:4稀疏性的OPT-125M,在Raw-Wikitext2上,SDS降低了9.13的困惑,并将准确率平均提高了2.05%。

[NLP-55] racing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions
[NLP-55] 通过调整后的影响函数将语言模型的隐私泄露到训练数据

链接: https://arxiv.org/abs/2408.10468
作者: Jinxin Liu,Zao Yang
关键词-EN: include sensitive information, Large Language Models, potential privacy leakage, Language Models, large gradient norms
关键词-ZH: 包括敏感信息、大型语言模型、潜在的隐私泄露、语言模型、大梯度规范
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The responses generated by Large Language Models (LLMs) can include sensitive information from individuals and organizations, leading to potential privacy leakage. This work implements Influence Functions (IFs) to trace privacy leakage back to the training data, thereby mitigating privacy concerns of Language Models (LMs). However, we notice that current IFs struggle to accurately estimate the influence of tokens with large gradient norms, potentially overestimating their influence. When tracing the most influential samples, this leads to frequently tracing back to samples with large gradient norm tokens, overshadowing the actual most influential samples even if their influences are well estimated. To address this issue, we propose Heuristically Adjusted IF (HAIF), which reduces the weight of tokens with large gradient norms, thereby significantly improving the accuracy of tracing the most influential samples. To establish easily obtained groundtruth for tracing privacy leakage, we construct two datasets, PII-E and PII-CR, representing two distinct scenarios: one with identical text in the model outputs and pre-training data, and the other where models leverage their reasoning abilities to generate text divergent from pre-training data. HAIF significantly improves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E dataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs on real-world pretraining data CLUECorpus2020, demonstrating strong robustness regardless prompt and response lengths.
摘要:大型语言模型(LLM)生成的响应可能包含来自个人和组织的敏感信息,导致潜在的隐私泄露。该工作实现了影响函数(IF)来追踪隐私泄漏到训练数据,从而缓解了语言模型(LMS)的隐私问题。然而,我们注意到,目前的迭代函数系统难以准确估计具有大梯度范数的符号的影响,潜在地高估了它们的影响。在跟踪最有影响力的样本时,这会导致频繁地回溯到具有较大梯度范数标记的样本,从而使实际最有影响力的样本黯然失色,即使它们的影响得到了很好的估计。为了解决这个问题,我们提出了启发式调整IF(HaIF),它降低了具有大梯度范数的标记的权重,从而显著提高了跟踪最有影响力样本的准确性。为了建立容易获得的用于追踪隐私泄露的基本事实,我们构建了两个数据集PII-E和PII-CR,代表两种不同的场景:一种是模型输出和预训练数据中的文本相同,另一种是模型利用其推理能力生成与预训练数据不同的文本。与在不同GPT-2和Qwen-1.5模型上的最佳SOTA IFS相比,Half显著提高了跟踪精度,在PII-E数据集上提高了20.96至73.71,在PII-CR数据集上提高了3.21至45.93。Half在真实的预训练数据CLUECorpus2020上的表现也超过了Sota IFS,无论提示和响应长度如何,都表现出了强大的稳健性。

[NLP-56] Federated Learning of Large ASR Models in the Real World
[NLP-56] 现实世界中大型ASB模型的联邦学习

链接: https://arxiv.org/abs/2408.10443
作者: Yonghui Xiao,Yuxin Ding,Changwan Ryu,Petr Zadrazil,Francoise Beaufays
关键词-EN: shown promising results, Federated learning, training machine learning, machine learning models, machine learning
关键词-ZH: 显示出有希望的结果,联合学习、训练机器学习、机器学习模型、机器学习
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Federated learning (FL) has shown promising results on training machine learning models with privacy preservation. However, for large models with over 100 million parameters, the training resource requirement becomes an obstacle for FL because common devices do not have enough memory and computation power to finish the FL tasks. Although efficient training methods have been proposed, it is still a challenge to train the large models like Conformer based ASR. This paper presents a systematic solution to train the full-size ASR models of 130M parameters with FL. To our knowledge, this is the first real-world FL application of the Conformer model, which is also the largest model ever trained with FL so far. And this is the first paper showing FL can improve the ASR model quality with a set of proposed methods to refine the quality of data and labels of clients. We demonstrate both the training efficiency and the model quality improvement in real-world experiments.
摘要:联邦学习(FL)在训练具有隐私保护的机器学习模型方面取得了令人鼓舞的结果。然而,对于参数超过1亿个的大型模型来说,训练资源需求成为FL的障碍,因为常见设备没有足够的内存和计算能力来完成FL任务。尽管已经提出了有效的训练方法,但训练像基于Conformer的ASB这样的大型模型仍然是一个挑战。本文提出了一种使用FL训练130 M参数的全尺寸ASC模型的系统解决方案。据我们所知,这是Conformer模型的第一个现实世界FL应用,也是迄今为止使用FL训练的最大模型。这是第一篇表明FL可以通过一套拟议的方法来提高ASB模型质量的论文,以完善数据和客户标签的质量。我们在现实实验中展示了训练效率和模型质量的改进。

[NLP-57] Goldfish: Monolingual Language Models for 350 Languages
[NLP-57] 金鱼:350种语言的单语语言模型

链接: https://arxiv.org/abs/2408.10441
作者: Tyler A. Chang,Catherine Arnett,Zhuowen Tu,Benjamin K. Bergen
关键词-EN: languages, models, Goldfish, large multilingual models, Transformer language models
关键词-ZH: 语言、模型、金鱼、大型多语言模型、Transformer语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. However, using FLORES perplexity as a metric, we find that these models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B). To facilitate research that focuses on low-resource languages, we pre-train and release Goldfish, a suite of monolingual autoregressive Transformer language models up to 125M parameters for 350 languages. The Goldfish reach lower FLORES perplexities than BLOOM, XGLM, and MaLA-500 on 98 of 204 FLORES languages, despite each Goldfish model being over 10x smaller. However, the Goldfish significantly underperform larger multilingual models on reasoning benchmarks, suggesting that for low-resource languages, multilinguality primarily improves general reasoning abilities rather than basic text generation. We release models trained on 5MB (350 languages), 10MB (288 languages), 100MB (166 languages), and 1GB (83 languages) of text data where available. The Goldfish models are available as baselines, fine-tuning sources, or augmentations to existing models in low-resource NLP research, and they are further useful for crosslinguistic studies requiring maximally comparable models across languages.
摘要:对于许多低资源语言,唯一可用的语言模型是同时对多种语言进行训练的大型多语言模型。然而,使用Flores困惑作为衡量标准,我们发现这些模型在许多语言中的表现都不如二元语法(例如,XGLM4.5B中24%的语言;Bloom 7.1B中43%的语言)。为了促进专注于低资源语言的研究,我们预先训练并发布了GoldFish,这是一套针对350种语言的高达1.25M参数的单语言自回归转换器语言模型。金鱼在204种Flores语言中的98种上达到了比Bloom、XGLM和Mala-500更低的Flores困惑,尽管每种金鱼模型都要小10倍以上。然而,金鱼在推理基准上的表现明显逊于较大的多语言模型,这表明对于低资源语言,多语言主要改善了一般推理能力,而不是基本的文本生成。我们发布针对5MB(350种语言)、10MB(288种语言)、100MB(166种语言)和1 GB(83种语言)文本数据的培训模型。在低资源的自然语言处理研究中,金鱼模型可作为基准、微调来源或对现有模型的扩充,它们对于要求跨语言模型具有最大可比性的跨语言研究进一步有用。

[NLP-58] Development of an AI Anti-Bullying System Using Large Language Model Key Topic Detection
[NLP-58] 基于大语言模型关键话题检测的人工智能反欺凌系统开发

链接: https://arxiv.org/abs/2408.10417
作者: Matthew Tassava,Cameron Kolodjski,Jordan Milbrath,Adorah Bishop,Nathan Flanders,Robbie Fetsch,Danielle Hanson,Jeremy Straub
关键词-EN: artificial intelligence, paper presents, presents and evaluates, evaluates work, anti-bullying system
关键词-ZH: 人工智能、论文呈现、呈现和评估、评估工作、反欺凌系统
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents and evaluates work on the development of an artificial intelligence (AI) anti-bullying system. The system is designed to identify coordinated bullying attacks via social media and other mechanisms, characterize them and propose remediation and response activities to them. In particular, a large language model (LLM) is used to populate an enhanced expert system-based network model of a bullying attack. This facilitates analysis and remediation activity - such as generating report messages to social media companies - determination. The system is described and the efficacy of the LLM for populating the model is analyzed herein.
摘要:本文介绍并评估了人工智能(AI)反欺凌系统的开发工作。该系统旨在识别通过社交媒体和其他机制的协调欺凌攻击,对其进行特征描述并提出补救和响应活动。特别是,使用大型语言模型(LLM)来填充欺凌攻击的增强型基于专家系统的网络模型。这有助于分析和补救活动(例如向社交媒体公司生成报告消息)的确定。本文描述了该系统,并分析了LLM填充模型的功效。

[NLP-59] Resolving Lexical Bias in Edit Scoping with Projector Editor Networks
[NLP-59] 用投影仪编辑器网络解决编辑范围界定中的词汇偏差

链接: https://arxiv.org/abs/2408.10411
作者: Hammad Rizwan,Domenic Rosati,Ga Wu,Hassan Sajjad
关键词-EN: techniques heavily rely, Weight-preserving model editing, editing techniques heavily, Weight-preserving model, model editing techniques
关键词-ZH: 严重依赖技术,保重模型编辑,严重编辑技术,保重模型,模型编辑技术
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Weight-preserving model editing techniques heavily rely on the scoping mechanism that decides when to apply an edit to the base model. These scoping mechanisms utilize distance functions in the representation space to ascertain the scope of the edit. In this work, we show that distance-based scoping functions grapple with lexical biases leading to issues such as misfires with irrelevant prompts that share similar lexical characteristics. To address this problem, we introduce, Projector Editor Networks for Model Editing (PENME),is a model editing approach that employs a compact adapter with a projection network trained via a contrastive learning objective. We demonstrate the efficacy of PENME in achieving superior results while being compute efficient and flexible to adapt across model architectures.
摘要:保留权重的模型编辑技术严重依赖于决定何时将编辑应用于基本模型的范围机制。这些范围机制利用表示空间中的距离函数来确定编辑的范围。在这项工作中,我们表明,基于距离的作用域功能会应对词汇偏见,从而导致诸如与具有相似词汇特征的不相关提示发生故障等问题。为了解决这个问题,我们引入了用于模型编辑的投影器编辑器网络(PENME),这是一种模型编辑方法,它采用紧凑型适配器,该适配器具有通过对比学习目标训练的投影网络。我们证明了PENME在实现卓越结果的同时具有计算效率和灵活性,以适应不同模型架构的能力。

[NLP-60] Value Alignment from Unstructured Text
[NLP-60] 非结构化文本的价值一致

链接: https://arxiv.org/abs/2408.10392
作者: Inkit Padhi,Karthikeyan Natesan Ramamurthy,Prasanna Sattigeri,Manish Nagireddy,Pierre Dognin,Kush R. Varshney
关键词-EN: Aligning large language, large language models, large language, systems has emerged, significant area
关键词-ZH: 调整大型语言、大型语言模型、大型语言、系统已经出现,重要领域
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) to value systems has emerged as a significant area of research within the fields of AI and NLP. Currently, this alignment process relies on the availability of high-quality supervised and preference data, which can be both time-consuming and expensive to curate or annotate. In this paper, we introduce a systematic end-to-end methodology for aligning LLMs to the implicit and explicit values represented in unstructured text data. Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data. Through two distinct use-cases, we demonstrate the efficiency of our methodology on the Mistral-7B-Instruct model. Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches, as quantified through the use of automatic metrics and win rates.
摘要:将大型语言模型(LLM)与价值系统保持一致已成为人工智能和NLP领域的一个重要研究领域。目前,这种对齐过程依赖于高质量的监督和偏好数据的可用性,这可能既耗时又昂贵。在本文中,我们介绍了一种系统性的端到端方法,用于将LLM与非结构化文本数据中表示的隐式和显式值进行对齐。我们提出的方法利用可扩展合成数据生成技术的使用,以有效地将模型与非结构化数据中存在的值保持一致。通过两个不同的用例,我们展示了我们方法论在Mistral-7 B-Direct模型上的效率。我们的方法可靠地将LLM与文档中嵌入的价值相一致,并显示出与其他方法相比的更好的性能,通过使用自动指标和获胜率进行量化。

[NLP-61] Narrowing the Gap between Vision and Action in Navigation
[NLP-61] 缩小航海愿景与行动之间的差距

链接: https://arxiv.org/abs/2408.10388
作者: Yue Zhang,Parisa Kordjamshidi
关键词-EN: Vision and Language, methods for Vision, Language Navigation, Continuous Environment, commonly incorporate
关键词-ZH: 视觉和语言、视觉、语言导航、连续环境的方法,通常结合
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves navigation performance significantly compared to direct training using low-level actions. However, the VLN-CE agents are still far from the real robots since there are gaps between their visual perception and executed actions. First, VLN-CE agents that discretize the visual environment are primarily trained with high-level view selection, which causes them to ignore crucial spatial reasoning within the low-level action movements. Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two issues, we introduce a low-level action decoder jointly trained with high-level action prediction, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls. Moreover, we enhance the current waypoint predictor by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans’ prior knowledge about the feasibility of actions. Empirically, our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.
摘要:现有的连续环境下的视觉和语言导航方法(VLN-CE)一般都包含一个路点预测器来离散化环境。这将导航动作简化为视图选择任务,并且与使用低级别动作的直接训练相比,显著提高了导航性能。然而,由于VLN-CE智能体的视觉感知和执行动作之间存在差距,因此距离真正的机器人还很远。首先,将视觉环境离散化的VLN-CE代理主要接受高级视图选择的训练,这导致它们忽略低级别动作动作中的关键空间推理。其次,在这些模型中,现有的路点预测器忽略了对象语义及其与情感相关的属性,这可以为指示动作的可行性提供信息。为了解决这两个问题,我们引入了一个与高级动作预测联合训练的低级动作解码器,使当前的VLN代理能够学习并将选定的视觉视图固定到低级控制。此外,我们通过利用包含丰富语义信息的视觉表示和基于人类关于动作可行性的先验知识来显式地掩蔽障碍来增强当前的路点预测器。从经验来看,与高级别和低级别操作的强基线相比,我们的代理可以提高导航性能指标。

[NLP-62] Beyond Relevant Documents: A Knowledge-Intensive Approach for Query-Focused Summarization using Large Language Models ICPR2024
[NLP-62] 超越相关文档:使用大型语言模型进行以查询为中心的摘要的知识密集型方法

链接: https://arxiv.org/abs/2408.10357
作者: Weijia Zhang,Jia-Hong Huang,Svitlana Vakulenko,Yumo Xu,Thilina Rajapakse,Evangelos Kanoulas
关键词-EN: including search engines, natural language processing, Query-focused summarization, broad applications, including search
关键词-ZH: 包括搜索引擎、自然语言处理、以查询为中心的摘要、包括搜索在内的广泛应用
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by the 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Query-focused summarization (QFS) is a fundamental task in natural language processing with broad applications, including search engines and report generation. However, traditional approaches assume the availability of relevant documents, which may not always hold in practical scenarios, especially in highly specialized topics. To address this limitation, we propose a novel knowledge-intensive approach that reframes QFS as a knowledge-intensive task setup. This approach comprises two main components: a retrieval module and a summarization controller. The retrieval module efficiently retrieves potentially relevant documents from a large-scale knowledge corpus based on the given textual query, eliminating the dependence on pre-existing document sets. The summarization controller seamlessly integrates a powerful large language model (LLM)-based summarizer with a carefully tailored prompt, ensuring the generated summary is comprehensive and relevant to the query. To assess the effectiveness of our approach, we create a new dataset, along with human-annotated relevance labels, to facilitate comprehensive evaluation covering both retrieval and summarization performance. Extensive experiments demonstrate the superior performance of our approach, particularly its ability to generate accurate summaries without relying on the availability of relevant documents initially. This underscores our method’s versatility and practical applicability across diverse query scenarios.
摘要:以查询为中心的摘要是自然语言处理中的一项基本任务,在搜索引擎和报告生成等领域有着广泛的应用。然而,传统方法假定有相关文件,这在实际情况下可能并不总是成立,特别是在高度专业化的专题中。为了解决这一局限性,我们提出了一种新的知识密集型方法,将QFS重新构建为知识密集型任务设置。该方法包括两个主要组件:检索模块和摘要控制器。该检索模块基于给定的文本查询从大规模知识语料库中高效地检索潜在相关的文档,消除了对预先存在的文档集合的依赖。摘要控制器将基于大型语言模型(LLM)的强大摘要生成器与精心定制的提示符无缝集成,确保生成的摘要全面且与查询相关。为了评估我们方法的有效性,我们创建了一个新的数据集,以及人类注释的相关性标签,以促进涵盖检索和摘要性能的综合评估。大量的实验表明,我们的方法具有优越的性能,特别是它能够在最初不依赖相关文档的情况下生成准确的摘要。这突显了我们的方法在不同的查询场景中的通用性和实用性。

[NLP-63] SEAL: Systematic Error Analysis for Value ALignment
[NLP-63] SEAL:价值对齐的系统误差分析

链接: https://arxiv.org/abs/2408.10270
作者: Manon Revel,Matteo Cargnelutti,Tyna Eloundou,Greg Leppert
关键词-EN: Reinforcement Learning, align language models, training reward models, Human Feedback, language models
关键词-ZH: 强化学习、对齐语言模型、训练奖励模型、人类反馈、语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 17 Figures, 8 Tables

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them - a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed inputs. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% incidence of alignment resistance in portions of the dataset where LM-labelers disagreed with human preferences. Furthermore, we find that misalignment often arises from ambiguous entries within the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.
摘要:人类反馈强化学习(RLHF)旨在通过训练二元偏好的奖励模型(RMS),并利用这些RMS来微调基本LMS,从而使语言模型(LMS)与人类价值观保持一致。尽管RLHF很重要,但其内部机制仍然知之甚少。本文引入了新的指标来评价建模和比对人类价值观的有效性,即特征印记、比对阻力和比对稳健性。我们将对齐数据集分类为目标特征(期望值)和扰流特征(不期望的概念)。通过将RM分数与这些特征进行回归,我们量化了RMS奖励他们的程度–我们称之为特征印记。我们将对齐阻力定义为RMS与人类偏好不匹配的偏好数据集的比例,并通过分析RM对扰动输入的响应来评估对齐的稳健性。我们的实验利用开源组件,如Anthropic/HH-rlhf偏好数据集和OpenAssistant RMS,揭示了目标特征的显著印记和对扰流特征的显著敏感性。我们观察到,在数据集中与人类偏好不一致的部分,对齐阻力的发生率为26%。此外,我们发现未对齐通常是由于对齐数据集中的歧义条目引起的。这些发现强调了仔细审查RMS和匹配数据集对于加深对价值匹配的理解的重要性。

[NLP-64] VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual Acoustic and Glossary Features
[NLP-64] VyAnG-Net:一种通过揭示视觉声学和词汇特征的新型多模式讽刺识别模型

链接: https://arxiv.org/abs/2408.10246
作者: Ananya Pandey,Dinesh Kumar Vishwakarma
关键词-EN: frequently convey sarcasm, sarcasm recognition, Multi-modal Sarcasm Recognition, non-linguistic clues, tone of voice
关键词-ZH: 频繁传达讽刺、讽刺识别、多模式讽刺识别、非语言线索、语气
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Various linguistic and non-linguistic clues, such as excessive emphasis on a word, a shift in the tone of voice, or an awkward expression, frequently convey sarcasm. The computer vision problem of sarcasm recognition in conversation aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue. Prior, sarcasm recognition has focused mainly on text. Still, it is critical to consider all textual information, audio stream, facial expression, and body position for reliable sarcasm identification. Hence, we propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data and an attentional tokenizer based strategy to extract the most critical context-specific information from the textual data. The following is a list of the key contributions that our experimentation has made in response to performing the task of Multi-modal Sarcasm Recognition: an attentional tokenizer branch to get beneficial features from the glossary content provided by the subtitles; a visual branch for acquiring the most prominent features from the video frames; an utterance-level feature extraction from acoustic content and a multi-headed attention based feature fusion branch to blend features obtained from multiple modalities. Extensive testing on one of the benchmark video datasets, MUSTaRD, yielded an accuracy of 79.86% for speaker dependent and 76.94% for speaker independent configuration demonstrating that our approach is superior to the existing methods. We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.
摘要:各种语言和非语言的线索,如过分强调一个词,语调的变化,或一个尴尬的表达,经常传达讽刺。会话中讽刺识别的计算机视觉问题旨在识别隐藏在日常对话中的讽刺、批评和隐喻信息。此前,讽刺识别主要集中在文本上。尽管如此,考虑所有文本信息、音频流、面部表情和身体位置对于可靠的讽刺识别是至关重要的。因此,我们提出了一种新的方法,它结合了轻量级的深度注意模块和自我调节的ConvNet,以专注于视觉数据最关键的特征,并提出了一种基于注意力标记器的策略,从文本数据中提取最关键的上下文特定信息。以下是我们的实验针对执行多通道讽刺识别任务所做的关键贡献:用于从字幕提供的词汇内容中获取有益特征的注意标记器分支;用于从视频帧中获取最显著特征的视觉分支;从声学内容中提取发声级别的特征;以及用于混合从多个通道获得的特征的基于多头注意力的特征融合分支。在一个基准视频数据集MIDARD上的大量测试表明,对于说话人相关的配置,准确率为79.86%,对于非说话人配置的准确率为76.94%,表明我们的方法优于现有的方法。我们还进行了跨数据集分析,以测试VyAnG-Net与另一个数据集芥菜++的未知样本的适应性。

[NLP-65] A General-Purpose Device for Interaction with LLMs
[NLP-65] 用于与LLM交互的通用设备

链接: https://arxiv.org/abs/2408.10230
作者: Jiajun Xu,Qun Wang,Yuhang Cao,Baitao Zeng,Sicheng Liu
关键词-EN: large language models, paper investigates integrating, investigates integrating large, integrating large language, general-purpose device designed
关键词-ZH: 大型语言模型,论文研究集成,研究集成大型,集成大型语言,设计通用设备
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper investigates integrating large language models (LLMs) with advanced hardware, focusing on developing a general-purpose device designed for enhanced interaction with LLMs. Initially, we analyze the current landscape, where virtual assistants and LLMs are reshaping human-technology interactions, highlighting pivotal advancements and setting the stage for a new era of intelligent hardware. Despite substantial progress in LLM technology, a significant gap exists in hardware development, particularly concerning scalability, efficiency, affordability, and multimodal capabilities. This disparity presents both challenges and opportunities, underscoring the need for hardware that is not only powerful but also versatile and capable of managing the sophisticated demands of modern computation. Our proposed device addresses these needs by emphasizing scalability, multimodal data processing, enhanced user interaction, and privacy considerations, offering a comprehensive platform for LLM integration in various applications.
摘要:本文研究了大型语言模型(LLM)与高级硬件的集成,重点是开发一种通用的设备,旨在增强与LLM的交互。首先,我们分析了当前的格局,虚拟助手和LLM正在重塑人与技术的交互,强调关键的进步,并为智能硬件的新时代奠定基础。尽管LLM技术取得了实质性进展,但在硬件开发方面仍存在重大差距,特别是在可伸缩性、效率、可负担性和多模式功能方面。这种差距既带来了挑战,也带来了机遇,突显了对硬件的需求,这些硬件不仅功能强大,而且功能多样,能够管理现代计算的复杂需求。我们建议的设备通过强调可伸缩性、多模式数据处理、增强的用户交互和隐私考虑来满足这些需求,为各种应用中的LLM集成提供了一个全面的平台。

[NLP-66] A Survey on Symbolic Knowledge Distillation of Large Language Models
[NLP-66] 大型语言模型的符号知识提炼研究

链接: https://arxiv.org/abs/2408.10210
作者: Kamal Acharya,Alvaro Velasquez,Houbing Herbert Song
关键词-EN: Large Language Models, Large Language, Bidirectional Encoder Representations, survey paper delves, symbolic knowledge distillation
关键词-ZH: 大型语言模型、大型语言、双向编码器表示、调查论文深入研究、符号知识蒸馏
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:This survey paper delves into the emerging and critical area of symbolic knowledge distillation in Large Language Models (LLMs). As LLMs like Generative Pre-trained Transformer-3 (GPT-3) and Bidirectional Encoder Representations from Transformers (BERT) continue to expand in scale and complexity, the challenge of effectively harnessing their extensive knowledge becomes paramount. This survey concentrates on the process of distilling the intricate, often implicit knowledge contained within these models into a more symbolic, explicit form. This transformation is crucial for enhancing the interpretability, efficiency, and applicability of LLMs. We categorize the existing research based on methodologies and applications, focusing on how symbolic knowledge distillation can be used to improve the transparency and functionality of smaller, more efficient Artificial Intelligence (AI) models. The survey discusses the core challenges, including maintaining the depth of knowledge in a comprehensible format, and explores the various approaches and techniques that have been developed in this field. We identify gaps in current research and potential opportunities for future advancements. This survey aims to provide a comprehensive overview of symbolic knowledge distillation in LLMs, spotlighting its significance in the progression towards more accessible and efficient AI systems.
摘要:本文深入探讨了大型语言模型中符号知识提炼这一新兴而关键的领域。随着生成式预训练的Transformer-3(GPT-3)和Transformers的双向编码器表示(BERT)等LLM在规模和复杂性上不断扩大,有效利用其丰富的知识变得至关重要。这项调查集中于将这些模型中包含的复杂的、通常是隐含的知识提炼成更具象征性的、更明确的形式的过程。这种转换对于提高LLMS的可解释性、效率和适用性至关重要。我们根据方法和应用对现有的研究进行分类,重点是如何使用符号知识蒸馏来提高更小、更高效的人工智能(AI)模型的透明度和功能。调查讨论了核心挑战,包括以可理解的形式保持知识深度,并探讨了这一领域已开发的各种方法和技术。我们确定了当前研究中的差距和未来发展的潜在机会。这项调查的目的是提供一个全面的概述符号知识蒸馏在低成本管理,突出它的意义,在向更容易获得和更有效的人工智能系统的进展。

[NLP-67] In-Context Learning with Representations: Contextual Generalization of Trained Transformers
[NLP-67] 使用表示的上下文学习:训练有素的变形金刚的上下文概括

链接: https://arxiv.org/abs/2408.10147
作者: Tong Yang,Yu Huang,Yingbin Liang,Yuejie Chi
关键词-EN: pretrained large language, large language models, remarkable capability, capability of pretrained, pretrained large
关键词-ZH: 预训练的大型语言,大型语言模型,非凡的能力,预训练的能力,预训练的大型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with m basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.
摘要:情境学习是指利用预先训练好的大型语言模型,在推理过程中通过几个例子学习一项新任务的能力。然而,对ICL的理论理解在很大程度上是探索不足的,特别是是否可以培训变压器在提示中概括到看不见的例子,这将要求模型获得关于概括提示的上下文知识。本文通过非线性回归任务的视角,研究了梯度下降法对变压器的训练动态。这里的上下文泛化可以通过学习上下文中每个任务的模板函数来实现,其中所有模板函数都位于具有m个基函数的线性空间中。我们分析了单层多头变压器的训练动态,以在给定部分标签提示的情况下对未标记输入进行上下文预测,其中标签包含高斯噪声,并且每个提示中的样本数不足以确定模板。在较温和的假设下,我们证明了单层多头变压器的训练损耗线性收敛到全局最小。此外,变压器有效地学习对基函数执行岭回归。据我们所知,这项研究是第一个可证明的证明,当提示只包含少量的询问-回答对时,转换器可以学习上下文(即模板)信息来概括到未见过的例子和任务。

人工智能

[AI-0] NeCo: Improving DINOv2s spatial representations in 19 GPU hours with Patch Neighbor Consistency

链接: https://arxiv.org/abs/2408.11054
作者: Valentinos Pariza,Mohammadreza Salehi,Gertjan Burghouts,Francesco Locatello,Yuki M. Asano
关键词-EN: Patch Neighbor Consistency, propose sorting patch, self-supervised learning signal, sorting patch representations, nearest neighbor consistency
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint. The webpage is accessible at: this https URL

点击查看摘要

Abstract:We propose sorting patch representations across views as a novel self-supervised learning signal to improve pretrained representations. To this end, we introduce NeCo: Patch Neighbor Consistency, a novel training loss that enforces patch-level nearest neighbor consistency across a student and teacher model, relative to reference batches. Our method leverages a differentiable sorting method applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. We demonstrate that this method generates high-quality dense feature encoders and establish several new state-of-the-art results: +5.5% and + 6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, and +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff.

[AI-1] Revisiting VerilogEval: Newer LLMs In-Context Learning and Specification-to-RTL Tasks

链接: https://arxiv.org/abs/2408.11053
作者: Nathaniel Pinckney,Christopher Batten,Mingjie Liu,Haoxing Ren,Brucek Khailany
关键词-EN: digital hardware code, emerging field, application of large-language, hardware code, models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: This paper revisits and improves the benchmark first presented in arXiv:2309.07544 . Seven pages, three figures

点击查看摘要

Abstract:The application of large-language models (LLMs) to digital hardware code generation is an emerging field. Most LLMs are primarily trained on natural language and software code. Hardware code, such as Verilog, represents only a small portion of the training data and few hardware benchmarks exist. To address this gap, the open-source VerilogEval benchmark was released in 2023, providing a consistent evaluation framework for LLMs on code completion tasks. It was tested on state-of-the-art models at the time including GPT-4. However, VerilogEval and other Verilog generation benchmarks lack failure analysis and, in present form, are not conducive to exploring prompting techniques. Also, since VerilogEval’s release, both commercial and open-source models have seen continued development. In this work, we evaluate new commercial and open-source models of varying sizes against an improved VerilogEval benchmark suite. We enhance VerilogEval’s infrastructure and dataset by automatically classifying failures, introduce new prompts for supporting in-context learning (ICL) examples, and extend the supported tasks to specification-to-RTL translation. We find a measurable improvement in commercial state-of-the-art models, with GPT-4 Turbo achieving a 59% pass rate on spec-to-RTL tasks. We also study the performance of open-source and domain-specific models that have emerged, and demonstrate that models can benefit substantially from ICL. We find that recently-released Llama 3.1 405B achieves a pass rate of 58%, effectively matching that of GPT-4 Turbo, and that the much smaller domain-specific RTL-Coder 6.7B models achieve an impressive 37% pass rate. However, prompt engineering is key to achieving good pass rates, and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is key to continued model development and deployment. Comments: This paper revisits and improves the benchmark first presented in arXiv:2309.07544. Seven pages, three figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.11053 [cs.SE] (or arXiv:2408.11053v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2408.11053 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] Accelerating Goal-Conditioned RL Algorithms and Research

链接: https://arxiv.org/abs/2408.11052
作者: Michał Bortkiewicz,Władek Pałucki,Vivek Myers,Tadeusz Dziarmaga,Tomasz Arczewski,Łukasz Kuciński,Benjamin Eysenbach
关键词-EN: transform reinforcement learning, reinforcement learning, paralleling the breakthroughs, potential to transform, areas of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervision has the potential to transform reinforcement learning (RL), paralleling the breakthroughs it has enabled in other areas of machine learning. While self-supervised learning in other domains aims to find patterns in a fixed dataset, self-supervised goal-conditioned reinforcement learning (GCRL) agents discover new behaviors by learning from the goals achieved during unstructured interaction with the environment. However, these methods have failed to see similar success, both due to a lack of data from slow environments as well as a lack of stable algorithms. We take a step toward addressing both of these issues by releasing a high-performance codebase and benchmark JaxGCRL for self-supervised GCRL, enabling researchers to train agents for millions of environment steps in minutes on a single GPU. The key to this performance is a combination of GPU-accelerated environments and a stable, batched version of the contrastive reinforcement learning algorithm, based on an infoNCE objective, that effectively makes use of this increased data throughput. With this approach, we provide a foundation for future research in self-supervised GCRL, enabling researchers to quickly iterate on new ideas and evaluate them in a diverse set of challenging environments. Website + Code: this https URL

[AI-3] FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

链接: https://arxiv.org/abs/2408.11051
作者: Yunzhe Xu,Yiyuan Pan,Zhe Liu,Hesheng Wang
关键词-EN: Large Language Models, Large Language, Language Models, specialized VLN models, applications face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME’s superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion rate on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards practical applications of MLLMs in embodied AI. Project page: this https URL

[AI-4] RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands

链接: https://arxiv.org/abs/2408.11048
作者: Yi Zhao,Le Chen,Jan Schneider,Quankai Gao,Juho Kannala,Bernhard Schölkopf,Joni Pajarinen,Dieter Büchler
关键词-EN: long-standing research goal, robot piano playing, robot piano, endow robot hands, piano playing
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project Website: this https URL

点击查看摘要

Abstract:It has been a long-standing research goal to endow robot hands with human-level dexterity. Bi-manual robot piano playing constitutes a task that combines challenges from dynamic tasks, such as generating fast while precise motions, with slower but contact-rich manipulation problems. Although reinforcement learning based approaches have shown promising results in single-task performance, these methods struggle in a multi-song setting. Our work aims to close this gap and, thereby, enable imitation learning approaches for robot piano playing at scale. To this end, we introduce the Robot Piano 1 Million (RP1M) dataset, containing bi-manual robot piano playing motion data of more than one million trajectories. We formulate finger placements as an optimal transport problem, thus, enabling automatic annotation of vast amounts of unlabeled songs. Benchmarking existing imitation learning approaches shows that such approaches reach state-of-the-art robot piano playing performance by leveraging RP1M.

[AI-5] Reconciling Methodological Paradigms: Employing Large Language Models as Novice Qualitative Research Assistants in Talent Management Research KDD’24

链接: https://arxiv.org/abs/2408.11043
作者: Sreyoshi Bhaduri,Satya Kapoor,Alex Gil,Anshul Mittal,Rutu Mulkar
关键词-EN: provide rich insights, Qualitative data collection, focus groups, provide rich, customer attitudes
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Accepted to KDD '24 workshop on Talent Management and Computing (TMC 2024). 9 pages

点击查看摘要

Abstract:Qualitative data collection and analysis approaches, such as those employing interviews and focus groups, provide rich insights into customer attitudes, sentiment, and behavior. However, manually analyzing qualitative data requires extensive time and effort to identify relevant topics and thematic insights. This study proposes a novel approach to address this challenge by leveraging Retrieval Augmented Generation (RAG) based Large Language Models (LLMs) for analyzing interview transcripts. The novelty of this work lies in strategizing the research inquiry as one that is augmented by an LLM that serves as a novice research assistant. This research explores the mental model of LLMs to serve as novice qualitative research assistants for researchers in the talent management space. A RAG-based LLM approach is extended to enable topic modeling of semi-structured interview data, showcasing the versatility of these models beyond their traditional use in information retrieval and search. Our findings demonstrate that the LLM-augmented RAG approach can successfully extract topics of interest, with significant coverage compared to manually generated topics from the same dataset. This establishes the viability of employing LLMs as novice qualitative research assistants. Additionally, the study recommends that researchers leveraging such models lean heavily on quality criteria used in traditional qualitative research to ensure rigor and trustworthiness of their approach. Finally, the paper presents key recommendations for industry practitioners seeking to reconcile the use of LLMs with established qualitative research paradigms, providing a roadmap for the effective integration of these powerful, albeit novice, AI tools in the analysis of qualitative datasets within talent

[AI-6] GraphFSA: A Finite State Automaton Framework for Algorithmic Learning on Graphs ECAI2024

链接: https://arxiv.org/abs/2408.11042
作者: Florian Grötschla,Joël Mathys,Christoffer Raun,Roger Wattenhofer
关键词-EN: Finite State Automaton, iteratively applied, viewed as sets, sets of rules, number of iterations
类目: Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ECAI 2024

点击查看摘要

Abstract:Many graph algorithms can be viewed as sets of rules that are iteratively applied, with the number of iterations dependent on the size and complexity of the input graph. Existing machine learning architectures often struggle to represent these algorithmic decisions as discrete state transitions. Therefore, we propose a novel framework: GraphFSA (Graph Finite State Automaton). GraphFSA is designed to learn a finite state automaton that runs on each node of a given graph. We test GraphFSA on cellular automata problems, showcasing its abilities in a straightforward algorithmic setting. For a comprehensive empirical evaluation of our framework, we create a diverse range of synthetic problems. As our main application, we then focus on learning more elaborate graph algorithms. Our findings suggest that GraphFSA exhibits strong generalization and extrapolation abilities, presenting an alternative approach to represent these algorithms.

[AI-7] ransfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

链接: https://arxiv.org/abs/2408.11039
作者: Chunting Zhou,Lili Yu,Arun Babu,Kushal Tirumala,Michihiro Yasunaga,Leonid Shamis,Jacob Kahn,Xuezhe Ma,Luke Zettlemoyer,Omer Levy
关键词-EN: Transfusion, introduce Transfusion, Transfusion models, continuous data, multiple Transfusion models
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.

[AI-8] Athena: Safe Autonomous Agents with Verbal Contrastive Learning

链接: https://arxiv.org/abs/2408.11021
作者: Tanmana Sadhu,Ali Pesaranghader,Yanan Chen,Dong Hoon Yi
关键词-EN: large language models, large language, language models, degree of autonomy, utilized as language-based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 9 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Due to emergent capabilities, large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks and make decisions with an increasing degree of autonomy. These autonomous agents can understand high-level instructions, interact with their environments, and execute complex tasks using a selection of tools available to them. As the capabilities of the agents expand, ensuring their safety and trustworthiness becomes more imperative. In this study, we introduce the Athena framework which leverages the concept of verbal contrastive learning where past safe and unsafe trajectories are used as in-context (contrastive) examples to guide the agent towards safety while fulfilling a given task. The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step. Furthermore, due to the lack of existing benchmarks on the safety reasoning ability of LLM-based agents, we curate a set of 80 toolkits across 8 categories with 180 scenarios to provide a safety evaluation benchmark. Our experimental evaluation, with both closed- and open-source LLMs, indicates verbal contrastive learning and interaction-level critiquing improve the safety rate significantly.

[AI-9] Multiwinner Temporal Voting with Aversion to Change ECAI

链接: https://arxiv.org/abs/2408.11017
作者: Valentin Zech,Niclas Boehmer,Edith Elkind,Nicholas Teh
关键词-EN: study two-stage committee, two-stage committee elections, Proportional Approval Voting, Thiele rules, Approval Voting
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注: Appears in the 27th European Conference on Artificial Intelligence (ECAI), 2024

点击查看摘要

Abstract:We study two-stage committee elections where voters have dynamic preferences over candidates; at each stage, a committee is chosen under a given voting rule. We are interested in identifying a winning committee for the second stage that overlaps as much as possible with the first-stage committee. We show a full complexity dichotomy for the class of Thiele rules: this problem is tractable for Approval Voting (AV) and hard for all other Thiele rules (including, in particular, Proportional Approval Voting and the Chamberlin-Courant rule). We extend this dichotomy to the greedy variants of Thiele rules. We also explore this problem from a parameterized complexity perspective for several natural parameters. We complement the theory with experimental analysis: e.g., we investigate the average number of changes in the committee as a function of changes in voters’ preferences and the role of ties.

[AI-10] Hybrid Recurrent Models Support Emergent Descriptions for Hierarchical Planning and Control

链接: https://arxiv.org/abs/2408.10970
作者: Poppy Collis,Ryan Singh,Paul F Kinghorn,Christopher L Buckley
关键词-EN: solving inherently continuous, flexibly learn discrete, inherently continuous problems, artificial intelligence, flexibly learn
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:An open problem in artificial intelligence is how systems can flexibly learn discrete abstractions that are useful for solving inherently continuous problems. Previous work has demonstrated that a class of hybrid state-space model known as recurrent switching linear dynamical systems (rSLDS) discover meaningful behavioural units via the piecewise linear decomposition of complex continuous dynamics (Linderman et al., 2016). Furthermore, they model how the underlying continuous states drive these discrete mode switches. We propose that the rich representations formed by an rSLDS can provide useful abstractions for planning and control. We present a novel hierarchical model-based algorithm inspired by Active Inference in which a discrete MDP sits above a low-level linear-quadratic controller. The recurrent transition dynamics learned by the rSLDS allow us to (1) specify temporally-abstracted sub-goals in a method reminiscent of the options framework, (2) lift the exploration into discrete space allowing us to exploit information-theoretic exploration bonuses and (3) `cache’ the approximate solutions to low-level problems in the discrete planner. We successfully apply our model to the sparse Continuous Mountain Car task, demonstrating fast system identification via enhanced exploration and non-trivial planning through the delineation of abstract sub-goals.

[AI-11] Wave-Mask/Mix: Exploring Wavelet-Based Augmentations for Time Series Forecasting

链接: https://arxiv.org/abs/2408.10951
作者: Dona Arabi,Jafar Bakhshaliyev,Ayse Coskuner,Kiran Madhusudhanan,Kami Serdar Uckardes
关键词-EN: improving machine learning, machine learning model, learning model performance, limited real-world data, discrete wavelet transform
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation is important for improving machine learning model performance when faced with limited real-world data. In time series forecasting (TSF), where accurate predictions are crucial in fields like finance, healthcare, and manufacturing, traditional augmentation methods for classification tasks are insufficient to maintain temporal coherence. This research introduces two augmentation approaches using the discrete wavelet transform (DWT) to adjust frequency elements while preserving temporal dependencies in time series data. Our methods, Wavelet Masking (WaveMask) and Wavelet Mixing (WaveMix), are evaluated against established baselines across various forecasting horizons. To the best of our knowledge, this is the first study to conduct extensive experiments on multivariate time series using Discrete Wavelet Transform as an augmentation technique. Experimental results demonstrate that our techniques achieve competitive results with previous methods. We also explore cold-start forecasting using downsampled training datasets, comparing outcomes to baseline methods.

[AI-12] GAIM: Attacking Graph Neural Networks via Adversarial Influence Maximization

链接: https://arxiv.org/abs/2408.10948
作者: Xiaodong Yang,Xiaoting Li,Huiyuan Chen,Yiwei Cai
关键词-EN: Graph Neural Network, trained Graph Neural, mislead trained Graph, Neural Network, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent studies show that well-devised perturbations on graph structures or node features can mislead trained Graph Neural Network (GNN) models. However, these methods often overlook practical assumptions, over-rely on heuristics, or separate vital attack components. In response, we present GAIM, an integrated adversarial attack method conducted on a node feature basis while considering the strict black-box setting. Specifically, we define an adversarial influence function to theoretically assess the adversarial impact of node perturbations, thereby reframing the GNN attack problem into the adversarial influence maximization problem. In our approach, we unify the selection of the target node and the construction of feature perturbations into a single optimization problem, ensuring a unique and consistent feature perturbation for each target node. We leverage a surrogate model to transform this problem into a solvable linear programming task, streamlining the optimization process. Moreover, we extend our method to accommodate label-oriented attacks, broadening its applicability. Thorough evaluations on five benchmark datasets across three popular models underscore the effectiveness of our method in both untargeted and label-oriented targeted attacks. Through comprehensive analysis and ablation studies, we demonstrate the practical value and efficacy inherent to our design choices.

[AI-13] Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models ACL2024

链接: https://arxiv.org/abs/2408.10947
作者: Yuyan Chen,Chenwei Wu,Songzhou Yan,Panjun Liu,Haoyu Zhou,Yanghua Xiao
关键词-EN: large language models, important area, language models, area of study, imparting knowledge
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
*备注: Accepted to ACL 2024

点击查看摘要

Abstract:Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs’ capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl’s taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs’ outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.

[AI-14] Large Language Model Driven Recommendation

链接: https://arxiv.org/abs/2408.10946
作者: Anton Korikov,Scott Sanner,Yashar Deldjoo,Zhankui He,Julian McAuley,Arnau Ramisa,Rene Vidal,Mahesh Sathiamoorthy,Atoosa Kasrizadeh,Silvia Milano,Francesco Ricci
关键词-EN: previous chapters focused, non-verbal user feedback, based on standardized, natural language, previous chapters
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While previous chapters focused on recommendation systems (RSs) based on standardized, non-verbal user feedback such as purchases, views, and clicks – the advent of LLMs has unlocked the use of natural language (NL) interactions for recommendation. This chapter discusses how LLMs’ abilities for general NL reasoning present novel opportunities to build highly personalized RSs – which can effectively connect nuanced and diverse user preferences to items, potentially via interactive dialogues. To begin this discussion, we first present a taxonomy of the key data sources for language-driven recommendation, covering item descriptions, user-system interactions, and user profiles. We then proceed to fundamental techniques for LLM recommendation, reviewing the use of encoder-only and autoregressive LLM recommendation in both tuned and untuned settings. Afterwards, we move to multi-module recommendation architectures in which LLMs interact with components such as retrievers and RSs in multi-stage pipelines. This brings us to architectures for conversational recommender systems (CRSs), in which LLMs facilitate multi-turn dialogues where each turn presents an opportunity not only to make recommendations, but also to engage with the user in interactive preference elicitation, critiquing, and question-answering.

[AI-15] HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

链接: https://arxiv.org/abs/2408.10945
作者: Kazi Hasan Ibn Arif,JinYi Yoon,Dimitrios S. Nikolopoulos,Hans Vandierendonck,Deepu John,Bo Ji
关键词-EN: detailed image information, preserving detailed image, High-resolution Vision-Language Models, Large Language Model, multimodal tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder’s attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

[AI-16] A Closer Look at Data Augmentation Strategies for Finetuning-Based Low/Few-Shot Object Detection

链接: https://arxiv.org/abs/2408.10940
作者: Vladislav Li,Georgios Tsoumplekas,Ilias Siniosoglou,Vasileios Argyriou,Anastasios Lytos,Eleftherios Fountoukidis,Panagiotis Sarigiannidis
关键词-EN: few-shot object detection, Current methods, detection have primarily, primarily focused, focused on enhancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Current methods for low- and few-shot object detection have primarily focused on enhancing model performance for detecting objects. One common approach to achieve this is by combining model finetuning with data augmentation strategies. However, little attention has been given to the energy efficiency of these approaches in data-scarce regimes. This paper seeks to conduct a comprehensive empirical study that examines both model performance and energy efficiency of custom data augmentations and automated data augmentation selection strategies when combined with a lightweight object detector. The methods are evaluated in three different benchmark datasets in terms of their performance and energy consumption, and the Efficiency Factor is employed to gain insights into their effectiveness considering both performance and efficiency. Consequently, it is shown that in many cases, the performance gains of data augmentation strategies are overshadowed by their increased energy usage, necessitating the development of more energy efficient data augmentation strategies to address data scarcity.

[AI-17] SDI-Net: Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement

链接: https://arxiv.org/abs/2408.10934
作者: Linlin Hu,Ao Sun,Shijie Hao,Richang Hong,Meng Wang
关键词-EN: stereo image enhancement, low-light stereo image, image enhancement, low-light image enhancement, image enhancement methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Currently, most low-light image enhancement methods only consider information from a single view, neglecting the correlation between cross-view information. Therefore, the enhancement results produced by these methods are often unsatisfactory. In this context, there have been efforts to develop methods specifically for low-light stereo image enhancement. These methods take into account the cross-view disparities and enable interaction between the left and right views, leading to improved performance. However, these methods still do not fully exploit the interaction between left and right view information. To address this issue, we propose a model called Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement (SDI-Net). The backbone structure of SDI-Net is two encoder-decoder pairs, which are used to learn the mapping function from low-light images to normal-light images. Among the encoders and the decoders, we design a module named Cross-View Sufficient Interaction Module (CSIM), aiming to fully exploit the correlations between the binocular views via the attention mechanism. The quantitative and visual results on public datasets validate the superiority of our method over other related methods. Ablation studies also demonstrate the effectiveness of the key elements in our model.

[AI-18] he Evolution of Reinforcement Learning in Quantitative Finance

链接: https://arxiv.org/abs/2408.10932
作者: Nikolaos Pippas,Cagatay Turkay,Elliot A. Ludvig
关键词-EN: experienced significant advancement, Reinforcement Learning, past decade, prompting a growing, experienced significant
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: This work is currently submitted to and under-review for ACM Computing Surveys. This copy is an unedited, pre-print version and it is the author’s version of the work. I

点击查看摘要

Abstract:Reinforcement Learning (RL) has experienced significant advancement over the past decade, prompting a growing interest in applications within finance. This survey critically evaluates 167 publications, exploring diverse RL applications and frameworks in finance. Financial markets, marked by their complexity, multi-agent nature, information asymmetry, and inherent randomness, serve as an intriguing test-bed for RL. Traditional finance offers certain solutions, and RL advances these with a more dynamic approach, incorporating machine learning methods, including transfer learning, meta-learning, and multi-agent solutions. This survey dissects key RL components through the lens of Quantitative Finance. We uncover emerging themes, propose areas for future research, and critique the strengths and weaknesses of existing methods.

[AI-19] LBC: Language-Based-Classifier for Out-Of-Variable Generalization

链接: https://arxiv.org/abs/2408.10923
作者: Kangjun Noh,Baekryun Seong,Hoyoon Byun,Sungjin Song,Kyungwoo Song
关键词-EN: Large Language Models, natural language processing, Large Language, language processing tasks, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have great success in natural language processing tasks such as response generation. However, their use in tabular data has been limited due to their inferior performance compared to traditional machine learning models (TMLs) such as XGBoost. We find that the pre-trained knowledge of LLMs enables them to interpret new variables that appear in a test without additional training, a capability central to the concept of Out-of-Variable (OOV). From the findings, we propose a Language-Based-Classifier (LBC), a classifier that maximizes the benefits of LLMs to outperform TMLs on OOV tasks. LBC employs three key methodological strategies: 1) Categorical changes to adjust data to better fit the model’s understanding, 2) Advanced order and indicator to enhance data representation to the model, and 3) Using verbalizer to map logit scores to classes during inference to generate model predictions. These strategies, combined with the pre-trained knowledge of LBC, emphasize the model’s ability to effectively handle OOV tasks. We empirically and theoretically validate the superiority of LBC. LBC is the first study to apply an LLM-based model to OOV tasks. The source code is at this https URL.

[AI-20] MTFinEval:A Multi-domain Chinese Financial Benchmark with Eurypalynous questions

链接: https://arxiv.org/abs/2408.10921
作者: Xinyu Liu,Ke Jin
关键词-EN: safely invested, invested in production, LLMS, economy-specific LLMS, economics
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the emergence of more and more economy-specific LLMS, how to measure whether they can be safely invested in production becomes a problem. Previous research has primarily focused on evaluating the performance of LLMs within specific application scenarios. However, these benchmarks cannot reflect the theoretical level and generalization ability, and the backward datasets are increasingly unsuitable for problems in real scenarios. In this paper, we have compiled a new benchmark, MTFinEval, focusing on the LLMs’ basic knowledge of economics, which can always be used as a basis for judgment. To examine only theoretical knowledge as much as possible, MTFinEval is build with foundational questions from university textbooks,and exam papers in economics and management major. Aware of the overall performance of LLMs do not depend solely on one subdiscipline of economics, MTFinEval comprise 360 questions refined from six major disciplines of economics, and reflect capabilities more comprehensively. Experiment result shows all LLMs perform poorly on MTFinEval, which proves that our benchmark built on basic knowledge is very successful. Our research not only offers guidance for selecting the appropriate LLM for specific use cases, but also put forward increase the rigor reliability of LLMs from the basics.

[AI-21] Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

链接: https://arxiv.org/abs/2408.10920
作者: Róbert Csordás,Christopher Potts,Christopher D. Manning,Atticus Geiger
关键词-EN: Linear Representation Hypothesis, neural networks learn, Representation Hypothesis, LRH states, states that models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.

[AI-22] CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network

链接: https://arxiv.org/abs/2408.10919
作者: Zijian Zhao,Tingwei Chen,Zhijie Cai,Hang Li,Xiaoyang Li,Qimei Chen,Guangxu Zhu
关键词-EN: garnered significant attention, significant attention due, low cost, recent years, numerous benefits
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fails to perform well in environments different from the training data. One major factor contributing to this issue is the limited availability of Wi-Fi sensing datasets, which makes models learn excessive irrelevant information and over-fit to the training set. Unfortunately, collecting large-scale Wi-Fi sensing datasets across diverse scenarios is a challenging task. To address this problem, we propose CrossFi, a siamese network-based approach that excels in both in-domain scenario and cross-domain scenario, including few-shot, zero-shot scenarios, and even works in few-shot new-class scenario where testing set contains new categories. The core component of CrossFi is a sample-similarity calculation network called CSi-Net, which improves the structure of the siamese network by using an attention mechanism to capture similarity information, instead of simply calculating the distance or cosine similarity. Based on it, we develop an extra Weight-Net that can generate a template for each class, so that our CrossFi can work in different scenarios. Experimental results demonstrate that our CrossFi achieves state-of-the-art performance across various scenarios. In gesture recognition task, our CrossFi achieves an accuracy of 98.17% in in-domain scenario, 91.72% in one-shot cross-domain scenario, 64.81% in zero-shot cross-domain scenario, and 84.75% in one-shot new-class scenario. To facilitate future research, we will release the code for our model upon publication.

[AI-23] he impact of labeling automotive AI as “trustworthy” or “reliable” on user evaluation and technology acceptance

链接: https://arxiv.org/abs/2408.10905
作者: John Dorsch,Ophelia Deroy
关键词-EN: automotive AI technologies, Technology Acceptance Model, Acceptance Model, human-like trust, Abstract
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 36 pages, 12 figures

点击查看摘要

Abstract:This study explores whether labeling AI as “trustworthy” or “reliable” influences user perceptions and acceptance of automotive AI technologies. Using a one-way between-subjects design, the research involved 478 online participants who were presented with guidelines for either trustworthy or reliable AI. Participants then evaluated three vignette scenarios and completed a modified version of the Technology Acceptance Model, which included variables such as perceived ease of use, human-like trust, and overall attitude. Although labeling AI as “trustworthy” did not significantly influence judgments on specific scenarios, it increased perceived ease of use and human-like trust, particularly benevolence. This suggests a positive impact on usability and an anthropomorphic effect on user perceptions. The study provides insights into how specific labels can influence attitudes toward AI technology.

[AI-24] A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse

链接: https://arxiv.org/abs/2408.10901
作者: Zhongliang Guo,Lei Fang,Jingyu Lin,Yifei Qian,Shuai Zhao,Zeyu Wang,Junhao Dong,Cunjian Chen,Ognjen Arandjelović,Chun Pong Lau
关键词-EN: Latent Diffusion Models, Latent Diffusion, Recent advancements, revolutionized image synthesis, Diffusion Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.

[AI-25] owards Efficient Formal Verification of Spiking Neural Network

链接: https://arxiv.org/abs/2408.10900
作者: Baekryun Seong,Jieung Kim,Sang-Ki Ko
关键词-EN: large language models, primarily focused, focused on large, large language, increasing accuracy
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recently, AI research has primarily focused on large language models (LLMs), and increasing accuracy often involves scaling up and consuming more power. The power consumption of AI has become a significant societal issue; in this context, spiking neural networks (SNNs) offer a promising solution. SNNs operate event-driven, like the human brain, and compress information temporally. These characteristics allow SNNs to significantly reduce power consumption compared to perceptron-based artificial neural networks (ANNs), highlighting them as a next-generation neural network technology. However, societal concerns regarding AI go beyond power consumption, with the reliability of AI models being a global issue. For instance, adversarial attacks on AI models are a well-studied problem in the context of traditional neural networks. Despite their importance, the stability and property verification of SNNs remains in the early stages of research. Most SNN verification methods are time-consuming and barely scalable, making practical applications challenging. In this paper, we introduce temporal encoding to achieve practical performance in verifying the adversarial robustness of SNNs. We conduct a theoretical analysis of this approach and demonstrate its success in verifying SNNs at previously unmanageable scales. Our contribution advances SNN verification to a practical level, facilitating the safer application of SNNs.

[AI-26] Analytical and Empirical Study of Herding Effects in Recommendation Systems

链接: https://arxiv.org/abs/2408.10895
作者: Hong Xie,Mingze Zhong,Defu Lian,Zhen Wang,Enhong Chen
关键词-EN: Online rating systems, rating aggregation rules, Online rating, aggregation rules, rating aggregation
类目: Artificial Intelligence (cs.AI)
*备注: 29 pages

点击查看摘要

Abstract:Online rating systems are often used in numerous web or mobile applications, e.g., Amazon and TripAdvisor, to assess the ground-truth quality of products. Due to herding effects, the aggregation of historical ratings (or historical collective opinion) can significantly influence subsequent ratings, leading to misleading and erroneous assessments. We study how to manage product ratings via rating aggregation rules and shortlisted representative reviews, for the purpose of correcting the assessment error. We first develop a mathematical model to characterize important factors of herding effects in product ratings. We then identify sufficient conditions (via the stochastic approximation theory), under which the historical collective opinion converges to the ground-truth collective opinion of the whole user population. These conditions identify a class of rating aggregation rules and review selection mechanisms that can reveal the ground-truth product quality. We also quantify the speed of convergence (via the martingale theory), which reflects the efficiency of rating aggregation rules and review selection mechanisms. We prove that the herding effects slow down the speed of convergence while an accurate review selection mechanism can speed it up. We also study the speed of convergence numerically and reveal trade-offs in selecting rating aggregation rules and review selection mechanisms. To show the utility of our framework, we design a maximum likelihood algorithm to infer model parameters from ratings, and conduct experiments on rating datasets from Amazon and TripAdvisor. We show that proper recency aware rating aggregation rules can improve the speed of convergence in Amazon and TripAdvisor by 41% and 62% respectively.

[AI-27] On Learning Action Costs from Input Plans

链接: https://arxiv.org/abs/2408.10889
作者: Marianela Morales,Alberto Pozanco,Giuseppe Canonaco,Sriram Gopalakrishnan,Daniel Borrajo,Manuela Veloso
关键词-EN: actions’ dynamics, action models focus, plans, learning, input plans
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most of the work on learning action models focus on learning the actions’ dynamics from input plans. This allows us to specify the valid plans of a planning task. However, very little work focuses on learning action costs, which in turn allows us to rank the different plans. In this paper we introduce a new problem: that of learning the costs of a set of actions such that a set of input plans are optimal under the resulting planning model. To solve this problem we present LACFIP^k , an algorithm to learn action’s costs from unlabeled input plans. We provide theoretical and empirical results showing how LACFIP^k can successfully solve this task.

[AI-28] DAAD: Dynamic Analysis and Adaptive Discriminator for Fake News Detection

链接: https://arxiv.org/abs/2408.10883
作者: Xinqi Su,Yawen Cui,Ajian Liu,Xun Lin,Yuhao Wang,Haochen Liang,Wenhui Li,Zitong Yu
关键词-EN: current web environment, online social networks, web environment, social networks, posing serious threats
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In current web environment, fake news spreads rapidly across online social networks, posing serious threats to society. Existing multimodal fake news detection (MFND) methods can be classified into knowledge-based and semantic-based approaches. However, these methods are overly dependent on human expertise and feedback, lacking flexibility. To address this challenge, we propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. For knowledge-based methods, we introduce the Monte Carlo Tree Search (MCTS) algorithm to leverage the self-reflective capabilities of large language models (LLMs) for prompt optimization, providing richer, domain-specific details and guidance to the LLMs, while enabling more flexible integration of LLM comment on news content. For semantic-based methods, we define four typical deceit patterns: emotional exaggeration, logical inconsistency, image manipulation, and semantic inconsistency, to reveal the mechanisms behind fake news creation. To detect these patterns, we carefully design four discriminators and expand them in depth and breadth, using the soft-routing mechanism to explore optimal detection models. Experimental results on three real-world datasets demonstrate the superiority of our approach. The code will be available at: this https URL.

[AI-29] DBHP: Trajectory Imputation in Multi-Agent Sports Using Derivative-Based Hybrid Prediction

链接: https://arxiv.org/abs/2408.10878
作者: Hanjun Choi,Hyunsung Kim,Minho Lee,Chang-Jo Kim,Jinsung Yoon,Sang-Ki Ko
关键词-EN: collected trajectory data, multi-agent trajectory data, spatiotemporal domains handle, domains handle multi-agent, handle multi-agent trajectory
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Many spatiotemporal domains handle multi-agent trajectory data, but in real-world scenarios, collected trajectory data are often partially missing due to various reasons. While existing approaches demonstrate good performance in trajectory imputation, they face challenges in capturing the complex dynamics and interactions between agents due to a lack of physical constraints that govern realistic trajectories, leading to suboptimal results. To address this issue, the paper proposes a Derivative-Based Hybrid Prediction (DBHP) framework that can effectively impute multiple agents’ missing trajectories. First, a neural network equipped with Set Transformers produces a naive prediction of missing trajectories while satisfying the permutation-equivariance in terms of the order of input agents. Then, the framework makes alternative predictions leveraging velocity and acceleration information and combines all the predictions with properly determined weights to provide final imputed trajectories. In this way, our proposed framework not only accurately predicts position, velocity, and acceleration values but also enforces the physical relationship between them, eventually improving both the accuracy and naturalness of the predicted trajectories. Accordingly, the experiment results about imputing player trajectories in team sports show that our framework significantly outperforms existing imputation baselines.

[AI-30] V-RoAst: A New Dataset for Visual Road Assessment

链接: https://arxiv.org/abs/2408.10872
作者: Natchapon Jongwiriyanurak,Zichao Zeng,June Moh Goo,Xinglei Wang,Ilya Ilyankou,Kerkritt Srirrongvikrai,Meihui Wang,James Haworth
关键词-EN: Convolutional Neural Networks, Road traffic crashes, significant economic impact, Vision Language Models, traditional Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Road traffic crashes cause millions of deaths annually and have a significant economic impact, particularly in low- and middle-income countries (LMICs). This paper presents an approach using Vision Language Models (VLMs) for road safety assessment, overcoming the limitations of traditional Convolutional Neural Networks (CNNs). We introduce a new task ,V-RoAst (Visual question answering for Road Assessment), with a real-world dataset. Our approach optimizes prompt engineering and evaluates advanced VLMs, including Gemini-1.5-flash and GPT-4o-mini. The models effectively examine attributes for road assessment. Using crowdsourced imagery from Mapillary, our scalable solution influentially estimates road safety levels. In addition, this approach is designed for local stakeholders who lack resources, as it does not require training data. It offers a cost-effective and automated methods for global road safety assessments, potentially saving lives and reducing economic burdens.

[AI-31] Multi-agent Multi-armed Bandits with Stochastic Sharable Arm Capacities

链接: https://arxiv.org/abs/2408.10865
作者: Hong Xie,Jinyu Mo,Defu Lian,Jie Wang,Enhong Chen
关键词-EN: optimal arm pulling, arm pulling profile, arm pulling, multi-player multi-armed bandit, captures stochastic arrival
类目: Artificial Intelligence (cs.AI)
*备注: 28 pages

点击查看摘要

Abstract:Motivated by distributed selection problems, we formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures stochastic arrival of requests to each arm, as well as the policy of allocating requests to players. The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile (an arm pulling profile prescribes the number of players at each arm) without communicating to each other. We first design a greedy algorithm, which locates one of the optimal arm pulling profiles with a polynomial computational complexity. We also design an iterative distributed algorithm for players to commit to an optimal arm pulling profile with a constant number of rounds in expectation. We apply the explore then commit (ETC) framework to address the online setting when model parameters are unknown. We design an exploration strategy for players to estimate the optimal arm pulling profile. Since such estimates can be different across different players, it is challenging for players to commit. We then design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds. We conduct experiments to validate our algorithm.

[AI-32] Knowledge Sharing and Transfer via Centralized Reward Agent for Multi-Task Reinforcement Learning

链接: https://arxiv.org/abs/2408.10858
作者: Haozhe Ma,Zhengding Luo,Thanh Vinh Vo,Kuankuan Sima,Tze-Yun Leong
关键词-EN: auxiliary informative rewards, providing immediate feedback, feedback through auxiliary, auxiliary informative, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reward shaping is effective in addressing the sparse-reward challenge in reinforcement learning by providing immediate feedback through auxiliary informative rewards. Based on the reward shaping strategy, we propose a novel multi-task reinforcement learning framework, that integrates a centralized reward agent (CRA) and multiple distributed policy agents. The CRA functions as a knowledge pool, which aims to distill knowledge from various tasks and distribute it to individual policy agents to improve learning efficiency. Specifically, the shaped rewards serve as a straightforward metric to encode knowledge. This framework not only enhances knowledge sharing across established tasks but also adapts to new tasks by transferring valuable reward signals. We validate the proposed method on both discrete and continuous domains, demonstrating its robustness in multi-task sparse-reward settings and its effective transferability to unseen tasks.

[AI-33] Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?

链接: https://arxiv.org/abs/2408.10853
作者: Yuankun Xie,Chenxu Xiong,Xiaopeng Wang,Zhiyong Wang,Yi Lu,Xin Qi,Ruibo Fu,Yukun Liu,Zhengqi Wen,Jianhua Tao,Guanjun Li,Long Ye
关键词-EN: large language models, Audio Language Models, Language Models, rapidly advancing due, audio neural codecs
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Currently, Audio Language Models (ALMs) are rapidly advancing due to the developments in large language models and audio neural codecs. These ALMs have significantly lowered the barrier to creating deepfake audio, generating highly realistic and diverse types of deepfake audio, which pose severe threats to society. Consequently, effective audio deepfake detection technologies to detect ALM-based audio have become increasingly critical. This paper investigate the effectiveness of current countermeasure (CM) against ALM-based audio. Specifically, we collect 12 types of the latest ALM-based deepfake audio and utilizing the latest CMs to evaluate. Our findings reveal that the latest codec-trained CM can effectively detect ALM-based audio, achieving 0% equal error rate under most ALM test conditions, which exceeded our expectations. This indicates promising directions for future research in ALM-based deepfake audio detection.

[AI-34] Harmonizing Attention: Training-free Texture-aware Geometry Transfer

链接: https://arxiv.org/abs/2408.10846
作者: Eito Ikuta,Yohan Lee,Akihiro Iohara,Yu Saito,Toshiyuki Tanaka
关键词-EN: Extracting geometry features, photographic images independently, Extracting geometry, complex challenge, independently of surface
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Extracting geometry features from photographic images independently of surface texture and transferring them onto different materials remains a complex challenge. In this study, we introduce Harmonizing Attention, a novel training-free approach that leverages diffusion models for texture-aware geometry transfer. Our method employs a simple yet effective modification of self-attention layers, allowing the model to query information from multiple reference images within these layers. This mechanism is seamlessly integrated into the inversion process as Texture-aligning Attention and into the generation process as Geometry-aligning Attention. This dual-attention approach ensures the effective capture and transfer of material-independent geometry features while maintaining material-specific textural continuity, all without the need for model fine-tuning.

[AI-35] Detecting Wildfires on UAVs with Real-time Segmentation Trained by Larger Teacher Models

链接: https://arxiv.org/abs/2408.10843
作者: Julius Pesonen,Teemu Hakala,Väinö Karjalainen,Niko Koivumäki,Lauri Markelin,Anna-Maria Raita-Hakola,Juha Suomalainen,Ilkka Pölönen,Eija Honkavaara
关键词-EN: prevent large-scale fires, large-scale fires resulting, Early detection, extensive environmental, societal damage
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early detection of wildfires is essential to prevent large-scale fires resulting in extensive environmental, structural, and societal damage. Uncrewed aerial vehicles (UAVs) can cover large remote areas effectively with quick deployment requiring minimal infrastructure and equipping them with small cameras and computers enables autonomous real-time detection. In remote areas, however, the UAVs are limited to on-board computing for detection due to the lack of high-bandwidth mobile networks. This limits the detection to methods which are light enough for the on-board computer alone. For accurate camera-based localisation, segmentation of the detected smoke is essential but training data for deep learning-based wildfire smoke segmentation is limited. This study shows how small specialised segmentation models can be trained using only bounding box labels, leveraging zero-shot foundation model supervision. The method offers the advantages of needing only fairly easily obtainable bounding box labels and requiring training solely for the smaller student network. The proposed method achieved 63.3% mIoU on a manually annotated and diverse wildfire dataset. The used model can perform in real-time at ~11 fps with a UAV-carried NVIDIA Jetson Orin NX computer while reliably recognising smoke, demonstrated at real-world forest burning events. Code is available at this https URL

[AI-36] DELIA: Diversity-Enhanced Learning for Instruction Adaptation in Large Language Models

链接: https://arxiv.org/abs/2408.10841
作者: Yuanhao Zeng,Fei Ren,Xinpeng Zhou,Yihang Wang,Yingxia Shao
关键词-EN: Large Language Models, Large Language, Language Models, specific task formats, behavior in Large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Although instruction tuning is widely used to adjust behavior in Large Language Models (LLMs), extensive empirical evidence and research indicates that it is primarily a process where the model fits to specific task formats, rather than acquiring new knowledge or capabilities. We propose that this limitation stems from biased features learned during instruction tuning, which differ from ideal task-specfic features, leading to learn less underlying semantics in downstream tasks. However, ideal features are unknown and incalculable, constraining past work to rely on prior knowledge to assist reasoning or training, which limits LLMs’ capabilities to the developers’ abilities, rather than data-driven scalable learning. In our paper, through our novel data synthesis method, DELIA (Diversity-Enhanced Learning for Instruction Adaptation), we leverage the buffering effect of extensive diverse data in LLMs training to transform biased features in instruction tuning into approximations of ideal features, without explicit prior ideal features. Experiments show DELIA’s better performance compared to common instruction tuning and other baselines. It outperforms common instruction tuning by 17.07%-33.41% on Icelandic-English translation bleurt score (WMT-21 dataset, gemma-7b-it) and improves accuracy by 36.1% on formatted text generation (Llama2-7b-chat). Notably, among knowledge injection methods we’ve known, DELIA uniquely align the internal representations of new special tokens with their prior semantics.

[AI-37] ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data

链接: https://arxiv.org/abs/2408.10831
作者: Elia Bonetto,Aamir Ahmad
关键词-EN: deep learning tasks, pose estimation, address the lack, deep learning, Synthetic data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 5 tables, 7 figures

点击查看摘要

Abstract:Synthetic data is increasingly being used to address the lack of labeled images in uncommon domains for deep learning tasks. A prominent example is 2D pose estimation of animals, particularly wild species like zebras, for which collecting real-world data is complex and impractical. However, many approaches still require real images, consistency and style constraints, sophisticated animal models, and/or powerful pre-trained networks to bridge the syn-to-real gap. Moreover, they often assume that the animal can be reliably detected in images or videos, a hypothesis that often does not hold, e.g. in wildlife scenarios or aerial images. To solve this, we use synthetic data generated with a 3D photorealistic simulator to obtain the first synthetic dataset that can be used for both detection and 2D pose estimation of zebras without applying any of the aforementioned bridging strategies. Unlike previous works, we extensively train and benchmark our detection and 2D pose estimation models on multiple real-world and synthetic datasets using both pre-trained and non-pre-trained backbones. These experiments show how the models trained from scratch and only with synthetic data can consistently generalize to real-world images of zebras in both tasks. Moreover, we show it is possible to easily generalize those same models to 2D pose estimation of horses with a minimal amount of real-world images to account for the domain transfer. Code, results, trained models; and the synthetic, training, and validation data, including 104K manually labeled frames, are provided as open-source at this https URL

[AI-38] Exploiting Large Language Models Capabilities for Question Answer-Driven Knowledge Graph Completion Across Static and Temporal Domains

链接: https://arxiv.org/abs/2408.10819
作者: Rui Yang,Jiahao Zhu,Jianping Man,Li Fang,Yi Zhou
关键词-EN: identify missing triples, Knowledge graph completion, aims to identify, Knowledge graph, Generative Subgraph-based KGC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge graph completion (KGC) aims to identify missing triples in a knowledge graph (KG). This is typically achieved through tasks such as link prediction and instance completion. However, these methods often focus on either static knowledge graphs (SKGs) or temporal knowledge graphs (TKGs), addressing only within-scope triples. This paper introduces a new generative completion framework called Generative Subgraph-based KGC (GS-KGC). GS-KGC employs a question-answering format to directly generate target entities, addressing the challenge of questions having multiple possible answers. We propose a strategy that extracts subgraphs centered on entities and relationships within the KG, from which negative samples and neighborhood information are separately obtained to address the one-to-many problem. Our method generates negative samples using known facts to facilitate the discovery of new information. Furthermore, we collect and refine neighborhood path data of known entities, providing contextual information to enhance reasoning in large language models (LLMs). Our experiments evaluated the proposed method on four SKGs and two TKGs, achieving state-of-the-art Hits@1 metrics on five datasets. Analysis of the results shows that GS-KGC can discover new triples within existing KGs and generate new facts beyond the closed KG, effectively bridging the gap between closed-world and open-world KGC.

[AI-39] Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?

链接: https://arxiv.org/abs/2408.10811
作者: Chengzhi Zhong,Fei Cheng,Qianying Liu,Junfeng Jiang,Zhen Wan,Chenhui Chu,Yugo Murawaki,Sadao Kurohashi
关键词-EN: exhibit higher probabilities, respective dominant language, respective dominant, strong performance, vocabulary space
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: work in progress

点击查看摘要

Abstract:In this study, we investigate whether non-English-centric LLMs, despite their strong performance, think' in their respective dominant language: more precisely, think’ refers to how the representations of intermediate layers, when un-embedded into the vocabulary space, exhibit higher probabilities for certain dominant languages during generation. We term such languages as internal \textbflatent languages . We examine the latent language of three typical categories of models for Japanese processing: Llama2, an English-centric model; Swallow, an English-centric model with continued pre-training in Japanese; and LLM-jp, a model pre-trained on balanced English and Japanese corpora. Our empirical findings reveal that, unlike Llama2 which relies exclusively on English as the internal latent language, Japanese-specific Swallow and LLM-jp employ both Japanese and English, exhibiting dual internal latent languages. For any given target language, the model preferentially activates the latent language most closely related to it. In addition, we explore how intermediate layers respond to questions involving cultural conflicts between latent internal and target output languages. We further explore how the language identity shifts across layers while keeping consistent semantic meaning reflected in the intermediate layer representations. This study deepens the understanding of non-English-centric large language models, highlighting the intricate dynamics of language representation within their intermediate layers. Comments: work in progress Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.10811 [cs.CL] (or arXiv:2408.10811v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.10811 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

链接: https://arxiv.org/abs/2408.10807
作者: Yin-Jyun Luo,Kin Wai Cheuk,Woosung Choi,Toshimitsu Uesaka,Keisuke Toyama,Koichi Saito,Chieh-Hsin Lai,Yuhta Takida,Wei-Hsiang Liao,Simon Dixon,Yuki Mitsufuji
关键词-EN: single-instrument music audio, Existing work, pitch and timbre, music audio, excluding the cases
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.

[AI-41] Inverse Deep Learning Ray Tracing for Heliostat Surface Prediction

链接: https://arxiv.org/abs/2408.10802
作者: Jan Lewen,Max Pargmann,Mehdi Cherti,Jenia Jitsev,Robert Pitz-Paal,Daniel Maldonado Quinto
关键词-EN: Concentrating Solar Power, Concentrating Solar, Solar Power, flux density, CSP plant operations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Concentrating Solar Power (CSP) plants play a crucial role in the global transition towards sustainable energy. A key factor in ensuring the safe and efficient operation of CSP plants is the distribution of concentrated flux density on the receiver. However, the non-ideal flux density generated by individual heliostats can undermine the safety and efficiency of the power plant. The flux density from each heliostat is influenced by its precise surface profile, which includes factors such as canting and mirror errors. Accurately measuring these surface profiles for a large number of heliostats in operation is a formidable challenge. Consequently, control systems often rely on the assumption of ideal surface conditions, which compromises both safety and operational efficiency. In this study, we introduce inverse Deep Learning Ray Tracing (iDLR), an innovative method designed to predict heliostat surfaces based solely on target images obtained during heliostat calibration. Our simulation-based investigation demonstrates that sufficient information regarding the heliostat surface is retained in the flux density distribution of a single heliostat, enabling deep learning models to accurately predict the underlying surface with deflectometry-like precision for the majority of heliostats. Additionally, we assess the limitations of this method, particularly in relation to surface accuracy and resultant flux density predictions. Furthermore, we are presenting a new comprehensive heliostat model using Non-Uniform Rational B-Spline (NURBS) that has the potential to become the new State of the Art for heliostat surface parameterization. Our findings reveal that iDLR has significant potential to enhance CSP plant operations, potentially increasing the overall efficiency and energy output of the power plants.

[AI-42] Understanding the Skills Gap between Higher Education and Industry in the UK in Artificial Intelligence Sector

链接: https://arxiv.org/abs/2408.10788
作者: Khushi Jaiswal,Ievgeniia Kuzminykh,Sanjay Modgil
关键词-EN: Artificial Intelligence, United Kingdom offering, businesses work, United Kingdom, Intelligence
类目: Artificial Intelligence (cs.AI)
*备注: Accepted to the journal “Industry and Higher Education”

点击查看摘要

Abstract:As Artificial Intelligence (AI) changes how businesses work, there is a growing need for people who can work in this sector. This paper investigates how well universities in United Kingdom offering courses in AI, prepare students for jobs in the real world. To gain insight into the differences between university curricula and industry demands we review the contents of taught courses and job advertisement portals. By using custom data scraping tools to gather information from job advertisements and university curricula, and frequency and Naive Bayes classifier analysis, this study will show exactly what skills industry is looking for. In this study we identified 12 skill categories that were used for mapping. The study showed that the university curriculum in the AI domain is well balanced in most technical skills, including Programming and Machine learning subjects, but have a gap in Data Science and Maths and Statistics skill categories.

[AI-43] Just a Hint: Point-Supervised Camouflaged Object Detection ECCV2024

链接: https://arxiv.org/abs/2408.10777
作者: Huafeng Chen,Dian Shao,Guangqian Guo,Shan Gao
关键词-EN: Camouflaged Object Detection, accurately distinguish objects, Object Detection, expeditiously and accurately, accurately distinguish
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Camouflaged Object Detection (COD) demands models to expeditiously and accurately distinguish objects which conceal themselves seamlessly in the environment. Owing to the subtle differences and ambiguous boundaries, COD is not only a remarkably challenging task for models but also for human annotators, requiring huge efforts to provide pixel-wise annotations. To alleviate the heavy annotation burden, we propose to fulfill this task with the help of only one point supervision. Specifically, by swiftly clicking on each object, we first adaptively expand the original point-based annotation to a reasonable hint area. Then, to avoid partial localization around discriminative parts, we propose an attention regulator to scatter model attention to the whole object through partially masking labeled regions. Moreover, to solve the unstable feature representation of camouflaged objects under only point-based annotation, we perform unsupervised contrastive learning based on differently augmented image pairs (e.g. changing color or doing translation). On three mainstream COD benchmarks, experimental results show that our model outperforms several weakly-supervised methods by a large margin across various metrics.

[AI-44] Flexora: Flexible Low Rank Adaptation for Large Language Models

链接: https://arxiv.org/abs/2408.10774
作者: Chenxing Wei,Yao Shu,Ying Tiffany He,Fei Richard Yu
关键词-EN: significantly enhanced generalization, enhanced generalization ability, Large Language Models, Large Language, driving advancements
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 29 pages, 13 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are driving advancements in artificial intelligence by increasing the scale of model parameters, which has significantly enhanced generalization ability and unlocked new capabilities in practice. However, their performance in specific downstream tasks is usually hindered by their knowledge boundaries on these tasks. Thus, fine-tuning techniques, especially the widely used Low-Rank Adaptation (LoRA) method, have been introduced to expand the boundaries on these tasks, whereas LoRA would underperform on certain tasks owing to its potential overfitting on these tasks. To overcome this overfitting and improve the performance of LoRA, we propose the flexible low rank adaptation (Flexora) method to automatically and flexibly select the most important layers needing to be fine-tuned to achieve the best performance on different downstream tasks. Specifically, Flexora firstly frames this layer selection problem as a well-defined hyperparameter optimization (HPO) problem, then addresses it using the unrolled differentiation (UD) method, and finally selects the most useful layers based on the optimized hyperparameters. Our extensive experiments on many pretrained models and natural language tasks show that Flexora is able to consistently improve over the existing baselines, indicating the effectiveness of our Flexora in practice. We additionally provide insightful theoretical results and many ablation studies to deliver a comprehensive understanding of our Flexora.

[AI-45] SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection ECCV2024

链接: https://arxiv.org/abs/2408.10760
作者: Huafeng Chen,Pengxu Wei,Guangqian Guo,Shan Gao
关键词-EN: Camouflaged Object Detection, Object Detection, methods heavily rely, camouflaged object labels, Camouflaged Object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Most Camouflaged Object Detection (COD) methods heavily rely on mask annotations, which are time-consuming and labor-intensive to acquire. Existing weakly-supervised COD approaches exhibit significantly inferior performance compared to fully-supervised methods and struggle to simultaneously support all the existing types of camouflaged object labels, including scribbles, bounding boxes, and points. Even for Segment Anything Model (SAM), it is still problematic to handle the weakly-supervised COD and it typically encounters challenges of prompt compatibility of the scribble labels, extreme response, semantically erroneous response, and unstable feature representations, producing unsatisfactory results in camouflaged scenes. To mitigate these issues, we propose a unified COD framework in this paper, termed SAM-COD, which is capable of supporting arbitrary weakly-supervised labels. Our SAM-COD employs a prompt adapter to handle scribbles as prompts based on SAM. Meanwhile, we introduce response filter and semantic matcher modules to improve the quality of the masks obtained by SAM under COD prompts. To alleviate the negative impacts of inaccurate mask predictions, a new strategy of prompt-adaptive knowledge distillation is utilized to ensure a reliable feature representation. To validate the effectiveness of our approach, we have conducted extensive empirical experiments on three mainstream COD benchmarks. The results demonstrate the superiority of our method against state-of-the-art weakly-supervised and even fully-supervised methods.

[AI-46] Generating Synthetic Fair Syntax-agnostic Data by Learning and Distilling Fair Representation

链接: https://arxiv.org/abs/2408.10755
作者: Md Fahim Sikder,Resmi Ramachandranpillai,Daniel de Leng,Fredrik Heintz
关键词-EN: crucial topic due, recent wide usage, latent space, Data, fair
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data Fairness is a crucial topic due to the recent wide usage of AI powered applications. Most of the real-world data is filled with human or machine biases and when those data are being used to train AI models, there is a chance that the model will reflect the bias in the training data. Existing bias-mitigating generative methods based on GANs, Diffusion models need in-processing fairness objectives and fail to consider computational overhead while choosing computationally-heavy architectures, which may lead to high computational demands, instability and poor optimization performance. To mitigate this issue, in this work, we present a fair data generation technique based on knowledge distillation, where we use a small architecture to distill the fair representation in the latent space. The idea of fair latent space distillation enables more flexible and stable training of Fair Generative Models (FGMs). We first learn a syntax-agnostic (for any data type) fair representation of the data, followed by distillation in the latent space into a smaller model. After distillation, we use the distilled fair latent space to generate high-fidelity fair synthetic data. While distilling, we employ quality loss (for fair distillation) and utility loss (for data utility) to ensure that the fairness and data utility characteristics remain in the distilled latent space. Our approaches show a 5%, 5% and 10% rise in performance in fairness, synthetic sample quality and data utility, respectively, than the state-of-the-art fair generative model.

[AI-47] Security Assessment of Hierarchical Federated Deep Learning

链接: https://arxiv.org/abs/2408.10752
作者: D Alqattan,R Sun,H Liang,G Nicosia,V Snasel,R Ranjan,V Ojha
关键词-EN: distributed deep learning, deep learning model, promising distributed deep, Hierarchical federated learning, crucial security concerns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Hierarchical federated learning (HFL) is a promising distributed deep learning model training paradigm, but it has crucial security concerns arising from adversarial attacks. This research investigates and assesses the security of HFL using a novel methodology by focusing on its resilience against adversarial attacks inference-time and training-time. Through a series of extensive experiments across diverse datasets and attack scenarios, we uncover that HFL demonstrates robustness against untargeted training-time attacks due to its hierarchical structure. However, targeted attacks, particularly backdoor attacks, exploit this architecture, especially when malicious clients are positioned in the overlapping coverage areas of edge servers. Consequently, HFL shows a dual nature in its resilience, showcasing its capability to recover from attacks thanks to its hierarchical aggregation that strengthens its suitability for adversarial training, thereby reinforcing its resistance against inference-time attacks. These insights underscore the necessity for balanced security strategies in HFL systems, leveraging their inherent strengths while effectively mitigating vulnerabilities.

[AI-48] Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning

链接: https://arxiv.org/abs/2408.10746
作者: Bei Ouyang,Shengyuan Ye,Liekang Zeng,Tianyi Qian,Jingyi Li,Xu Chen
关键词-EN: Large language models, Large language, personal LLMs fine-tuning, intelligent personal assistants, personal LLMs
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted by The 53rd International Conference on Parallel Processing (ICPP’24)

点击查看摘要

Abstract:Large language models (LLMs) have unlocked a plethora of powerful applications at the network edge, such as intelligent personal assistants. Data privacy and security concerns have prompted a shift towards edge-based fine-tuning of personal LLMs, away from cloud reliance. However, this raises issues of computational intensity and resource scarcity, hindering training efficiency and feasibility. While current studies investigate parameter-efficient fine-tuning (PEFT) techniques to mitigate resource constraints, our analysis indicates that these techniques are not sufficiently resource-efficient for edge devices. To tackle these challenges, we propose Pluto and Charon (PAC), a time and memory efficient collaborative edge AI framework for personal LLMs fine-tuning. PAC breaks the resource wall of personal LLMs fine-tuning with a sophisticated algorithm-system co-design. (1) Algorithmically, PAC implements a personal LLMs fine-tuning technique that is efficient in terms of parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Additionally, an activation cache mechanism further streamlining the process by negating the necessity for repeated forward passes across multiple epochs. (2) Systematically, PAC leverages edge devices in close proximity, pooling them as a collective resource for in-situ personal LLMs fine-tuning, utilizing a hybrid data and pipeline parallelism to orchestrate distributed training. The use of the activation cache eliminates the need for forward pass through the LLM backbone,enabling exclusive fine-tuning of the Parallel Adapters using data parallelism. Extensive evaluation based on prototype implementation demonstrates that PAC remarkably outperforms state-of-the-art approaches, achieving up to 8.64x end-to-end speedup and up to 88.16% reduction in memory footprint.

[AI-49] owards Efficient Large Language Models for Scientific Text: A Review

链接: https://arxiv.org/abs/2408.10729
作者: Huy Quoc To,Ming Liu,Guangyan Huang
关键词-EN: Large language models, processing complex information, Large language, era for processing, processing complex
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have ushered in a new era for processing complex information in various fields, including science. The increasing amount of scientific literature allows these models to acquire and understand scientific knowledge effectively, thus improving their performance in a wide range of tasks. Due to the power of LLMs, they require extremely expensive computational resources, intense amounts of data, and training time. Therefore, in recent years, researchers have proposed various methodologies to make scientific LLMs more affordable. The most well-known approaches align in two directions. It can be either focusing on the size of the models or enhancing the quality of data. To date, a comprehensive review of these two families of methods has not yet been undertaken. In this paper, we (I) summarize the current advances in the emerging abilities of LLMs into more accessible AI solutions for science, and (II) investigate the challenges and opportunities of developing affordable solutions for scientific domains using LLMs.

[AI-50] MEGen: Generative Backdoor in Large Language Models via Model Editing

链接: https://arxiv.org/abs/2408.10722
作者: Jiyang Qiu,Xinbei Ma,Zhuosheng Zhang,Hai Zhao
关键词-EN: demonstrated remarkable capabilities, Large language models, Large language, remarkable capabilities, demonstrated remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Working in progress

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. Emerging as widely adopted generalists for diverse tasks, LLMs are still vulnerable to backdoors. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects. In our approach, we first leverage a language model to insert a trigger selected on fixed metrics into the input, then design a pipeline of model editing to directly embed a backdoor into an LLM. By adjusting a small set of local parameters with a mini-batch of samples, MEGen significantly enhances time efficiency and achieves high robustness. Experimental results indicate that our backdoor attack strategy achieves a high attack success rate on poison data while maintaining the model’s performance on clean data. Notably, the backdoored model, when triggered, can freely output pre-set dangerous information while successfully completing downstream tasks. This suggests that future LLM applications could be guided to deliver certain dangerous information, thus altering the LLM’s generative style. We believe this approach provides insights for future LLM applications and the execution of backdoor attacks on conversational AI systems.

[AI-51] owards Foundation Models for the Industrial Forecasting of Chemical Kinetics

链接: https://arxiv.org/abs/2408.10720
作者: Imran Nasim,Joaõ Lucas de Sousa Almeida
关键词-EN: Scientific Machine Learning, Scientific Machine, Machine Learning, Learning is transforming, modeling chemical reactions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted into the IEEE CAI 2024 Workshop on Scientific Machine Learning and Its Industrial Applications (SMLIA2024)

点击查看摘要

Abstract:Scientific Machine Learning is transforming traditional engineering industries by enhancing the efficiency of existing technologies and accelerating innovation, particularly in modeling chemical reactions. Despite recent advancements, the issue of solving stiff chemically reacting problems within computational fluid dynamics remains a significant issue. In this study we propose a novel approach utilizing a multi-layer-perceptron mixer architecture (MLP-Mixer) to model the time-series of stiff chemical kinetics. We evaluate this method using the ROBER system, a benchmark model in chemical kinetics, to compare its performance with traditional numerical techniques. This study provides insight into the industrial utility of the recently developed MLP-Mixer architecture to model chemical kinetics and provides motivation for such neural architecture to be used as a base for time-series foundation models.

[AI-52] Fine-Tuning a Local LLaMA-3 Large Language Model for Automated Privacy-Preserving Physician Letter Generation in Radiation Oncology

链接: https://arxiv.org/abs/2408.10715
作者: Yihao Hou,Christoph Bert,Ahmed Gomaa,Godehard Lahmer,Daniel Hoefler,Thomas Weissmann,Raphaela Voigt,Philipp Schubert,Charlotte Schmitter,Alina Depardon,Sabine Semrau,Andreas Maier,Rainer Fietkau,Yixing Huang,Florian Putz
关键词-EN: Generating physician letters, daily clinical practice, physician letters, physician letter generation, time-consuming task
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating physician letters is a time-consuming task in daily clinical practice. This study investigates local fine-tuning of large language models (LLMs), specifically LLaMA models, for physician letter generation in a privacy-preserving manner within the field of radiation oncology. Our findings demonstrate that base LLaMA models, without fine-tuning, are inadequate for effectively generating physician letters. The QLoRA algorithm provides an efficient method for local intra-institutional fine-tuning of LLMs with limited computational resources (i.e., a single 48 GB GPU workstation within the hospital). The fine-tuned LLM successfully learns radiation oncology-specific information and generates physician letters in an institution-specific style. ROUGE scores of the generated summary reports highlight the superiority of the 8B LLaMA-3 model over the 13B LLaMA-2 model. Further multidimensional physician evaluations of 10 cases reveal that, although the fine-tuned LLaMA-3 model has limited capacity to generate content beyond the provided input data, it successfully generates salutations, diagnoses and treatment histories, recommendations for further treatment, and planned schedules. Overall, clinical benefit was rated highly by the clinical experts (average score of 3.44 on a 4-point scale). With careful physician review and correction, automated LLM-based physician letter generation has significant practical value.

[AI-53] Offline Model-Based Reinforcement Learning with Anti-Exploration

链接: https://arxiv.org/abs/2408.10713
作者: Padmanaba Srinivasan,William Knottenbelt
关键词-EN: enable faster learning, Model-based reinforcement learning, offline reinforcement learning, reinforcement learning, generate synthetic trajectories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) algorithms learn a dynamics model from collected data and apply it to generate synthetic trajectories to enable faster learning. This is an especially promising paradigm in offline reinforcement learning (RL) where data may be limited in quantity, in addition to being deficient in coverage and quality. Practical approaches to offline MBRL usually rely on ensembles of dynamics models to prevent exploitation of any individual model and to extract uncertainty estimates that penalize values in states far from the dataset support. Uncertainty estimates from ensembles can vary greatly in scale, making it challenging to generalize hyperparameters well across even similar tasks. In this paper, we present Morse Model-based offline RL (MoMo), which extends the anti-exploration paradigm found in offline model-free RL to the model-based space. We develop model-free and model-based variants of MoMo and show how the model-free version can be extended to detect and deal with out-of-distribution (OOD) states using explicit uncertainty estimation without the need for large ensembles. MoMo performs offline MBRL using an anti-exploration bonus to counteract value overestimation in combination with a policy constraint, as well as a truncation function to terminate synthetic rollouts that are excessively OOD. Experimentally, we find that both model-free and model-based MoMo perform well, and the latter outperforms prior model-based and model-free baselines on the majority of D4RL datasets tested.

[AI-54] Investigating Context Effects in Similarity Judgements in Large Language Models KDD2024

链接: https://arxiv.org/abs/2408.10711
作者: Sagar Uprety,Amit Kumar Jaiswal,Haiming Liu,Dawei Song
关键词-EN: Large Language Models, natural language text, generating natural language, Large Language, Language Models
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at The First Workshop on AI Behavioral Science (AIBS 2024), held in conjunction with KDD 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionised the capability of AI models in comprehending and generating natural language text. They are increasingly being used to empower and deploy agents in real-world scenarios, which make decisions and take actions based on their understanding of the context. Therefore researchers, policy makers and enterprises alike are working towards ensuring that the decisions made by these agents align with human values and user expectations. That being said, human values and decisions are not always straightforward to measure and are subject to different cognitive biases. There is a vast section of literature in Behavioural Science which studies biases in human judgements. In this work we report an ongoing investigation on alignment of LLMs with human judgements affected by order bias. Specifically, we focus on a famous human study which showed evidence of order effects in similarity judgements, and replicate it with various popular LLMs. We report the different settings where LLMs exhibit human-like order effect bias and discuss the implications of these findings to inform the design and development of LLM based applications.

[AI-55] Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

链接: https://arxiv.org/abs/2408.10710
作者: Pengkun Wei,Shuo Cheng,Dayou Li,Ran Song,Yipeng Zhang,Wei Zhang
关键词-EN: Efficiently detecting target, ensuring sub-millimeter accuracy, detecting target weld, Efficiently detecting, target weld seams
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently detecting target weld seams while ensuring sub-millimeter accuracy has always been an important challenge in autonomous welding, which has significant application in industrial practice. Previous works mostly focused on recognizing and localizing welding seams one by one, leading to inferior efficiency in modeling the workpiece. This paper proposes a novel framework capable of multiple weld seams extraction using both RGB images and 3D point clouds. The RGB image is used to obtain the region of interest by approximately localizing the weld seams, and the point cloud is used to achieve the fine-edge extraction of the weld seams within the region of interest using region growth. Our method is further accelerated by using a pre-trained deep learning model to ensure both efficiency and generalization ability. The performance of the proposed method has been comprehensively tested on various workpieces featuring both linear and curved weld seams and in physical experiment systems. The results showcase considerable potential for real-world industrial applications, emphasizing the method’s efficiency and effectiveness. Videos of the real-world experiments can be found at this https URL.

[AI-56] Variable Assignment Invariant Neural Networks for Learning Logic Programs

链接: https://arxiv.org/abs/2408.10709
作者: Yin Jun Phua,Katsumi Inoue
关键词-EN: observed state transitions, observed state, interpretation transition, state transitions, learning rules
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning from interpretation transition (LFIT) is a framework for learning rules from observed state transitions. LFIT has been implemented in purely symbolic algorithms, but they are unable to deal with noise or generalize to unobserved transitions. Rule extraction based neural network methods suffer from overfitting, while more general implementation that categorize rules suffer from combinatorial explosion. In this paper, we introduce a technique to leverage variable permutation invariance inherent in symbolic domains. Our technique ensures that the permutation and the naming of the variables would not affect the results. We demonstrate the effectiveness and the scalability of this method with various experiments. Our code is publicly available at this https URL

[AI-57] AnyGraph: Graph Foundation Model in the Wild

链接: https://arxiv.org/abs/2408.10700
作者: Lianghao Xia,Chao Huang
关键词-EN: exceptional generalization capabilities, relational data structured, graph, generalization capabilities, growing ubiquity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing ubiquity of relational data structured as graphs has underscored the need for graph learning models with exceptional generalization capabilities. However, current approaches often struggle to effectively extract generalizable insights, frequently requiring extensive fine-tuning and limiting their versatility. Graph foundation models offer a transformative solution, with the potential to learn robust, generalizable representations from graph data. This enables more effective and adaptable applications across a wide spectrum of tasks and domains. In this work, we investigate a unified graph model, AnyGraph, designed to handle key challenges: i) Structure Heterogenity. Addressing distribution shift in graph structural information; ii) Feature Heterogenity. Handling diverse feature representation spaces across graph datasets; iii) Fast Adaptation. Efficiently adapting the model to new graph domains; iv) Scaling Law Emergence. Enabling the model to exhibit scaling law behavior, where its performance scales favorably with the amount of data and parameter sizes. To tackle these critical challenges, we build the AnyGraph upon a Graph Mixture-of-Experts (MoE) architecture. This approach empowers the model to effectively manage both the in-domain and cross-domain distribution shift concerning structure-level and feature-level heterogeneity. Furthermore, a lightweight graph expert routing mechanism is proposed to facilitate AnyGraph’s fast adaptability to new data and domains. Our extensive experiments on diverse 38 graph datasets have demonstrated the strong zero-shot learning performance of AnyGraph across diverse graph domains with significant distribution shift. Furthermore, we have validated the model’s fast adaptation ability and scaling law emergence, showcasing its versatility.

[AI-58] Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

链接: https://arxiv.org/abs/2408.10691
作者: Yanjie Dong,Xiaoyi Fan,Fangxin Wang,Chengming Li,Victor C. M. Leung,Xiping Hu
关键词-EN: large language models, large language, transitioned from specialized, versatile foundation models, versatile foundation
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Since the invention of GPT2–1.5B in 2019, large language models (LLMs) have transitioned from specialized models to versatile foundation models. The LLMs exhibit impressive zero-shot ability, however, require fine-tuning on local datasets and significant resources for deployment. Traditional fine-tuning techniques with the first-order optimizers require substantial GPU memory that exceeds mainstream hardware capability. Therefore, memory-efficient methods are motivated to be investigated. Model compression techniques can reduce energy consumption, operational costs, and environmental impact so that to support sustainable artificial intelligence advancements. Additionally, large-scale foundation models have expanded to create images, audio, videos, and multi-modal contents, further emphasizing the need for efficient deployment. Therefore, we are motivated to present a comprehensive overview of the prevalent memory-efficient fine-tuning methods over the network edge. We also review the state-of-the-art literatures on model compression to provide a vision on deploying LLMs over the network edge.

[AI-59] Genesis: Towards the Automation of Systems Biology Research

链接: https://arxiv.org/abs/2408.10689
作者: Ievgeniia A. Tiukova,Daniel Brunnsåker,Erik Y. Bjurström,Alexander H. Gower,Filip Kronström,Gabriel K. Reder,Ronald S. Reiserer,Konstantin Korovin,Larisa B. Soldatova,John P. Wikswo,Ross D. King
关键词-EN: robot scientists, robot scientist Genesis, scientific research, cutting edge, edge of applying
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The cutting edge of applying AI to science is the closed-loop automation of scientific research: robot scientists. We have previously developed two robot scientists: Adam' (for yeast functional biology), and Eve’ (for early-stage drug design)). We are now developing a next generation robot scientist Genesis. With Genesis we aim to demonstrate that an area of science can be investigated using robot scientists unambiguously faster, and at lower cost, than with human scientists. Here we report progress on the Genesis project. Genesis is designed to automatically improve system biology models with thousands of interacting causal components. When complete Genesis will be able to initiate and execute in parallel one thousand hypothesis-led closed-loop cycles of experiment per-day. Here we describe the core Genesis hardware: the one thousand computer-controlled \mu -bioreactors. For the integrated Mass Spectrometry platform we have developed AutonoMS, a system to automatically run, process, and analyse high-throughput experiments. We have also developed Genesis-DB, a database system designed to enable software agents access to large quantities of structured domain information. We have developed RIMBO (Revisions for Improvements of Models in Biology Ontology) to describe the planned hundreds of thousands of changes to the models. We have demonstrated the utility of this infrastructure by developed two relational learning bioinformatic projects. Finally, we describe LGEM+ a relational learning system for the automated abductive improvement of genome-scale metabolic models.

[AI-60] Rejection in Abstract Argumentation: Harder Than Acceptance? ECAI24

链接: https://arxiv.org/abs/2408.10683
作者: Johannes K. Fichte,Markus Hecher,Yasir Mahmood,Arne Meier
关键词-EN: Abstract argumentation, Abstract, toolkit for modeling, popular toolkit, argumentation frameworks
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注: accepted version as ECAI24

点击查看摘要

Abstract:Abstract argumentation is a popular toolkit for modeling, evaluating, and comparing arguments. Relationships between arguments are specified in argumentation frameworks (AFs), and conditions are placed on sets (extensions) of arguments that allow AFs to be evaluated. For more expressiveness, AFs are augmented with \emphacceptance conditions on directly interacting arguments or a constraint on the admissible sets of arguments, resulting in dialectic frameworks or constrained argumentation frameworks. In this paper, we consider flexible conditions for \emphrejecting an argument from an extension, which we call rejection conditions (RCs). On the technical level, we associate each argument with a specific logic program. We analyze the resulting complexity, including the structural parameter treewidth. Rejection AFs are highly expressive, giving rise to natural problems on higher levels of the polynomial hierarchy.

[AI-61] owards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models

链接: https://arxiv.org/abs/2408.10682
作者: Hongbang Yuan,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: unlearned knowledge, unlearned, training corpora, achieved success, troubled by problematic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:LLM have achieved success in many fields but still troubled by problematic content in the training corpora. LLM unlearning aims at reducing their influence and avoid undesirable behaviours. However, existing unlearning methods remain vulnerable to adversarial queries and the unlearned knowledge resurfaces after the manually designed attack queries. As part of a red-team effort to proactively assess the vulnerabilities of unlearned models, we design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack these models and evaluate their robustness. It optimizes adversarial suffixes to reintroduce the unlearned knowledge in various scenarios. We find that unlearned knowledge can be recovered in 55.2% of the questions, even without revealing the unlearned model’s parameters. In response to this vulnerability, we propose Latent Adversarial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. It formulates the unlearning process as a min-max optimization problem and resolves it through two stages: an attack stage, where perturbation vectors are trained and added to the latent space of LLMs to recover the unlearned knowledge, and a defense stage, where previously trained perturbation vectors are used to enhance unlearned model’s robustness. With our LAU framework, we obtain two robust unlearning methods, AdvGA and AdvNPO. We conduct extensive experiments across multiple unlearning benchmarks and various models, and demonstrate that they improve the unlearning effectiveness by over 53.5% , cause only less than a 11.6% reduction in neighboring knowledge, and have almost no impact on the model’s general capabilities.

[AI-62] nsor tree learns hidden relational structures in data to construct generative models

链接: https://arxiv.org/abs/2408.10669
作者: Kenji Harada,Tsuyoshi Okubo,Naoki Kawashima
关键词-EN: Born machine framework, quantum wave function, wave function amplitude, function amplitude represented, target distribution function
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Based on the tensor tree network with the Born machine framework, we propose a general method for constructing a generative model by expressing the target distribution function as the quantum wave function amplitude represented by a tensor tree. The key idea is dynamically optimizing the tree structure that minimizes the bond mutual information. The proposed method offers enhanced performance and uncovers hidden relational structures in the target data. We illustrate potential practical applications with four examples: (i) random patterns, (ii) QMNIST hand-written digits, (iii) Bayesian networks, and (iv) the stock price fluctuation pattern in SP500. In (i) and (ii), strongly correlated variables were concentrated near the center of the network; in (iii), the causality pattern was identified; and, in (iv), a structure corresponding to the eleven sectors emerged.

[AI-63] Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

链接: https://arxiv.org/abs/2408.10668
作者: Haoyu Wang,Bingzhe Wu,Yatao Bian,Yongzhe Chang,Xueqian Wang,Peilin Zhao
关键词-EN: Large Language Models, Large Language, Language Models, implicit troublemakers, Large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are implicit troublemakers. While they provide valuable insights and assist in problem-solving, they can also potentially serve as a resource for malicious activities. Implementing safety alignment could mitigate the risk of LLMs generating harmful responses. We argue that: even when an LLM appears to successfully block harmful queries, there may still be hidden vulnerabilities that could act as ticking time bombs. To identify these underlying weaknesses, we propose to use a cost value model as both a detector and an attacker. Trained on external or self-generated harmful datasets, the cost value model could successfully influence the original safe LLM to output toxic content in decoding process. For instance, LLaMA-2-chat 7B outputs 39.18% concrete toxic content, along with only 22.16% refusals without any harmful suffixes. These potential weaknesses can then be exploited via prompt optimization such as soft prompts on images. We name this decoding strategy: Jailbreak Value Decoding (JVD), emphasizing that seemingly secure LLMs may not be as safe as we initially believe. They could be used to gather harmful data or launch covert attacks.

[AI-64] ETGuard: Malicious Encrypted Traffic Detection in Blockchain-based Power Grid Systems

链接: https://arxiv.org/abs/2408.10657
作者: Peng Zhou,Yongdong Liu,Lixun Ma,Weiye Zhang,Haohan Tan,Zhenguang Liu,Butian Huang
关键词-EN: Power grid systems, Power grid, blockchain-based power grid, escalating prevalence, prevalence of encryption
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The escalating prevalence of encryption protocols has led to a concomitant surge in the number of malicious attacks that hide in encrypted traffic. Power grid systems, as fundamental infrastructure, are becoming prime targets for such attacks. Conventional methods for detecting malicious encrypted packets typically use a static pre-trained model. We observe that these methods are not well-suited for blockchain-based power grid systems. More critically, they fall short in dynamic environments where new types of encrypted attacks continuously emerge. Motivated by this, in this paper we try to tackle these challenges from two aspects: (1) We present a novel framework that is able to automatically detect malicious encrypted traffic in blockchain-based power grid systems and incrementally learn from new malicious traffic. (2) We mathematically derive incremental learning losses to resist the forgetting of old attack patterns while ensuring the model is capable of handling new encrypted attack patterns. Empirically, our method achieves state-of-the-art performance on three different benchmark datasets. We also constructed the first malicious encrypted traffic dataset for blockchain-based power grid scenario. Our code and dataset are available at this https URL, hoping to inspire future research.

[AI-65] Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

链接: https://arxiv.org/abs/2408.10652
作者: Guofeng Mei,Luigi Riz,Yiming Wang,Fabio Poiesi
关键词-EN: offering a greater, greater flexibility, flexibility than closed-vocabulary, instance, open vocabulary
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering ``List the objects in the scene.‘’. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available.

[AI-66] Inferring Underwater Topography with FINN

链接: https://arxiv.org/abs/2408.10649
作者: Coşku Can Horuz,Matthias Karlbauer,Timothy Praditia,Sergey Oladyshkin,Wolfgang Nowak,Sebastian Otte
关键词-EN: find extensive application, Spatiotemporal partial differential, partial differential equations, find extensive, engineering fields
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Spatiotemporal partial differential equations (PDEs) find extensive application across various scientific and engineering fields. While numerous models have emerged from both physics and machine learning (ML) communities, there is a growing trend towards integrating these approaches to develop hybrid architectures known as physics-aware machine learning models. Among these, the finite volume neural network (FINN) has emerged as a recent addition. FINN has proven to be particularly efficient in uncovering latent structures in data. In this study, we explore the capabilities of FINN in tackling the shallow-water equations, which simulates wave dynamics in coastal regions. Specifically, we investigate FINN’s efficacy to reconstruct underwater topography based on these particular wave equations. Our findings reveal that FINN exhibits a remarkable capacity to infer topography solely from wave dynamics, distinguishing itself from both conventional ML and physics-aware ML models. Our results underscore the potential of FINN in advancing our understanding of spatiotemporal phenomena and enhancing parametrization capabilities in related domains.

[AI-67] Privacy-preserving Universal Adversarial Defense for Black-box Models

链接: https://arxiv.org/abs/2408.10647
作者: Qiao Li,Cong Wu,Jing Chen,Zijun Zhang,Kun He,Ruiying Du,Xinxin Wang,Qingchuang Zhao,Yang Liu
关键词-EN: Deep neural networks, Deep neural, neural networks, autonomous driving, critical applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Deep neural networks (DNNs) are increasingly used in critical applications such as identity authentication and autonomous driving, where robustness against adversarial attacks is crucial. These attacks can exploit minor perturbations to cause significant prediction errors, making it essential to enhance the resilience of DNNs. Traditional defense methods often rely on access to detailed model information, which raises privacy concerns, as model owners may be reluctant to share such data. In contrast, existing black-box defense methods fail to offer a universal defense against various types of adversarial attacks. To address these challenges, we introduce DUCD, a universal black-box defense method that does not require access to the target model’s parameters or architecture. Our approach involves distilling the target model by querying it with data, creating a white-box surrogate while preserving data privacy. We further enhance this surrogate model using a certified defense based on randomized smoothing and optimized noise selection, enabling robust defense against a broad range of adversarial attacks. Comparative evaluations between the certified defenses of the surrogate and target models demonstrate the effectiveness of our approach. Experiments on multiple image classification datasets show that DUCD not only outperforms existing black-box defenses but also matches the accuracy of white-box defenses, all while enhancing data privacy and reducing the success rate of membership inference attacks.

[AI-68] Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs

链接: https://arxiv.org/abs/2408.10646
作者: Maxim Ifergan,Leshem Choshen,Roee Aharoni,Idan Szpektor,Omri Abend
关键词-EN: factoid is largely, largely independent, languages, representation, multilingual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model’s ability to answer a query consistently across languages, and the ability to ‘‘store’’ answers in a shared representation for several languages. We propose a methodology to measure the extent of representation sharing across languages by repurposing knowledge editing methods. We examine LLMs with various multilingual configurations using a new multilingual dataset. We reveal that high consistency does not necessarily imply shared representation, particularly for languages with different scripts. Moreover, we find that script similarity is a dominant factor in representation sharing. Finally, we observe that if LLMs could fully share knowledge across languages, their accuracy in their best-performing language could benefit an increase of up to 150% on average. These findings highlight the need for improved multilingual knowledge representation in LLMs and suggest a path for the development of more robust and consistent multilingual LLMs.

[AI-69] Minor SFT loss for LLM fine-tune to increase performance and reduce model deviation

链接: https://arxiv.org/abs/2408.10642
作者: Shiming Xie,Hong Chen,Fred Yu,Zeye Sun,Xiuyu Wu
关键词-EN: Instruct LLM provide, large scale language, Instruct LLM, scale language model, large scale
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Instruct LLM provide a paradigm used in large scale language model to align LLM to human preference. The paradigm contains supervised fine tuning and reinforce learning from human feedback. This paradigm is also used in downstream scenarios to adapt LLM to specific corpora and applications. Comparing to SFT, there are many efforts focused on RLHF and several algorithms being proposed, such as PPO, DPO, IPO, KTO, MinorDPO and etc. Meanwhile most efforts for SFT are focused on how to collect, filter and mix high quality data. In this article with insight from DPO and MinorDPO, we propose a training metric for SFT to measure the discrepancy between the optimized model and the original model, and a loss function MinorSFT that can increase the training effectiveness, and reduce the discrepancy between the optimized LLM and original LLM.

[AI-70] A Review of Human-Object Interaction Detection

链接: https://arxiv.org/abs/2408.10641
作者: Yuxiao Wang,Qiwei Xiong,Yu Lei,Weiying Xue,Qi Liu,Zhenao Wei
关键词-EN: high-level visual understanding, HOI detection, image-based HOI detection, Human-object interaction, HOI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human-object interaction (HOI) detection plays a key role in high-level visual understanding, facilitating a deep comprehension of human activities. Specifically, HOI detection aims to locate the humans and objects involved in interactions within images or videos and classify the specific interactions between them. The success of this task is influenced by several key factors, including the accurate localization of human and object instances, as well as the correct classification of object categories and interaction relationships. This paper systematically summarizes and discusses the recent work in image-based HOI detection. First, the mainstream datasets involved in HOI relationship detection are introduced. Furthermore, starting with two-stage methods and end-to-end one-stage detection approaches, this paper comprehensively discusses the current developments in image-based HOI detection, analyzing the strengths and weaknesses of these two methods. Additionally, the advancements of zero-shot learning, weakly supervised learning, and the application of large-scale language models in HOI detection are discussed. Finally, the current challenges in HOI detection are outlined, and potential research directions and future trends are explored.

[AI-71] Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

链接: https://arxiv.org/abs/2408.10635
作者: Jonathan Light,Min Cai,Weiqin Chen,Guanzhi Wang,Xiusi Chen,Wei Cheng,Yisong Yue,Ziniu Hu
关键词-EN: Strategist that utilizes, playing multi-agent games, self-improvement process, method Strategist, Monte Carlo tree
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: website: this https URL

点击查看摘要

Abstract:In this paper, we propose a new method Strategist that utilizes LLMs to acquire new skills for playing multi-agent games through a self-improvement process. Our method gathers quality feedback through self-play simulations with Monte Carlo tree search and LLM-based reflection, which can then be used to learn high-level strategic skills such as how to evaluate states that guide the low-level execution.We showcase how our method can be used in both action planning and dialogue generation in the context of games, achieving good performance on both tasks. Specifically, we demonstrate that our method can help train agents with better performance than both traditional reinforcement learning-based approaches and other LLM-based skill learning approaches in games including the Game of Pure Strategy (GOPS) and The Resistance: Avalon.

[AI-72] LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

链接: https://arxiv.org/abs/2408.10631
作者: Yupeng Su,Ziyi Guan,Xiaoqun Liu,Tianlai Jin,Dongkuan Wu,Graziano Chesi,Ngai Wong,Hao Yu
关键词-EN: Large language models, Large language, significantly in scale, grown significantly, Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have grown significantly in scale, leading to a critical need for efficient model pruning techniques. Existing post-training pruning techniques primarily focus on measuring weight importance on converged dense models to determine salient weights to retain. However, they often overlook the changes in weight importance during the pruning process, which can lead to performance degradation in the pruned models. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, ensuring global performance optimization. Inspired by the recent discovery of prominent outliers in LLMs, LLM-Barber introduces an innovative pruning metric that identifies weight importance using weights multiplied by gradients. Our experiments show that LLM-Barber can efficiently prune models like LLaMA and OPT families with 7B to 13B parameters on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at this https URL.

[AI-73] Finding the DeepDream for Time Series: Activation Maximization for Univariate Time Series ECML-PKDD

链接: https://arxiv.org/abs/2408.10628
作者: Udo Schlegel,Daniel A. Keim,Tobias Sutter
关键词-EN: series data remains, interpret time series, time series data, Sequence Dreaming, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 4 figures, accepted at TempXAI @ ECML-PKDD

点击查看摘要

Abstract:Understanding how models process and interpret time series data remains a significant challenge in deep learning to enable applicability in safety-critical areas such as healthcare. In this paper, we introduce Sequence Dreaming, a technique that adapts Activation Maximization to analyze sequential information, aiming to enhance the interpretability of neural networks operating on univariate time series. By leveraging this method, we visualize the temporal dynamics and patterns most influential in model decision-making processes. To counteract the generation of unrealistic or excessively noisy sequences, we enhance Sequence Dreaming with a range of regularization techniques, including exponential smoothing. This approach ensures the production of sequences that more accurately reflect the critical features identified by the neural network. Our approach is tested on a time series classification dataset encompassing applications in predictive maintenance. The results show that our proposed Sequence Dreaming approach demonstrates targeted activation maximization for different use cases so that either centered class or border activation maximization can be generated. The results underscore the versatility of Sequence Dreaming in uncovering salient temporal features learned by neural networks, thereby advancing model transparency and trustworthiness in decision-critical domains.

[AI-74] WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification

链接: https://arxiv.org/abs/2408.10624
作者: Yonggan Wu,Ling-Chao Meng,Yuan Zichao,Sixian Chan,Hong-Qiang Wang
关键词-EN: visible-infrared person re-identification, primary challenges lies, significant cross-modality discrepancy, information mining, Interactive Information Mining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:For the visible-infrared person re-identification (VI-ReID) task, one of the primary challenges lies in significant cross-modality discrepancy. Existing methods struggle to conduct modality-invariant information mining. They often focus solely on mining singular dimensions like spatial or channel, and overlook the extraction of specific-modality multi-dimension information. To fully mine modality-invariant information across a wide range, we introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. Empowered by the proposed Global Region Interaction (GRI), MIIM comprehensively mines non-local spatial and channel information through intra-dimension interaction. Moreover, Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information. AICL, by introducing the novel Cross-Modality Key-Instance Contrastive (CMKIC) loss, effectively guides the network in extracting modality-invariant information. We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset. The results demonstrate WRIM-Net’s superiority over state-of-the-art methods.

[AI-75] Novel Change Detection Framework in Remote Sensing Imagery Using Diffusion Models and Structural Similarity Index (SSIM)

链接: https://arxiv.org/abs/2408.10619
作者: Andrew Kiruluta,Eric Lundy,Andreas Lemos
关键词-EN: urban growth, Change detection, enabling the monitoring, disaster impact, crucial task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Change detection is a crucial task in remote sensing, enabling the monitoring of environmental changes, urban growth, and disaster impact. Conventional change detection techniques, such as image differencing and ratioing, often struggle with noise and fail to capture complex variations in imagery. Recent advancements in machine learning, particularly generative models like diffusion models, offer new opportunities for enhancing change detection accuracy. In this paper, we propose a novel change detection framework that combines the strengths of Stable Diffusion models with the Structural Similarity Index (SSIM) to create robust and interpretable change maps. Our approach, named Diffusion Based Change Detector, is evaluated on both synthetic and real-world remote sensing datasets and compared with state-of-the-art methods. The results demonstrate that our method significantly outperforms traditional differencing techniques and recent deep learning-based methods, particularly in scenarios with complex changes and noise.

[AI-76] OMEGA: Efficient Occlusion-Aware Navigation for Air-Ground Robot in Dynamic Environments via State Space Model

链接: https://arxiv.org/abs/2408.10618
作者: Junming Wang,Dong Huang,Xiuxian Guan,Zekai Sun,Tianxiang Shen,Fangming Liu,Heming Cui
关键词-EN: Signed Distance Field, Euclidean Signed Distance, Air-ground robots, disaster response due, computing Euclidean Signed
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: OccMamba is Coming!

点击查看摘要

Abstract:Air-ground robots (AGRs) are widely used in surveillance and disaster response due to their exceptional mobility and versatility (i.e., flying and driving). Current AGR navigation systems perform well in static occlusion-prone environments (e.g., indoors) by using 3D semantic occupancy networks to predict occlusions for complete local mapping and then computing Euclidean Signed Distance Field (ESDF) for path planning. However, these systems face challenges in dynamic, severe occlusion scenes (e.g., crowds) due to limitations in perception networks’ low prediction accuracy and path planners’ high computation overhead. In this paper, we propose OMEGA, which contains OccMamba with an Efficient AGR-Planner to address the above-mentioned problems. OccMamba adopts a novel architecture that separates semantic and occupancy prediction into independent branches, incorporating two mamba blocks within these branches. These blocks efficiently extract semantic and geometric features in 3D environments with linear complexity, ensuring that the network can learn long-distance dependencies to improve prediction accuracy. Semantic and geometric features are combined within the Bird’s Eye View (BEV) space to minimise computational overhead during feature fusion. The resulting semantic occupancy map is then seamlessly integrated into the local map, providing occlusion awareness of the dynamic environment. Our AGR-Planner utilizes this local map and employs kinodynamic A* search and gradient-based trajectory optimization to guarantee planning is ESDF-free and energy-efficient. Extensive experiments demonstrate that OccMamba outperforms the state-of-the-art 3D semantic occupancy network with 25.0% mIoU. End-to-end navigation experiments in dynamic scenes verify OMEGA’s efficiency, achieving a 96% average planning success rate. Code and video are available at this https URL.

[AI-77] Generalizable Facial Expression Recognition ECCV2024

链接: https://arxiv.org/abs/2408.10614
作者: Yuhang Zhang,Xiuqi Zheng,Chenyi Liang,Jiani Hu,Weihong Deng
关键词-EN: FER, FER methods, SOTA FER methods, facial expression recognition, domain adaptation FER
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:SOTA facial expression recognition (FER) methods fail on test sets that have domain gaps with the train set. Recent domain adaptation FER methods need to acquire labeled or unlabeled samples of target domains to fine-tune the FER model, which might be infeasible in real-world deployment. In this paper, we aim to improve the zero-shot generalization ability of FER methods on different unseen test sets using only one train set. Inspired by how humans first detect faces and then select expression features, we propose a novel FER pipeline to extract expression-related features from any given face images. Our method is based on the generalizable face features extracted by large models like CLIP. However, it is non-trivial to adapt the general features of CLIP for specific tasks like FER. To preserve the generalization ability of CLIP and the high precision of the FER model, we design a novel approach that learns sigmoid masks based on the fixed CLIP face features to extract expression features. To further improve the generalization ability on unseen test sets, we separate the channels of the learned masked features according to the expression classes to directly generate logits and avoid using the FC layer to reduce overfitting. We also introduce a channel-diverse loss to make the learned masks separated. Extensive experiments on five different FER datasets verify that our method outperforms SOTA FER methods by large margins. Code is available in this https URL.

[AI-78] Promoting Equality in Large Language Models : Identifying and Mitigating the Implicit Bias based on Bayesian Theory

链接: https://arxiv.org/abs/2408.10608
作者: Yongxin Deng(1),Xihe Qiu(1),Xiaoyu Tan(2),Jing Pan(3),Chen Jue(1),Zhijun Fang(4),Yinghui Xu(5),Wei Chu(2),Yuan Qi(5) ((1) Shanghai University of Engineering Science, (2) INF Technology (Shanghai) Co., Ltd., (3) Monash University, (4) Donghua University, (5) Fudan University)
关键词-EN: Large language models, Large language, extensive text corpora, inevitably include biased, text corpora
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model’s weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs relevant knowledge triples and expunges bias information from LLMs using model editing techniques. Through extensive experimentation, we have confirmed the presence of the implicit bias problem in LLMs and demonstrated the effectiveness of our BTBR approach.

[AI-79] MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

链接: https://arxiv.org/abs/2408.10605
作者: Yanbo Ding,Shaobin Zhuang,Kunchang Li,Zhengrong Yue,Yu Qiao,Yali Wang
关键词-EN: existing methods struggle, methods struggle, struggle to create, image, MUSES
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world.

[AI-80] Multilingual Non-Factoid Question Answering with Silver Answers

链接: https://arxiv.org/abs/2408.10604
作者: Ritwik Mishra,Sreeram Vennam,Rajiv Ratn Shah,Ponnurangam Kumaraguru
关键词-EN: existing Question Answering, short-context Question Answering, Question Answering Datasets, Question Answering, Answering Datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing Question Answering Datasets (QuADs) primarily focus on factoid-based short-context Question Answering (QA) in high-resource languages. However, the scope of such datasets for low-resource languages remains limited, with only a few works centered on factoid-based QuADs and none on non-factoid QuADs. Therefore, this work presents MuNfQuAD, a multilingual QuAD with non-factoid questions. It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers. The dataset comprises over 370K QA pairs across 38 languages, encompassing several low-resource languages, and stands as the largest multilingual QA dataset to date. Based on the manual annotations of 790 QA-pairs from MuNfQuAD (golden set), we observe that 98% of questions can be answered using their corresponding silver answer. Our fine-tuned Answer Paragraph Selection (APS) model outperforms the baselines. The APS model attained an accuracy of 80% and 72%, as well as a macro F1 of 72% and 66%, on the MuNfQuAD testset and the golden set, respectively. Furthermore, the APS model effectively generalizes certain a language within the golden set, even after being fine-tuned on silver labels.

[AI-81] MV-MOS: Multi-View Feature Fusion for 3D Moving Object Segmentation

链接: https://arxiv.org/abs/2408.10602
作者: Jintao Cheng,Xingming Chen,Jinxin Liang,Xiaoyu Tang,Xieyuanli Chen,Dachuan Li
关键词-EN: Effectively summarizing dense, moving object segmentation, summarizing dense, robotics applications, point cloud data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Effectively summarizing dense 3D point cloud data and extracting motion information of moving objects (moving object segmentation, MOS) is crucial to autonomous driving and robotics applications. How to effectively utilize motion and semantic features and avoid information loss during 3D-to-2D projection is still a key challenge. In this paper, we propose a novel multi-view MOS model (MV-MOS) by fusing motion-semantic features from different 2D representations of point clouds. To effectively exploit complementary information, the motion branches of the proposed model combines motion features from both bird’s eye view (BEV) and range view (RV) representations. In addition, a semantic branch is introduced to provide supplementary semantic features of moving objects. Finally, a Mamba module is utilized to fuse the semantic features with motion features and provide effective guidance for the motion branches. We validated the effectiveness of the proposed multi-branch fusion MOS framework via comprehensive experiments, and our proposed model outperforms existing state-of-the-art models on the SemanticKITTI benchmark.

[AI-82] Breast tumor classification based on self-supervised contrastive learning from ultrasound videos

链接: https://arxiv.org/abs/2408.10600
作者: Yunxin Tang,Siyuan Tang,Jian Zhang,Hao Chen
关键词-EN: diagnosing breast tumors, Breast ultrasound, Background, Breast, breast tumors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: Breast ultrasound is prominently used in diagnosing breast tumors. At present, many automatic systems based on deep learning have been developed to help radiologists in diagnosis. However, training such systems remains challenging because they are usually data-hungry and demand amounts of labeled data, which need professional knowledge and are expensive. Methods: We adopted a triplet network and a self-supervised contrastive learning technique to learn representations from unlabeled breast ultrasound video clips. We further designed a new hard triplet loss to to learn representations that particularly discriminate positive and negative image pairs that are hard to recognize. We also constructed a pretraining dataset from breast ultrasound videos (1,360 videos from 200 patients), which includes an anchor sample dataset with 11,805 images, a positive sample dataset with 188,880 images, and a negative sample dataset dynamically generated from video clips. Further, we constructed a finetuning dataset, including 400 images from 66 patients. We transferred the pretrained network to a downstream benign/malignant classification task and compared the performance with other state-of-the-art models, including three models pretrained on ImageNet and a previous contrastive learning model retrained on our datasets. Results and conclusion: Experiments revealed that our model achieved an area under the receiver operating characteristic curve (AUC) of 0.952, which is significantly higher than the others. Further, we assessed the dependence of our pretrained model on the number of labeled data and revealed that 100 samples were required to achieve an AUC of 0.901. The proposed framework greatly reduces the demand for labeled data and holds potential for use in automatic breast ultrasound image diagnosis.

[AI-83] Hologram Reasoning for Solving Algebra Problems with Geometry Diagrams

链接: https://arxiv.org/abs/2408.10592
作者: Litian Huang,Xinguo Yu,Feng Xiong,Bin He,Shengbing Tang,Jiawen Fu
关键词-EN: Solving Algebra Problems, Geometry Diagrams, diagram processing, Solving Algebra, Algebra Problems
类目: Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Solving Algebra Problems with Geometry Diagrams (APGDs) is still a challenging problem because diagram processing is not studied as intensively as language processing. To work against this challenge, this paper proposes a hologram reasoning scheme and develops a high-performance method for solving APGDs by using this scheme. To reach this goal, it first defines a hologram, being a kind of graph, and proposes a hologram generator to convert a given APGD into a hologram, which represents the entire information of APGD and the relations for solving the problem can be acquired from it by a uniform way. Then HGR, a hologram reasoning method employs a pool of prepared graph models to derive algebraic equations, which is consistent with the geometric theorems. This method is able to be updated by adding new graph models into the pool. Lastly, it employs deep reinforcement learning to enhance the efficiency of model selection from the pool. The entire HGR not only ensures high solution accuracy with fewer reasoning steps but also significantly enhances the interpretability of the solution process by providing descriptions of all reasoning steps. Experimental results demonstrate the effectiveness of HGR in improving both accuracy and interpretability in solving APGDs.

[AI-84] Putting People in LLMs Shoes: Generating Better Answers via Question Rewriter

链接: https://arxiv.org/abs/2408.10573
作者: Junhao Chen,Bowen Wang,Zhouqiang jiang,Yuta Nakashima
关键词-EN: Large Language Models, Large Language, Language Models, demonstrated significant capabilities, significant capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant capabilities, particularly in the domain of question answering (QA). However, their effectiveness in QA is often undermined by the vagueness of user questions. To address this issue, we introduce single-round instance-level prompt optimization, referred to as question rewriter. By enhancing the intelligibility of human questions for black-box LLMs, our question rewriter improves the quality of generated answers. The rewriter is optimized using direct preference optimization based on feedback collected from automatic criteria for evaluating generated answers; therefore, its training does not require costly human annotations. The experiments across multiple black-box LLMs and long-form question answering (LFQA) datasets demonstrate the efficacy of our method. This paper provides a practical framework for training question rewriters and sets a precedent for future explorations in prompt optimization within LFQA tasks. Code is available at \urlthis https URL.

[AI-85] Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models

链接: https://arxiv.org/abs/2408.10571
作者: Cong Wan,Yuhang He,Xiang Song,Yihong Gong
关键词-EN: allowing for efficient, textual descriptions, efficient synthesis, synthesis of photos, data with textual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 33 pages, 14 figures, under review

点击查看摘要

Abstract:Diffusion models have revolutionized customized text-to-image generation, allowing for efficient synthesis of photos from personal data with textual descriptions. However, these advancements bring forth risks including privacy breaches and unauthorized replication of artworks. Previous researches primarily center around using prompt-specific methods to generate adversarial examples to protect personal images, yet the effectiveness of existing methods is hindered by constrained adaptability to different prompts. In this paper, we introduce a Prompt-Agnostic Adversarial Perturbation (PAP) method for customized diffusion models. PAP first models the prompt distribution using a Laplace Approximation, and then produces prompt-agnostic perturbations by maximizing a disturbance expectation based on the modeled distribution. This approach effectively tackles the prompt-agnostic attacks, leading to improved defense stability. Extensive experiments in face privacy and artistic style protection, demonstrate the superior generalization of our method in comparison to existing techniques.

[AI-86] SparseGrow: Addressing Growth-Induced Forgetting in Task-Agnostic Continual Learning DATE AAAI

链接: https://arxiv.org/abs/2408.10566
作者: Yuqing Zhao,Divya Saxena,Jiannong Cao,Xiaoyun Liu,Changlin Song
关键词-EN: model growth, model, growth, model growth enhances, improper model growth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been submitted to the AAAI conference. If accepted, the final version will be updated to reflect the conference proceedings

点击查看摘要

Abstract:In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks. However, improper model growth can lead to severe degradation of previously learned knowledge, an issue we name as growth-induced forgetting (GIFt), especially in task-agnostic CL using entire grown model for inference. Existing works, despite adopting model growth and random initialization for better adaptability, often fail to recognize the presence of GIFt caused by improper model growth. This oversight limits comprehensive control of forgetting and hinders full utilization of model growth. We are the first in CL to identify this issue and conduct an in-depth study on root cause of GIFt, where layer expansion stands out among model growth strategies, widening layers without affecting model functionality. Yet, direct adoption of layer expansion presents challenges. It lacks data-driven control and initialization of expanded parameters to balance adaptability and knowledge retention. This paper presents a novel SparseGrow approach to overcome the issue of GIFt while enhancing adaptability over new data. SparseGrow employs data-driven sparse layer expansion to control efficient parameter usage during growth, reducing GIFt from excessive growth and functionality changes. It also combines sparse growth with on-data initialization at training late-stage to create partially 0-valued expansions that fit learned distribution, enhancing retention and adaptability. To further minimize forgetting, freezing is applied by calculating the sparse mask, allowing data-driven preservation of important parameters. Through experiments across datasets with various settings, cases and task numbers, we demonstrate the necessity of layer expansion and showcase the effectiveness of SparseGrow in overcoming GIFt, highlighting its adaptability and knowledge retention for incremental tasks.

[AI-87] he Stable Model Semantics for Higher-Order Logic Programming

链接: https://arxiv.org/abs/2408.10563
作者: Bart Bogaerts,Angelos Charalambidis,Giannos Chatziagapis,Babis Kostopoulos,Samuele Pollaci,Panos Rondogiannis
关键词-EN: Approximation Fixpoint Theory, stable model semantics, Fixpoint Theory, stable model, higher-order logic programs
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:We propose a stable model semantics for higher-order logic programs. Our semantics is developed using Approximation Fixpoint Theory (AFT), a powerful formalism that has successfully been used to give meaning to diverse non-monotonic formalisms. The proposed semantics generalizes the classical two-valued stable model semantics of (Gelfond and Lifschitz 1988) as-well-as the three-valued one of (Przymusinski 1990), retaining their desirable properties. Due to the use of AFT, we also get for free alternative semantics for higher-order logic programs, namely supported model, Kripke-Kleene, and well-founded. Additionally, we define a broad class of stratified higher-order logic programs and demonstrate that they have a unique two-valued higher-order stable model which coincides with the well-founded semantics of such programs. We provide a number of examples in different application domains, which demonstrate that higher-order logic programming under the stable model semantics is a powerful and versatile formalism, which can potentially form the basis of novel ASP systems.

[AI-88] Hokoff: Real Game Dataset from Honor of Kings and its Offline Reinforcement Learning Benchmarks

链接: https://arxiv.org/abs/2408.10556
作者: Yun Qu,Boyuan Wang,Jianzhun Shao,Yuhang Jiang,Chen Chen,Zhenbin Ye,Lin Liu,Junfeng Yang,Lin Lai,Hongyang Qin,Minwen Deng,Juchao Zhuo,Deheng Ye,Qiang Fu,Wei Yang,Guang Yang,Lanxiao Huang,Xiangyang Ji
关键词-EN: Multi-Agent Reinforcement Learning, Offline Multi-Agent Reinforcement, Offline Reinforcement Learning, Reinforcement Learning, represent real-world complexities
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre-collected offline datasets that represent real-world complexities and practical applications. However, existing datasets often fall short in their simplicity and lack of realism. To address this gap, we propose Hokoff, a comprehensive set of pre-collected datasets that covers both offline RL and offline MARL, accompanied by a robust framework, to facilitate further research. This data is derived from Honor of Kings, a recognized Multiplayer Online Battle Arena (MOBA) game known for its intricate nature, closely resembling real-life situations. Utilizing this framework, we benchmark a variety of offline RL and offline MARL algorithms. We also introduce a novel baseline algorithm tailored for the inherent hierarchical action space of the game. We reveal the incompetency of current offline RL approaches in handling task complexity, generalization and multi-task learning.

[AI-89] AI-Based IVR

链接: https://arxiv.org/abs/2408.10549
作者: Gassyrbek Kosherbay,Nurgissa Apbaz
关键词-EN: Interactive Voice Response, Interactive Voice, Voice Response, call center IVR, traditional IVR
类目: Artificial Intelligence (cs.AI)
*备注: in Russian language

点击查看摘要

Abstract:The use of traditional IVR (Interactive Voice Response) methods often proves insufficient to meet customer needs. This article examines the application of artificial intelligence (AI) technologies to enhance the efficiency of IVR systems in call centers. A proposed approach is based on the integration of speech-to-text conversion solutions, text query classification using large language models (LLM), and speech synthesis. Special attention is given to adapting these technologies to work with the Kazakh language, including fine-tuning models on specialized datasets. The practical aspects of implementing the developed system in a real call center for query classification are described. The research results demonstrate that the application of AI technologies in call center IVR systems reduces operator workload, improves customer service quality, and increases the efficiency of query processing. The proposed approach can be adapted for use in call centers operating with various languages.

[AI-90] Diff-PCC: Diffusion-based Neural Compression for 3D Point Clouds

链接: https://arxiv.org/abs/2408.10543
作者: Kai Liu,Kang You,Pan Gao
关键词-EN: Stable diffusion networks, detailed visual content, Stable diffusion, visual content, networks have emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Stable diffusion networks have emerged as a groundbreaking development for their ability to produce realistic and detailed visual content. This characteristic renders them ideal decoders, capable of producing high-quality and aesthetically pleasing reconstructions. In this paper, we introduce the first diffusion-based point cloud compression method, dubbed Diff-PCC, to leverage the expressive power of the diffusion model for generative and aesthetically superior decoding. Different from the conventional autoencoder fashion, a dual-space latent representation is devised in this paper, in which a compressor composed of two independent encoding backbones is considered to extract expressive shape latents from distinct latent spaces. At the decoding side, a diffusion-based generator is devised to produce high-quality reconstructions by considering the shape latents as guidance to stochastically denoise the noisy point clouds. Experiments demonstrate that the proposed Diff-PCC achieves state-of-the-art compression performance (e.g., 7.711 dB BD-PSNR gains against the latest G-PCC standard at ultra-low bitrate) while attaining superior subjective quality. Source code will be made publicly available.

[AI-91] NutrifyAI: An AI-Powered System for Real-Time Food Detection Nutritional Analysis and Personalized Meal Recommendations

链接: https://arxiv.org/abs/2408.10532
作者: Michelle Han,Junyao Chen
关键词-EN: Calorie Counter, nutrition apps reaching, apps reaching, health apps, surging in popularity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 12 figures

点击查看摘要

Abstract:With diet and nutrition apps reaching 1.4 billion users in 2022 [1], it’s no surprise that health apps like MyFitnessPal, Noom, and Calorie Counter, are surging in popularity. However, one major setback [2] of nearly all nutrition applications is that users must enter food data manually, which is time-consuming and tedious. Thus, there has been an increasing demand for applications that can accurately identify food items, analyze their nutritional content, and offer dietary recommendations in real-time. This paper introduces a comprehensive system that combines advanced computer vision techniques with nutrition analysis, implemented in a versatile mobile and web application. The system is divided into three key components: 1) food detection using the YOLOv8 model, 2) nutrient analysis via the Edamam Nutrition Analysis API, and 3) personalized meal recommendations using the Edamam Meal Planning and Recipe Search APIs. Designed for both mobile and web platforms, the application ensures fast processing times with an intuitive user interface, with features such as data visualizations using Chart.js, a login system, and personalized settings for dietary preferences, allergies, and cuisine choices. Preliminary results showcase the system’s effectiveness, making it a valuable tool for users to make informed dietary decisions.

[AI-92] EdgeNAT: Transformer for Efficient Edge Detection

链接: https://arxiv.org/abs/2408.10527
作者: Jinghuai Jie,Yan Guo,Guixing Wu,Junmin Wu,Baojian Hua
关键词-EN: increasingly prominent role, Neighborhood Attention Transformer, feature extraction capabilities, Dilated Neighborhood Attention, powerful feature extraction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformers, renowned for their powerful feature extraction capabilities, have played an increasingly prominent role in various vision tasks. Especially, recent advancements present transformer with hierarchical structures such as Dilated Neighborhood Attention Transformer (DiNAT), demonstrating outstanding ability to efficiently capture both global and local features. However, transformers’ application in edge detection has not been fully exploited. In this paper, we propose EdgeNAT, a one-stage transformer-based edge detector with DiNAT as the encoder, capable of extracting object boundaries and meaningful edges both accurately and efficiently. On the one hand, EdgeNAT captures global contextual information and detailed local cues with DiNAT, on the other hand, it enhances feature representation with a novel SCAF-MLA decoder by utilizing both inter-spatial and inter-channel relationships of feature maps. Extensive experiments on multiple datasets show that our method achieves state-of-the-art performance on both RGB and depth images. Notably, on the widely used BSDS500 dataset, our L model achieves impressive performances, with ODS F-measure and OIS F-measure of 86.0%, 87.6% for multi-scale input,and 84.9%, and 86.3% for single-scale input, surpassing the current state-of-the-art EDTER by 1.2%, 1.1%, 1.7%, and 1.6%, respectively. Moreover, as for throughput, our approach runs at 20.87 FPS on RTX 4090 GPU with single-scale input. The code for our method will be released soon.

[AI-93] XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition

链接: https://arxiv.org/abs/2408.10524
作者: Xucheng Wan,Naijun Zheng,Kai Liu,Huan Zhou
关键词-EN: Contextualized ASR models, Contextualized ASR, predefined phrase list, demonstrated to effectively, effectively improve
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: accepted to NCMMSC 2024

点击查看摘要

Abstract:Contextualized ASR models have been demonstrated to effectively improve the recognition accuracy of uncommon phrases when a predefined phrase list is available. However, these models often struggle with bilingual settings, which are prevalent in code-switching speech recognition. In this study, we make the initial attempt to address this challenge by introducing a Cross-lingual Contextual Biasing(XCB) module. Specifically, we augment a pre-trained ASR model for the dominant language by integrating an auxiliary language biasing module and a supplementary language-specific loss, aimed at enhancing the recognition of phrases in the secondary language. Experimental results conducted on our in-house code-switching dataset have validated the efficacy of our approach, demonstrating significant improvements in the recognition of biasing phrases in the secondary language, even without any additional inference overhead. Additionally, our proposed system exhibits both efficiency and generalization when is applied by the unseen ASRU-2019 test set.

[AI-94] Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision Models: Decision MetaMamba

链接: https://arxiv.org/abs/2408.10517
作者: Wall Kim
关键词-EN: Return-Conditioned Transformer Decision, RCTDM required alternative, potential to enhance, offline reinforcement learning, RCTDM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Return-Conditioned Transformer Decision Models (RCTDM) have demonstrated the potential to enhance transformer performance in offline reinforcement learning by replacing rewards in the input sequence with returns-to-go. However, to achieve the goal of learning an optimal policy from offline datasets composed of limited suboptimal trajectories, RCTDM required alternative methods. One prominent approach, trajectory stitching, was designed to enable the network to combine multiple trajectories to find the optimal path. To implement this using only transformers without auxiliary networks, it was necessary to shorten the input sequence length to better capture the Markov property in reinforcement learnings. This, however, introduced a trade-off, as it reduced the accuracy of action inference. Our study introduces a model named Decision MetaMamba to resolve these challenges. DMM employs an input token mixer to extract patterns from short sequences and uses a State Space Model (SSM) to selectively combine information from relatively distant sequences. Inspired by Metaformer, this structure was developed by transforming Mamba’s input layer into various multi-modal layers. Fortunately, with the advent of Mamba, implemented using parallel selective scanning, we achieved a high-performance sequence model capable of replacing transformers. Based on these innovations, DMM demonstrated excellent performance across various datasets in offline RL, confirming that models using SSM can improve performance by domain-specific alterations of the input layer. Additionally, it maintained its performance even in lightweight models with fewer parameters. These results suggest that decision models based on SSM can pave the way for improved outcomes in future developments.

[AI-95] Data Augmentation Integrating Dialogue Flow and Style to Adapt Spoken Dialogue Systems to Low-Resource User Groups SIGDIAL2024

链接: https://arxiv.org/abs/2408.10516
作者: Zhiyang Qi,Michimasa Inaba
关键词-EN: distinct conversational behaviors, exhibit distinct conversational, interaction challenges encountered, conversational behaviors, study addresses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to SIGDIAL 2024

点击查看摘要

Abstract:This study addresses the interaction challenges encountered by spoken dialogue systems (SDSs) when engaging with users who exhibit distinct conversational behaviors, particularly minors, in scenarios where data are scarce. We propose a novel data augmentation framework to enhance SDS performance for user groups with limited resources. Our approach leverages a large language model (LLM) to extract speaker styles and a pre-trained language model (PLM) to simulate dialogue act history. This method generates enriched and personalized dialogue data, facilitating improved interactions with unique user demographics. Extensive experiments validate the efficacy of our methodology, highlighting its potential to foster the development of more adaptive and inclusive dialogue systems.

[AI-96] Approximate Estimation of High-dimension Execution Skill for Dynamic Agents in Continuous Domains

链接: https://arxiv.org/abs/2408.10512
作者: Delma Nieves-Rivera,Christopher Archibald
关键词-EN: continuous action domains, real-world continuous action, error, human, execution error
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In many real-world continuous action domains, human agents must decide which actions to attempt and then execute those actions to the best of their ability. However, humans cannot execute actions without error. Human performance in these domains can potentially be improved by the use of AI to aid in decision-making. One requirement for an AI to correctly reason about what actions a human agent should attempt is a correct model of that human’s execution error, or skill. Recent work has demonstrated successful techniques for estimating this execution error with various types of agents across different domains. However, this previous work made several assumptions that limit the application of these ideas to real-world settings. First, previous work assumed that the error distributions were symmetric normal, which meant that only a single parameter had to be estimated. In reality, agent error distributions might exhibit arbitrary shapes and should be modeled more flexibly. Second, it was assumed that the execution error of the agent remained constant across all observations. Especially for human agents, execution error changes over time, and this must be taken into account to obtain effective estimates. To overcome both of these shortcomings, we propose a novel particle-filter-based estimator for this problem. After describing the details of this approximate estimator, we experimentally explore various design decisions and compare performance with previous skill estimators in a variety of settings to showcase the improvements. The outcome is an estimator capable of generating more realistic, time-varying execution skill estimates of agents, which can then be used to assist agents in making better decisions and improve their overall performance.

[AI-97] Single-cell Curriculum Learning-based Deep Graph Embedding Clustering

链接: https://arxiv.org/abs/2408.10511
作者: Huifa Li,Jie Fu,Xinpeng Ling,Zhiyu Sun,Kuncan Wang,Zhili Chen
关键词-EN: single-cell RNA sequencing, cellular-level tissue heterogeneity, RNA sequencing, technologies enables, tissue heterogeneity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:The swift advancement of single-cell RNA sequencing (scRNA-seq) technologies enables the investigation of cellular-level tissue heterogeneity. Cell annotation significantly contributes to the extensive downstream analysis of scRNA-seq data. However, The analysis of scRNA-seq for biological inference presents challenges owing to its intricate and indeterminate data distribution, characterized by a substantial volume and a high frequency of dropout events. Furthermore, the quality of training samples varies greatly, and the performance of the popular scRNA-seq data clustering solution GNN could be harmed by two types of low-quality training nodes: 1) nodes on the boundary; 2) nodes that contribute little additional information to the graph. To address these problems, we propose a single-cell curriculum learning-based deep graph embedding clustering (scCLG). We first propose a Chebyshev graph convolutional autoencoder with multi-decoder (ChebAE) that combines three optimization objectives corresponding to three decoders, including topology reconstruction loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and clustering loss, to learn cell-cell topology representation. Meanwhile, we employ a selective training strategy to train GNN based on the features and entropy of nodes and prune the difficult nodes based on the difficulty scores to keep the high-quality graph. Empirical results on a variety of gene expression datasets show that our model outperforms state-of-the-art methods.

[AI-98] QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning

链接: https://arxiv.org/abs/2408.10504
作者: Yilun Kong,Hangyu Mao,Qi Zhao,Bin Zhang,Jingqing Ruan,Li Shen,Yongzhe Chang,Xueqian Wang,Rui Zhao,Dacheng Tao
关键词-EN: demonstrated remarkable success, Query-dependent Prompt Optimization, engineering has demonstrated, demonstrated remarkable, remarkable success
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Prompt engineering has demonstrated remarkable success in enhancing the performance of large language models (LLMs) across diverse tasks. However, most existing prompt optimization methods only focus on the task-level performance, overlooking the importance of query-preferred prompts, which leads to suboptimal performances. Additionally, these methods rely heavily on frequent interactions with LLMs to obtain feedback for guiding the optimization process, incurring substantial redundant interaction costs. In this paper, we introduce Query-dependent Prompt Optimization (QPO), which leverages multi-loop offline reinforcement learning to iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries, thus significantly improving the prompting effect on the large target LLM. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks, thereby circumventing the expenses of online interactions. Furthermore, we continuously augment the offline dataset with the generated prompts in each loop, as the prompts from the fine-tuned model are supposed to outperform the source prompts in the original dataset. These iterative loops bootstrap the model towards generating optimal prompts. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.

[AI-99] Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers KDD2024 ECML

链接: https://arxiv.org/abs/2408.10503
作者: Thanh Thi Nguyen,Campbell Wilson,Janis Dalins
关键词-EN: Assessing the forensic, hand images involves, unique features, hand, hand images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the ECML PKDD 2024 (Research Track)

点击查看摘要

Abstract:Assessing the forensic value of hand images involves the use of unique features and patterns present in an individual’s hand. The human hand has distinct characteristics, such as the pattern of veins, fingerprints, and the geometry of the hand itself. This paper investigates the use of vision transformers (ViTs) for classification of hand images. We use explainability tools to explore the internal representations of ViTs and assess their impact on the model outputs. Utilizing the internal understanding of ViTs, we introduce distillation methods that allow a student model to adaptively extract knowledge from a teacher model while learning on data of a different domain to prevent catastrophic forgetting. Two publicly available hand image datasets are used to conduct a series of experiments to evaluate performance of the ViTs and our proposed adaptive distillation methods. The experimental results demonstrate that ViT models significantly outperform traditional machine learning methods and the internal states of ViTs are useful for explaining the model outputs in the classification task. By averting catastrophic forgetting, our distillation methods achieve excellent performance on data from both source and target domains, particularly when these two domains exhibit significant dissimilarity. The proposed approaches therefore can be developed and implemented effectively for real-world applications such as access control, identity verification, and authentication systems.

[AI-100] ProgramAlly: Creating Custom Visual Access Programs via Multi-Modal End-User Programming

链接: https://arxiv.org/abs/2408.10499
作者: Jaylin Herskovitz,Andi Xu,Rahaf Alharbi,Anhong Guo
关键词-EN: visual access programs, visual assistive technologies, DIY assistive technology, common use cases, technologies are built
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: UIST 2024

点击查看摘要

Abstract:Existing visual assistive technologies are built for simple and common use cases, and have few avenues for blind people to customize their functionalities. Drawing from prior work on DIY assistive technology, this paper investigates end-user programming as a means for users to create and customize visual access programs to meet their unique needs. We introduce ProgramAlly, a system for creating custom filters for visual information, e.g., ‘find NUMBER on BUS’, leveraging three end-user programming approaches: block programming, natural language, and programming by example. To implement ProgramAlly, we designed a representation of visual filtering tasks based on scenarios encountered by blind people, and integrated a set of on-device and cloud models for generating and running these programs. In user studies with 12 blind adults, we found that participants preferred different programming modalities depending on the task, and envisioned using visual access programs to address unique accessibility challenges that are otherwise difficult with existing applications. Through ProgramAlly, we present an exploration of how blind end-users can create visual access programs to customize and control their experiences.

[AI-101] QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention

链接: https://arxiv.org/abs/2408.10497
作者: Yihang Wang,Xu Huang,Bowen Tian,Yixing Fan,Jiafeng Guo
关键词-EN: Generative LLM, achieved significant success, LLM have achieved, effectively adapt, adapt to vertical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative LLM have achieved significant success in various industrial tasks and can effectively adapt to vertical domains and downstream tasks through ICL. However, with tasks becoming increasingly complex, the context length required by ICL is also getting longer, and two significant issues arise: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the “lost in the middle” problem. Recently, compressing prompts by removing tokens according to some metric obtained from some causal language models, such as llama-7b, has emerged as an effective approach to mitigate these issues. However, the metric used by prior method such as self-information or PPL do not fully align with the objective of distinuishing the most important tokens when conditioning on query. In this work, we introduce information bottleneck theory to carefully examine the properties required by the metric. Inspired by this, we use cross-attention in encoder-decoder architecture as a new metric. Our simple method leads to significantly better performance in smaller models with lower latency. We evaluate our method on four datasets: DROP, CoQA, SQuAD, and Quoref. The experimental results show that, while maintaining the same performance, our compression rate can improve by nearly 25% over previous SOTA. Remarkably, in experiments where 25% of the tokens are removed, our model’s EM score for answers sometimes even exceeds that of the control group using uncompressed text as context. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.10497 [cs.CL] (or arXiv:2408.10497v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.10497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-102] How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

链接: https://arxiv.org/abs/2408.10495
作者: Jianian Gong,Nachuan Duan,Ziheng Tao,Zhaohui Gong,Yuan Yuan,Minlie Huang
关键词-EN: modern development practices, large language models, code, rapid advancement, revolutionized the landscape
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) such as GPT-4 has revolutionized the landscape of software engineering, positioning these models at the core of modern development practices. As we anticipate these models to evolve into the primary and trustworthy tools used in software development, ensuring the security of the code they produce becomes paramount. How well can LLMs serve as end-to-end secure code producers? This paper presents a systematic investigation into LLMs’ inherent potential to generate code with fewer vulnerabilities. Specifically, We studied GPT-3.5 and GPT-4’s capability to identify and repair vulnerabilities in the code generated by four popular LLMs including themselves (GPT-3.5, GPT-4, Code Llama, and CodeGeeX2). By manually or automatically reviewing 4,900 pieces of code, our study reveals that: (1) large language models lack awareness of scenario-relevant security risks, which leads to the generation of over 75% vulnerable code on the SecurityEval benchmark; (2) LLMs such as GPT-3.5 and GPT-4 are unable to precisely identify vulnerabilities in the code they generated; (3) GPT-3.5 and GPT-4 can achieve 33.2%~59.6% success rates in repairing the insecure code produced by the 4 LLMs, but they both perform poorly when repairing self-produced code, indicating self-repair “blind spots”. To address the limitation of a single round of repair, we developed a lightweight tool that prompts LLMs to construct safer source code through an iterative repair procedure based on the insights gained from our study. Experiments show that assisted by semantic analysis engines, our tool significantly improves the success rates of repair to 65.9%~85.5%.

[AI-103] Is the Lecture Engaging for Learning? Lecture Voice Sentiment Analysis for Knowledge Graph-Supported Intelligent Lecturing Assistant (ILA) System

链接: https://arxiv.org/abs/2408.10492
作者: Yuan An,Samarth Kolanupaka,Jacob An,Matthew Ma,Unnat Chhatwal,Alex Kalinowski,Michelle Rogers,Brian Smith
关键词-EN: intelligent lecturing assistant, optimal pedagogical strategies, lecturing assistant, paper introduces, introduces an intelligent
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This paper introduces an intelligent lecturing assistant (ILA) system that utilizes a knowledge graph to represent course content and optimal pedagogical strategies. The system is designed to support instructors in enhancing student learning through real-time analysis of voice, content, and teaching methods. As an initial investigation, we present a case study on lecture voice sentiment analysis, in which we developed a training set comprising over 3,000 one-minute lecture voice clips. Each clip was manually labeled as either engaging or non-engaging. Utilizing this dataset, we constructed and evaluated several classification models based on a variety of features extracted from the voice clips. The results demonstrate promising performance, achieving an F1-score of 90% for boring lectures on an independent set of over 800 test voice clips. This case study lays the groundwork for the development of a more sophisticated model that will integrate content analysis and pedagogical practices. Our ultimate goal is to aid instructors in teaching more engagingly and effectively by leveraging modern artificial intelligence techniques.

[AI-104] Achieving the Tightest Relaxation of Sigmoids for Formal Verification

链接: https://arxiv.org/abs/2408.10491
作者: Samuel Chevalier,Duncan Starkenburg,Krishnamurthy(Dj)Dvijotham
关键词-EN: Neural Networks, equivalent mathematical programs, sigmoid activation function, activation function, sigmoid activation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of formal verification, Neural Networks (NNs) are typically reformulated into equivalent mathematical programs which are optimized over. To overcome the inherent non-convexity of these reformulations, convex relaxations of nonlinear activation functions are typically utilized. Common relaxations (i.e., static linear cuts) of ``S-shaped" activation functions, however, can be overly loose, slowing down the overall verification process. In this paper, we derive tuneable hyperplanes which upper and lower bound the sigmoid activation function. When tuned in the dual space, these affine bounds smoothly rotate around the nonlinear manifold of the sigmoid activation function. This approach, termed \alpha -sig, allows us to tractably incorporate the tightest possible, element-wise convex relaxation of the sigmoid activation function into a formal verification framework. We embed these relaxations inside of large verification tasks and compare their performance to LiRPA and \alpha -CROWN, a state-of-the-art verification duo.

[AI-105] Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm

链接: https://arxiv.org/abs/2408.10488
作者: Xiao Wang,Yao Rong,Fuling Wang,Jianing Li,Lin Zhu,Bo Jiang,Yaowei Wang
关键词-EN: Sign Language Translation, AI-assisted disability, Event stream sign, core task, field of AI-assisted
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
*备注: First Large-scale and High-Definition Benchmark Dataset for Event-based Sign Language Translation

点击查看摘要

Abstract:Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Unlike traditional SLT based on visible light videos, which is easily affected by factors such as lighting, rapid hand movements, and privacy breaches, this paper proposes the use of high-definition Event streams for SLT, effectively mitigating the aforementioned issues. This is primarily because Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well. Additionally, due to their sparsity in space, they effectively protect the privacy of the target person. More specifically, we propose a new high-resolution Event stream sign language dataset, termed Event-CSL, which effectively fills the data gap in this area of research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected in a variety of indoor and outdoor scenes, encompassing multiple angles, light intensities, and camera movements. We have benchmarked existing mainstream SLT works to enable fair comparison for future efforts. Based on this dataset and several other large-scale datasets, we propose a novel baseline method that fully leverages the Mamba model’s ability to integrate temporal information of CNN features, resulting in improved sign language translation outcomes. Both the benchmark dataset and source code will be released on this https URL

[AI-106] MambaEVT: Event Stream based Visual Object Tracking using State Space Model

链接: https://arxiv.org/abs/2408.10487
作者: Xiao Wang,Chao wang,Shiao Wang,Xixi Wang,Zhicheng Zhao,Lin Zhu,Bo Jiang
关键词-EN: Event camera-based visual, low energy consumption, dense temporal resolution, unique imaging principle, recent years due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: In Peer Review

点击查看摘要

Abstract:Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on this https URL

[AI-107] Evaluation Framework for AI-driven Molecular Design of Multi-target Drugs: Brain Diseases as a Case Study CEC

链接: https://arxiv.org/abs/2408.10482
作者: Arthur Cerveira,Frederico Kremer,Darling de Andrade Lourenço,Ulisses B Corrêa
关键词-EN: Artificial Intelligence, application of Artificial, therapeutic agents, widespread application, significantly influenced
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 8 pages, 1 figure, published in 2024 IEEE Congress on Evolutionary Computation (CEC)

点击查看摘要

Abstract:The widespread application of Artificial Intelligence (AI) techniques has significantly influenced the development of new therapeutic agents. These computational methods can be used to design and predict the properties of generated molecules. Multi-target Drug Discovery (MTDD) is an emerging paradigm for discovering drugs against complex disorders that do not respond well to more traditional target-specific treatments, such as central nervous system, immune system, and cardiovascular diseases. Still, there is yet to be an established benchmark suite for assessing the effectiveness of AI tools for designing multi-target compounds. Standardized benchmarks allow for comparing existing techniques and promote rapid research progress. Hence, this work proposes an evaluation framework for molecule generation techniques in MTDD scenarios, considering brain diseases as a case study. Our methodology involves using large language models to select the appropriate molecular targets, gathering and preprocessing the bioassay datasets, training quantitative structure-activity relationship models to predict target modulation, and assessing other essential drug-likeness properties for implementing the benchmarks. Additionally, this work will assess the performance of four deep generative models and evolutionary algorithms over our benchmark suite. In our findings, both evolutionary algorithms and generative models can achieve competitive results across the proposed benchmarks.

[AI-108] An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing

链接: https://arxiv.org/abs/2408.10479
作者: Xinlang Yue,Yiran Liu,Fangzhou Shi,Sihong Luo,Chen Zhong,Min Lu,Zhe Xu
关键词-EN: localized spatiotemporal context, influences ride-hailing service, Assigning orders, ride-hailing service experience, spatiotemporal context
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Assigning orders to drivers under localized spatiotemporal context (micro-view order-dispatching) is a major task in Didi, as it influences ride-hailing service experience. Existing industrial solutions mainly follow a two-stage pattern that incorporate heuristic or learning-based algorithms with naive combinatorial methods, tackling the uncertainty of both sides’ behaviors, including emerging timings, spatial relationships, and travel duration, etc. In this paper, we propose a one-stage end-to-end reinforcement learning based order-dispatching approach that solves behavior prediction and combinatorial optimization uniformly in a sequential decision-making manner. Specifically, we employ a two-layer Markov Decision Process framework to model this problem, and present \underlineDeep \underlineDouble \underlineScalable \underlineNetwork (D2SN), an encoder-decoder structure network to generate order-driver assignments directly and stop assignments accordingly. Besides, by leveraging contextual dynamics, our approach can adapt to the behavioral patterns for better performance. Extensive experiments on Didi’s real-world benchmarks justify that the proposed approach significantly outperforms competitive baselines in optimizing matching efficiency and user experience tasks. In addition, we evaluate the deployment outline and discuss the gains and experiences obtained during the deployment tests from the view of large-scale engineering implementation.

[AI-109] LeCov: Multi-level Testing Criteria for Large Language Models

链接: https://arxiv.org/abs/2408.10474
作者: Xuan Xie,Jiayang Song,Yuheng Huang,Da Song,Fuyuan Zhang,Felix Juefei-Xu,Lei Ma
关键词-EN: Large Language Models, Large Language, truthfulness and toxicity, Language Models, limited interpretability
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in many different domains, but because of their limited interpretability, there are questions about how trustworthy they are in various perspectives, e.g., truthfulness and toxicity. Recent research has started developing testing methods for LLMs, aiming to uncover untrustworthy issues, i.e., defects, before deployment. However, systematic and formalized testing criteria are lacking, which hinders a comprehensive assessment of the extent and adequacy of testing exploration. To mitigate this threat, we propose a set of multi-level testing criteria, LeCov, for LLMs. The criteria consider three crucial LLM internal components, i.e., the attention mechanism, feed-forward neurons, and uncertainty, and contain nine types of testing criteria in total. We apply the criteria in two scenarios: test prioritization and coverage-guided testing. The experiment evaluation, on three models and four datasets, demonstrates the usefulness and effectiveness of LeCov.

[AI-110] IDEA: Enhancing the rule learning ability of language agent through Induction DEuction and Abduction

链接: https://arxiv.org/abs/2408.10455
作者: Kaiyu He,Zhiyu Chen
关键词-EN: holistic rule learning, large language models, interactive environments remains, remains less explored, large language
类目: Artificial Intelligence (cs.AI)
*备注: 9pages, 12 figs, 4 tables

点击查看摘要

Abstract:While large language models (LLMs) have been thoroughly evaluated for deductive and inductive reasoning, their proficiency in abductive reasoning and holistic rule learning in interactive environments remains less explored. This work introduces RULEARN, a novel benchmark specifically designed to assess the rule-learning ability of LLMs in interactive settings. In RULEARN, agents interact with the environment to gather observations and discern patterns, using these insights to solve problems. To further enhance the rule-learning capabilities of LLM agents within this benchmark, we propose IDEA agent, which integrates Induction, Deduction, and Abduction processes. IDEA agent refines this approach by leveraging a structured reasoning sequence: generating hypotheses through abduction, testing them via deduction, and refining them based on induction feedback. This sequence enables agents to dynamically establish and apply rules, mimicking human-like reasoning processes. Our evaluation of five representative LLMs indicates that while these models can generate plausible initial hypotheses, they often struggle with strategic interaction within the environment, effective incorporation of feedback, and adaptive refinement of their hypotheses. IDEA agent demonstrates significantly improved performance on the RULEARN benchmark, offering valuable insights for the development of agents capable of human-like rule-learning in real-world scenarios. We will release our code and data.

[AI-111] RUMI: Rummaging Using Mutual Information

链接: https://arxiv.org/abs/2408.10450
作者: Sheng Zhong,Nima Fazeli,Dmitry Berenson
关键词-EN: paper presents Rummaging, object pose distribution, Mutual Information, robot action sequences, visually-occluded environments
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 19 pages, 17 figures, submitted to IEEE Transactions on Robotics (T-RO)

点击查看摘要

Abstract:This paper presents Rummaging Using Mutual Information (RUMI), a method for online generation of robot action sequences to gather information about the pose of a known movable object in visually-occluded environments. Focusing on contact-rich rummaging, our approach leverages mutual information between the object pose distribution and robot trajectory for action planning. From an observed partial point cloud, RUMI deduces the compatible object pose distribution and approximates the mutual information of it with workspace occupancy in real time. Based on this, we develop an information gain cost function and a reachability cost function to keep the object within the robot’s reach. These are integrated into a model predictive control (MPC) framework with a stochastic dynamics model, updating the pose distribution in a closed loop. Key contributions include a new belief framework for object pose estimation, an efficient information gain computation strategy, and a robust MPC-based control scheme. RUMI demonstrates superior performance in both simulated and real tasks compared to baseline methods.

[AI-112] he Brittleness of AI-Generated Image Watermarking Techniques: Examining Their Robustness Against Visual Paraphrasing Attacks

链接: https://arxiv.org/abs/2408.10446
作者: Niyar R Barman,Krish Sharma,Ashhar Aziz,Shashwat Bajpai,Shwetangshu Biswas,Vasu Sharma,Vinija Jain,Aman Chadha,Amit Sheth,Amitava Das
关键词-EN: models like Stable, Stable Diffusion, visual paraphrase, exemplified by models, potential misuse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 23 pages and 10 figures

点击查看摘要

Abstract:The rapid advancement of text-to-image generation systems, exemplified by models like Stable Diffusion, Midjourney, Imagen, and DALL-E, has heightened concerns about their potential misuse. In response, companies like Meta and Google have intensified their efforts to implement watermarking techniques on AI-generated images to curb the circulation of potentially misleading visuals. However, in this paper, we argue that current image watermarking methods are fragile and susceptible to being circumvented through visual paraphrase attacks. The proposed visual paraphraser operates in two steps. First, it generates a caption for the given image using KOSMOS-2, one of the latest state-of-the-art image captioning systems. Second, it passes both the original image and the generated caption to an image-to-image diffusion system. During the denoising step of the diffusion pipeline, the system generates a visually similar image that is guided by the text caption. The resulting image is a visual paraphrase and is free of any watermarks. Our empirical findings demonstrate that visual paraphrase attacks can effectively remove watermarks from images. This paper provides a critical assessment, empirically revealing the vulnerability of existing watermarking techniques to visual paraphrase attacks. While we do not propose solutions to this issue, this paper serves as a call to action for the scientific community to prioritize the development of more robust watermarking techniques. Our first-of-its-kind visual paraphrase dataset and accompanying code are publicly available.

[AI-113] Feasibility of assessing cognitive impairment via distributed camera network and privacy-preserving edge computing

链接: https://arxiv.org/abs/2408.10442
作者: Chaitra Hegde,Yashar Kiarashi,Allan I Levey,Amy D Rodriguez,Hyeokhyen Kwon,Gari D Clifford
关键词-EN: Mild cognitive impairment, Mild cognitive, education-related expectations, functions beyond typical, typical age
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:INTRODUCTION: Mild cognitive impairment (MCI) is characterized by a decline in cognitive functions beyond typical age and education-related expectations. Since, MCI has been linked to reduced social interactions and increased aimless movements, we aimed to automate the capture of these behaviors to enhance longitudinal monitoring. METHODS: Using a privacy-preserving distributed camera network, we collected movement and social interaction data from groups of individuals with MCI undergoing therapy within a 1700 m^2 space. We developed movement and social interaction features, which were then used to train a series of machine learning algorithms to distinguish between higher and lower cognitive functioning MCI groups. RESULTS: A Wilcoxon rank-sum test revealed statistically significant differences between high and low-functioning cohorts in features such as linear path length, walking speed, change in direction while walking, entropy of velocity and direction change, and number of group formations in the indoor space. Despite lacking individual identifiers to associate with specific levels of MCI, a machine learning approach using the most significant features provided a 71% accuracy. DISCUSSION: We provide evidence to show that a privacy-preserving low-cost camera network using edge computing framework has the potential to distinguish between different levels of cognitive impairment from the movements and social interactions captured during group activities. Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.10442 [cs.AI] (or arXiv:2408.10442v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.10442 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chaitra Hegde [view email] [v1] Mon, 19 Aug 2024 22:34:43 UTC (951 KB)

[AI-114] Understanding Generative AI Content with Embedding Models

链接: https://arxiv.org/abs/2408.10437
作者: Max Vargas,Reilly Cannon,Andrew Engel,Anand D. Sarwate,Tony Chiang
关键词-EN: high-quality numerical features, quantitative data analysis, construction of high-quality, high-quality numerical, numerical features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The construction of high-quality numerical features is critical to any quantitative data analysis. Feature engineering has been historically addressed by carefully hand-crafting data representations based on domain expertise. This work views the internal representations of modern deep neural networks (DNNs), called embeddings, as an automated form of traditional feature engineering. For trained DNNs, we show that these embeddings can reveal interpretable, high-level concepts in unstructured sample data. We use these embeddings in natural language and computer vision tasks to uncover both inherent heterogeneity in the underlying data and human-understandable explanations for it. In particular, we find empirical evidence that there is inherent separability between real data and that generated from AI models.

[AI-115] Are LLMs Any Good for High-Level Synthesis?

链接: https://arxiv.org/abs/2408.10428
作者: Yuchao Liao,Tosiron Adegbija,Roman Lysecky
关键词-EN: innovative High-Level Synthesis, necessitate innovative High-Level, High-Level Synthesis, Large Language Models, designs necessitate innovative
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: ICCAD '24 Special Session on AI4HLS: New Frontiers in High-Level Synthesis Augmented with Artificial Intelligence

点击查看摘要

Abstract:The increasing complexity and demand for faster, energy-efficient hardware designs necessitate innovative High-Level Synthesis (HLS) methodologies. This paper explores the potential of Large Language Models (LLMs) to streamline or replace the HLS process, leveraging their ability to understand natural language specifications and refactor code. We survey the current research and conduct experiments comparing Verilog designs generated by a standard HLS tool (Vitis HLS) with those produced by LLMs translating C code or natural language specifications. Our evaluation focuses on quantifying the impact on performance, power, and resource utilization, providing an assessment of the efficiency of LLM-based approaches. This study aims to illuminate the role of LLMs in HLS, identifying promising directions for optimized hardware design in applications such as AI acceleration, embedded systems, and high-performance computing.

[AI-116] Development of an AI Anti-Bullying System Using Large Language Model Key Topic Detection

链接: https://arxiv.org/abs/2408.10417
作者: Matthew Tassava,Cameron Kolodjski,Jordan Milbrath,Adorah Bishop,Nathan Flanders,Robbie Fetsch,Danielle Hanson,Jeremy Straub
关键词-EN: artificial intelligence, paper presents, presents and evaluates, evaluates work, anti-bullying system
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This paper presents and evaluates work on the development of an artificial intelligence (AI) anti-bullying system. The system is designed to identify coordinated bullying attacks via social media and other mechanisms, characterize them and propose remediation and response activities to them. In particular, a large language model (LLM) is used to populate an enhanced expert system-based network model of a bullying attack. This facilitates analysis and remediation activity - such as generating report messages to social media companies - determination. The system is described and the efficacy of the LLM for populating the model is analyzed herein.

[AI-117] owards Automation of Human Stage of Decay Identification: An Artificial Intelligence Approach

链接: https://arxiv.org/abs/2408.10414
作者: Anna-Maria Nau,Phillip Ditto,Dawnie Wolfe Steadman,Audris Mockus
关键词-EN: identifying human remains, Determining the stage, human decomposition, human decomposition images, human decomposition scoring
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Determining the stage of decomposition (SOD) is crucial for estimating the postmortem interval and identifying human remains. Currently, labor-intensive manual scoring methods are used for this purpose, but they are subjective and do not scale for the emerging large-scale archival collections of human decomposition photos. This study explores the feasibility of automating two common human decomposition scoring methods proposed by Megyesi and Gelderman using artificial intelligence (AI). We evaluated two popular deep learning models, Inception V3 and Xception, by training them on a large dataset of human decomposition images to classify the SOD for different anatomical regions, including the head, torso, and limbs. Additionally, an interrater study was conducted to assess the reliability of the AI models compared to human forensic examiners for SOD identification. The Xception model achieved the best classification performance, with macro-averaged F1 scores of .878, .881, and .702 for the head, torso, and limbs when predicting Megyesi’s SODs, and .872, .875, and .76 for the head, torso, and limbs when predicting Gelderman’s SODs. The interrater study results supported AI’s ability to determine the SOD at a reliability level comparable to a human expert. This work demonstrates the potential of AI models trained on a large dataset of human decomposition images to automate SOD identification.

[AI-118] Webcam-based Pupil Diameter Prediction Benefits from Upscaling

链接: https://arxiv.org/abs/2408.10397
作者: Vijul Shah,Brian B. Moser,Ko Watanabe,Andreas Dengel
关键词-EN: Capturing pupil diameter, Capturing pupil, cognitive load, pupil diameter, essential for assessing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Capturing pupil diameter is essential for assessing psychological and physiological states such as stress levels and cognitive load. However, the low resolution of images in eye datasets often hampers precise measurement. This study evaluates the impact of various upscaling methods, ranging from bicubic interpolation to advanced super-resolution, on pupil diameter predictions. We compare several pre-trained methods, including CodeFormer, GFPGAN, Real-ESRGAN, HAT, and SRResNet. Our findings suggest that pupil diameter prediction models trained on upscaled datasets are highly sensitive to the selected upscaling method and scale. Our results demonstrate that upscaling methods consistently enhance the accuracy of pupil diameter prediction models, highlighting the importance of upscaling in pupilometry. Overall, our work provides valuable insights for selecting upscaling techniques, paving the way for more accurate assessments in psychological and physiological research.

[AI-119] Evaluating Image-Based Face and Eye Tracking with Event Cameras ECCV

链接: https://arxiv.org/abs/2408.10395
作者: Khadija Iddrisu,Waseem Shariff,Noel E.OConnor,Joseph Lemley,Suzanne Little
关键词-EN: producing asynchronously generated, Neuromorphic sensors, generated data termed, asynchronously generated data, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted at The Workshop On Neuromorphic Vision: Advantages and Applications of Event Cameras at the European Conference on Computer Vision (ECCV), 2024

点击查看摘要

Abstract:Event Cameras, also known as Neuromorphic sensors, capture changes in local light intensity at the pixel level, producing asynchronously generated data termed ``events’'. This distinct data format mitigates common issues observed in conventional cameras, like under-sampling when capturing fast-moving objects, thereby preserving critical information that might otherwise be lost. However, leveraging this data often necessitates the development of specialized, handcrafted event representations that can integrate seamlessly with conventional Convolutional Neural Networks (CNNs), considering the unique attributes of event data. In this study, We evaluate event-based Face and Eye tracking. The core objective of our study is to showcase the viability of integrating conventional algorithms with event-based data, transformed into a frame format while preserving the unique benefits of event cameras. To validate our approach, we constructed a frame-based event dataset by simulating events between RGB frames derived from the publicly accessible Helen Dataset. We assess its utility for face and eye detection tasks through the application of GR-YOLO – a pioneering technique derived from YOLOv3. This evaluation includes a comparative analysis with results derived from training the dataset with YOLOv8. Subsequently, the trained models were tested on real event streams from various iterations of Prophesee’s event cameras and further evaluated on the Faces in Event Stream (FES) benchmark dataset. The models trained on our dataset shows a good prediction performance across all the datasets obtained for validation with the best results of a mean Average precision score of 0.91. Additionally, The models trained demonstrated robust performance on real event camera data under varying light conditions.

[AI-120] Joint Modeling of Search and Recommendations Via an Unified Contextual Recommender (UniCoRn)

链接: https://arxiv.org/abs/2408.10394
作者: Moumita Bhattacharya,Vito Ostuni,Sudarshan Lamkhede
关键词-EN: Search and recommendation, developed separately, leading to complex, technical debt, recommendation systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 3 pages, 1 figure

点击查看摘要

Abstract:Search and recommendation systems are essential in many services, and they are often developed separately, leading to complex maintenance and technical debt. In this paper, we present a unified deep learning model that efficiently handles key aspects of both tasks.

[AI-121] BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval

链接: https://arxiv.org/abs/2408.10383
作者: Zhenyu Lu,Lakshay Sethi
关键词-EN: pipeline models, models, pipeline, matching generally fall, pipeline models outperform
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Previous methods for audio-image matching generally fall into one of two categories: pipeline models or End-to-End models. Pipeline models first transcribe speech and then encode the resulting text; End-to-End models encode speech directly. Generally, pipeline models outperform end-to-end models, but the intermediate transcription necessarily discards some potentially useful non-textual information. In addition to textual information, speech can convey details such as accent, mood, and and emphasis, which should be effectively captured in the encoded representation. In this paper, we investigate whether non-textual information, which is overlooked by pipeline-based models, can be leveraged to improve speech-image matching performance. We thoroughly analyze and compare End-to-End models, pipeline models, and our proposed dual-channel model for robust audio-image retrieval on a variety of datasets. Our approach achieves a substantial performance gain over the previous state-of-the-art by leveraging strong pretrained models, a prompting mechanism and a bifurcated design.

[AI-122] Boolean Matrix Logic Programming

链接: https://arxiv.org/abs/2408.10369
作者: Lun Ai,Stephen H. Muggleton
关键词-EN: boolean matrix manipulation, composable boolean matrix, datalog query evaluation, boolean matrix, query evaluation approach
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:We describe a datalog query evaluation approach based on efficient and composable boolean matrix manipulation modules. We first define an overarching problem, Boolean Matrix Logic Programming (BMLP), which uses boolean matrices as an alternative computation to evaluate datalog programs. We develop two novel BMLP modules for bottom-up inferences on linear dyadic recursive datalog programs, and show how additional modules can extend this capability to compute both linear and non-linear recursive datalog programs of arity two. Our empirical results demonstrate that these modules outperform general-purpose and specialised systems by factors of 30x and 9x, respectively, when evaluating large programs with millions of facts. This boolean matrix approach significantly enhances the efficiency of datalog querying to support logic programming techniques.

[AI-123] AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews

链接: https://arxiv.org/abs/2408.10365
作者: Keith Tyser,Ben Segev,Gaston Longhitano,Xin-Yu Zhang,Zachary Meeks,Jason Lee,Uday Garg,Nicholas Belsten,Avi Shporer,Madeleine Udell,Dov Te’eni,Iddo Drori
关键词-EN: LLM, automatic paper reviews, predict human preferences, human preferences, handle a large
类目: Artificial Intelligence (cs.AI)
*备注: 42 pages

点击查看摘要

Abstract:Automatic reviewing helps handle a large volume of papers, provides early feedback and quality control, reduces bias, and allows the analysis of trends. We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons. Gathering human preference may be time-consuming; therefore, we also use an LLM to automatically evaluate reviews to increase sample efficiency while reducing bias. In addition to evaluating human and LLM preferences among LLM reviews, we fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs. We artificially introduce errors into papers and analyze the LLM’s responses to identify limitations, use adaptive review questions, meta prompting, role-playing, integrate visual and textual analysis, use venue-specific reviewing materials, and predict human preferences, improving upon the limitations of the traditional review processes. We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality. This work develops proof-of-concept LLM reviewing systems that quickly deliver consistent, high-quality reviews and evaluate their quality. We mitigate the risks of misuse, inflated review scores, overconfident ratings, and skewed score distributions by augmenting the LLM with multiple documents, including the review form, reviewer guide, code of ethics and conduct, area chair guidelines, and previous year statistics, by finding which errors and shortcomings of the paper may be detected by automated reviews, and evaluating pairwise reviewer preferences. This work identifies and addresses the limitations of using LLMs as reviewers and evaluators and enhances the quality of the reviewing process.

[AI-124] Query languages for neural networks ICDT2025

链接: https://arxiv.org/abs/2408.10362
作者: Martin Grohe,Christoph Standke,Juno Steegmans,Jan Van den Bussche
关键词-EN: understanding neural network, neural network models, neural network, lay the foundations, interpreting and understanding
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Logic in Computer Science (cs.LO)
*备注: To appear at ICDT 2025

点击查看摘要

Abstract:We lay the foundations for a database-inspired approach to interpreting and understanding neural network models by querying them using declarative languages. Towards this end we study different query languages, based on first-order logic, that mainly differ in their access to the neural network model. First-order logic over the reals naturally yields a language which views the network as a black box; only the input–output function defined by the network can be queried. This is essentially the approach of constraint query languages. On the other hand, a white-box language can be obtained by viewing the network as a weighted graph, and extending first-order logic with summation over weight terms. The latter approach is essentially an abstraction of SQL. In general, the two approaches are incomparable in expressive power, as we will show. Under natural circumstances, however, the white-box approach can subsume the black-box approach; this is our main result. We prove the result concretely for linear constraint queries over real functions definable by feedforward neural networks with a fixed number of hidden layers and piecewise linear activation functions.

[AI-125] HaSPeR: An Image Repository for Hand Shadow Puppet Recognition

链接: https://arxiv.org/abs/2408.10360
作者: Syed Rifat Raiyan,Zibran Zarif Amio,Sabbir Ahmed
关键词-EN: Hand shadow puppetry, Hand shadow, living creatures, hand shadow puppets, hand shadow puppeteer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE Transactions on Artificial Intelligence (IEEE TAI), 11 pages, 78 figures, 2 tables

点击查看摘要

Abstract:Hand shadow puppetry, also known as shadowgraphy or ombromanie, is a form of theatrical art and storytelling where hand shadows are projected onto flat surfaces to create illusions of living creatures. The skilled performers create these silhouettes by hand positioning, finger movements, and dexterous gestures to resemble shadows of animals and objects. Due to the lack of practitioners and a seismic shift in people’s entertainment standards, this art form is on the verge of extinction. To facilitate its preservation and proliferate it to a wider audience, we introduce \rm H\small ASP\small ER , a novel dataset consisting of 8,340 images of hand shadow puppets across 11 classes extracted from both professional and amateur hand shadow puppeteer clips. We provide a detailed statistical analysis of the dataset and employ a range of pretrained image classification models to establish baselines. Our findings show a substantial performance superiority of traditional convolutional models over attention-based transformer architectures. We also find that lightweight models, such as MobileNetV2, suited for mobile applications and embedded devices, perform comparatively well. We surmise that such low-latency architectures can be useful in developing ombromanie teaching tools, and we create a prototype application to explore this surmission. Keeping the best-performing model InceptionV3 under the limelight, we conduct comprehensive feature-spatial, explainability, and error analyses to gain insights into its decision-making process. To the best of our knowledge, this is the first documented dataset and research endeavor to preserve this dying art for future generations, with computer vision approaches. Our code and data are publicly available.

[AI-126] he Psychological Impacts of Algorithmic and AI-Driven Social Media on Teenagers: A Call to Action

链接: https://arxiv.org/abs/2408.10351
作者: Sunil Arora,Sahil Arora,John D. Hastings
关键词-EN: meta-issues surrounding social, enhance social interactions, adverse psychological impacts, surrounding social media, social media platforms
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 7 pages, 0 figures, 2 tables, 2024 IEEE Conference on Digital Platforms and Societal Harms

点击查看摘要

Abstract:This study investigates the meta-issues surrounding social media, which, while theoretically designed to enhance social interactions and improve our social lives by facilitating the sharing of personal experiences and life events, often results in adverse psychological impacts. Our investigation reveals a paradoxical outcome: rather than fostering closer relationships and improving social lives, the algorithms and structures that underlie social media platforms inadvertently contribute to a profound psychological impact on individuals, influencing them in unforeseen ways. This phenomenon is particularly pronounced among teenagers, who are disproportionately affected by curated online personas, peer pressure to present a perfect digital image, and the constant bombardment of notifications and updates that characterize their social media experience. As such, we issue a call to action for policymakers, platform developers, and educators to prioritize the well-being of teenagers in the digital age and work towards creating secure and safe social media platforms that protect the young from harm, online harassment, and exploitation.

[AI-127] LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

链接: https://arxiv.org/abs/2408.10343
作者: Nicholas Pipitone,Ghita Houir Alami
关键词-EN: showing promising potential, Retrieval-Augmented Generation, AI-powered legal applications, Large Language Models, RAG systems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of Large Language Models (LLMs) in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at this https URL.

[AI-128] A Disguised Wolf Is More Harmful Than a Toothless Tiger: Adaptive Malicious Code Injection Backdoor Attack Leveraging User Behavior as Triggers

链接: https://arxiv.org/abs/2408.10334
作者: Shangxi Wu,Jitao Sang
关键词-EN: made significant progress, code generation, large language models, code generation models, code
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have made significant progress in the field of code generation. However, as more and more users rely on these models for software development, the security risks associated with code generation models have become increasingly significant. Studies have shown that traditional deep learning robustness issues also negatively impact the field of code generation. In this paper, we first present the game-theoretic model that focuses on security issues in code generation scenarios. This framework outlines possible scenarios and patterns where attackers could spread malicious code models to create security threats. We also pointed out for the first time that the attackers can use backdoor attacks to dynamically adjust the timing of malicious code injection, which will release varying degrees of malicious code depending on the skill level of the user. Through extensive experiments on leading code generation models, we validate our proposed game-theoretic model and highlight the significant threats that these new attack scenarios pose to the safe use of code models.

[AI-129] Decoding Human Emotions: Analyzing Multi-Channel EEG Data using LSTM Networks

链接: https://arxiv.org/abs/2408.10328
作者: Shyam K Sateesh,Sparsh BK,Uma D
关键词-EN: Human-Computer Interaction, Long Short-Term Memory, analyze EEG signals, EEG signal data, thriving field
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 13 pages, 3 figures; accepted at ICDSA '24 Conference, Jaipur, India

点击查看摘要

Abstract:Emotion recognition from electroencephalogram (EEG) signals is a thriving field, particularly in neuroscience and Human-Computer Interaction (HCI). This study aims to understand and improve the predictive accuracy of emotional state classification through metrics such as valence, arousal, dominance, and likeness by applying a Long Short-Term Memory (LSTM) network to analyze EEG signals. Using a popular dataset of multi-channel EEG recordings known as DEAP, we look towards leveraging LSTM networks’ properties to handle temporal dependencies within EEG signal data. This allows for a more comprehensive understanding and classification of emotional parameter states. We obtain accuracies of 89.89%, 90.33%, 90.70%, and 90.54% for arousal, valence, dominance, and likeness, respectively, demonstrating significant improvements in emotion recognition model capabilities. This paper elucidates the methodology and architectural specifics of our LSTM model and provides a benchmark analysis with existing papers.

[AI-130] Leveraging Superfluous Information in Contrastive Representation Learning

链接: https://arxiv.org/abs/2408.10292
作者: Xuechu Yu
关键词-EN: learnthe shared information, downstream tasks, aims to learnthe, learnthe shared, shown its powerful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Contrastive representation learning, which aims to learnthe shared information between different views of unlabeled data by maximizing the mutual information between them, has shown its powerful competence in self-supervised learning for downstream tasks. However, recent works have demonstrated that more estimated mutual information does not guarantee better performance in different downstream tasks. Such works inspire us to conjecture that the learned representations not only maintain task-relevant information from unlabeled data but also carry task-irrelevant information which is superfluous for downstream tasks, thus leading to performance degeneration. In this paper we show that superfluous information does exist during the conventional contrastive learning framework, and further design a new objective, namely SuperInfo, to learn robust representations by a linear combination of both predictive and superfluous information. Besides, we notice that it is feasible to tune the coefficients of introduced losses to discard task-irrelevant information, while keeping partial non-shared task-relevant information according to our SuperInfo loss.We demonstrate that learning with our loss can often outperform the traditional contrastive learning approaches on image classification, object detection and instance segmentation tasks with significant improvements.

[AI-131] GPT-Augmented Reinforcement Learning with Intelligent Control for Vehicle Dispatching

链接: https://arxiv.org/abs/2408.10286
作者: Xiao Han,Zijian Zhang,Xiangyu Zhao,Guojiang Shen,Xiangjie Kong,Xuetao Wei,Liqiang Nie,Jieping Ye
关键词-EN: online ride-hailing services, residents demand higher, urban residents demand, demand higher travel, critical component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As urban residents demand higher travel quality, vehicle dispatch has become a critical component of online ride-hailing services. However, current vehicle dispatch systems struggle to navigate the complexities of urban traffic dynamics, including unpredictable traffic conditions, diverse driver behaviors, and fluctuating supply and demand patterns. These challenges have resulted in travel difficulties for passengers in certain areas, while many drivers in other areas are unable to secure orders, leading to a decline in the overall quality of urban transportation services. To address these issues, this paper introduces GARLIC: a framework of GPT-Augmented Reinforcement Learning with Intelligent Control for vehicle dispatching. GARLIC utilizes multiview graphs to capture hierarchical traffic states, and learns a dynamic reward function that accounts for individual driving behaviors. The framework further integrates a GPT model trained with a custom loss function to enable high-precision predictions and optimize dispatching policies in real-world scenarios. Experiments conducted on two real-world datasets demonstrate that GARLIC effectively aligns with driver behaviors while reducing the empty load rate of vehicles.

[AI-132] BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

链接: https://arxiv.org/abs/2408.10285
作者: Yifei Yang,Runhan Shi,Zuchao Li,Shu Jiang,Bao-Liang Lu,Yang Yang,Hai Zhao
关键词-EN: organic chemistry, pivotal yet challenging, discovery and organic, challenging in drug, drug discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Retrosynthesis analysis is pivotal yet challenging in drug discovery and organic chemistry. Despite the proliferation of computational tools over the past decade, AI-based systems often fall short in generalizing across diverse reaction types and exploring alternative synthetic pathways. This paper presents BatGPT-Chem, a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Integrating chemical tasks via a unified framework of natural language and SMILES notation, this approach synthesizes extensive instructional data from an expansive chemical database. Employing both autoregressive and bidirectional training techniques across over one hundred million instances, BatGPT-Chem captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions and exhibiting strong zero-shot capabilities. Superior to existing AI methods, our model demonstrates significant advancements in generating effective strategies for complex molecules, as validated by stringent benchmark tests. BatGPT-Chem not only boosts the efficiency and creativity of retrosynthetic analysis but also establishes a new standard for computational tools in synthetic design. This development empowers chemists to adeptly address the synthesis of novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science. We release our trial platform at \urlthis https URL.

[AI-133] FEDKIM: Adaptive Federated Knowledge Injection into Medical Foundation Models EMNLP’24

链接: https://arxiv.org/abs/2408.10276
作者: Xiaochen Wang,Jiaqi Wang,Houping Xiao,Jinghui Chen,Fenglong Ma
关键词-EN: outperforming conventional artificial, conventional artificial intelligence, demonstrated remarkable capabilities, handling diverse modalities, outperforming conventional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to EMNLP’24

点击查看摘要

Abstract:Foundation models have demonstrated remarkable capabilities in handling diverse modalities and tasks, outperforming conventional artificial intelligence (AI) approaches that are highly task-specific and modality-reliant. In the medical domain, however, the development of comprehensive foundation models is constrained by limited access to diverse modalities and stringent privacy regulations. To address these constraints, this study introduces a novel knowledge injection approach, FedKIM, designed to scale the medical foundation model within a federated learning framework. FedKIM leverages lightweight local models to extract healthcare knowledge from private data and integrates this knowledge into a centralized foundation model using a designed adaptive Multitask Multimodal Mixture Of Experts (M3OE) module. This method not only preserves privacy but also enhances the model’s ability to handle complex medical tasks involving multiple modalities. Our extensive experiments across twelve tasks in seven modalities demonstrate the effectiveness of FedKIM in various settings, highlighting its potential to scale medical foundation models without direct access to sensitive data.

[AI-134] FedKBP: Federated dose prediction framework for knowledge-based planning in radiation therapy

链接: https://arxiv.org/abs/2408.10275
作者: Jingyun Chen,Martin King,Yading Yuan
关键词-EN: automatically generating patient-specific, generating patient-specific dose, Dose prediction plays, patient-specific dose distribution, Dose prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review by SPIE Medical Imaging 2025 Conference

点击查看摘要

Abstract:Dose prediction plays a key role in knowledge-based planning (KBP) by automatically generating patient-specific dose distribution. Recent advances in deep learning-based dose prediction methods necessitates collaboration among data contributors for improved performance. Federated learning (FL) has emerged as a solution, enabling medical centers to jointly train deep-learning models without compromising patient data privacy. We developed the FedKBP framework to evaluate the performances of centralized, federated, and individual (i.e. separated) training of dose prediction model on the 340 plans from OpenKBP dataset. To simulate FL and individual training, we divided the data into 8 training sites. To evaluate the effect of inter-site data variation on model training, we implemented two types of case distributions: 1) Independent and identically distributed (IID), where the training and validating cases were evenly divided among the 8 sites, and 2) non-IID, where some sites have more cases than others. The results show FL consistently outperforms individual training on both model optimization speed and out-of-sample testing scores, highlighting the advantage of FL over individual training. Under IID data division, FL shows comparable performance to centralized training, underscoring FL as a promising alternative to traditional pooled-data training. Under non-IID division, larger sites outperformed smaller sites by up to 19% on testing scores, confirming the need of collaboration among data owners to achieve better prediction accuracy. Meanwhile, non-IID FL showed reduced performance as compared to IID FL, posing the need for more sophisticated FL method beyond mere model averaging to handle data variation among participating sites.

[AI-135] SEAL: Systematic Error Analysis for Value ALignment

链接: https://arxiv.org/abs/2408.10270
作者: Manon Revel,Matteo Cargnelutti,Tyna Eloundou,Greg Leppert
关键词-EN: Reinforcement Learning, align language models, training reward models, Human Feedback, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 28 pages, 17 Figures, 8 Tables

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them - a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed inputs. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% incidence of alignment resistance in portions of the dataset where LM-labelers disagreed with human preferences. Furthermore, we find that misalignment often arises from ambiguous entries within the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.

[AI-136] OpenCity: Open Spatio-Temporal Foundation Models for Traffic Prediction

链接: https://arxiv.org/abs/2408.10269
作者: Zhonghang Li,Long Xia,Lei Shi,Yong Xu,Dawei Yin,Chao Huang
关键词-EN: enabling efficient resource, enhanced travel experiences, efficient resource allocation, Accurate traffic forecasting, effective urban planning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 12 pages

点击查看摘要

Abstract:Accurate traffic forecasting is crucial for effective urban planning and transportation management, enabling efficient resource allocation and enhanced travel experiences. However, existing models often face limitations in generalization, struggling with zero-shot prediction on unseen regions and cities, as well as diminished long-term accuracy. This is primarily due to the inherent challenges in handling the spatial and temporal heterogeneity of traffic data, coupled with the significant distribution shift across time and space. In this work, we aim to unlock new possibilities for building versatile, resilient and adaptive spatio-temporal foundation models for traffic prediction. To achieve this goal, we introduce a novel foundation model, named OpenCity, that can effectively capture and normalize the underlying spatio-temporal patterns from diverse data characteristics, facilitating zero-shot generalization across diverse urban environments. OpenCity integrates the Transformer architecture with graph neural networks to model the complex spatio-temporal dependencies in traffic data. By pre-training OpenCity on large-scale, heterogeneous traffic datasets, we enable the model to learn rich, generalizable representations that can be seamlessly applied to a wide range of traffic forecasting scenarios. Experimental results demonstrate that OpenCity exhibits exceptional zero-shot predictive performance. Moreover, OpenCity showcases promising scaling laws, suggesting the potential for developing a truly one-for-all traffic prediction solution that can adapt to new urban contexts with minimal overhead. We made our proposed OpenCity model open-source and it is available at the following link: this https URL.

[AI-137] Realtime Generation of Streamliners with Large Language Models

链接: https://arxiv.org/abs/2408.10268
作者: Florentina Voboril,Vaidyanathan Peruvemba Ramaswamy,Stefan Szeider
关键词-EN: Large Language Models, Language Models, Large Language, paper presents, Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the novel method StreamLLM for generating streamliners in constraint programming using Large Language Models (LLMs). Streamliners are constraints that narrow the search space, enhancing the speed and feasibility of solving complex problems. Traditionally, streamliners were crafted manually or generated through systematically combined atomic constraints with high-effort offline testing. Our approach uses LLMs to propose effective streamliners. Our system StreamLLM generates streamlines for problems specified in the MiniZinc constraint programming language and integrates feedback to the LLM with quick empirical tests. Our rigorous empirical evaluation involving ten problems with several hundreds of test instances shows robust results that are highly encouraging, showcasing the transforming power of LLMs in the domain of constraint programming.

[AI-138] Diffusion Model for Planning: A Systematic Literature Review

链接: https://arxiv.org/abs/2408.10266
作者: Toshihide Ubukata,Jialong Li,Kenji Tei
关键词-EN: leverage stochastic processes, iterative denoising processes, data distributions effectively, achieving notable success, capture complex data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 13 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Diffusion models, which leverage stochastic processes to capture complex data distributions effectively, have shown their performance as generative models, achieving notable success in image-related tasks through iterative denoising processes. Recently, diffusion models have been further applied and show their strong abilities in planning tasks, leading to a significant growth in related publications since 2023. To help researchers better understand the field and promote the development of the field, we conduct a systematic literature review of recent advancements in the application of diffusion models for planning. Specifically, this paper categorizes and discusses the current literature from the following perspectives: (i) relevant datasets and benchmarks used for evaluating diffusion modelbased planning; (ii) fundamental studies that address aspects such as sampling efficiency; (iii) skill-centric and condition-guided planning for enhancing adaptability; (iv) safety and uncertainty managing mechanism for enhancing safety and robustness; and (v) domain-specific application such as autonomous driving. Finally, given the above literature review, we further discuss the challenges and future directions in this field.

[AI-139] OPDR: Order-Preserving Dimension Reduction for Semantic Embedding of Multimodal Scientific Data

链接: https://arxiv.org/abs/2408.10264
作者: Chengyu Gong,Gefei Shen,Luanzheng Guo,Nathan Tallent,Dongfang Zhao
关键词-EN: scientific data management, multimodal scientific data, similar items, original multimodal data, multimodal machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:One of the most common operations in multimodal scientific data management is searching for the k most similar items (or, k -nearest neighbors, KNN) from the database after being provided a new item. Although recent advances of multimodal machine learning models offer a \textitsemantic index, the so-called \textitembedding vectors mapped from the original multimodal data, the dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications. This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top- k nearest neighbors do not change in the lower-dimensional space, namely Order-Preserving Dimension Reduction (OPDR). In order to develop such an OPDR method, our central hypothesis is that by analyzing the intrinsic relationship among key parameters during the dimension-reduction map, a quantitative function may be constructed to reveal the correlation between the target (lower) dimensionality and other variables. To demonstrate the hypothesis, this paper first defines a formal measure function to quantify the KNN similarity for a specific vector, then extends the measure into an aggregate accuracy of the global metric spaces, and finally derives a closed-form function between the target (lower) dimensionality and other variables. We incorporate the closed-function into popular dimension-reduction methods, various distance metrics, and embedding models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2408.10264 [cs.LG] (or arXiv:2408.10264v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.10264 Focus to learn more arXiv-issued DOI via DataCite

[AI-140] Relational Graph Convolutional Networks Do Not Learn Sound Rules KR2024

链接: https://arxiv.org/abs/2408.10261
作者: Matthew Morris,David J. Tena Cucala,Bernardo Cuenca Grau,Ian Horrocks
关键词-EN: Graph neural networks, Graph neural, knowledge graphs, predict missing facts, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Full version (with appendices) of paper accepted to KR 2024 (21st International Conference on Principles of Knowledge Representation and Reasoning)

点击查看摘要

Abstract:Graph neural networks (GNNs) are frequently used to predict missing facts in knowledge graphs (KGs). Motivated by the lack of explainability for the outputs of these models, recent work has aimed to explain their predictions using Datalog, a widely used logic-based formalism. However, such work has been restricted to certain subclasses of GNNs. In this paper, we consider one of the most popular GNN architectures for KGs, R-GCN, and we provide two methods to extract rules that explain its predictions and are sound, in the sense that each fact derived by the rules is also predicted by the GNN, for any input dataset. Furthermore, we provide a method that can verify that certain classes of Datalog rules are not sound for the R-GCN. In our experiments, we train R-GCNs on KG completion benchmarks, and we are able to verify that no Datalog rule is sound for these models, even though the models often obtain high to near-perfect accuracy. This raises some concerns about the ability of R-GCN models to generalise and about the explainability of their predictions. We further provide two variations to the training paradigm of R-GCN that encourage it to learn sound rules and find a trade-off between model accuracy and the number of learned sound rules.

[AI-141] Optical Music Recognition in Manuscripts from the Ricordi Archive

链接: https://arxiv.org/abs/2408.10260
作者: Federico Simonetta,Rishav Mondal,Luca Andrea Ludovico,Stavros Ntalampiras
关键词-EN: Verdi and Puccini, renowned opera composers, significant musical manuscripts, Ricordi archive, prestigious collection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注: Accepted at AudioMostly 2024

点击查看摘要

Abstract:The Ricordi archive, a prestigious collection of significant musical manuscripts from renowned opera composers such as Donizetti, Verdi and Puccini, has been digitized. This process has allowed us to automatically extract samples that represent various musical elements depicted on the manuscripts, including notes, staves, clefs, erasures, and composer’s annotations, among others. To distinguish between digitization noise and actual music elements, a subset of these images was meticulously grouped and labeled by multiple individuals into several classes. After assessing the consistency of the annotations, we trained multiple neural network-based classifiers to differentiate between the identified music elements. The primary objective of this study was to evaluate the reliability of these classifiers, with the ultimate goal of using them for the automatic categorization of the remaining unannotated data set. The dataset, complemented by manual annotations, models, and source code used in these experiments are publicly accessible for replication purposes.

[AI-142] Contrastive Learning on Medical Intents for Sequential Prescription Recommendation CIKM2024

链接: https://arxiv.org/abs/2408.10259
作者: Arya Hadizadeh Moghaddam,Mohsen Nayebi Kerdabadi,Mei Liu,Zijun Yao
关键词-EN: Electronic Health Records, applied to Electronic, sequential modeling applied, prescription recommender systems, greatly influenced prescription
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024)

点击查看摘要

Abstract:Recent advancements in sequential modeling applied to Electronic Health Records (EHR) have greatly influenced prescription recommender systems. While the recent literature on drug recommendation has shown promising performance, the study of discovering a diversity of coexisting temporal relationships at the level of medical codes over consecutive visits remains less explored. The goal of this study can be motivated from two perspectives. First, there is a need to develop a sophisticated sequential model capable of disentangling the complex relationships across sequential visits. Second, it is crucial to establish multiple and diverse health profiles for the same patient to ensure a comprehensive consideration of different medical intents in drug recommendation. To achieve this goal, we introduce Attentive Recommendation with Contrasted Intents (ARCI), a multi-level transformer-based method designed to capture the different but coexisting temporal paths across a shared sequence of visits. Specifically, we propose a novel intent-aware method with contrastive learning, that links specialized medical intents of the patients to the transformer heads for extracting distinct temporal paths associated with different health profiles. We conducted experiments on two real-world datasets for the prescription recommendation task using both ranking and classification metrics. Our results demonstrate that ARCI has outperformed the state-of-the-art prescription recommendation methods and is capable of providing interpretable insights for healthcare practitioners.

[AI-143] Balancing Innovation and Ethics in AI-Driven Software Development

链接: https://arxiv.org/abs/2408.10252
作者: Mohammad Baqar
关键词-EN: GitHub Copilot, Copilot and ChatGPT, paper critically examines, software development process, critically examines
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 20 Pages

点击查看摘要

Abstract:This paper critically examines the ethical implications of integrating AI tools like GitHub Copilot and ChatGPT into the software development process. It explores issues such as code ownership, bias, accountability, privacy, and the potential impact on the job market. While these AI tools offer significant benefits in terms of productivity and efficiency, they also introduce complex ethical challenges. The paper argues that addressing these challenges is essential to ensuring that AI’s integration into software development is both responsible and beneficial to society

[AI-144] arget-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs

链接: https://arxiv.org/abs/2408.10248
作者: Ananya Pandey,Dinesh Kumar Vishwakarma
关键词-EN: natural language processing, multimodal sentiment recognition, Multimodal Sentiment Analysis, multimodal sentiment, Target-Dependent Multimodal Sentiment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The natural language processing and multimedia field has seen a notable surge in interest in multimodal sentiment recognition. Hence, this study aims to employ Target-Dependent Multimodal Sentiment Analysis (TDMSA) to identify the level of sentiment associated with every target (aspect) stated within a multimodal post consisting of a visual-caption pair. Despite the recent advancements in multimodal sentiment recognition, there has been a lack of explicit incorporation of emotional clues from the visual modality, specifically those pertaining to facial expressions. The challenge at hand is to proficiently obtain visual and emotional clues and subsequently synchronise them with the textual content. In light of this fact, this study presents a novel approach called the Visual-to-Emotional-Caption Translation Network (VECTN) technique. The primary objective of this strategy is to effectively acquire visual sentiment clues by analysing facial expressions. Additionally, it effectively aligns and blends the obtained emotional clues with the target attribute of the caption mode. The experimental findings demonstrate that our methodology is capable of producing ground-breaking outcomes when applied to two publicly accessible multimodal Twitter datasets, namely, Twitter-2015 and Twitter-2017. The experimental results show that the suggested model achieves an accuracy of 81.23% and a macro-F1 of 80.61% on the Twitter-15 dataset, while 77.42% and 75.19% on the Twitter-17 dataset, respectively. The observed improvement in performance reveals that our model is better than others when it comes to collecting target-level sentiment in multimodal data using the expressions of the face.

[AI-145] VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual Acoustic and Glossary Features

链接: https://arxiv.org/abs/2408.10246
作者: Ananya Pandey,Dinesh Kumar Vishwakarma
关键词-EN: frequently convey sarcasm, sarcasm recognition, Multi-modal Sarcasm Recognition, non-linguistic clues, tone of voice
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Various linguistic and non-linguistic clues, such as excessive emphasis on a word, a shift in the tone of voice, or an awkward expression, frequently convey sarcasm. The computer vision problem of sarcasm recognition in conversation aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue. Prior, sarcasm recognition has focused mainly on text. Still, it is critical to consider all textual information, audio stream, facial expression, and body position for reliable sarcasm identification. Hence, we propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data and an attentional tokenizer based strategy to extract the most critical context-specific information from the textual data. The following is a list of the key contributions that our experimentation has made in response to performing the task of Multi-modal Sarcasm Recognition: an attentional tokenizer branch to get beneficial features from the glossary content provided by the subtitles; a visual branch for acquiring the most prominent features from the video frames; an utterance-level feature extraction from acoustic content and a multi-headed attention based feature fusion branch to blend features obtained from multiple modalities. Extensive testing on one of the benchmark video datasets, MUSTaRD, yielded an accuracy of 79.86% for speaker dependent and 76.94% for speaker independent configuration demonstrating that our approach is superior to the existing methods. We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.

[AI-146] rIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks – Part II: Architecture and Hardware Implementation

链接: https://arxiv.org/abs/2408.10243
作者: Cristian Sestito,Shady Agwa,Themis Prodromakis
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, targeting high performance, dissipating limited energy
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Modern hardware architectures for Convolutional Neural Networks (CNNs), other than targeting high performance, aim at dissipating limited energy. Reducing the data movement cost between the computing cores and the memory is a way to mitigate the energy consumption. Systolic arrays are suitable architectures to achieve this objective: they use multiple processing elements that communicate each other to maximize data utilization, based on proper dataflows like the weight stationary and row stationary. Motivated by this, we have proposed TrIM, an innovative dataflow based on a triangular movement of inputs, and capable to reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays. In this paper, we present a TrIM-based hardware architecture for CNNs. As a showcase, the accelerator is implemented onto a Field Programmable Gate Array (FPGA) to execute the VGG-16 CNN. The architecture achieves a peak throughput of 453.6 Giga Operations per Second, outperforming a state-of-the-art row stationary systolic array by ~5.1x in terms of memory accesses, and being up to ~12.2x more energy-efficient than other FPGA accelerators.

[AI-147] AltCanvas: A Tile-Based Image Editor with Generative AI for Blind or Visually Impaired People

链接: https://arxiv.org/abs/2408.10240
作者: Seonghee Lee,Maho Kohga,Steve Landau,Sile O’Modhrain,Hari Subramonyam
关键词-EN: structural information, impairments often struggle, content that relies, relies heavily, conveying spatial
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:People with visual impairments often struggle to create content that relies heavily on visual elements, particularly when conveying spatial and structural information. Existing accessible drawing tools, which construct images line by line, are suitable for simple tasks like math but not for more expressive artwork. On the other hand, emerging generative AI-based text-to-image tools can produce expressive illustrations from descriptions in natural language, but they lack precise control over image composition and properties. To address this gap, our work integrates generative AI with a constructive approach that provides users with enhanced control and editing capabilities. Our system, AltCanvas, features a tile-based interface enabling users to construct visual scenes incrementally, with each tile representing an object within the scene. Users can add, edit, move, and arrange objects while receiving speech and audio feedback. Once completed, the scene can be rendered as a color illustration or as a vector for tactile graphic generation. Involving 14 blind or low-vision users in design and evaluation, we found that participants effectively used the AltCanvas workflow to create illustrations.

[AI-148] A Conceptual Framework for Ethical Evaluation of Machine Learning Systems

链接: https://arxiv.org/abs/2408.10239
作者: Neha R. Gupta,Jessica Hullman,Hari Subramonyam
关键词-EN: Research in Responsible, machine learning systems, developed a range, range of principles, machine learning
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly informative tests to ensure downstream product safety, with potential fairness harms inherent to the implemented testing procedures. We conceptualize ethics-related concerns in standard ML evaluation techniques. Specifically, we present a utility framework, characterizing the key trade-off in ethical evaluation as balancing information gain against potential ethical harms. The framework is then a tool for characterizing challenges teams face, and systematically disentangling competing considerations that teams seek to balance. Differentiating between different types of issues encountered in evaluation allows us to highlight best practices from analogous domains, such as clinical trials and automotive crash testing, which navigate these issues in ways that can offer inspiration to improve evaluation processes in ML. Our analysis underscores the critical need for development teams to deliberately assess and manage ethical complexities that arise during the evaluation of ML systems, and for the industry to move towards designing institutional policies to support ethical evaluations.

[AI-149] A General-Purpose Device for Interaction with LLMs

链接: https://arxiv.org/abs/2408.10230
作者: Jiajun Xu,Qun Wang,Yuhang Cao,Baitao Zeng,Sicheng Liu
关键词-EN: large language models, paper investigates integrating, investigates integrating large, integrating large language, general-purpose device designed
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper investigates integrating large language models (LLMs) with advanced hardware, focusing on developing a general-purpose device designed for enhanced interaction with LLMs. Initially, we analyze the current landscape, where virtual assistants and LLMs are reshaping human-technology interactions, highlighting pivotal advancements and setting the stage for a new era of intelligent hardware. Despite substantial progress in LLM technology, a significant gap exists in hardware development, particularly concerning scalability, efficiency, affordability, and multimodal capabilities. This disparity presents both challenges and opportunities, underscoring the need for hardware that is not only powerful but also versatile and capable of managing the sophisticated demands of modern computation. Our proposed device addresses these needs by emphasizing scalability, multimodal data processing, enhanced user interaction, and privacy considerations, offering a comprehensive platform for LLM integration in various applications.

[AI-150] NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices

链接: https://arxiv.org/abs/2408.10161
作者: Zhiyong Zhang,Aniket Gupta,Huaizu Jiang,Hanumant Singh
关键词-EN: Real-time high-accuracy optical, Real-time high-accuracy, high-accuracy optical flow, optical flow estimation, optical flow
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Real-time high-accuracy optical flow estimation is crucial for various real-world applications. While recent learning-based optical flow methods have achieved high accuracy, they often come with significant computational costs. In this paper, we propose a highly efficient optical flow method that balances high accuracy with reduced computational demands. Building upon NeuFlow v1, we introduce new components including a much more light-weight backbone and a fast refinement module. Both these modules help in keeping the computational demands light while providing close to state of the art accuracy. Compares to other state of the art methods, our model achieves a 10x-70x speedup while maintaining comparable performance on both synthetic and real-world data. It is capable of running at over 20 FPS on 512x384 resolution images on a Jetson Orin Nano. The full training and evaluation code is available at this https URL.

[AI-151] Neural Horizon Model Predictive Control – Increasing Computational Efficiency with Neural Networks

链接: https://arxiv.org/abs/2408.09781
作者: Hendrik Alsmeier,Anton Savchenko,Rolf Findeisen
关键词-EN: low-power edge devices, edge devices poses, based control algorithms, increasingly fast applications, model predictive control
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 4 tables, American Control Conference (ACC) 2024

点击查看摘要

Abstract:The expansion in automation of increasingly fast applications and low-power edge devices poses a particular challenge for optimization based control algorithms, like model predictive control. Our proposed machine-learning supported approach addresses this by utilizing a feed-forward neural network to reduce the computation load of the online-optimization. We propose approximating part of the problem horizon, while maintaining safety guarantees – constraint satisfaction – via the remaining optimization part of the controller. The approach is validated in simulation, demonstrating an improvement in computational efficiency, while maintaining guarantees and near-optimal performance. The proposed MPC scheme can be applied to a wide range of applications, including those requiring a rapid control response, such as robotics and embedded applications with limited computational resources.

[AI-152] Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence

链接: https://arxiv.org/abs/2112.13060
作者: Ruoxi Chen,Haibo Jin,Haibin Zheng,Jinyin Chen,Zhenguang Liu
关键词-EN: attracted increasing attention, deep learning models, increasing attention, security-critical domains, vulnerabilities of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Final version. Accepted to IEEE Transactions on Dependable and Secure Computing

点击查看摘要

Abstract:The vulnerabilities of deep learning models towards adversarial attacks have attracted increasing attention, especially when models are deployed in security-critical domains. Numerous defense methods, including reactive and proactive ones, have been proposed for model robustness improvement. Reactive defenses, such as conducting transformations to remove perturbations, usually fail to handle large perturbations. The proactive defenses that involve retraining, suffer from the attack dependency and high computation cost. In this paper, we consider defense methods from the general effect of adversarial attacks that take on neurons inside the model. We introduce the concept of neuron influence, which can quantitatively measure neurons’ contribution to correct classification. Then, we observe that almost all attacks fool the model by suppressing neurons with larger influence and enhancing those with smaller influence. Based on this, we propose \emphNeuron-level Inverse Perturbation (NIP), a novel defense against general adversarial attacks. It calculates neuron influence from benign examples and then modifies input examples by generating inverse perturbations that can in turn strengthen neurons with larger influence and weaken those with smaller influence.

[AI-153] An Overlooked Role of Context-Sensitive Dendrites

链接: https://arxiv.org/abs/2408.11019
作者: Mohsin Raza,Ahsan Adeel
关键词-EN: higher perceptual layers, pyramidal two-point neurons, predominantly focused, zone of pyramidal, pyramidal two-point
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To date, most dendritic studies have predominantly focused on the apical zone of pyramidal two-point neurons (TPNs) receiving only feedback (FB) connections from higher perceptual layers and using them for learning. Recent cellular neurophysiology and computational neuroscience studies suggests that the apical input (context), coming from feedback and lateral connections, is multifaceted and far more diverse, with greater implications for ongoing learning and processing in the brain than previously realized. In addition to the FB, the apical tuft receives signals from neighboring cells of the same network as proximal § context, other parts of the brain as distal (D) context, and overall coherent information across the network as universal (U) context. The integrated context © amplifies and suppresses the transmission of coherent and conflicting feedforward (FF) signals, respectively. Specifically, we show that complex context-sensitive (CS)-TPNs flexibly integrate C moment-by-moment with the FF somatic current at the soma such that the somatic current is amplified when both feedforward (FF) and C are coherent; otherwise, it is attenuated. This generates the event only when the FF and C currents are coherent, which is then translated into a singlet or a burst based on the FB information. Spiking simulation results show that this flexible integration of somatic and contextual currents enables the propagation of more coherent signals (bursts), making learning faster with fewer neurons. Similar behavior is observed when this functioning is used in conventional artificial networks, where orders of magnitude fewer neurons are required to process vast amounts of heterogeneous real-world audio-visual (AV) data trained using backpropagation (BP). The computational findings presented here demonstrate the universality of CS-TPNs, suggesting a dendritic narrative that was previously overlooked.

[AI-154] Denoising Plane Wave Ultrasound Images Using Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2408.10987
作者: Hojat Asgariandehkordi,Sobhan Goudarzi,Mostafa Sharifzadeh,Adrian Basarab,Hassan Rivaz
关键词-EN: frame-rate ultrasound imaging, high frame-rate ultrasound, enables high frame-rate, high frame-rate imaging, Ultrasound plane wave
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ultrasound plane wave imaging is a cutting-edge technique that enables high frame-rate imaging. However, one challenge associated with high frame-rate ultrasound imaging is the high noise associated with them, hindering their wider adoption. Therefore, the development of a denoising method becomes imperative to augment the quality of plane wave images. Drawing inspiration from Denoising Diffusion Probabilistic Models (DDPMs), our proposed solution aims to enhance plane wave image quality. Specifically, the method considers the distinction between low-angle and high-angle compounding plane waves as noise and effectively eliminates it by adapting a DDPM to beamformed radiofrequency (RF) data. The method underwent training using only 400 simulated images. In addition, our approach employs natural image segmentation masks as intensity maps for the generated images, resulting in accurate denoising for various anatomy shapes. The proposed method was assessed across simulation, phantom, and in vivo images. The results of the evaluations indicate that our approach not only enhances image quality on simulated data but also demonstrates effectiveness on phantom and in vivo data in terms of image quality. Comparative analysis with other methods underscores the superiority of our proposed method across various evaluation metrics. The source code and trained model will be released along with the dataset at: this http URL

[AI-155] Radio U-Net: a convolutional neural network to detect diffuse radio sources in galaxy clusters and beyond

链接: https://arxiv.org/abs/2408.10871
作者: Chiara Stuardi,Claudio Gheller,Franco Vazza,Andrea Botteon
关键词-EN: telescope arrays promises, arrays promises significant, radio telescope arrays, radio, promises significant advancements
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by MNRAS, 16 pages, 9 figures, 2 tables

点击查看摘要

Abstract:The forthcoming generation of radio telescope arrays promises significant advancements in sensitivity and resolution, enabling the identification and characterization of many new faint and diffuse radio sources. Conventional manual cataloging methodologies are anticipated to be insufficient to exploit the capabilities of new radio surveys. Radio interferometric images of diffuse sources present a challenge for image segmentation tasks due to noise, artifacts, and embedded radio sources. In response to these challenges, we introduce Radio U-Net, a fully convolutional neural network based on the U-Net architecture. Radio U-Net is designed to detect faint and extended sources in radio surveys, such as radio halos, relics, and cosmic web filaments. Radio U-Net was trained on synthetic radio observations built upon cosmological simulations and then tested on a sample of galaxy clusters, where the detection of cluster diffuse radio sources relied on customized data reduction and visual inspection of LOFAR Two Metre Sky Survey (LoTSS) data. The 83% of clusters exhibiting diffuse radio emission were accurately identified, and the segmentation successfully recovered the morphology of the sources even in low-quality images. In a test sample comprising 246 galaxy clusters, we achieved a 73% accuracy rate in distinguishing between clusters with and without diffuse radio emission. Our results establish the applicability of Radio U-Net to extensive radio survey datasets, probing its efficiency on cutting-edge high-performance computing systems. This approach represents an advancement in optimizing the exploitation of forthcoming large radio surveys for scientific exploration.

[AI-156] MambaDS: Near-Surface Meteorological Field Downscaling with Topography Constrained Selective State Space Modeling

链接: https://arxiv.org/abs/2408.10854
作者: Zili Liu,Hao Chen,Lei Bai,Wenyuan Li,Wanli Ouyang,Zhengxia Zou,Zhenwei Shi
关键词-EN: fine-grained near-surface weather, frequent extreme weather, near-surface weather forecasts, obtaining precise, extreme weather
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In an era of frequent extreme weather and global warming, obtaining precise, fine-grained near-surface weather forecasts is increasingly essential for human activities. Downscaling (DS), a crucial task in meteorological forecasting, enables the reconstruction of high-resolution meteorological states for target regions from global-scale forecast results. Previous downscaling methods, inspired by CNN and Transformer-based super-resolution models, lacked tailored designs for meteorology and encountered structural limitations. Notably, they failed to efficiently integrate topography, a crucial prior in the downscaling process. In this paper, we address these limitations by pioneering the selective state space model into the meteorological field downscaling and propose a novel model called MambaDS. This model enhances the utilization of multivariable correlations and topography information, unique challenges in the downscaling process while retaining the advantages of Mamba in long-range dependency modeling and linear computational complexity. Through extensive experiments in both China mainland and the continental United States (CONUS), we validated that our proposed MambaDS achieves state-of-the-art results in three different types of meteorological field downscaling settings. We will release the code subsequently.

[AI-157] SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS

链接: https://arxiv.org/abs/2408.10771
作者: Karl El Hajal,Ajinkya Kulkarni,Enno Hermann,Mathew Magimai.-Doss
关键词-EN: achieve impressive results, recent zero-shot multispeaker, intricate training pipelines, impressive results, typically rely
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to IEEE Signal Processing Letters

点击查看摘要

Abstract:While recent zero-shot multispeaker text-to-speech (TTS) models achieve impressive results, they typically rely on extensive transcribed speech datasets from numerous speakers and intricate training pipelines. Meanwhile, self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. It was also observed that SSL features from different speakers that are linearly close share phonetic information while maintaining individual speaker identity, which enables straight-forward and robust voice cloning. In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker. SSL-TTS leverages SSL features and retrieval methods for simple and robust zero-shot multi-speaker synthesis. Objective and subjective evaluations show that our approach achieves performance comparable to state-of-the-art models that require significantly larger training datasets. The low training data requirements mean that SSL-TTS is well suited for the development of multi-speaker TTS systems for low-resource domains and languages. We also introduce an interpolation parameter which enables fine control over the output speech by blending voices. Demo samples are available at this https URL

[AI-158] Quantum Artificial Intelligence: A Brief Survey

链接: https://arxiv.org/abs/2408.10726
作者: Matthias Klusch,Jörg Lässig,Daniel Müssig,Antonio Macaluso,Frank K. Wilhelm
关键词-EN: Quantum Artificial Intelligence, Artificial Intelligence, expected significant benefits, Quantum Artificial, technological synergy
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Quantum Artificial Intelligence (QAI) is the intersection of quantum computing and AI, a technological synergy with expected significant benefits for both. In this paper, we provide a brief overview of what has been achieved in QAI so far and point to some open questions for future research. In particular, we summarize some major key findings on the feasability and the potential of using quantum computing for solving computationally hard problems in various subfields of AI, and vice versa, the leveraging of AI methods for building and operating quantum computing devices.

[AI-159] A Tutorial on Explainable Image Classification for Dementia Stages Using Convolutional Neural Network and Gradient-weighted Class Activation Mapping

链接: https://arxiv.org/abs/2408.10572
作者: Kevin Kam Fung Yuen
关键词-EN: Convolutional Neural Network, Class Activation Mapping, Gradient-weighted Class Activation, MRI brain images, open MRI brain
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 11 figures, 3 tables

点击查看摘要

Abstract:This paper presents a tutorial of an explainable approach using Convolutional Neural Network (CNN) and Gradient-weighted Class Activation Mapping (Grad-CAM) to classify four progressive dementia stages based on open MRI brain images. The detailed implementation steps are demonstrated with an explanation. Whilst the proposed CNN architecture is demonstrated to achieve more than 99% accuracy for the test dataset, the computational procedure of CNN remains a black box. The visualisation based on Grad-CAM is attempted to explain such very high accuracy and may provide useful information for physicians. Future motivation based on this work is discussed.

[AI-160] Prompt Your Brain: Scaffold Prompt Tuning for Efficient Adaptation of fMRI Pre-trained Model MICCAI2024

链接: https://arxiv.org/abs/2408.10567
作者: Zijian Dong,Yilei Wu,Zijiao Chen,Yichi Zhang,Yueming Jin,Juan Helen Zhou
关键词-EN: magnetic resonance imaging, introduce Scaffold Prompt, large-scale functional magnetic, functional magnetic resonance, improved performance compared
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024

点击查看摘要

Abstract:We introduce Scaffold Prompt Tuning (ScaPT), a novel prompt-based framework for adapting large-scale functional magnetic resonance imaging (fMRI) pre-trained models to downstream tasks, with high parameter efficiency and improved performance compared to fine-tuning and baselines for prompt tuning. The full fine-tuning updates all pre-trained parameters, which may distort the learned feature space and lead to overfitting with limited training data which is common in fMRI fields. In contrast, we design a hierarchical prompt structure that transfers the knowledge learned from high-resource tasks to low-resource ones. This structure, equipped with a Deeply-conditioned Input-Prompt (DIP) mapping module, allows for efficient adaptation by updating only 2% of the trainable parameters. The framework enhances semantic interpretability through attention mechanisms between inputs and prompts, and it clusters prompts in the latent space in alignment with prior knowledge. Experiments on public resting state fMRI datasets reveal ScaPT outperforms fine-tuning and multitask-based prompt tuning in neurodegenerative diseases diagnosis/prognosis and personality trait prediction, even with fewer than 20 participants. It highlights ScaPT’s efficiency in adapting pre-trained fMRI models to low-resource tasks.

[AI-161] Efficient Reinforcement Learning in Probabilistic Reward Machines

链接: https://arxiv.org/abs/2408.10381
作者: Xiaofeng Lin,Xuezhou Zhang
关键词-EN: Markov Decision Processes, Probabilistic Reward Machines, Markov Decision, Decision Processes, Processes with Probabilistic
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 33 pages, 4 figures

点击查看摘要

Abstract:In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of \widetildeO(\sqrtHOAT + H^2O^2A^3/2 + H\sqrtT) , where H is the time horizon, O is the number of observations, A is the number of actions, and T is the number of time-steps. This result improves over the best-known bound, \widetildeO(H\sqrtOAT) of \citetpmlr-v206-bourel23a for MDPs with Deterministic Reward Machines (DRMs), a special case of PRMs. When T \geq H^3O^3A^2 and OA \geq H , our regret bound leads to a regret of \widetildeO(\sqrtHOAT) , which matches the established lower bound of \Omega(\sqrtHOAT) for MDPs with DRMs up to a logarithmic factor. To the best of our knowledge, this is the first efficient algorithm for PRMs. Additionally, we present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward given access to an approximate planner. Complementing our theoretical findings, we show through extensive experiment evaluations that our algorithm indeed outperforms prior methods in various PRM environments.

[AI-162] Recognizing Beam Profiles from Silicon Photonics Gratings using Transformer Model

链接: https://arxiv.org/abs/2408.10287
作者: Yu Dian Lim,Hong Yu Li,Simon Chun Kiat Goh,Xiangyu Wang,Peng Zhao,Chuan Seng Tan
关键词-EN: trapped ion qubits, ion trap quantum, integrated silicon photonics, quantum computing community, developing integrated silicon
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Over the past decade, there has been extensive work in developing integrated silicon photonics (SiPh) gratings for the optical addressing of trapped ion qubits in the ion trap quantum computing community. However, when viewing beam profiles from infrared (IR) cameras, it is often difficult to determine the corresponding heights where the beam profiles are located. In this work, we developed transformer models to recognize the corresponding height categories of beam profiles of light from SiPh gratings. The model is trained using two techniques: (1) input patches, and (2) input sequence. For model trained with input patches, the model achieved recognition accuracy of 0.938. Meanwhile, model trained with input sequence shows lower accuracy of 0.895. However, when repeating the model-training 150 cycles, model trained with input patches shows inconsistent accuracy ranges between 0.445 to 0.959, while model trained with input sequence exhibit higher accuracy values between 0.789 to 0.936. The obtained outcomes can be expanded to various applications, including auto-focusing of light beam and auto-adjustment of z-axis stage to acquire desired beam profiles.

[AI-163] Large Investment Model

链接: https://arxiv.org/abs/2408.10255
作者: Jian Guo,Heung-Yeung Shum
关键词-EN: Traditional quantitative investment, encountering diminishing returns, diminishing returns alongside, returns alongside rising, alongside rising labor
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注: 20 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Traditional quantitative investment research is encountering diminishing returns alongside rising labor and time costs. To overcome these challenges, we introduce the Large Investment Model (LIM), a novel research paradigm designed to enhance both performance and efficiency at scale. LIM employs end-to-end learning and universal modeling to create an upstream foundation model capable of autonomously learning comprehensive signal patterns from diverse financial data spanning multiple exchanges, instruments, and frequencies. These “global patterns” are subsequently transferred to downstream strategy modeling, optimizing performance for specific tasks. We detail the system architecture design of LIM, address the technical challenges inherent in this approach, and outline potential directions for future research. The advantages of LIM are demonstrated through a series of numerical experiments on cross-instrument prediction for commodity futures trading, leveraging insights from stock markets.

[AI-164] MetaEnzyme: Meta Pan-Enzyme Learning for Task-Adaptive Redesign

链接: https://arxiv.org/abs/2408.10247
作者: Jiangbin Zheng,Han Zhang,Qianqing Xu,An-Ping Zeng,Stan Z. Li
关键词-EN: Enzyme design, enzyme design tasks, Enzyme design plays, production and biology, design
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注: Accepted to ACM Multimedia 2024

点击查看摘要

Abstract:Enzyme design plays a crucial role in both industrial production and biology. However, this field faces challenges due to the lack of comprehensive benchmarks and the complexity of enzyme design tasks, leading to a dearth of systematic research. Consequently, computational enzyme design is relatively overlooked within the broader protein domain and remains in its early stages. In this work, we address these challenges by introducing MetaEnzyme, a staged and unified enzyme design framework. We begin by employing a cross-modal structure-to-sequence transformation architecture, as the feature-driven starting point to obtain initial robust protein representation. Subsequently, we leverage domain adaptive techniques to generalize specific enzyme design tasks under low-resource conditions. MetaEnzyme focuses on three fundamental low-resource enzyme redesign tasks: functional design (FuncDesign), mutation design (MutDesign), and sequence generation design (SeqDesign). Through novel unified paradigm and enhanced representation capabilities, MetaEnzyme demonstrates adaptability to diverse enzyme design tasks, yielding outstanding results. Wet lab experiments further validate these findings, reinforcing the efficacy of the redesign process.

计算机视觉

[CV-0] Prompt-Guided Image-Adaptive Neural Implicit Lookup Tables for Interpretable Image Enhancement

链接: https://arxiv.org/abs/2408.11055
作者: Satoshi Kosugi
关键词-EN: enhances image quality, technique that enhances, quality by adjusting, easily understandable, enhances image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACM Multimedia 2024

点击查看摘要

Abstract:In this paper, we delve into the concept of interpretable image enhancement, a technique that enhances image quality by adjusting filter parameters with easily understandable names such as “Exposure” and “Contrast”. Unlike using predefined image editing filters, our framework utilizes learnable filters that acquire interpretable names through training. Our contribution is two-fold. Firstly, we introduce a novel filter architecture called an image-adaptive neural implicit lookup table, which uses a multilayer perceptron to implicitly define the transformation from input feature space to output color space. By incorporating image-adaptive parameters directly into the input features, we achieve highly expressive filters. Secondly, we introduce a prompt guidance loss to assign interpretable names to each filter. We evaluate visual impressions of enhancement results, such as exposure and contrast, using a vision and language model along with guiding prompts. We define a constraint to ensure that each filter affects only the targeted visual impression without influencing other attributes, which allows us to obtain the desired filter effects. Experimental results show that our method outperforms existing predefined filter-based methods, thanks to the filters optimized to predict target results. Our source code is available at this https URL.

[CV-1] NeCo: Improving DINOv2s spatial representations in 19 GPU hours with Patch Neighbor Consistency

链接: https://arxiv.org/abs/2408.11054
作者: Valentinos Pariza,Mohammadreza Salehi,Gertjan Burghouts,Francesco Locatello,Yuki M. Asano
关键词-EN: Patch Neighbor Consistency, propose sorting patch, self-supervised learning signal, sorting patch representations, nearest neighbor consistency
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint. The webpage is accessible at: this https URL

点击查看摘要

Abstract:We propose sorting patch representations across views as a novel self-supervised learning signal to improve pretrained representations. To this end, we introduce NeCo: Patch Neighbor Consistency, a novel training loss that enforces patch-level nearest neighbor consistency across a student and teacher model, relative to reference batches. Our method leverages a differentiable sorting method applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. We demonstrate that this method generates high-quality dense feature encoders and establish several new state-of-the-art results: +5.5% and + 6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, and +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff.

[CV-2] FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

链接: https://arxiv.org/abs/2408.11051
作者: Yunzhe Xu,Yiyuan Pan,Zhe Liu,Hesheng Wang
关键词-EN: Large Language Models, Large Language, Language Models, specialized VLN models, applications face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME’s superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion rate on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards practical applications of MLLMs in embodied AI. Project page: this https URL

[CV-3] ransfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

链接: https://arxiv.org/abs/2408.11039
作者: Chunting Zhou,Lili Yu,Arun Babu,Kushal Tirumala,Michihiro Yasunaga,Leonid Shamis,Jacob Kahn,Xuezhe Ma,Luke Zettlemoyer,Omer Levy
关键词-EN: Transfusion, introduce Transfusion, Transfusion models, continuous data, multiple Transfusion models
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages

点击查看摘要

Abstract:We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.

[CV-4] Atmospheric Transport Modeling of CO_2 with Neural Networks

链接: https://arxiv.org/abs/2408.11032
作者: Vitus Benson,Ana Bastos,Christian Reimers,Alexander J. Winkler,Fanny Yang,Markus Reichstein
关键词-EN: international climate agreements, greenhouse gas monitoring, verification support systems, Accurately describing, climate agreements
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Code: this https URL

点击查看摘要

Abstract:Accurately describing the distribution of CO _2 in the atmosphere with atmospheric tracer transport models is essential for greenhouse gas monitoring and verification support systems to aid implementation of international climate agreements. Large deep neural networks are poised to revolutionize weather prediction, which requires 3D modeling of the atmosphere. While similar in this regard, atmospheric transport modeling is subject to new challenges. Both, stable predictions for longer time horizons and mass conservation throughout need to be achieved, while IO plays a larger role compared to computational costs. In this study we explore four different deep neural networks (UNet, GraphCast, Spherical Fourier Neural Operator and SwinTransformer) which have proven as state-of-the-art in weather prediction to assess their usefulness for atmospheric tracer transport modeling. For this, we assemble the CarbonBench dataset, a systematic benchmark tailored for machine learning emulators of Eulerian atmospheric transport. Through architectural adjustments, we decouple the performance of our emulators from the distribution shift caused by a steady rise in atmospheric CO _2 . More specifically, we center CO _2 input fields to zero mean and then use an explicit flux scheme and a mass fixer to assure mass balance. This design enables stable and mass conserving transport for over 6 months with all four neural network architectures. In our study, the SwinTransformer displays particularly strong emulation skill (90-day R^2 0.99 ), with physically plausible emulation even for forward runs of multiple years. This work paves the way forward towards high resolution forward and inverse modeling of inert trace gases with neural networks.

[CV-5] OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

链接: https://arxiv.org/abs/2408.11030
作者: Youjun Zhao,Jiaying Lin,Shuquan Ye,Qianshi Pang,Rynson W.H. Lau
关键词-EN: scene understanding, closed object classes, object classes, aims to localize, open vocabulary problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient to provide a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named OpenScan, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, material, and more. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark, and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed by simply scaling up object classes during training. We highlight the limitations of existing methodologies and explore a promising direction to overcome the identified shortcomings. Data and code are available at this https URL

[CV-6] MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

链接: https://arxiv.org/abs/2408.11001
作者: Haoning Wu,Shaocheng Shen,Qiang Hu,Xiaoyun Zhang,Ya Zhang,Yanfeng Wang
关键词-EN: impressive capabilities, emerged as frontrunners, generation, high-resolution image generation, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report. Project Page: this https URL

点击查看摘要

Abstract:Diffusion models have emerged as frontrunners in text-to-image generation for their impressive capabilities. Nonetheless, their fixed image resolution during training often leads to challenges in high-resolution image generation, such as semantic inaccuracies and object replication. This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image generation models towards efficient higher-resolution generation without additional fine-tuning or extra adaptation. Specifically, we employ an innovative truncate and relay strategy to bridge the denoising processes across different resolutions, allowing for high-resolution image generation in a coarse-to-fine manner. Moreover, by integrating dilated convolutions and noise re-scheduling, we further adapt the model’s priors for higher resolution. The versatility and efficacy of MegaFusion make it universally applicable to both latent-space and pixel-space diffusion models, along with other derivative models. Extensive experiments confirm that MegaFusion significantly boosts the capability of existing models to produce images of megapixels and various aspect ratios, while only requiring about 40% of the original computational cost.

[CV-7] SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining

链接: https://arxiv.org/abs/2408.11000
作者: Jonathan Prexl,Michael Schmitt
关键词-EN: paper introduces SenPa-MAE, image embeddings, paper introduces, transformer architecture, architecture that encodes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: GCPR 2024

点击查看摘要

Abstract:This paper introduces SenPa-MAE, a transformer architecture that encodes the sensor parameters of an observed multispectral signal into the image embeddings. SenPa-MAE can be pre-trained on imagery of different satellites with non-matching spectral or geometrical sensor characteristics. To incorporate sensor parameters, we propose a versatile sensor parameter encoding module as well as a data augmentation strategy for the diversification of the pre-training dataset. This enables the model to effectively differentiate between various sensors and gain an understanding of sensor parameters and the correlation to the observed signal. Given the rising number of Earth observation satellite missions and the diversity in their sensor specifications, our approach paves the way towards a sensor-independent Earth observation foundation model. This opens up possibilities such as cross-sensor training and sensor-independent inference.

[CV-8] Facial Demorphing via Identity Preserving Image Decomposition

链接: https://arxiv.org/abs/2408.10993
作者: Nitish Shukla,Arun Ross
关键词-EN: created by combining, face recognition system, distinct identities, face images, face
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A face morph is created by combining the face images usually pertaining to two distinct identities. The goal is to generate an image that can be matched with two identities thereby undermining the security of a face recognition system. To deal with this problem, several morph attack detection techniques have been developed. But these methods do not extract any information about the underlying bonafides used to create them. Demorphing addresses this limitation. However, current demorphing techniques are mostly reference-based, i.e, they need an image of one of the identities to recover the other. In this work, we treat demorphing as an ill-posed decomposition problem. We propose a novel method that is reference-free and recovers the bonafides with high accuracy. Our method decomposes the morph into several identity-preserving feature components. A merger network then weighs and combines these components to recover the bonafides. Our method is observed to reconstruct high-quality bonafides in terms of definition and fidelity. Experiments on the CASIA-WebFace, SMDD and AMSL datasets demonstrate the effectiveness of our method.

[CV-9] Multichannel Attention Networks with Ensembled Transfer Learning to Recognize Bangla Handwritten Charecter

链接: https://arxiv.org/abs/2408.10955
作者: Farhanul Haque,Md. Al-Hasan,Sumaiya Tabssum Mou,Abu Saleh Musa Miah,Jungpil Shin,Md Abdur Rahim
关键词-EN: Bengali handwritten character, Bengali character recognition, Chinese character recognition, character recognition, handwritten character recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Bengali language is the 5th most spoken native and 7th most spoken language in the world, and Bengali handwritten character recognition has attracted researchers for decades. However, other languages such as English, Arabic, Turkey, and Chinese character recognition have contributed significantly to developing handwriting recognition systems. Still, little research has been done on Bengali character recognition because of the similarity of the character, curvature and other complexities. However, many researchers have used traditional machine learning and deep learning models to conduct Bengali hand-written recognition. The study employed a convolutional neural network (CNN) with ensemble transfer learning and a multichannel attention network. We generated the feature from the two branches of the CNN, including Inception Net and ResNet and then produced an ensemble feature fusion by concatenating them. After that, we applied the attention module to produce the contextual information from the ensemble features. Finally, we applied a classification module to refine the features and classification. We evaluated the proposed model using the CAMTERdb 3.1.2 data set and achieved 92% accuracy for the raw dataset and 98.00% for the preprocessed dataset. We believe that our contribution to the Bengali handwritten character recognition domain will be considered a great development.

[CV-10] HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments

链接: https://arxiv.org/abs/2408.10945
作者: Kazi Hasan Ibn Arif,JinYi Yoon,Dimitrios S. Nikolopoulos,Hans Vandierendonck,Deepu John,Bo Ji
关键词-EN: detailed image information, preserving detailed image, High-resolution Vision-Language Models, Large Language Model, multimodal tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder’s attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

[CV-11] A Closer Look at Data Augmentation Strategies for Finetuning-Based Low/Few-Shot Object Detection

链接: https://arxiv.org/abs/2408.10940
作者: Vladislav Li,Georgios Tsoumplekas,Ilias Siniosoglou,Vasileios Argyriou,Anastasios Lytos,Eleftherios Fountoukidis,Panagiotis Sarigiannidis
关键词-EN: few-shot object detection, Current methods, detection have primarily, primarily focused, focused on enhancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Current methods for low- and few-shot object detection have primarily focused on enhancing model performance for detecting objects. One common approach to achieve this is by combining model finetuning with data augmentation strategies. However, little attention has been given to the energy efficiency of these approaches in data-scarce regimes. This paper seeks to conduct a comprehensive empirical study that examines both model performance and energy efficiency of custom data augmentations and automated data augmentation selection strategies when combined with a lightweight object detector. The methods are evaluated in three different benchmark datasets in terms of their performance and energy consumption, and the Efficiency Factor is employed to gain insights into their effectiveness considering both performance and efficiency. Consequently, it is shown that in many cases, the performance gains of data augmentation strategies are overshadowed by their increased energy usage, necessitating the development of more energy efficient data augmentation strategies to address data scarcity.

[CV-12] Large Point-to-Gaussian Model for Image-to-3D Generation ACM-MM2024

链接: https://arxiv.org/abs/2408.10935
作者: Longfei Lu,Huachen Gao,Tao Dai,Yaohua Zha,Zhi Hou,Junta Wu,Shu-Tao Xia
关键词-EN: Gaussian reconstruction models, large reconstruction models, reconstruction models, Gaussian reconstruction, Gaussian parameters
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 9 figures, ACM MM 2024

点击查看摘要

Abstract:Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. The point cloud provides initial 3D geometry prior for Gaussian generation, thus significantly facilitating image-to-3D Generation. Moreover, we present the \textbfAttention mechanism, \textbfProjection mechanism, and \textbfPoint feature extractor, dubbed as \textbfAPP block, for fusing the image features with point cloud features. The qualitative and quantitative experiments extensively demonstrate the effectiveness of the proposed approach on GSO and Objaverse datasets, and show the proposed method achieves state-of-the-art performance.

[CV-13] SDI-Net: Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement

链接: https://arxiv.org/abs/2408.10934
作者: Linlin Hu,Ao Sun,Shijie Hao,Richang Hong,Meng Wang
关键词-EN: stereo image enhancement, low-light stereo image, image enhancement, low-light image enhancement, image enhancement methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Currently, most low-light image enhancement methods only consider information from a single view, neglecting the correlation between cross-view information. Therefore, the enhancement results produced by these methods are often unsatisfactory. In this context, there have been efforts to develop methods specifically for low-light stereo image enhancement. These methods take into account the cross-view disparities and enable interaction between the left and right views, leading to improved performance. However, these methods still do not fully exploit the interaction between left and right view information. To address this issue, we propose a model called Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement (SDI-Net). The backbone structure of SDI-Net is two encoder-decoder pairs, which are used to learn the mapping function from low-light images to normal-light images. Among the encoders and the decoders, we design a module named Cross-View Sufficient Interaction Module (CSIM), aiming to fully exploit the correlations between the binocular views via the attention mechanism. The quantitative and visual results on public datasets validate the superiority of our method over other related methods. Ablation studies also demonstrate the effectiveness of the key elements in our model.

[CV-14] CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network

链接: https://arxiv.org/abs/2408.10919
作者: Zijian Zhao,Tingwei Chen,Zhijie Cai,Hang Li,Xiaoyang Li,Qimei Chen,Guangxu Zhu
关键词-EN: garnered significant attention, significant attention due, low cost, recent years, numerous benefits
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fails to perform well in environments different from the training data. One major factor contributing to this issue is the limited availability of Wi-Fi sensing datasets, which makes models learn excessive irrelevant information and over-fit to the training set. Unfortunately, collecting large-scale Wi-Fi sensing datasets across diverse scenarios is a challenging task. To address this problem, we propose CrossFi, a siamese network-based approach that excels in both in-domain scenario and cross-domain scenario, including few-shot, zero-shot scenarios, and even works in few-shot new-class scenario where testing set contains new categories. The core component of CrossFi is a sample-similarity calculation network called CSi-Net, which improves the structure of the siamese network by using an attention mechanism to capture similarity information, instead of simply calculating the distance or cosine similarity. Based on it, we develop an extra Weight-Net that can generate a template for each class, so that our CrossFi can work in different scenarios. Experimental results demonstrate that our CrossFi achieves state-of-the-art performance across various scenarios. In gesture recognition task, our CrossFi achieves an accuracy of 98.17% in in-domain scenario, 91.72% in one-shot cross-domain scenario, 64.81% in zero-shot cross-domain scenario, and 84.75% in one-shot new-class scenario. To facilitate future research, we will release the code for our model upon publication.

[CV-15] ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining

链接: https://arxiv.org/abs/2408.10906
作者: Qi Ma,Yue Li,Bin Ren,Nicu Sebe,Ender Konukoglu,Theo Gevers,Luc Van Gool,Danda Pani Paudel
关键词-EN: Gaussian Splatting, facto method, Splatting, Gaussian, Gaussian parameters
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks. This calls for the 3D understanding directly in this representation space. To facilitate the research in this direction, we first build a large-scale dataset of 3DGS using the commonly used ShapeNet and ModelNet datasets. Our dataset ShapeSplat consists of 65K objects from 87 unique categories, whose labels are in accordance with the respective datasets. The creation of this dataset utilized the compute equivalent of 2 GPU years on a TITAN XP GPU. We utilize our dataset for unsupervised pretraining and supervised finetuning for classification and segmentation tasks. To this end, we introduce \textbf\textitGaussian-MAE, which highlights the unique benefits of representation learning from Gaussian parameters. Through exhaustive experiments, we provide several valuable insights. In particular, we show that (1) the distribution of the optimized GS centroids significantly differs from the uniformly sampled point cloud (used for initialization) counterpart; (2) this change in distribution results in degradation in classification but improvement in segmentation tasks when using only the centroids; (3) to leverage additional Gaussian parameters, we propose Gaussian feature grouping in a normalized feature space, along with splats pooling layer, offering a tailored solution to effectively group and embed similar Gaussians, which leads to notable improvement in finetuning tasks. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.10906 [cs.CV] (or arXiv:2408.10906v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.10906 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-16] A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse

链接: https://arxiv.org/abs/2408.10901
作者: Zhongliang Guo,Lei Fang,Jingyu Lin,Yifei Qian,Shuai Zhao,Zeyu Wang,Junhao Dong,Cunjian Chen,Ognjen Arandjelović,Chun Pong Lau
关键词-EN: Latent Diffusion Models, Latent Diffusion, Recent advancements, revolutionized image synthesis, Diffusion Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.

[CV-17] ViLReF: A Chinese Vision-Language Retinal Foundation Model

链接: https://arxiv.org/abs/2408.10894
作者: Shengzhu Yang,Jiawei Du,Jia Guo,Weihang Zhang,Hanruo Liu,Huiqi Li,Ningli Wang
关键词-EN: Subtle semantic differences, data present great, present great challenges, text data present, Subtle semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Subtle semantic differences in retinal image and text data present great challenges for pre-training visual-language models. Moreover, false negative samples, i.e., image-text pairs having the same semantics but incorrectly regarded as negatives, disrupt the visual-language pre-training process and affect the model’s learning ability. This work aims to develop a retinal foundation model, called ViLReF, by pre-training on a paired dataset comprising 451,956 retinal images and corresponding diagnostic text reports. In our vision-language pre-training strategy, we leverage expert knowledge to facilitate the extraction of labels and propose a novel constraint, the Weighted Similarity Coupling Loss, to adjust the speed of pushing sample pairs further apart dynamically within the feature space. Furthermore, we employ a batch expansion module with dynamic memory queues, maintained by momentum encoders, to supply extra samples and compensate for the vacancies caused by eliminating false negatives. Extensive experiments are conducted on multiple datasets for downstream classification and segmentation tasks. The experimental results demonstrate the powerful zero-shot and transfer learning capabilities of ViLReF, verifying the effectiveness of our pre-training strategy. Our ViLReF model is available at: this https URL.

[CV-18] Low-Quality Image Detection by Hierarchical VAE ICCV2023

链接: https://arxiv.org/abs/2408.10885
作者: Tomoyasu Nanaumi,Kazuhiko Kawamoto,Hiroshi Kera
关键词-EN: collect high-quality images, photo album, employee roster, generative models, make an employee
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICCV 2023, Workshop on Uncertainty Estimation for Computer Vision

点击查看摘要

Abstract:To make an employee roster, photo album, or training dataset of generative models, one needs to collect high-quality images while dismissing low-quality ones. This study addresses a new task of unsupervised detection of low-quality images. We propose a method that not only detects low-quality images with various types of degradation but also provides visual clues of them based on an observation that partial reconstruction by hierarchical variational autoencoders fails for low-quality images. The experiments show that our method outperforms several unsupervised out-of-distribution detection methods and also gives visual clues for low-quality images that help humans recognize them even in thumbnail view.

[CV-19] DAAD: Dynamic Analysis and Adaptive Discriminator for Fake News Detection

链接: https://arxiv.org/abs/2408.10883
作者: Xinqi Su,Yawen Cui,Ajian Liu,Xun Lin,Yuhao Wang,Haochen Liang,Wenhui Li,Zitong Yu
关键词-EN: current web environment, online social networks, web environment, social networks, posing serious threats
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In current web environment, fake news spreads rapidly across online social networks, posing serious threats to society. Existing multimodal fake news detection (MFND) methods can be classified into knowledge-based and semantic-based approaches. However, these methods are overly dependent on human expertise and feedback, lacking flexibility. To address this challenge, we propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. For knowledge-based methods, we introduce the Monte Carlo Tree Search (MCTS) algorithm to leverage the self-reflective capabilities of large language models (LLMs) for prompt optimization, providing richer, domain-specific details and guidance to the LLMs, while enabling more flexible integration of LLM comment on news content. For semantic-based methods, we define four typical deceit patterns: emotional exaggeration, logical inconsistency, image manipulation, and semantic inconsistency, to reveal the mechanisms behind fake news creation. To detect these patterns, we carefully design four discriminators and expand them in depth and breadth, using the soft-routing mechanism to explore optimal detection models. Experimental results on three real-world datasets demonstrate the superiority of our approach. The code will be available at: this https URL.

[CV-20] Open 3D World in Autonomous Driving

链接: https://arxiv.org/abs/2408.10880
作者: Xinlong Cheng,Lei Li
关键词-EN: open vocabulary perception, vocabulary perception represents, open vocabulary, facilitating the comprehension, represents a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The capability for open vocabulary perception represents a significant advancement in autonomous driving systems, facilitating the comprehension and interpretation of a wide array of textual inputs in real-time. Despite extensive research in open vocabulary tasks within 2D computer vision, the application of such methodologies to 3D environments, particularly within large-scale outdoor contexts, remains relatively underdeveloped. This paper presents a novel approach that integrates 3D point cloud data, acquired from LIDAR sensors, with textual information. The primary focus is on the utilization of textual data to directly localize and identify objects within the autonomous driving context. We introduce an efficient framework for the fusion of bird’s-eye view (BEV) region features with textual features, thereby enabling the system to seamlessly adapt to novel textual inputs and enhancing the robustness of open vocabulary detection tasks. The effectiveness of the proposed methodology is rigorously evaluated through extensive experimentation on the newly introduced NuScenes-T dataset, with additional validation of its zero-shot performance on the Lyft Level 5 dataset. This research makes a substantive contribution to the advancement of autonomous driving technologies by leveraging multimodal data to enhance open vocabulary perception in 3D environments, thereby pushing the boundaries of what is achievable in autonomous navigation and perception.

[CV-21] V-RoAst: A New Dataset for Visual Road Assessment

链接: https://arxiv.org/abs/2408.10872
作者: Natchapon Jongwiriyanurak,Zichao Zeng,June Moh Goo,Xinglei Wang,Ilya Ilyankou,Kerkritt Srirrongvikrai,Meihui Wang,James Haworth
关键词-EN: Convolutional Neural Networks, Road traffic crashes, significant economic impact, Vision Language Models, traditional Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Road traffic crashes cause millions of deaths annually and have a significant economic impact, particularly in low- and middle-income countries (LMICs). This paper presents an approach using Vision Language Models (VLMs) for road safety assessment, overcoming the limitations of traditional Convolutional Neural Networks (CNNs). We introduce a new task ,V-RoAst (Visual question answering for Road Assessment), with a real-world dataset. Our approach optimizes prompt engineering and evaluates advanced VLMs, including Gemini-1.5-flash and GPT-4o-mini. The models effectively examine attributes for road assessment. Using crowdsourced imagery from Mapillary, our scalable solution influentially estimates road safety levels. In addition, this approach is designed for local stakeholders who lack resources, as it does not require training data. It offers a cost-effective and automated methods for global road safety assessments, potentially saving lives and reducing economic burdens.

[CV-22] Perception-guided Jailbreak against Text-to-Image Models

链接: https://arxiv.org/abs/2408.10848
作者: Yihao Huang,Le Liang,Tianlin Li,Xiaojun Jia,Run Wang,Weikai Miao,Geguang Pu,Yang Liu
关键词-EN: garnered significant attention, significant attention due, recent years, remarkable advancements, garnered significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

[CV-23] Harmonizing Attention: Training-free Texture-aware Geometry Transfer

链接: https://arxiv.org/abs/2408.10846
作者: Eito Ikuta,Yohan Lee,Akihiro Iohara,Yu Saito,Toshiyuki Tanaka
关键词-EN: Extracting geometry features, photographic images independently, Extracting geometry, complex challenge, independently of surface
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Extracting geometry features from photographic images independently of surface texture and transferring them onto different materials remains a complex challenge. In this study, we introduce Harmonizing Attention, a novel training-free approach that leverages diffusion models for texture-aware geometry transfer. Our method employs a simple yet effective modification of self-attention layers, allowing the model to query information from multiple reference images within these layers. This mechanism is seamlessly integrated into the inversion process as Texture-aligning Attention and into the generation process as Geometry-aligning Attention. This dual-attention approach ensures the effective capture and transfer of material-independent geometry features while maintaining material-specific textural continuity, all without the need for model fine-tuning.

[CV-24] CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

链接: https://arxiv.org/abs/2408.10845
作者: Hidehisa Arai,Keita Miwa,Kento Sasaki,Yu Yamaguchi,Kohei Watanabe,Shunsuke Aoki,Issei Yamamoto
关键词-EN: demands sophisticated reasoning, Multi-modal Large Language, demands sophisticated, sophisticated reasoning, Multi-modal Large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.

[CV-25] Aligning Object Detector Bounding Boxes with Human Preference ECCV2024

链接: https://arxiv.org/abs/2408.10844
作者: Ombretta Strafforello,Osman S. Kayhan,Oana Inel,Klamer Schutte,Jan van Gemert
关键词-EN: Previous work shows, Previous work, Previous, object, boxes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted paper at the ECCV 2024 workshop on Assistive Computer Vision and Robotics (ACVR)

点击查看摘要

Abstract:Previous work shows that humans tend to prefer large bounding boxes over small bounding boxes with the same IoU. However, we show here that commonly used object detectors predict large and small boxes equally often. In this work, we investigate how to align automatically detected object boxes with human preference and study whether this improves human quality perception. We evaluate the performance of three commonly used object detectors through a user study (N = 123). We find that humans prefer object detections that are upscaled with factors of 1.5 or 2, even if the corresponding AP is close to 0. Motivated by this result, we propose an asymmetric bounding box regression loss that encourages large over small predicted bounding boxes. Our evaluation study shows that object detectors fine-tuned with the asymmetric loss are better aligned with human preference and are preferred over fixed scaling factors. A qualitative evaluation shows that human preference might be influenced by some object characteristics, like object shape.

[CV-26] Detecting Wildfires on UAVs with Real-time Segmentation Trained by Larger Teacher Models

链接: https://arxiv.org/abs/2408.10843
作者: Julius Pesonen,Teemu Hakala,Väinö Karjalainen,Niko Koivumäki,Lauri Markelin,Anna-Maria Raita-Hakola,Juha Suomalainen,Ilkka Pölönen,Eija Honkavaara
关键词-EN: prevent large-scale fires, large-scale fires resulting, Early detection, extensive environmental, societal damage
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early detection of wildfires is essential to prevent large-scale fires resulting in extensive environmental, structural, and societal damage. Uncrewed aerial vehicles (UAVs) can cover large remote areas effectively with quick deployment requiring minimal infrastructure and equipping them with small cameras and computers enables autonomous real-time detection. In remote areas, however, the UAVs are limited to on-board computing for detection due to the lack of high-bandwidth mobile networks. This limits the detection to methods which are light enough for the on-board computer alone. For accurate camera-based localisation, segmentation of the detected smoke is essential but training data for deep learning-based wildfire smoke segmentation is limited. This study shows how small specialised segmentation models can be trained using only bounding box labels, leveraging zero-shot foundation model supervision. The method offers the advantages of needing only fairly easily obtainable bounding box labels and requiring training solely for the smaller student network. The proposed method achieved 63.3% mIoU on a manually annotated and diverse wildfire dataset. The used model can perform in real-time at ~11 fps with a UAV-carried NVIDIA Jetson Orin NX computer while reliably recognising smoke, demonstrated at real-world forest burning events. Code is available at this https URL

[CV-27] ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data

链接: https://arxiv.org/abs/2408.10831
作者: Elia Bonetto,Aamir Ahmad
关键词-EN: deep learning tasks, pose estimation, address the lack, deep learning, Synthetic data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 5 tables, 7 figures

点击查看摘要

Abstract:Synthetic data is increasingly being used to address the lack of labeled images in uncommon domains for deep learning tasks. A prominent example is 2D pose estimation of animals, particularly wild species like zebras, for which collecting real-world data is complex and impractical. However, many approaches still require real images, consistency and style constraints, sophisticated animal models, and/or powerful pre-trained networks to bridge the syn-to-real gap. Moreover, they often assume that the animal can be reliably detected in images or videos, a hypothesis that often does not hold, e.g. in wildlife scenarios or aerial images. To solve this, we use synthetic data generated with a 3D photorealistic simulator to obtain the first synthetic dataset that can be used for both detection and 2D pose estimation of zebras without applying any of the aforementioned bridging strategies. Unlike previous works, we extensively train and benchmark our detection and 2D pose estimation models on multiple real-world and synthetic datasets using both pre-trained and non-pre-trained backbones. These experiments show how the models trained from scratch and only with synthetic data can consistently generalize to real-world images of zebras in both tasks. Moreover, we show it is possible to easily generalize those same models to 2D pose estimation of horses with a minimal amount of real-world images to account for the domain transfer. Code, results, trained models; and the synthetic, training, and validation data, including 104K manually labeled frames, are provided as open-source at this https URL

[CV-28] rustworthy Compression? Impact of AI-based Codecs on Biometrics for Law Enforcement

链接: https://arxiv.org/abs/2408.10823
作者: Sandra Bergmann,Denise Moussa,Christian Riess
关键词-EN: aid law enforcement, Image-based biometrics, Image-based, aid law, law enforcement
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Image-based biometrics can aid law enforcement in various aspects, for example in iris, fingerprint and soft-biometric recognition. A critical precondition for recognition is the availability of sufficient biometric information in images. It is visually apparent that strong JPEG compression removes such details. However, latest AI-based image compression seemingly preserves many image details even for very strong compression factors. Yet, these perceived details are not necessarily grounded in measurements, which raises the question whether these images can still be used for biometric recognition. In this work, we investigate how AI compression impacts iris, fingerprint and soft-biometric (fabrics and tattoo) images. We also investigate the recognition performance for iris and fingerprint images after AI compression. It turns out that iris recognition can be strongly affected, while fingerprint recognition is quite robust. The loss of detail is qualitatively best seen in fabrics and tattoos images. Overall, our results show that AI-compression still permits many biometric tasks, but attention to strong compression factors in sensitive tasks is advisable.

[CV-29] Constructing a High Temporal Resolution Global Lakes Dataset via Swin-Unet with Applications to Area Prediction

链接: https://arxiv.org/abs/2408.10821
作者: Yutian Han,Baoxiang Huang,He Gao
关键词-EN: valuable ecosystem services, biodiversity habitats, lake area changes, ecosystem services, water supply
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Lakes provide a wide range of valuable ecosystem services, such as water supply, biodiversity habitats, and carbon sequestration. However, lakes are increasingly threatened by climate change and human activities. Therefore, continuous global monitoring of lake dynamics is crucial, but remains challenging on a large scale. The recently developed Global Lakes Area Database (GLAKES) has mapped over 3.4 million lakes worldwide, but it only provides data at decadal intervals, which may be insufficient to capture rapid or short-term changes.This paper introduces an expanded lake database, GLAKES-Additional, which offers biennial delineations and area measurements for 152,567 lakes globally from 1990 to 2021. We employed the Swin-Unet model, replacing traditional convolution operations, to effectively address the challenges posed by the receptive field requirements of high spatial resolution satellite imagery. The increased biennial time resolution helps to quantitatively attribute lake area changes to climatic and hydrological drivers, such as precipitation and temperature changes.For predicting lake area changes, we used a Long Short-Term Memory (LSTM) neural network and an extended time series dataset for preliminary modeling. Under climate and land use scenarios, our model achieved an RMSE of 0.317 km^2 in predicting future lake area changes.

[CV-30] MPL: Lifting 3D Human Pose from Multi-view 2D Poses ECCV

链接: https://arxiv.org/abs/2408.10805
作者: Seyed Abolfazl Ghasemzadeh,Alexandre Alahi,Christophe De Vleeschouwer
关键词-EN: projective acquisition, occlusions and projective, Estimating, challenging due, due to occlusions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, accepted in ECCV T-CAP 2024, code: this https URL

点击查看摘要

Abstract:Estimating 3D human poses from 2D images is challenging due to occlusions and projective acquisition. Learning-based approaches have been largely studied to address this challenge, both in single and multi-view setups. These solutions however fail to generalize to real-world cases due to the lack of (multi-view) ‘in-the-wild’ images paired with 3D poses for training. For this reason, we propose combining 2D pose estimation, for which large and rich training datasets exist, and 2D-to-3D pose lifting, using a transformer-based network that can be trained from synthetic 2D-3D pose pairs. Our experiments demonstrate decreases up to 45% in MPJPE errors compared to the 3D pose obtained by triangulating the 2D poses. The framework’s source code is available at this https URL .

[CV-31] apping in a Remote Vehicles onboard LLM to Complement the Ego Vehicles Field-of-View MICRO

链接: https://arxiv.org/abs/2408.10794
作者: Malsha Ashani Mahawatta Dona,Beatriz Cabrero-Daniel,Yinan Yu,Christian Berger
关键词-EN: Today advanced automotive, bringing computational intelligence, intelligent Cyber-Physical Systems, Today advanced, advanced automotive systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 50th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA) 2024 - WiP

点击查看摘要

Abstract:Today’s advanced automotive systems are turning into intelligent Cyber-Physical Systems (CPS), bringing computational intelligence to their cyber-physical context. Such systems power advanced driver assistance systems (ADAS) that observe a vehicle’s surroundings for their functionality. However, such ADAS have clear limitations in scenarios when the direct line-of-sight to surrounding objects is occluded, like in urban areas. Imagine now automated driving (AD) systems that ideally could benefit from other vehicles’ field-of-view in such occluded situations to increase traffic safety if, for example, locations about pedestrians can be shared across vehicles. Current literature suggests vehicle-to-infrastructure (V2I) via roadside units (RSUs) or vehicle-to-vehicle (V2V) communication to address such issues that stream sensor or object data between vehicles. When considering the ongoing revolution in vehicle system architectures towards powerful, centralized processing units with hardware accelerators, foreseeing the onboard presence of large language models (LLMs) to improve the passengers’ comfort when using voice assistants becomes a reality. We are suggesting and evaluating a concept to complement the ego vehicle’s field-of-view (FOV) with another vehicle’s FOV by tapping into their onboard LLM to let the machines have a dialogue about what the other vehicle ``sees’'. Our results show that very recent versions of LLMs, such as GPT-4V and GPT-4o, understand a traffic situation to an impressive level of detail, and hence, they can be used even to spot traffic participants. However, better prompts are needed to improve the detection quality and future work is needed towards a standardised message interchange format between vehicles.

[CV-32] Learning Part-aware 3D Representations by Fusing 2D Gaussians and Superquadrics

链接: https://arxiv.org/abs/2408.10789
作者: Zhirui Gao,Renjiao Yi,Yuhang Huang,Wei Chen,Chenyang Zhu,Kai Xu
关键词-EN: Low-level, point clouds, Gaussians, scenes, objects or scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Low-level 3D representations, such as point clouds, meshes, NeRFs, and 3D Gaussians, are commonly used to represent 3D objects or scenes. However, humans usually perceive 3D objects or scenes at a higher level as a composition of parts or structures rather than points or voxels. Representing 3D as semantic parts can benefit further understanding and applications. We aim to solve part-aware 3D reconstruction, which parses objects or scenes into semantic parts. In this paper, we introduce a hybrid representation of superquadrics and 2D Gaussians, trying to dig 3D structural clues from multi-view image inputs. Accurate structured geometry reconstruction and high-quality rendering are achieved at the same time. We incorporate parametric superquadrics in mesh forms into 2D Gaussians by attaching Gaussian centers to faces in meshes. During the training, superquadrics parameters are iteratively optimized, and Gaussians are deformed accordingly, resulting in an efficient hybrid representation. On the one hand, this hybrid representation inherits the advantage of superquadrics to represent different shape primitives, supporting flexible part decomposition of scenes. On the other hand, 2D Gaussians are incorporated to model the complex texture and geometry details, ensuring high-quality rendering and geometry reconstruction. The reconstruction is fully unsupervised. We conduct extensive experiments on data from DTU and ShapeNet datasets, in which the method decomposes scenes into reasonable parts, outperforming existing state-of-the-art approaches.

[CV-33] LightMDETR: A Lightweight Approach for Low-Cost Open-Vocabulary Object Detection Training

链接: https://arxiv.org/abs/2408.10787
作者: Binta Sow,Bilal Faye,Hanane Azzag,Mustapha Lebbah
关键词-EN: computer vision traditionally, vision traditionally involves, traditionally involves identifying, involves identifying objects, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Object detection in computer vision traditionally involves identifying objects in images. By integrating textual descriptions, we enhance this process, providing better context and accuracy. The MDETR model significantly advances this by combining image and text data for more versatile object detection and classification. However, MDETR’s complexity and high computational demands hinder its practical use. In this paper, we introduce Lightweight MDETR (LightMDETR), an optimized MDETR variant designed for improved computational efficiency while maintaining robust multimodal capabilities. Our approach involves freezing the MDETR backbone and training a sole component, the Deep Fusion Encoder (DFE), to represent image and text modalities. A learnable context vector enables the DFE to switch between these modalities. Evaluation on datasets like RefCOCO, RefCOCO+, and RefCOCOg demonstrates that LightMDETR achieves superior precision and accuracy.

[CV-34] Just a Hint: Point-Supervised Camouflaged Object Detection ECCV2024

链接: https://arxiv.org/abs/2408.10777
作者: Huafeng Chen,Dian Shao,Guangqian Guo,Shan Gao
关键词-EN: Camouflaged Object Detection, accurately distinguish objects, Object Detection, expeditiously and accurately, accurately distinguish
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Camouflaged Object Detection (COD) demands models to expeditiously and accurately distinguish objects which conceal themselves seamlessly in the environment. Owing to the subtle differences and ambiguous boundaries, COD is not only a remarkably challenging task for models but also for human annotators, requiring huge efforts to provide pixel-wise annotations. To alleviate the heavy annotation burden, we propose to fulfill this task with the help of only one point supervision. Specifically, by swiftly clicking on each object, we first adaptively expand the original point-based annotation to a reasonable hint area. Then, to avoid partial localization around discriminative parts, we propose an attention regulator to scatter model attention to the whole object through partially masking labeled regions. Moreover, to solve the unstable feature representation of camouflaged objects under only point-based annotation, we perform unsupervised contrastive learning based on differently augmented image pairs (e.g. changing color or doing translation). On three mainstream COD benchmarks, experimental results show that our model outperforms several weakly-supervised methods by a large margin across various metrics.

[CV-35] Generative AI in Industrial Machine Vision – A Review

链接: https://arxiv.org/abs/2408.10775
作者: Hans Aoyang Zhou,Dominik Wolfschläger,Constantinos Florides,Jonas Werheid,Hannes Behnen,Jan-Henrick Woltersmann,Tiago C. Pinto,Marco Kemmerling,Anas Abdelrazeq,Robert H. Schmitt
关键词-EN: vision enhances automation, Machine vision, industrial machine vision, gls, generative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 44 pages, 7 figures, This work has been submitted to the Journal of Intelligent Manufacturing

点击查看摘要

Abstract:Machine vision enhances automation, quality control, and operational efficiency in industrial applications by enabling machines to interpret and act on visual data. While traditional computer vision algorithms and approaches remain widely utilized, machine learning has become pivotal in current research activities. In particular, generative \glsAI demonstrates promising potential by improving pattern recognition capabilities, through data augmentation, increasing image resolution, and identifying anomalies for quality control. However, the application of generative \glsAI in machine vision is still in its early stages due to challenges in data diversity, computational requirements, and the necessity for robust validation methods. A comprehensive literature review is essential to understand the current state of generative \glsAI in industrial machine vision, focusing on recent advancements, applications, and research trends. Thus, a literature review based on the PRISMA guidelines was conducted, analyzing over 1,200 papers on generative \glsAI in industrial machine vision. Our findings reveal various patterns in current research, with the primary use of generative \glsAI being data augmentation, for machine vision tasks such as classification and object detection. Furthermore, we gather a collection of application challenges together with data requirements to enable a successful application of generative \glsAI in industrial machine vision. This overview aims to provide researchers with insights into the different areas and applications within current research, highlighting significant advancements and identifying opportunities for future work.

[CV-36] Detection of Intracranial Hemorrhage for Trauma Patients

链接: https://arxiv.org/abs/2408.10768
作者: Antoine P. Sanner,Nils F. Grauhan,Marc A. Brockmann,Ahmed E. Othman,Anirban Mukhopadhyay
关键词-EN: multi-trauma patients, bounding boxes, Whole-body, Deep Learning approach, intracranial hemorrhages
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Whole-body CT is used for multi-trauma patients in the search of any and all injuries. Since an initial assessment needs to be rapid and the search for lesions is done for the whole body, very little time can be allocated for the inspection of a specific anatomy. In particular, intracranial hemorrhages are still missed, especially by clinical students. In this work, we present a Deep Learning approach for highlighting such lesions to improve the diagnostic accuracy. While most works on intracranial hemorrhages perform segmentation, detection only requires bounding boxes for the localization of the bleeding. In this paper, we propose a novel Voxel-Complete IoU (VC-IoU) loss that encourages the network to learn the 3D aspect ratios of bounding boxes and leads to more precise detections. We extensively experiment on brain bleeding detection using a publicly available dataset, and validate it on a private cohort, where we achieve 0.877 AR30, 0.728 AP30, and 0.653 AR30, 0.514 AP30 respectively. These results constitute a relative +5% improvement in Average Recall for both datasets compared to other loss functions. Finally, as there is little data currently publicly available for 3D object detection and as annotation resources are limited in the clinical setting, we evaluate the cost of different annotation methods, as well as the impact of imprecise bounding boxes in the training data on the detection performance.

[CV-37] SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection ECCV2024

链接: https://arxiv.org/abs/2408.10760
作者: Huafeng Chen,Pengxu Wei,Guangqian Guo,Shan Gao
关键词-EN: Camouflaged Object Detection, Object Detection, methods heavily rely, camouflaged object labels, Camouflaged Object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Most Camouflaged Object Detection (COD) methods heavily rely on mask annotations, which are time-consuming and labor-intensive to acquire. Existing weakly-supervised COD approaches exhibit significantly inferior performance compared to fully-supervised methods and struggle to simultaneously support all the existing types of camouflaged object labels, including scribbles, bounding boxes, and points. Even for Segment Anything Model (SAM), it is still problematic to handle the weakly-supervised COD and it typically encounters challenges of prompt compatibility of the scribble labels, extreme response, semantically erroneous response, and unstable feature representations, producing unsatisfactory results in camouflaged scenes. To mitigate these issues, we propose a unified COD framework in this paper, termed SAM-COD, which is capable of supporting arbitrary weakly-supervised labels. Our SAM-COD employs a prompt adapter to handle scribbles as prompts based on SAM. Meanwhile, we introduce response filter and semantic matcher modules to improve the quality of the masks obtained by SAM under COD prompts. To alleviate the negative impacts of inaccurate mask predictions, a new strategy of prompt-adaptive knowledge distillation is utilized to ensure a reliable feature representation. To validate the effectiveness of our approach, we have conducted extensive empirical experiments on three mainstream COD benchmarks. The results demonstrate the superiority of our method against state-of-the-art weakly-supervised and even fully-supervised methods.

[CV-38] rackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks ECCV2024

链接: https://arxiv.org/abs/2408.10739
作者: Jinjie Mai,Wenxuan Zhu,Sara Rojas,Jesus Zarzar,Abdullah Hamdi,Guocheng Qian,Bing Li,Silvio Giancola,Bernard Ghanem
关键词-EN: Neural radiance fields, Neural radiance, reflect realistic setups, radiance fields, generally require
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 (supplemental pages included)

点击查看摘要

Abstract:Neural radiance fields (NeRFs) generally require many images with accurate poses for accurate novel view synthesis, which does not reflect realistic setups where views can be sparse and poses can be noisy. Previous solutions for learning NeRFs with sparse views and noisy poses only consider local geometry consistency with pairs of views. Closely following \textitbundle adjustment in Structure-from-Motion (SfM), we introduce TrackNeRF for more globally consistent geometry reconstruction and more accurate pose optimization. TrackNeRF introduces \textitfeature tracks, \ie connected pixel trajectories across \textitall visible views that correspond to the \textitsame 3D points. By enforcing reprojection consistency among feature tracks, TrackNeRF encourages holistic 3D consistency explicitly. Through extensive experiments, TrackNeRF sets a new benchmark in noisy and sparse view reconstruction. In particular, TrackNeRF shows significant improvements over the state-of-the-art BARF and SPARF by \sim8 and \sim1 in terms of PSNR on DTU under various sparse and noisy view setups. The code is available at \hrefthis https URL.

[CV-39] Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

链接: https://arxiv.org/abs/2408.10710
作者: Pengkun Wei,Shuo Cheng,Dayou Li,Ran Song,Yipeng Zhang,Wei Zhang
关键词-EN: Efficiently detecting target, ensuring sub-millimeter accuracy, detecting target weld, Efficiently detecting, target weld seams
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently detecting target weld seams while ensuring sub-millimeter accuracy has always been an important challenge in autonomous welding, which has significant application in industrial practice. Previous works mostly focused on recognizing and localizing welding seams one by one, leading to inferior efficiency in modeling the workpiece. This paper proposes a novel framework capable of multiple weld seams extraction using both RGB images and 3D point clouds. The RGB image is used to obtain the region of interest by approximately localizing the weld seams, and the point cloud is used to achieve the fine-edge extraction of the weld seams within the region of interest using region growth. Our method is further accelerated by using a pre-trained deep learning model to ensure both efficiency and generalization ability. The performance of the proposed method has been comprehensively tested on various workpieces featuring both linear and curved weld seams and in physical experiment systems. The results showcase considerable potential for real-world industrial applications, emphasizing the method’s efficiency and effectiveness. Videos of the real-world experiments can be found at this https URL.

[CV-40] Large Language Models for Multimodal Deformable Image Registration

链接: https://arxiv.org/abs/2408.10703
作者: Mingrui Ma,Weijie Wang,Jie Ning,Jianfeng He,Nicu Sebe,Bruno Lepri
关键词-EN: Deformable Image Registration, Multimodal Deformable Image, Multimodal Deformable, challenge of Multimodal, Image Registration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The challenge of Multimodal Deformable Image Registration (MDIR) lies in the conversion and alignment of features between images of different modalities. Generative models (GMs) cannot retain the necessary information enough from the source modality to the target one, while non-GMs struggle to align features across these two modalities. In this paper, we propose a novel coarse-to-fine MDIR framework,LLM-Morph, which is applicable to various pre-trained Large Language Models (LLMs) to solve these concerns by aligning the deep features from different modal medical images. Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights, both aimed at eliminating the domain gap between the pre-trained LLMs and the MDIR task. Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task. Extensive experiments in MR-CT Abdomen and SR-Reg Brain datasets demonstrate the effectiveness of our framework and the potential of pre-trained LLMs for MDIR task. Our code is availabel at: this https URL.

[CV-41] MsMemoryGAN: A Multi-scale Memory GAN for Palm-vein Adversarial Purification

链接: https://arxiv.org/abs/2408.10694
作者: Huafeng Qin,Yuming Fu,Huiyan Zhang,Mounim A. El-Yacoubi,Xinbo Gao,Qun Song,Jun Wang
关键词-EN: Deep neural networks, increasing application trend, recently achieved promising, making incorrect recognition, Deep neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks have recently achieved promising performance in the vein recognition task and have shown an increasing application trend, however, they are prone to adversarial perturbation attacks by adding imperceptible perturbations to the input, resulting in making incorrect recognition. To address this issue, we propose a novel defense model named MsMemoryGAN, which aims to filter the perturbations from adversarial samples before recognition. First, we design a multi-scale autoencoder to achieve high-quality reconstruction and two memory modules to learn the detailed patterns of normal samples at different scales. Second, we investigate a learnable metric in the memory module to retrieve the most relevant memory items to reconstruct the input image. Finally, the perceptional loss is combined with the pixel loss to further enhance the quality of the reconstructed image. During the training phase, the MsMemoryGAN learns to reconstruct the input by merely using fewer prototypical elements of the normal patterns recorded in the memory. At the testing stage, given an adversarial sample, the MsMemoryGAN retrieves its most relevant normal patterns in memory for the reconstruction. Perturbations in the adversarial sample are usually not reconstructed well, resulting in purifying the input from adversarial perturbations. We have conducted extensive experiments on two public vein datasets under different adversarial attack methods to evaluate the performance of the proposed approach. The experimental results show that our approach removes a wide variety of adversarial perturbations, allowing vein classifiers to achieve the highest recognition accuracy.

[CV-42] DS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

链接: https://arxiv.org/abs/2408.10688
作者: Bin Wang,Wenqian Wang
关键词-EN: garnered significant attention, large-scale pre-trained vision-language, powerful representative capabilities, Action Recognition, action recognition networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, large-scale pre-trained vision-language models (e.g., CLIP), have garnered significant attention thanks to their powerful representative capabilities. This inspires researchers in transferring the knowledge from these large pre-trained models to other task-specific models, e.g., Video Action Recognition (VAR) models, via particularly leveraging side networks to enhance the efficiency of parameter-efficient fine-tuning (PEFT). However, current transferring approaches in VAR tend to directly transfer the frozen knowledge from large pre-trained models to action recognition networks with minimal cost, instead of exploiting the temporal modeling capabilities of the action recognition models themselves. Therefore, in this paper, we propose a memory-efficient Temporal Difference Side Network (TDS-CLIP) to balance knowledge transferring and temporal modeling, avoiding backpropagation in frozen parameter models. Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features to strengthen the model’s global temporal modeling capabilities. Furthermore, we designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos, thereby improving the side network’s ability to capture and learn motion information. Extensive experiments are conducted on three benchmark datasets, including Something-Something V1\V2, and Kinetics-400. Experimental results demonstrate that our approach achieves competitive performance.

[CV-43] DemMamba: Alignment-free Raw Video Demoireing with Frequency-assisted Spatio-Temporal Mamba

链接: https://arxiv.org/abs/2408.10679
作者: Shuning Xu,Xina Liu,Binbin Song,Xiangyu Chen,Qiubo Chen,Jiantao Zhou
关键词-EN: phenomenon frequently observed, repetitive patterns interfere, similar repetitive patterns, Moire patterns arise, Moire patterns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Moire patterns arise when two similar repetitive patterns interfere, a phenomenon frequently observed during the capture of images or videos on screens. The color, shape, and location of moire patterns may differ across video frames, posing a challenge in learning information from adjacent frames and preserving temporal consistency. Previous video demoireing methods heavily rely on well-designed alignment modules, resulting in substantial computational burdens. Recently, Mamba, an improved version of the State Space Model (SSM), has demonstrated significant potential for modeling long-range dependencies with linear complexity, enabling efficient temporal modeling in video demoireing without requiring a specific alignment module. In this paper, we propose a novel alignment-free Raw video demoireing network with frequency-assisted spatio-temporal Mamba (DemMamba). The Spatial Mamba Block (SMB) and Temporal Mamba Block (TMB) are sequentially arranged to facilitate effective intra- and inter-relationship modeling in Raw videos with moire patterns. Within SMB, an Adaptive Frequency Block (AFB) is introduced to aid demoireing in the frequency domain. For TMB, a Channel Attention Block (CAB) is embedded to further enhance temporal information interactions by exploiting the inter-channel relationships among features. Extensive experiments demonstrate that our proposed DemMamba surpasses state-of-the-art approaches by 1.3 dB and delivers a superior visual experience.

[CV-44] A Noncontact Technique for Wave Measurement Based on Thermal Stereography and Deep Learning

链接: https://arxiv.org/abs/2408.10670
作者: Deyu Li,Longfei Xiao,Handi Wei,Yan Li,Binghua Zhang
关键词-EN: engineering applications, evolution is essential, stereo, wave, wave field
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The accurate measurement of the wave field and its spatiotemporal evolution is essential in many hydrodynamic experiments and engineering applications. The binocular stereo imaging technique has been widely used to measure waves. However, the optical properties of indoor water surfaces, including transparency, specular reflection, and texture absence, pose challenges for image processing and stereo reconstruction. This study proposed a novel technique that combined thermal stereography and deep learning to achieve fully noncontact wave measurements. The optical imaging properties of water in the long-wave infrared spectrum were found to be suitable for stereo matching, effectively avoiding the issues in the visible-light spectrum. After capturing wave images using thermal stereo cameras, a reconstruction strategy involving deep learning techniques was proposed to improve stereo matching performance. A generative approach was employed to synthesize a dataset with ground-truth disparity from unannotated infrared images. This dataset was then fed to a pretrained stereo neural network for fine-tuning to achieve domain adaptation. Wave flume experiments were conducted to validate the feasibility and accuracy of the proposed technique. The final reconstruction results indicated great agreement and high accuracy with a mean bias of less than 2.1% compared with the measurements obtained using wave probes, suggesting that the novel technique effectively measures the spatiotemporal distribution of wave surface in hydrodynamic experiments.

[CV-45] UIE-UnFold: Deep Unfolding Network with Color Priors and Vision Transformer for Underwater Image Enhancement

链接: https://arxiv.org/abs/2408.10653
作者: Yingtie Lei,Jia Yu,Yihang Dong,Changwei Gong,Ziyang Zhou,Chi-Man Pun
关键词-EN: remains challenging due, complex underwater environment, proposed DUN model, Underwater image, underwater image formation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by DSAA CIVIL 2024

点击查看摘要

Abstract:Underwater image enhancement (UIE) plays a crucial role in various marine applications, but it remains challenging due to the complex underwater environment. Current learning-based approaches frequently lack explicit incorporation of prior knowledge about the physical processes involved in underwater image formation, resulting in limited optimization despite their impressive enhancement results. This paper proposes a novel deep unfolding network (DUN) for UIE that integrates color priors and inter-stage feature transformation to improve enhancement performance. The proposed DUN model combines the iterative optimization and reliability of model-based methods with the flexibility and representational power of deep learning, offering a more explainable and stable solution compared to existing learning-based UIE approaches. The proposed model consists of three key components: a Color Prior Guidance Block (CPGB) that establishes a mapping between color channels of degraded and original images, a Nonlinear Activation Gradient Descent Module (NAGDM) that simulates the underwater image degradation process, and an Inter Stage Feature Transformer (ISF-Former) that facilitates feature exchange between different network stages. By explicitly incorporating color priors and modeling the physical characteristics of underwater image formation, the proposed DUN model achieves more accurate and reliable enhancement results. Extensive experiments on multiple underwater image datasets demonstrate the superiority of the proposed model over state-of-the-art methods in both quantitative and qualitative evaluations. The proposed DUN-based approach offers a promising solution for UIE, enabling more accurate and reliable scientific analysis in marine research. The code is available at this https URL.

[CV-46] Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

链接: https://arxiv.org/abs/2408.10652
作者: Guofeng Mei,Luigi Riz,Yiming Wang,Fabio Poiesi
关键词-EN: offering a greater, greater flexibility, flexibility than closed-vocabulary, instance, open vocabulary
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering ``List the objects in the scene.‘’. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available.

[CV-47] A Review of Human-Object Interaction Detection

链接: https://arxiv.org/abs/2408.10641
作者: Yuxiao Wang,Qiwei Xiong,Yu Lei,Weiying Xue,Qi Liu,Zhenao Wei
关键词-EN: high-level visual understanding, HOI detection, image-based HOI detection, Human-object interaction, HOI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human-object interaction (HOI) detection plays a key role in high-level visual understanding, facilitating a deep comprehension of human activities. Specifically, HOI detection aims to locate the humans and objects involved in interactions within images or videos and classify the specific interactions between them. The success of this task is influenced by several key factors, including the accurate localization of human and object instances, as well as the correct classification of object categories and interaction relationships. This paper systematically summarizes and discusses the recent work in image-based HOI detection. First, the mainstream datasets involved in HOI relationship detection are introduced. Furthermore, starting with two-stage methods and end-to-end one-stage detection approaches, this paper comprehensively discusses the current developments in image-based HOI detection, analyzing the strengths and weaknesses of these two methods. Additionally, the advancements of zero-shot learning, weakly supervised learning, and the application of large-scale language models in HOI detection are discussed. Finally, the current challenges in HOI detection are outlined, and potential research directions and future trends are explored.

[CV-48] Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

链接: https://arxiv.org/abs/2408.10627
作者: Chen Liang,Qiang Guo,Xiaochao Qu,Luoqi Liu,Ting Liu
关键词-EN: meaningful segments based, partitioning video sequences, aims at partitioning, sequences into meaningful, meaningful segments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.

[CV-49] WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification

链接: https://arxiv.org/abs/2408.10624
作者: Yonggan Wu,Ling-Chao Meng,Yuan Zichao,Sixian Chan,Hong-Qiang Wang
关键词-EN: visible-infrared person re-identification, primary challenges lies, significant cross-modality discrepancy, information mining, Interactive Information Mining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:For the visible-infrared person re-identification (VI-ReID) task, one of the primary challenges lies in significant cross-modality discrepancy. Existing methods struggle to conduct modality-invariant information mining. They often focus solely on mining singular dimensions like spatial or channel, and overlook the extraction of specific-modality multi-dimension information. To fully mine modality-invariant information across a wide range, we introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach. Empowered by the proposed Global Region Interaction (GRI), MIIM comprehensively mines non-local spatial and channel information through intra-dimension interaction. Moreover, Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information. AICL, by introducing the novel Cross-Modality Key-Instance Contrastive (CMKIC) loss, effectively guides the network in extracting modality-invariant information. We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset. The results demonstrate WRIM-Net’s superiority over state-of-the-art methods.

[CV-50] xtMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles

链接: https://arxiv.org/abs/2408.10623
作者: Tong Wang,Xiaochao Qu,Ting Liu
关键词-EN: newly generated text, generated text similar, Generative Adversarial Networks, aims to modify, newly generated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original. Given an image, a target area, and target text, the task produces an output image with the target text in the selected area, replacing the original. This task has been studied extensively, with initial success using Generative Adversarial Networks (GANs) to balance text fidelity and style similarity. However, GAN-based methods struggled with complex backgrounds or text styles. Recent works leverage diffusion models, showing improved results, yet still face challenges, especially with non-Latin languages like CJK characters (Chinese, Japanese, Korean) that have complex glyphs, often producing inaccurate or unrecognizable characters. To address these issues, we present \emphTextMastero - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs). TextMastero introduces two key modules: a glyph conditioning module for fine-grained content control in generating accurate texts, and a latent guidance module for providing comprehensive style information to ensure similarity before and after editing. Both qualitative and quantitative experiments demonstrate that our method surpasses all known existing works in text fidelity and style similarity.

[CV-51] Novel Change Detection Framework in Remote Sensing Imagery Using Diffusion Models and Structural Similarity Index (SSIM)

链接: https://arxiv.org/abs/2408.10619
作者: Andrew Kiruluta,Eric Lundy,Andreas Lemos
关键词-EN: urban growth, Change detection, enabling the monitoring, disaster impact, crucial task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Change detection is a crucial task in remote sensing, enabling the monitoring of environmental changes, urban growth, and disaster impact. Conventional change detection techniques, such as image differencing and ratioing, often struggle with noise and fail to capture complex variations in imagery. Recent advancements in machine learning, particularly generative models like diffusion models, offer new opportunities for enhancing change detection accuracy. In this paper, we propose a novel change detection framework that combines the strengths of Stable Diffusion models with the Structural Similarity Index (SSIM) to create robust and interpretable change maps. Our approach, named Diffusion Based Change Detector, is evaluated on both synthetic and real-world remote sensing datasets and compared with state-of-the-art methods. The results demonstrate that our method significantly outperforms traditional differencing techniques and recent deep learning-based methods, particularly in scenarios with complex changes and noise.

[CV-52] OMEGA: Efficient Occlusion-Aware Navigation for Air-Ground Robot in Dynamic Environments via State Space Model

链接: https://arxiv.org/abs/2408.10618
作者: Junming Wang,Dong Huang,Xiuxian Guan,Zekai Sun,Tianxiang Shen,Fangming Liu,Heming Cui
关键词-EN: Signed Distance Field, Euclidean Signed Distance, Air-ground robots, disaster response due, computing Euclidean Signed
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: OccMamba is Coming!

点击查看摘要

Abstract:Air-ground robots (AGRs) are widely used in surveillance and disaster response due to their exceptional mobility and versatility (i.e., flying and driving). Current AGR navigation systems perform well in static occlusion-prone environments (e.g., indoors) by using 3D semantic occupancy networks to predict occlusions for complete local mapping and then computing Euclidean Signed Distance Field (ESDF) for path planning. However, these systems face challenges in dynamic, severe occlusion scenes (e.g., crowds) due to limitations in perception networks’ low prediction accuracy and path planners’ high computation overhead. In this paper, we propose OMEGA, which contains OccMamba with an Efficient AGR-Planner to address the above-mentioned problems. OccMamba adopts a novel architecture that separates semantic and occupancy prediction into independent branches, incorporating two mamba blocks within these branches. These blocks efficiently extract semantic and geometric features in 3D environments with linear complexity, ensuring that the network can learn long-distance dependencies to improve prediction accuracy. Semantic and geometric features are combined within the Bird’s Eye View (BEV) space to minimise computational overhead during feature fusion. The resulting semantic occupancy map is then seamlessly integrated into the local map, providing occlusion awareness of the dynamic environment. Our AGR-Planner utilizes this local map and employs kinodynamic A* search and gradient-based trajectory optimization to guarantee planning is ESDF-free and energy-efficient. Extensive experiments demonstrate that OccMamba outperforms the state-of-the-art 3D semantic occupancy network with 25.0% mIoU. End-to-end navigation experiments in dynamic scenes verify OMEGA’s efficiency, achieving a 96% average planning success rate. Code and video are available at this https URL.

[CV-53] A toolbox for calculating objective image properties in aesthetics research

链接: https://arxiv.org/abs/2408.10616
作者: Christoph Redies,Ralf Bartho,Lisa Koßmann,Branka Spehar,Ronald Hübner,Johan Wagemans,Gregor U. Hayn-Leichsenring
关键词-EN: image properties, studied numerous quantitative, quantitative image properties, visual aesthetic appreciation, toolbox
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: 41 pages, 6 figure

点击查看摘要

Abstract:Over the past two decades, researchers in the field of visual aesthetics have studied numerous quantitative (objective) image properties and how they relate to visual aesthetic appreciation. However, results are difficult to compare between research groups. One reason is that researchers use different sets of image properties in their studies. But even if the same properties are used, the image pre-processing techniques may differ and often researchers use their own customized scripts to calculate the image properties. To provide greater accessibility and comparability of research results in visual experimental aesthetics, we developed an open-access and easy-to-use toolbox (called the ‘Aesthetics Toolbox’). The Toolbox allows users to calculate a well-defined set of quantitative image properties popular in contemporary research. The properties include lightness and color statistics, Fourier spectral properties, fractality, self-similarity, symmetry, as well as different entropy measures and CNN-based variances. Compatible with most devices, the Toolbox provides an intuitive click-and-drop web interface. In the Toolbox, we integrated the original scripts of four different research groups and translated them into Python 3. To ensure that results were consistent across analyses, we took care that results from the Python versions of the scripts were the same as those from the original scripts. The toolbox, detailed documentation, and a link to the cloud version are available via Github: this https URL. In summary, we developed a toolbox that helps to standardize and simplify the calculation of quantitative image properties for visual aesthetics research.

[CV-54] Generalizable Facial Expression Recognition ECCV2024

链接: https://arxiv.org/abs/2408.10614
作者: Yuhang Zhang,Xiuqi Zheng,Chenyi Liang,Jiani Hu,Weihong Deng
关键词-EN: FER, FER methods, SOTA FER methods, facial expression recognition, domain adaptation FER
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:SOTA facial expression recognition (FER) methods fail on test sets that have domain gaps with the train set. Recent domain adaptation FER methods need to acquire labeled or unlabeled samples of target domains to fine-tune the FER model, which might be infeasible in real-world deployment. In this paper, we aim to improve the zero-shot generalization ability of FER methods on different unseen test sets using only one train set. Inspired by how humans first detect faces and then select expression features, we propose a novel FER pipeline to extract expression-related features from any given face images. Our method is based on the generalizable face features extracted by large models like CLIP. However, it is non-trivial to adapt the general features of CLIP for specific tasks like FER. To preserve the generalization ability of CLIP and the high precision of the FER model, we design a novel approach that learns sigmoid masks based on the fixed CLIP face features to extract expression features. To further improve the generalization ability on unseen test sets, we separate the channels of the learned masked features according to the expression classes to directly generate logits and avoid using the FC layer to reduce overfitting. We also introduce a channel-diverse loss to make the learned masks separated. Extensive experiments on five different FER datasets verify that our method outperforms SOTA FER methods by large margins. Code is available in this https URL.

[CV-55] MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

链接: https://arxiv.org/abs/2408.10605
作者: Yanbo Ding,Shaobin Zhuang,Kunchang Li,Zhengrong Yue,Yu Qiao,Yali Wang
关键词-EN: existing methods struggle, methods struggle, struggle to create, image, MUSES
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world.

[CV-56] MV-MOS: Multi-View Feature Fusion for 3D Moving Object Segmentation

链接: https://arxiv.org/abs/2408.10602
作者: Jintao Cheng,Xingming Chen,Jinxin Liang,Xiaoyu Tang,Xieyuanli Chen,Dachuan Li
关键词-EN: Effectively summarizing dense, moving object segmentation, summarizing dense, robotics applications, point cloud data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Effectively summarizing dense 3D point cloud data and extracting motion information of moving objects (moving object segmentation, MOS) is crucial to autonomous driving and robotics applications. How to effectively utilize motion and semantic features and avoid information loss during 3D-to-2D projection is still a key challenge. In this paper, we propose a novel multi-view MOS model (MV-MOS) by fusing motion-semantic features from different 2D representations of point clouds. To effectively exploit complementary information, the motion branches of the proposed model combines motion features from both bird’s eye view (BEV) and range view (RV) representations. In addition, a semantic branch is introduced to provide supplementary semantic features of moving objects. Finally, a Mamba module is utilized to fuse the semantic features with motion features and provide effective guidance for the motion branches. We validated the effectiveness of the proposed multi-branch fusion MOS framework via comprehensive experiments, and our proposed model outperforms existing state-of-the-art models on the SemanticKITTI benchmark.

[CV-57] Breast tumor classification based on self-supervised contrastive learning from ultrasound videos

链接: https://arxiv.org/abs/2408.10600
作者: Yunxin Tang,Siyuan Tang,Jian Zhang,Hao Chen
关键词-EN: diagnosing breast tumors, Breast ultrasound, Background, Breast, breast tumors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: Breast ultrasound is prominently used in diagnosing breast tumors. At present, many automatic systems based on deep learning have been developed to help radiologists in diagnosis. However, training such systems remains challenging because they are usually data-hungry and demand amounts of labeled data, which need professional knowledge and are expensive. Methods: We adopted a triplet network and a self-supervised contrastive learning technique to learn representations from unlabeled breast ultrasound video clips. We further designed a new hard triplet loss to to learn representations that particularly discriminate positive and negative image pairs that are hard to recognize. We also constructed a pretraining dataset from breast ultrasound videos (1,360 videos from 200 patients), which includes an anchor sample dataset with 11,805 images, a positive sample dataset with 188,880 images, and a negative sample dataset dynamically generated from video clips. Further, we constructed a finetuning dataset, including 400 images from 66 patients. We transferred the pretrained network to a downstream benign/malignant classification task and compared the performance with other state-of-the-art models, including three models pretrained on ImageNet and a previous contrastive learning model retrained on our datasets. Results and conclusion: Experiments revealed that our model achieved an area under the receiver operating characteristic curve (AUC) of 0.952, which is significantly higher than the others. Further, we assessed the dependence of our pretrained model on the number of labeled data and revealed that 100 samples were required to achieve an AUC of 0.901. The proposed framework greatly reduces the demand for labeled data and holds potential for use in automatic breast ultrasound image diagnosis.

[CV-58] An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs

链接: https://arxiv.org/abs/2408.10593
作者: Eui Jun Hwang,Sukmin Cho,Junmyeong Lee,Jong C. Park
关键词-EN: converts sign videos, Large Language Models, Sign Language Translation, sign videos directly, spoken language sentences
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses. Recently, Large Language Models (LLMs) have shown remarkable translation performance in gloss-free methods by harnessing their powerful natural language generation capabilities. However, these methods often rely on domain-specific fine-tuning of visual encoders to achieve optimal results. By contrast, this paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language. With this in mind, we introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework. The core idea of SpaMo is simple yet effective. We first extract spatial and motion features using off-the-shelf visual encoders and then input these features into an LLM with a language prompt. Additionally, we employ a visual-text alignment process as a warm-up before the SLT supervision. Our experiments demonstrate that SpaMo achieves state-of-the-art performance on two popular datasets, PHOENIX14T and How2Sign.

[CV-59] DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

链接: https://arxiv.org/abs/2408.10588
作者: Zhijing Shao,Duotun Wang,Qing-Yao Tian,Yao-Dong Yang,Hengyu Meng,Zeyu Cai,Bo Dong,Yu Zhang,Kang Zhang,Zeyu Wang
关键词-EN: remains largely unexplored, made significant advancements, incorporating detailed expressions, avatars remains largely, full-body avatars remains
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Although neural rendering has made significant advancements in creating lifelike, animatable full-body and head avatars, incorporating detailed expressions into full-body avatars remains largely unexplored. We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. Trained on multiview videos of a given subject, our method learns a conditional variational autoencoder that takes both the body motion and facial expression as driving signals to generate Gaussian maps in the UV layout. To drive the facial expressions, instead of the commonly used 3D Morphable Models (3DMMs) in 3D head avatars, we propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars. Leveraging the rendering capability of 3DGS and the rich expressiveness of the expression latent space, the learned avatars can be reenacted to reproduce photorealistic rendering images with subtle and accurate facial expressions. Experiments on an existing dataset and our newly proposed dataset of full-body talking avatars demonstrate the efficacy of our method. We also propose an audio-driven extension of our method with the help of 2D talking faces, opening new possibilities to interactive AI agents.

[CV-60] Multi-view Hand Reconstruction with a Point-Embedded Transformer CVPR2023

链接: https://arxiv.org/abs/2408.10581
作者: Lixin Yang,Licheng Zhong,Pengxiang Zhu,Xinyu Zhan,Junxiao Kong,Jian Xu,Cewu Lu
关键词-EN: Hand Mesh Reconstruction, Mesh Reconstruction, named POEM, generalizable multi-view Hand, work introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Generalizable multi-view Hand Mesh Reconstruction (HMR) model. Extension of the original work at CVPR2023

点击查看摘要

Abstract:This work introduces a novel and generalizable multi-view Hand Mesh Reconstruction (HMR) model, named POEM, designed for practical use in real-world hand motion capture scenarios. The advances of the POEM model consist of two main aspects. First, concerning the modeling of the problem, we propose embedding a static basis point within the multi-view stereo space. A point represents a natural form of 3D information and serves as an ideal medium for fusing features across different views, given its varied projections across these views. Consequently, our method harnesses a simple yet effective idea: a complex 3D hand mesh can be represented by a set of 3D basis points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encompass the hand in it. The second advance lies in the training strategy. We utilize a combination of five large-scale multi-view datasets and employ randomization in the number, order, and poses of the cameras. By processing such a vast amount of data and a diverse array of camera configurations, our model demonstrates notable generalizability in the real-world applications. As a result, POEM presents a highly practical, plug-and-play solution that enables user-friendly, cost-effective multi-view motion capture for both left and right hands. The model and source codes are available at this https URL.

[CV-61] MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

链接: https://arxiv.org/abs/2408.10575
作者: Haoran Tang,Meng Cao,Jinfa Huang,Ruyang Liu,Peng Jin,Ge Li,Xiaodan Liang
关键词-EN: natural language queries, associate relevant video, relevant video content, Text-Video Retrieval, aims to align
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

[CV-62] Prompt-Agnostic Adversarial Perturbation for Customized Diffusion Models

链接: https://arxiv.org/abs/2408.10571
作者: Cong Wan,Yuhang He,Xiang Song,Yihong Gong
关键词-EN: allowing for efficient, textual descriptions, efficient synthesis, synthesis of photos, data with textual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 33 pages, 14 figures, under review

点击查看摘要

Abstract:Diffusion models have revolutionized customized text-to-image generation, allowing for efficient synthesis of photos from personal data with textual descriptions. However, these advancements bring forth risks including privacy breaches and unauthorized replication of artworks. Previous researches primarily center around using prompt-specific methods to generate adversarial examples to protect personal images, yet the effectiveness of existing methods is hindered by constrained adaptability to different prompts. In this paper, we introduce a Prompt-Agnostic Adversarial Perturbation (PAP) method for customized diffusion models. PAP first models the prompt distribution using a Laplace Approximation, and then produces prompt-agnostic perturbations by maximizing a disturbance expectation based on the modeled distribution. This approach effectively tackles the prompt-agnostic attacks, leading to improved defense stability. Extensive experiments in face privacy and artistic style protection, demonstrate the superior generalization of our method in comparison to existing techniques.

[CV-63] Kalib: Markerless Hand-Eye Calibration with Keypoint Tracking

链接: https://arxiv.org/abs/2408.10562
作者: Tutian Tang,Minghao Liu,Wenqiang Xu,Cewu Lu
关键词-EN: calibration involves estimating, Hand-eye calibration involves, involves estimating, Hand-eye calibration, markerless hand-eye calibration
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: The code and supplementary materials are available at this https URL

点击查看摘要

Abstract:Hand-eye calibration involves estimating the transformation between the camera and the robot. Traditional methods rely on fiducial markers, involving much manual labor and careful setup. Recent advancements in deep learning offer markerless techniques, but they present challenges, including the need for retraining networks for each robot, the requirement of accurate mesh models for data generation, and the need to address the sim-to-real gap. In this letter, we propose Kalib, an automatic and universal markerless hand-eye calibration pipeline that leverages the generalizability of visual foundation models to eliminate these barriers. In each calibration process, Kalib uses keypoint tracking and proprioceptive sensors to estimate the transformation between a robot’s coordinate space and its corresponding points in camera space. Our method does not require training new networks or access to mesh models. Through evaluations in simulation environments and the real-world dataset DROID, Kalib demonstrates superior accuracy compared to recent baseline methods. This approach provides an effective and flexible calibration process for various robot systems by simplifying setup and removing dependency on precise physical markers.

[CV-64] Diff-PCC: Diffusion-based Neural Compression for 3D Point Clouds

链接: https://arxiv.org/abs/2408.10543
作者: Kai Liu,Kang You,Pan Gao
关键词-EN: Stable diffusion networks, detailed visual content, Stable diffusion, visual content, networks have emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Stable diffusion networks have emerged as a groundbreaking development for their ability to produce realistic and detailed visual content. This characteristic renders them ideal decoders, capable of producing high-quality and aesthetically pleasing reconstructions. In this paper, we introduce the first diffusion-based point cloud compression method, dubbed Diff-PCC, to leverage the expressive power of the diffusion model for generative and aesthetically superior decoding. Different from the conventional autoencoder fashion, a dual-space latent representation is devised in this paper, in which a compressor composed of two independent encoding backbones is considered to extract expressive shape latents from distinct latent spaces. At the decoding side, a diffusion-based generator is devised to produce high-quality reconstructions by considering the shape latents as guidance to stochastically denoise the noisy point clouds. Experiments demonstrate that the proposed Diff-PCC achieves state-of-the-art compression performance (e.g., 7.711 dB BD-PSNR gains against the latest G-PCC standard at ultra-low bitrate) while attaining superior subjective quality. Source code will be made publicly available.

[CV-65] he Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

链接: https://arxiv.org/abs/2408.10541
作者: Bin Cao,Yisi Zhang,Hanyi Wang,Xingjian He,Jing Liu
关键词-EN: Referring Video Object, Video Object Segmentation, natural language expression, emerging multi-modal task, Referring Video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2406.13939

点击查看摘要

Abstract:Referring Video Object Segmentation is an emerging multi-modal task that aims to segment objects in the video given a natural language expression. In this work, we build two instance-centric models and fuse predicted results from frame-level and instance-level. First, we introduce instance mask into the DETR-based model for query initialization to achieve temporal enhancement and employ SAM for spatial refinement. Secondly, we build an instance retrieval model conducting binary instance mask classification whether the instance is referred. Finally, we fuse predicted results and our method achieved a score of 52.67 JF in the validation phase and 60.36 JF in the test phase, securing the final ranking of 3rd place in the 6-th LSVOS Challenge RVOS Track.

[CV-66] raining Matting Models without Alpha Labels

链接: https://arxiv.org/abs/2408.10539
作者: Wenze Liu,Zixuan Ye,Hao Lu,Zhiguo Cao,Xiangyu Yue
关键词-EN: labelling difficulty, longstanding problem, problem in deep, deep image matting, DDC loss
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 12 figures

点击查看摘要

Abstract:The labelling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present that the cooperation between learned semantics from indicated known regions and proper assumed matting rules can help infer alpha values at transition areas. Inspired by the nonlocal principle in traditional image matting, we build a directional distance consistency loss (DDC loss) at each pixel neighborhood to constrain the alpha values conditioned on the input image. DDC loss forces the distance of similar pairs on the alpha matte and on its corresponding image to be consistent. In this way, the alpha values can be propagated from learned known regions to unknown transition areas. With only images and trimaps, a matting model can be trained under the supervision of a known loss and the proposed DDC loss. Experiments on AM-2K and P3M-10K dataset show that our paradigm achieves comparable performance with the fine-label-supervised baseline, while sometimes offers even more satisfying results than human-labelled ground truth. Code is available at \urlthis https URL.

[CV-67] Surgical Workflow Recognition and Blocking Effectiveness Detection in Laparoscopic Liver Resections with Pringle Maneuver

链接: https://arxiv.org/abs/2408.10538
作者: Diandian Guo,Weixin Si,Zhixi Li,Jialun Pei,Pheng-Ann Heng
关键词-EN: reduce blood loss, intermittently blocking blood, blocking blood inflow, Pringle maneuver, clear surgical view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pringle maneuver (PM) in laparoscopic liver resection aims to reduce blood loss and provide a clear surgical view by intermittently blocking blood inflow of the liver, whereas prolonged PM may cause ischemic injury. To comprehensively monitor this surgical procedure and provide timely warnings of ineffective and prolonged blocking, we suggest two complementary AI-assisted surgical monitoring tasks: workflow recognition and blocking effectiveness detection in liver resections. The former presents challenges in real-time capturing of short-term PM, while the latter involves the intraoperative discrimination of long-term liver ischemia states. To address these challenges, we meticulously collect a novel dataset, called PmLR50, consisting of 25,037 video frames covering various surgical phases from 50 laparoscopic liver resection procedures. Additionally, we develop an online baseline for PmLR50, termed PmNet. This model embraces Masked Temporal Encoding (MTE) and Compressed Sequence Modeling (CSM) for efficient short-term and long-term temporal information modeling, and embeds Contrastive Prototype Separation (CPS) to enhance action discrimination between similar intraoperative operations. Experimental results demonstrate that PmNet outperforms existing state-of-the-art surgical workflow recognition methods on the PmLR50 benchmark. Our research offers potential clinical applications for the laparoscopic liver surgery community. Source code and data will be publicly available.

[CV-68] Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation

链接: https://arxiv.org/abs/2408.10537
作者: Jiawei Han,Kaiqi Liu,Wei Li,Guangzhi Chen
关键词-EN: cloud semantic segmentation, Point cloud semantic, intelligent agent, segmentation network, semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Point cloud semantic segmentation can significantly enhance the perception of an intelligent agent. Nevertheless, the discriminative capability of the segmentation network is influenced by the quantity of samples available for different categories. To mitigate the cognitive bias induced by class imbalance, this paper introduces a novel method, namely subspace prototype guidance (\textbfSPG), to guide the training of segmentation network. Specifically, the point cloud is initially separated into independent point sets by category to provide initial conditions for the generation of feature subspaces. The auxiliary branch which consists of an encoder and a projection head maps these point sets into separate feature subspaces. Subsequently, the feature prototypes which are extracted from the current separate subspaces and then combined with prototypes of historical subspaces guide the feature space of main branch to enhance the discriminability of features of minority categories. The prototypes derived from the feature space of main branch are also employed to guide the training of the auxiliary branch, forming a supervisory loop to maintain consistent convergence of the entire network. The experiments conducted on the large public benchmarks (i.e. S3DIS, ScanNet v2, ScanNet200, Toronto-3D) and collected real-world data illustrate that the proposed method significantly improves the segmentation performance and surpasses the state-of-the-art method. The code is available at \urlthis https URL.

[CV-69] FAGStyle: Feature Augmentation on Geodesic Surface for Zero-shot Text-guided Diffusion Image Style Transfer

链接: https://arxiv.org/abs/2408.10533
作者: Yuexing Han,Liheng Ruan,Bing Wang
关键词-EN: style, style reference, image style transfer, Sliding Window Crop, style reference images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The goal of image style transfer is to render an image guided by a style reference while maintaining the original content. Existing image-guided methods rely on specific style reference images, restricting their wider application and potentially compromising result quality. As a flexible alternative, text-guided methods allow users to describe the desired style using text prompts. Despite their versatility, these methods often struggle with maintaining style consistency, reflecting the described style accurately, and preserving the content of the target image. To address these challenges, we introduce FAGStyle, a zero-shot text-guided diffusion image style transfer method. Our approach enhances inter-patch information interaction by incorporating the Sliding Window Crop technique and Feature Augmentation on Geodesic Surface into our style control loss. Furthermore, we integrate a Pre-Shape self-correlation consistency loss to ensure content consistency. FAGStyle demonstrates superior performance over existing methods, consistently achieving stylization that retains the semantic content of the source image. Experimental results confirms the efficacy of FAGStyle across a diverse range of source contents and styles, both imagined and common.

[CV-70] NutrifyAI: An AI-Powered System for Real-Time Food Detection Nutritional Analysis and Personalized Meal Recommendations

链接: https://arxiv.org/abs/2408.10532
作者: Michelle Han,Junyao Chen
关键词-EN: Calorie Counter, nutrition apps reaching, apps reaching, health apps, surging in popularity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 12 figures

点击查看摘要

Abstract:With diet and nutrition apps reaching 1.4 billion users in 2022 [1], it’s no surprise that health apps like MyFitnessPal, Noom, and Calorie Counter, are surging in popularity. However, one major setback [2] of nearly all nutrition applications is that users must enter food data manually, which is time-consuming and tedious. Thus, there has been an increasing demand for applications that can accurately identify food items, analyze their nutritional content, and offer dietary recommendations in real-time. This paper introduces a comprehensive system that combines advanced computer vision techniques with nutrition analysis, implemented in a versatile mobile and web application. The system is divided into three key components: 1) food detection using the YOLOv8 model, 2) nutrient analysis via the Edamam Nutrition Analysis API, and 3) personalized meal recommendations using the Edamam Meal Planning and Recipe Search APIs. Designed for both mobile and web platforms, the application ensures fast processing times with an intuitive user interface, with features such as data visualizations using Chart.js, a login system, and personalized settings for dietary preferences, allergies, and cuisine choices. Preliminary results showcase the system’s effectiveness, making it a valuable tool for users to make informed dietary decisions.

[CV-71] EdgeNAT: Transformer for Efficient Edge Detection

链接: https://arxiv.org/abs/2408.10527
作者: Jinghuai Jie,Yan Guo,Guixing Wu,Junmin Wu,Baojian Hua
关键词-EN: increasingly prominent role, Neighborhood Attention Transformer, feature extraction capabilities, Dilated Neighborhood Attention, powerful feature extraction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformers, renowned for their powerful feature extraction capabilities, have played an increasingly prominent role in various vision tasks. Especially, recent advancements present transformer with hierarchical structures such as Dilated Neighborhood Attention Transformer (DiNAT), demonstrating outstanding ability to efficiently capture both global and local features. However, transformers’ application in edge detection has not been fully exploited. In this paper, we propose EdgeNAT, a one-stage transformer-based edge detector with DiNAT as the encoder, capable of extracting object boundaries and meaningful edges both accurately and efficiently. On the one hand, EdgeNAT captures global contextual information and detailed local cues with DiNAT, on the other hand, it enhances feature representation with a novel SCAF-MLA decoder by utilizing both inter-spatial and inter-channel relationships of feature maps. Extensive experiments on multiple datasets show that our method achieves state-of-the-art performance on both RGB and depth images. Notably, on the widely used BSDS500 dataset, our L model achieves impressive performances, with ODS F-measure and OIS F-measure of 86.0%, 87.6% for multi-scale input,and 84.9%, and 86.3% for single-scale input, surpassing the current state-of-the-art EDTER by 1.2%, 1.1%, 1.7%, and 1.6%, respectively. Moreover, as for throughput, our approach runs at 20.87 FPS on RTX 4090 GPU with single-scale input. The code for our method will be released soon.

[CV-72] BAUST Lipi: A BdSL Dataset with Deep Learning Based Bangla Sign Language Recognition

链接: https://arxiv.org/abs/2408.10518
作者: Md Hadiuzzaman,Mohammed Sowket Ali,Tamanna Sultana,Abdur Raj Shafi,Abu Saleh Musa Miah,Jungpil Shin
关键词-EN: People commonly communicate, People commonly, communicate in English, Bangla sign language, Bengali spoken languages
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:People commonly communicate in English, Arabic, and Bengali spoken languages through various mediums. However, deaf and hard-of-hearing individuals primarily use body language and sign language to express their needs and achieve independence. Sign language research is burgeoning to enhance communication with the deaf community. While many researchers have made strides in recognizing sign languages such as French, British, Arabic, Turkish, and American, there has been limited research on Bangla sign language (BdSL) with less-than-satisfactory results. One significant barrier has been the lack of a comprehensive Bangla sign language dataset. In our work, we introduced a new BdSL dataset comprising alphabets totaling 18,000 images, with each image being 224x224 pixels in size. Our dataset encompasses 36 Bengali symbols, of which 30 are consonants and the remaining six are vowels. Despite our dataset contribution, many existing systems continue to grapple with achieving high-performance accuracy for BdSL. To address this, we devised a hybrid Convolutional Neural Network (CNN) model, integrating multiple convolutional layers, activation functions, dropout techniques, and LSTM layers. Upon evaluating our hybrid-CNN model with the newly created BdSL dataset, we achieved an accuracy rate of 97.92%. We are confident that both our BdSL dataset and hybrid CNN model will be recognized as significant milestones in BdSL research.

[CV-73] Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers KDD2024 ECML

链接: https://arxiv.org/abs/2408.10503
作者: Thanh Thi Nguyen,Campbell Wilson,Janis Dalins
关键词-EN: Assessing the forensic, hand images involves, unique features, hand, hand images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the ECML PKDD 2024 (Research Track)

点击查看摘要

Abstract:Assessing the forensic value of hand images involves the use of unique features and patterns present in an individual’s hand. The human hand has distinct characteristics, such as the pattern of veins, fingerprints, and the geometry of the hand itself. This paper investigates the use of vision transformers (ViTs) for classification of hand images. We use explainability tools to explore the internal representations of ViTs and assess their impact on the model outputs. Utilizing the internal understanding of ViTs, we introduce distillation methods that allow a student model to adaptively extract knowledge from a teacher model while learning on data of a different domain to prevent catastrophic forgetting. Two publicly available hand image datasets are used to conduct a series of experiments to evaluate performance of the ViTs and our proposed adaptive distillation methods. The experimental results demonstrate that ViT models significantly outperform traditional machine learning methods and the internal states of ViTs are useful for explaining the model outputs in the classification task. By averting catastrophic forgetting, our distillation methods achieve excellent performance on data from both source and target domains, particularly when these two domains exhibit significant dissimilarity. The proposed approaches therefore can be developed and implemented effectively for real-world applications such as access control, identity verification, and authentication systems.

[CV-74] SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

链接: https://arxiv.org/abs/2408.10500
作者: Zebang Cheng,Shuyuan Tu,Dawei Huang,Minghan Li,Xiaojiang Peng,Zhi-Qi Cheng,Alexander G. Hauptmann
关键词-EN: multimodal emotion recognition, emotion recognition, paper presents, presents our winning, winning approach
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition. Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples, addressing the challenge of limited labeled data. To enhance multimodal fusion while mitigating modality-specific noise, we introduce Conv-Attention, a lightweight and efficient hybrid framework. Extensive experimentation vali-dates the effectiveness of our approach. In the MER-NOISE track, our system achieves a state-of-the-art weighted average F-score of 85.30%, surpassing the second and third-place teams by 1.47% and 1.65%, respectively. For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V, securing the highest score among all participating large multimodal models. The code and model for Emotion-LLaMA are available at this https URL.

[CV-75] GPT-based Textile Pilling Classification Using 3D Point Cloud Data

链接: https://arxiv.org/abs/2408.10496
作者: Yu Lu,YuYu Chen,Gang Zhou,Zhenghua Lan
关键词-EN: textile quality control, Textile pilling assessment, point cloud, Textile pilling, quality control
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Textile pilling assessment is critical for textile quality control. We collect thousands of 3D point cloud images in the actual test environment of textiles and organize and label them as TextileNet8 dataset. To the best of our knowledge, it is the first publicly available eight-categories 3D point cloud dataset in the field of textile pilling assessment. Based on PointGPT, the GPT-like big model of point cloud analysis, we incorporate the global features of the input point cloud extracted from the non-parametric network into it, thus proposing the PointGPT+NN model. Using TextileNet8 as a benchmark, the experimental results show that the proposed PointGPT+NN model achieves an overall accuracy (OA) of 91.8% and a mean per-class accuracy (mAcc) of 92.2%. Test results on other publicly available datasets also validate the competitive performance of the proposed PointGPT+NN model. The proposed TextileNet8 dataset will be publicly available.

[CV-76] Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm

链接: https://arxiv.org/abs/2408.10488
作者: Xiao Wang,Yao Rong,Fuling Wang,Jianing Li,Lin Zhu,Bo Jiang,Yaowei Wang
关键词-EN: Sign Language Translation, AI-assisted disability, Event stream sign, core task, field of AI-assisted
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
*备注: First Large-scale and High-Definition Benchmark Dataset for Event-based Sign Language Translation

点击查看摘要

Abstract:Sign Language Translation (SLT) is a core task in the field of AI-assisted disability. Unlike traditional SLT based on visible light videos, which is easily affected by factors such as lighting, rapid hand movements, and privacy breaches, this paper proposes the use of high-definition Event streams for SLT, effectively mitigating the aforementioned issues. This is primarily because Event streams have a high dynamic range and dense temporal signals, which can withstand low illumination and motion blur well. Additionally, due to their sparsity in space, they effectively protect the privacy of the target person. More specifically, we propose a new high-resolution Event stream sign language dataset, termed Event-CSL, which effectively fills the data gap in this area of research. It contains 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary. These samples are collected in a variety of indoor and outdoor scenes, encompassing multiple angles, light intensities, and camera movements. We have benchmarked existing mainstream SLT works to enable fair comparison for future efforts. Based on this dataset and several other large-scale datasets, we propose a novel baseline method that fully leverages the Mamba model’s ability to integrate temporal information of CNN features, resulting in improved sign language translation outcomes. Both the benchmark dataset and source code will be released on this https URL

[CV-77] MambaEVT: Event Stream based Visual Object Tracking using State Space Model

链接: https://arxiv.org/abs/2408.10487
作者: Xiao Wang,Chao wang,Shiao Wang,Xixi Wang,Zhicheng Zhao,Lin Zhu,Bo Jiang
关键词-EN: Event camera-based visual, low energy consumption, dense temporal resolution, unique imaging principle, recent years due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: In Peer Review

点击查看摘要

Abstract:Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on this https URL

[CV-78] LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

链接: https://arxiv.org/abs/2408.10469
作者: Xinyu Liu,Jing Zhang,Kexin Zhang,Xu Liu,Lingling Li
关键词-EN: including object occlusion, tracking specific objects, Video Object Segmentation, including object, occlusion and fragmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overa1l.

[CV-79] Learning Multimodal Latent Space with EBM Prior and MCMC Inference

链接: https://arxiv.org/abs/2408.10467
作者: Shiyu Yuan,Carlo Lipizzi,Tian Han
关键词-EN: Chain Monte Carlo, Markov Chain Monte, MCMC inference, Monte Carlo, Markov Chain
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal generative models are crucial for various applications. We propose an approach that combines an expressive energy-based model (EBM) prior with Markov Chain Monte Carlo (MCMC) inference in the latent space for multimodal generation. The EBM prior acts as an informative guide, while MCMC inference, specifically through short-run Langevin dynamics, brings the posterior distribution closer to its true form. This method not only provides an expressive prior to better capture the complexity of multimodality but also improves the learning of shared latent variables for more coherent generation across modalities. Our proposed method is supported by empirical experiments, underscoring the effectiveness of our EBM prior with MCMC inference in enhancing cross-modal and joint generative tasks in multimodal contexts.

[CV-80] Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

链接: https://arxiv.org/abs/2408.10453
作者: Liu He,Yizhi Song,Hejun Huang,Daniel Aliaga,Xin Zhou
关键词-EN: diffusion-based or autoregressive, Programmer agent, video, Vision Large Language, Large Language Model
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Text-to-video generation has been dominated by end-to-end diffusion-based or autoregressive models. On one hand, those novel models provide plausible versatility, but they are criticized for physical correctness, shading and illumination, camera motion, and temporal consistency. On the other hand, film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos and animations address the aforementioned shortcomings, but it is extremely tedious and requires tight collaboration between movie makers and 3D rendering experts. In this paper, we introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video that best aligns with the given description. Based on film making inspiration and augmented with Blender-based movie making knowledge, the Director agent decomposes the input text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on customized function composing and API calling. Then, the Reviewer agent, augmented with knowledge of video reviewing, character motion coordinates, and intermediate screenshots uses its compositional reasoning ability to provide feedback to the Programmer agent. The Programmer agent iteratively improves the scripts to yield the best overall video outcome. Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance. Moreover, our framework outperforms other approaches in a comprehensive user study on quality, consistency, and rationality.

[CV-81] he Brittleness of AI-Generated Image Watermarking Techniques: Examining Their Robustness Against Visual Paraphrasing Attacks

链接: https://arxiv.org/abs/2408.10446
作者: Niyar R Barman,Krish Sharma,Ashhar Aziz,Shashwat Bajpai,Shwetangshu Biswas,Vasu Sharma,Vinija Jain,Aman Chadha,Amit Sheth,Amitava Das
关键词-EN: models like Stable, Stable Diffusion, visual paraphrase, exemplified by models, potential misuse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 23 pages and 10 figures

点击查看摘要

Abstract:The rapid advancement of text-to-image generation systems, exemplified by models like Stable Diffusion, Midjourney, Imagen, and DALL-E, has heightened concerns about their potential misuse. In response, companies like Meta and Google have intensified their efforts to implement watermarking techniques on AI-generated images to curb the circulation of potentially misleading visuals. However, in this paper, we argue that current image watermarking methods are fragile and susceptible to being circumvented through visual paraphrase attacks. The proposed visual paraphraser operates in two steps. First, it generates a caption for the given image using KOSMOS-2, one of the latest state-of-the-art image captioning systems. Second, it passes both the original image and the generated caption to an image-to-image diffusion system. During the denoising step of the diffusion pipeline, the system generates a visually similar image that is guided by the text caption. The resulting image is a visual paraphrase and is free of any watermarks. Our empirical findings demonstrate that visual paraphrase attacks can effectively remove watermarks from images. This paper provides a critical assessment, empirically revealing the vulnerability of existing watermarking techniques to visual paraphrase attacks. While we do not propose solutions to this issue, this paper serves as a call to action for the scientific community to prioritize the development of more robust watermarking techniques. Our first-of-its-kind visual paraphrase dataset and accompanying code are publicly available.

[CV-82] Feasibility of assessing cognitive impairment via distributed camera network and privacy-preserving edge computing

链接: https://arxiv.org/abs/2408.10442
作者: Chaitra Hegde,Yashar Kiarashi,Allan I Levey,Amy D Rodriguez,Hyeokhyen Kwon,Gari D Clifford
关键词-EN: Mild cognitive impairment, Mild cognitive, education-related expectations, functions beyond typical, typical age
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:INTRODUCTION: Mild cognitive impairment (MCI) is characterized by a decline in cognitive functions beyond typical age and education-related expectations. Since, MCI has been linked to reduced social interactions and increased aimless movements, we aimed to automate the capture of these behaviors to enhance longitudinal monitoring. METHODS: Using a privacy-preserving distributed camera network, we collected movement and social interaction data from groups of individuals with MCI undergoing therapy within a 1700 m^2 space. We developed movement and social interaction features, which were then used to train a series of machine learning algorithms to distinguish between higher and lower cognitive functioning MCI groups. RESULTS: A Wilcoxon rank-sum test revealed statistically significant differences between high and low-functioning cohorts in features such as linear path length, walking speed, change in direction while walking, entropy of velocity and direction change, and number of group formations in the indoor space. Despite lacking individual identifiers to associate with specific levels of MCI, a machine learning approach using the most significant features provided a 71% accuracy. DISCUSSION: We provide evidence to show that a privacy-preserving low-cost camera network using edge computing framework has the potential to distinguish between different levels of cognitive impairment from the movements and social interactions captured during group activities. Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.10442 [cs.AI] (or arXiv:2408.10442v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.10442 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chaitra Hegde [view email] [v1] Mon, 19 Aug 2024 22:34:43 UTC (951 KB)

[CV-83] CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs ECCV2024

链接: https://arxiv.org/abs/2408.10433
作者: Yassine Ouali,Adrian Bulat,Brais Martinez,Georgios Tzimiropoulos
关键词-EN: Large Vision Language, Vision Language Models, Large Vision, Vision Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.

[CV-84] owards Automation of Human Stage of Decay Identification: An Artificial Intelligence Approach

链接: https://arxiv.org/abs/2408.10414
作者: Anna-Maria Nau,Phillip Ditto,Dawnie Wolfe Steadman,Audris Mockus
关键词-EN: identifying human remains, Determining the stage, human decomposition, human decomposition images, human decomposition scoring
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Determining the stage of decomposition (SOD) is crucial for estimating the postmortem interval and identifying human remains. Currently, labor-intensive manual scoring methods are used for this purpose, but they are subjective and do not scale for the emerging large-scale archival collections of human decomposition photos. This study explores the feasibility of automating two common human decomposition scoring methods proposed by Megyesi and Gelderman using artificial intelligence (AI). We evaluated two popular deep learning models, Inception V3 and Xception, by training them on a large dataset of human decomposition images to classify the SOD for different anatomical regions, including the head, torso, and limbs. Additionally, an interrater study was conducted to assess the reliability of the AI models compared to human forensic examiners for SOD identification. The Xception model achieved the best classification performance, with macro-averaged F1 scores of .878, .881, and .702 for the head, torso, and limbs when predicting Megyesi’s SODs, and .872, .875, and .76 for the head, torso, and limbs when predicting Gelderman’s SODs. The interrater study results supported AI’s ability to determine the SOD at a reliability level comparable to a human expert. This work demonstrates the potential of AI models trained on a large dataset of human decomposition images to automate SOD identification.

[CV-85] Parallel Processing of Point Cloud Ground Segmentation for Mechanical and Solid-State LiDARs

链接: https://arxiv.org/abs/2408.10404
作者: Xiao Zhang,Zhanhong Huang,Garcia Gonzalez Antony,Witek Jachimczyk,Xinming Huang
关键词-EN: real-time point cloud, point cloud ground, cloud ground segmentation, adapting LiDAR algorithms, ground segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 5 pages

点击查看摘要

Abstract:In this study, we introduce a novel parallel processing framework for real-time point cloud ground segmentation on FPGA platforms, aimed at adapting LiDAR algorithms to the evolving landscape from mechanical to solid-state LiDAR (SSL) technologies. Focusing on the ground segmentation task, we explore parallel processing techniques on existing approaches and adapt them to real-world SSL data handling. We validated frame-segmentation based parallel processing methods using point-based, voxel-based, and range-image-based ground segmentation approaches on the SemanticKITTI dataset based on mechanical LiDAR. The results revealed the superior performance and robustness of the range-image method, especially in its resilience to slicing. Further, utilizing a custom dataset from our self-built Camera-SSLSS equipment, we examined regular SSL data frames and validated the effectiveness of our parallel approach for SSL sensor. Additionally, our pioneering implementation of range-image ground segmentation on FPGA for SSL sensors demonstrated significant processing speed improvements and resource efficiency, achieving processing rates up to 50.3 times faster than conventional CPU setups. These findings underscore the potential of parallel processing strategies to significantly enhance LiDAR technologies for advanced perception tasks in autonomous systems. Post-publication, both the data and the code will be made available on GitHub.

[CV-86] Webcam-based Pupil Diameter Prediction Benefits from Upscaling

链接: https://arxiv.org/abs/2408.10397
作者: Vijul Shah,Brian B. Moser,Ko Watanabe,Andreas Dengel
关键词-EN: Capturing pupil diameter, Capturing pupil, cognitive load, pupil diameter, essential for assessing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Capturing pupil diameter is essential for assessing psychological and physiological states such as stress levels and cognitive load. However, the low resolution of images in eye datasets often hampers precise measurement. This study evaluates the impact of various upscaling methods, ranging from bicubic interpolation to advanced super-resolution, on pupil diameter predictions. We compare several pre-trained methods, including CodeFormer, GFPGAN, Real-ESRGAN, HAT, and SRResNet. Our findings suggest that pupil diameter prediction models trained on upscaled datasets are highly sensitive to the selected upscaling method and scale. Our results demonstrate that upscaling methods consistently enhance the accuracy of pupil diameter prediction models, highlighting the importance of upscaling in pupilometry. Overall, our work provides valuable insights for selecting upscaling techniques, paving the way for more accurate assessments in psychological and physiological research.

[CV-87] Evaluating Image-Based Face and Eye Tracking with Event Cameras ECCV

链接: https://arxiv.org/abs/2408.10395
作者: Khadija Iddrisu,Waseem Shariff,Noel E.OConnor,Joseph Lemley,Suzanne Little
关键词-EN: producing asynchronously generated, Neuromorphic sensors, generated data termed, asynchronously generated data, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted at The Workshop On Neuromorphic Vision: Advantages and Applications of Event Cameras at the European Conference on Computer Vision (ECCV), 2024

点击查看摘要

Abstract:Event Cameras, also known as Neuromorphic sensors, capture changes in local light intensity at the pixel level, producing asynchronously generated data termed ``events’'. This distinct data format mitigates common issues observed in conventional cameras, like under-sampling when capturing fast-moving objects, thereby preserving critical information that might otherwise be lost. However, leveraging this data often necessitates the development of specialized, handcrafted event representations that can integrate seamlessly with conventional Convolutional Neural Networks (CNNs), considering the unique attributes of event data. In this study, We evaluate event-based Face and Eye tracking. The core objective of our study is to showcase the viability of integrating conventional algorithms with event-based data, transformed into a frame format while preserving the unique benefits of event cameras. To validate our approach, we constructed a frame-based event dataset by simulating events between RGB frames derived from the publicly accessible Helen Dataset. We assess its utility for face and eye detection tasks through the application of GR-YOLO – a pioneering technique derived from YOLOv3. This evaluation includes a comparative analysis with results derived from training the dataset with YOLOv8. Subsequently, the trained models were tested on real event streams from various iterations of Prophesee’s event cameras and further evaluated on the Faces in Event Stream (FES) benchmark dataset. The models trained on our dataset shows a good prediction performance across all the datasets obtained for validation with the best results of a mean Average precision score of 0.91. Additionally, The models trained demonstrated robust performance on real event camera data under varying light conditions.

[CV-88] Narrowing the Gap between Vision and Action in Navigation

链接: https://arxiv.org/abs/2408.10388
作者: Yue Zhang,Parisa Kordjamshidi
关键词-EN: Vision and Language, methods for Vision, Language Navigation, Continuous Environment, commonly incorporate
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves navigation performance significantly compared to direct training using low-level actions. However, the VLN-CE agents are still far from the real robots since there are gaps between their visual perception and executed actions. First, VLN-CE agents that discretize the visual environment are primarily trained with high-level view selection, which causes them to ignore crucial spatial reasoning within the low-level action movements. Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two issues, we introduce a low-level action decoder jointly trained with high-level action prediction, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls. Moreover, we enhance the current waypoint predictor by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans’ prior knowledge about the feasibility of actions. Empirically, our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.

[CV-89] HaSPeR: An Image Repository for Hand Shadow Puppet Recognition

链接: https://arxiv.org/abs/2408.10360
作者: Syed Rifat Raiyan,Zibran Zarif Amio,Sabbir Ahmed
关键词-EN: Hand shadow puppetry, Hand shadow, living creatures, hand shadow puppets, hand shadow puppeteer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE Transactions on Artificial Intelligence (IEEE TAI), 11 pages, 78 figures, 2 tables

点击查看摘要

Abstract:Hand shadow puppetry, also known as shadowgraphy or ombromanie, is a form of theatrical art and storytelling where hand shadows are projected onto flat surfaces to create illusions of living creatures. The skilled performers create these silhouettes by hand positioning, finger movements, and dexterous gestures to resemble shadows of animals and objects. Due to the lack of practitioners and a seismic shift in people’s entertainment standards, this art form is on the verge of extinction. To facilitate its preservation and proliferate it to a wider audience, we introduce \rm H\small ASP\small ER , a novel dataset consisting of 8,340 images of hand shadow puppets across 11 classes extracted from both professional and amateur hand shadow puppeteer clips. We provide a detailed statistical analysis of the dataset and employ a range of pretrained image classification models to establish baselines. Our findings show a substantial performance superiority of traditional convolutional models over attention-based transformer architectures. We also find that lightweight models, such as MobileNetV2, suited for mobile applications and embedded devices, perform comparatively well. We surmise that such low-latency architectures can be useful in developing ombromanie teaching tools, and we create a prototype application to explore this surmission. Keeping the best-performing model InceptionV3 under the limelight, we conduct comprehensive feature-spatial, explainability, and error analyses to gain insights into its decision-making process. To the best of our knowledge, this is the first documented dataset and research endeavor to preserve this dying art for future generations, with computer vision approaches. Our code and data are publicly available.

[CV-90] Diversity and stylization of the contemporary user-generated visual arts in the complexity-entropy plane

链接: https://arxiv.org/abs/2408.10356
作者: Seunghwan Kim,Byunghwee Lee,Wonjae Lee
关键词-EN: analyzing art historiographical, art historiographical narratives, advent of computational, computational and numerical, numerical methods
类目: Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an); Physics and Society (physics.soc-ph)
*备注: 18 pages, 3 figures, 1 table, SI(4 figures, 3 tables)

点击查看摘要

Abstract:The advent of computational and numerical methods in recent times has provided new avenues for analyzing art historiographical narratives and tracing the evolution of art styles therein. Here, we investigate an evolutionary process underpinning the emergence and stylization of contemporary user-generated visual art styles using the complexity-entropy (C-H) plane, which quantifies local structures in paintings. Informatizing 149,780 images curated in DeviantArt and Behance platforms from 2010 to 2020, we analyze the relationship between local information of the C-H space and multi-level image features generated by a deep neural network and a feature extraction algorithm. The results reveal significant statistical relationships between the C-H information of visual artistic styles and the dissimilarities of the multi-level image features over time within groups of artworks. By disclosing a particular C-H region where the diversity of image representations is noticeably manifested, our analyses reveal an empirical condition of emerging styles that are both novel in the C-H plane and characterized by greater stylistic diversity. Our research shows that visual art analyses combined with physics-inspired methodologies and machine learning, can provide macroscopic insights into quantitatively mapping relevant characteristics of an evolutionary process underpinning the creative stylization of uncharted visual arts of given groups and time.

[CV-91] AIR: Analytic Imbalance Rectifier for Continual Learning

链接: https://arxiv.org/abs/2408.10349
作者: Di Fang,Yinan Zhu,Runze Fang,Cen Chen,Ziqian Zeng,Huiping Zhuang
关键词-EN: Continual learning enables, generalized CIL scenarios, Continual learning, sequentially without retraining, CIL scenarios
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continual learning enables AI models to learn new data sequentially without retraining in real-world scenarios. Most existing methods assume the training data are balanced, aiming to reduce the catastrophic forgetting problem that models tend to forget previously generated data. However, data imbalance and the mixture of new and old data in real-world scenarios lead the model to ignore categories with fewer training samples. To solve this problem, we propose an analytic imbalance rectifier algorithm (AIR), a novel online exemplar-free continual learning method with an analytic (i.e., closed-form) solution for data-imbalanced class-incremental learning (CIL) and generalized CIL scenarios in real-world continual learning. AIR introduces an analytic re-weighting module (ARM) that calculates a re-weighting factor for each class for the loss function to balance the contribution of each category to the overall loss and solve the problem of imbalanced training data. AIR uses the least squares technique to give a non-discriminatory optimal classifier and its iterative update method in continual learning. Experimental results on multiple datasets show that AIR significantly outperforms existing methods in long-tailed and generalized CIL scenarios. The source code is available at this https URL.

[CV-92] Optical Music Recognition in Manuscripts from the Ricordi Archive

链接: https://arxiv.org/abs/2408.10260
作者: Federico Simonetta,Rishav Mondal,Luca Andrea Ludovico,Stavros Ntalampiras
关键词-EN: Verdi and Puccini, renowned opera composers, significant musical manuscripts, Ricordi archive, prestigious collection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注: Accepted at AudioMostly 2024

点击查看摘要

Abstract:The Ricordi archive, a prestigious collection of significant musical manuscripts from renowned opera composers such as Donizetti, Verdi and Puccini, has been digitized. This process has allowed us to automatically extract samples that represent various musical elements depicted on the manuscripts, including notes, staves, clefs, erasures, and composer’s annotations, among others. To distinguish between digitization noise and actual music elements, a subset of these images was meticulously grouped and labeled by multiple individuals into several classes. After assessing the consistency of the annotations, we trained multiple neural network-based classifiers to differentiate between the identified music elements. The primary objective of this study was to evaluate the reliability of these classifiers, with the ultimate goal of using them for the automatic categorization of the remaining unannotated data set. The dataset, complemented by manual annotations, models, and source code used in these experiments are publicly accessible for replication purposes.

[CV-93] NeRF-US: Removing Ultrasound Imaging Artifacts from Neural Radiance Fields in the Wild

链接: https://arxiv.org/abs/2408.10258
作者: Rishit Dagli,Atsuhiro Hibi,Rahul G. Krishnan,Pascal N. Tyrrell
关键词-EN: face severe artifacts, view synthesis, face severe, current approaches differ, training NeRF-based approaches
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current methods for performing 3D reconstruction and novel view synthesis (NVS) in ultrasound imaging data often face severe artifacts when training NeRF-based approaches. The artifacts produced by current approaches differ from NeRF floaters in general scenes because of the unique nature of ultrasound capture. Furthermore, existing models fail to produce reasonable 3D reconstructions when ultrasound data is captured or obtained casually in uncontrolled environments, which is common in clinical settings. Consequently, existing reconstruction and NVS methods struggle to handle ultrasound motion, fail to capture intricate details, and cannot model transparent and reflective surfaces. In this work, we introduced NeRF-US, which incorporates 3D-geometry guidance for border probability and scattering density into NeRF training, while also utilizing ultrasound-specific rendering over traditional volume rendering. These 3D priors are learned through a diffusion model. Through experiments conducted on our new “Ultrasound in the Wild” dataset, we observed accurate, clinically plausible, artifact-free reconstructions.

[CV-94] arget-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs

链接: https://arxiv.org/abs/2408.10248
作者: Ananya Pandey,Dinesh Kumar Vishwakarma
关键词-EN: natural language processing, multimodal sentiment recognition, Multimodal Sentiment Analysis, multimodal sentiment, Target-Dependent Multimodal Sentiment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The natural language processing and multimedia field has seen a notable surge in interest in multimodal sentiment recognition. Hence, this study aims to employ Target-Dependent Multimodal Sentiment Analysis (TDMSA) to identify the level of sentiment associated with every target (aspect) stated within a multimodal post consisting of a visual-caption pair. Despite the recent advancements in multimodal sentiment recognition, there has been a lack of explicit incorporation of emotional clues from the visual modality, specifically those pertaining to facial expressions. The challenge at hand is to proficiently obtain visual and emotional clues and subsequently synchronise them with the textual content. In light of this fact, this study presents a novel approach called the Visual-to-Emotional-Caption Translation Network (VECTN) technique. The primary objective of this strategy is to effectively acquire visual sentiment clues by analysing facial expressions. Additionally, it effectively aligns and blends the obtained emotional clues with the target attribute of the caption mode. The experimental findings demonstrate that our methodology is capable of producing ground-breaking outcomes when applied to two publicly accessible multimodal Twitter datasets, namely, Twitter-2015 and Twitter-2017. The experimental results show that the suggested model achieves an accuracy of 81.23% and a macro-F1 of 80.61% on the Twitter-15 dataset, while 77.42% and 75.19% on the Twitter-17 dataset, respectively. The observed improvement in performance reveals that our model is better than others when it comes to collecting target-level sentiment in multimodal data using the expressions of the face.

[CV-95] VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual Acoustic and Glossary Features

链接: https://arxiv.org/abs/2408.10246
作者: Ananya Pandey,Dinesh Kumar Vishwakarma
关键词-EN: frequently convey sarcasm, sarcasm recognition, Multi-modal Sarcasm Recognition, non-linguistic clues, tone of voice
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Various linguistic and non-linguistic clues, such as excessive emphasis on a word, a shift in the tone of voice, or an awkward expression, frequently convey sarcasm. The computer vision problem of sarcasm recognition in conversation aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue. Prior, sarcasm recognition has focused mainly on text. Still, it is critical to consider all textual information, audio stream, facial expression, and body position for reliable sarcasm identification. Hence, we propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data and an attentional tokenizer based strategy to extract the most critical context-specific information from the textual data. The following is a list of the key contributions that our experimentation has made in response to performing the task of Multi-modal Sarcasm Recognition: an attentional tokenizer branch to get beneficial features from the glossary content provided by the subtitles; a visual branch for acquiring the most prominent features from the video frames; an utterance-level feature extraction from acoustic content and a multi-headed attention based feature fusion branch to blend features obtained from multiple modalities. Extensive testing on one of the benchmark video datasets, MUSTaRD, yielded an accuracy of 79.86% for speaker dependent and 76.94% for speaker independent configuration demonstrating that our approach is superior to the existing methods. We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.

[CV-96] AltCanvas: A Tile-Based Image Editor with Generative AI for Blind or Visually Impaired People

链接: https://arxiv.org/abs/2408.10240
作者: Seonghee Lee,Maho Kohga,Steve Landau,Sile O’Modhrain,Hari Subramonyam
关键词-EN: structural information, impairments often struggle, content that relies, relies heavily, conveying spatial
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:People with visual impairments often struggle to create content that relies heavily on visual elements, particularly when conveying spatial and structural information. Existing accessible drawing tools, which construct images line by line, are suitable for simple tasks like math but not for more expressive artwork. On the other hand, emerging generative AI-based text-to-image tools can produce expressive illustrations from descriptions in natural language, but they lack precise control over image composition and properties. To address this gap, our work integrates generative AI with a constructive approach that provides users with enhanced control and editing capabilities. Our system, AltCanvas, features a tile-based interface enabling users to construct visual scenes incrementally, with each tile representing an object within the scene. Users can add, edit, move, and arrange objects while receiving speech and audio feedback. Once completed, the scene can be rendered as a color illustration or as a vector for tactile graphic generation. Involving 14 blind or low-vision users in design and evaluation, we found that participants effectively used the AltCanvas workflow to create illustrations.

[CV-97] A Comprehensive Survey on Diffusion Models and Their Applications

链接: https://arxiv.org/abs/2408.10207
作者: Md Manjurul Ahsan,Shivakumar Raman,Yingtao Liu,Zahed Siddique
关键词-EN: create realistic samples, Diffusion Models, gradually adding, noise from data, diffusion process
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Models are probabilistic models that create realistic samples by simulating the diffusion process, gradually adding and removing noise from data. These models have gained popularity in domains such as image processing, speech synthesis, and natural language processing due to their ability to produce high-quality samples. As Diffusion Models are being adopted in various domains, existing literature reviews that often focus on specific areas like computer vision or medical imaging may not serve a broader audience across multiple fields. Therefore, this review presents a comprehensive overview of Diffusion Models, covering their theoretical foundations and algorithmic innovations. We highlight their applications in diverse areas such as media quality, authenticity, synthesis, image transformation, healthcare, and more. By consolidating current knowledge and identifying emerging trends, this review aims to facilitate a deeper understanding and broader adoption of Diffusion Models and provide guidelines for future researchers and practitioners across diverse disciplines.

[CV-98] NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices

链接: https://arxiv.org/abs/2408.10161
作者: Zhiyong Zhang,Aniket Gupta,Huaizu Jiang,Hanumant Singh
关键词-EN: Real-time high-accuracy optical, Real-time high-accuracy, high-accuracy optical flow, optical flow estimation, optical flow
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Real-time high-accuracy optical flow estimation is crucial for various real-world applications. While recent learning-based optical flow methods have achieved high accuracy, they often come with significant computational costs. In this paper, we propose a highly efficient optical flow method that balances high accuracy with reduced computational demands. Building upon NeuFlow v1, we introduce new components including a much more light-weight backbone and a fast refinement module. Both these modules help in keeping the computational demands light while providing close to state of the art accuracy. Compares to other state of the art methods, our model achieves a 10x-70x speedup while maintaining comparable performance on both synthetic and real-world data. It is capable of running at over 20 FPS on 512x384 resolution images on a Jetson Orin Nano. The full training and evaluation code is available at this https URL.

[CV-99] EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models

链接: https://arxiv.org/abs/2311.12066
作者: Ruoxi Chen,Haibo Jin,Yixin Liu,Jinyin Chen,Haohan Wang,Lichao Sun
关键词-EN: producing creative content, evolutionary for producing, producing creative, creative content, diffusion models
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models have emerged as an evolutionary for producing creative content in image synthesis. Based on the impressive generation abilities of these models, instruction-guided diffusion models can edit images with simple instructions and input images. While they empower users to obtain their desired edited images with ease, they have raised concerns about unauthorized image manipulation. Prior research has delved into the unauthorized use of personalized diffusion models; however, this problem of instruction-guided diffusion models remains largely unexplored. In this paper, we first propose a protection method EditShield against unauthorized modifications from such models. Specifically, EditShield works by adding imperceptible perturbations that can shift the latent representation used in the diffusion process, tricking models into generating unrealistic images with mismatched subjects. Our extensive experiments demonstrate EditShield’s effectiveness among synthetic and real-world datasets. Besides, we found that EditShield performs robustly against various manipulation settings across editing types and synonymous instruction phrases.

[CV-100] Fight Perturbations with Perturbations: Defending Adversarial Attacks via Neuron Influence

链接: https://arxiv.org/abs/2112.13060
作者: Ruoxi Chen,Haibo Jin,Haibin Zheng,Jinyin Chen,Zhenguang Liu
关键词-EN: attracted increasing attention, deep learning models, increasing attention, security-critical domains, vulnerabilities of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Final version. Accepted to IEEE Transactions on Dependable and Secure Computing

点击查看摘要

Abstract:The vulnerabilities of deep learning models towards adversarial attacks have attracted increasing attention, especially when models are deployed in security-critical domains. Numerous defense methods, including reactive and proactive ones, have been proposed for model robustness improvement. Reactive defenses, such as conducting transformations to remove perturbations, usually fail to handle large perturbations. The proactive defenses that involve retraining, suffer from the attack dependency and high computation cost. In this paper, we consider defense methods from the general effect of adversarial attacks that take on neurons inside the model. We introduce the concept of neuron influence, which can quantitatively measure neurons’ contribution to correct classification. Then, we observe that almost all attacks fool the model by suppressing neurons with larger influence and enhancing those with smaller influence. Based on this, we propose \emphNeuron-level Inverse Perturbation (NIP), a novel defense against general adversarial attacks. It calculates neuron influence from benign examples and then modifies input examples by generating inverse perturbations that can in turn strengthen neurons with larger influence and weaken those with smaller influence.

[CV-101] Denoising Plane Wave Ultrasound Images Using Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2408.10987
作者: Hojat Asgariandehkordi,Sobhan Goudarzi,Mostafa Sharifzadeh,Adrian Basarab,Hassan Rivaz
关键词-EN: frame-rate ultrasound imaging, high frame-rate ultrasound, enables high frame-rate, high frame-rate imaging, Ultrasound plane wave
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ultrasound plane wave imaging is a cutting-edge technique that enables high frame-rate imaging. However, one challenge associated with high frame-rate ultrasound imaging is the high noise associated with them, hindering their wider adoption. Therefore, the development of a denoising method becomes imperative to augment the quality of plane wave images. Drawing inspiration from Denoising Diffusion Probabilistic Models (DDPMs), our proposed solution aims to enhance plane wave image quality. Specifically, the method considers the distinction between low-angle and high-angle compounding plane waves as noise and effectively eliminates it by adapting a DDPM to beamformed radiofrequency (RF) data. The method underwent training using only 400 simulated images. In addition, our approach employs natural image segmentation masks as intensity maps for the generated images, resulting in accurate denoising for various anatomy shapes. The proposed method was assessed across simulation, phantom, and in vivo images. The results of the evaluations indicate that our approach not only enhances image quality on simulated data but also demonstrates effectiveness on phantom and in vivo data in terms of image quality. Comparative analysis with other methods underscores the superiority of our proposed method across various evaluation metrics. The source code and trained model will be released along with the dataset at: this http URL

[CV-102] ISLES24: Improving final infarct prediction in ischemic stroke using multimodal imaging and clinical data

链接: https://arxiv.org/abs/2408.10966
作者: Ezequiel de la Rosa,Ruisheng Su,Mauricio Reyes,Roland Wiest,Evamaria O. Riedel,Florian Kofler,Kaiyuan Yang,Hakim Baazaoui,David Robben,Susanne Wegener,Jan S. Kirschke,Benedikt Wiestler,Bjoern Menze
关键词-EN: Accurate estimation, irreversibly damaged tissue, stroke treatment decisions, ischemic stroke treatment, irreversibly damaged
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate estimation of core (irreversibly damaged tissue) and penumbra (salvageable tissue) volumes is essential for ischemic stroke treatment decisions. Perfusion CT, the clinical standard, estimates these volumes but is affected by variations in deconvolution algorithms, implementations, and thresholds. Core tissue expands over time, with growth rates influenced by thrombus location, collateral circulation, and inherent patient-specific factors. Understanding this tissue growth is crucial for determining the need to transfer patients to comprehensive stroke centers, predicting the benefits of additional reperfusion attempts during mechanical thrombectomy, and forecasting final clinical outcomes. This work presents the ISLES’24 challenge, which addresses final post-treatment stroke infarct prediction from pre-interventional acute stroke imaging and clinical data. ISLES’24 establishes a unique 360-degree setting where all feasibly accessible clinical data are available for participants, including full CT acute stroke imaging, sub-acute follow-up MRI, and clinical tabular data. The contributions of this work are two-fold: first, we introduce a standardized benchmarking of final stroke infarct segmentation algorithms through the ISLES’24 challenge; second, we provide insights into infarct segmentation using multimodal imaging and clinical data strategies by identifying outperforming methods on a finely curated dataset. The outputs of this challenge are anticipated to enhance clinical decision-making and improve patient outcome predictions. All ISLES’24 materials, including data, performance evaluation scripts, and leading algorithmic strategies, are available to the research community following \urlthis https URL.

[CV-103] Radio U-Net: a convolutional neural network to detect diffuse radio sources in galaxy clusters and beyond

链接: https://arxiv.org/abs/2408.10871
作者: Chiara Stuardi,Claudio Gheller,Franco Vazza,Andrea Botteon
关键词-EN: telescope arrays promises, arrays promises significant, radio telescope arrays, radio, promises significant advancements
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by MNRAS, 16 pages, 9 figures, 2 tables

点击查看摘要

Abstract:The forthcoming generation of radio telescope arrays promises significant advancements in sensitivity and resolution, enabling the identification and characterization of many new faint and diffuse radio sources. Conventional manual cataloging methodologies are anticipated to be insufficient to exploit the capabilities of new radio surveys. Radio interferometric images of diffuse sources present a challenge for image segmentation tasks due to noise, artifacts, and embedded radio sources. In response to these challenges, we introduce Radio U-Net, a fully convolutional neural network based on the U-Net architecture. Radio U-Net is designed to detect faint and extended sources in radio surveys, such as radio halos, relics, and cosmic web filaments. Radio U-Net was trained on synthetic radio observations built upon cosmological simulations and then tested on a sample of galaxy clusters, where the detection of cluster diffuse radio sources relied on customized data reduction and visual inspection of LOFAR Two Metre Sky Survey (LoTSS) data. The 83% of clusters exhibiting diffuse radio emission were accurately identified, and the segmentation successfully recovered the morphology of the sources even in low-quality images. In a test sample comprising 246 galaxy clusters, we achieved a 73% accuracy rate in distinguishing between clusters with and without diffuse radio emission. Our results establish the applicability of Radio U-Net to extensive radio survey datasets, probing its efficiency on cutting-edge high-performance computing systems. This approach represents an advancement in optimizing the exploitation of forthcoming large radio surveys for scientific exploration.

[CV-104] MambaDS: Near-Surface Meteorological Field Downscaling with Topography Constrained Selective State Space Modeling

链接: https://arxiv.org/abs/2408.10854
作者: Zili Liu,Hao Chen,Lei Bai,Wenyuan Li,Wanli Ouyang,Zhengxia Zou,Zhenwei Shi
关键词-EN: fine-grained near-surface weather, frequent extreme weather, near-surface weather forecasts, obtaining precise, extreme weather
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In an era of frequent extreme weather and global warming, obtaining precise, fine-grained near-surface weather forecasts is increasingly essential for human activities. Downscaling (DS), a crucial task in meteorological forecasting, enables the reconstruction of high-resolution meteorological states for target regions from global-scale forecast results. Previous downscaling methods, inspired by CNN and Transformer-based super-resolution models, lacked tailored designs for meteorology and encountered structural limitations. Notably, they failed to efficiently integrate topography, a crucial prior in the downscaling process. In this paper, we address these limitations by pioneering the selective state space model into the meteorological field downscaling and propose a novel model called MambaDS. This model enhances the utilization of multivariable correlations and topography information, unique challenges in the downscaling process while retaining the advantages of Mamba in long-range dependency modeling and linear computational complexity. Through extensive experiments in both China mainland and the continental United States (CONUS), we validated that our proposed MambaDS achieves state-of-the-art results in three different types of meteorological field downscaling settings. We will release the code subsequently.

[CV-105] CO2Wounds-V2: Extended Chronic Wounds Dataset From Leprosy Patients ICIP2024

链接: https://arxiv.org/abs/2408.10827
作者: Karen Sanchez,Carlos Hinojosa,Olinto Mieles,Chen Zhao,Bernard Ghanem,Henry Arguello
关键词-EN: Chronic wounds pose, health concern globally, ongoing health concern, Chronic wounds, concern globally
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 IEEE International Conference on Image Processing (ICIP 2024)

点击查看摘要

Abstract:Chronic wounds pose an ongoing health concern globally, largely due to the prevalence of conditions such as diabetes and leprosy’s disease. The standard method of monitoring these wounds involves visual inspection by healthcare professionals, a practice that could present challenges for patients in remote areas with inadequate transportation and healthcare infrastructure. This has led to the development of algorithms designed for the analysis and follow-up of wound images, which perform image-processing tasks such as classification, detection, and segmentation. However, the effectiveness of these algorithms heavily depends on the availability of comprehensive and varied wound image data, which is usually scarce. This paper introduces the CO2Wounds-V2 dataset, an extended collection of RGB wound images from leprosy patients with their corresponding semantic segmentation annotations, aiming to enhance the development and testing of image-processing algorithms in the medical field.

[CV-106] Classification of Endoscopy and Video Capsule Images using CNN-Transformer Model

链接: https://arxiv.org/abs/2408.10733
作者: Aliza Subedi,Smriti Regmi,Nisha Regmi,Bhumi Bhusal,Ulas Bagci,Debesh Jha
关键词-EN: computer-aided diagnosis systems, Convolutional Neural Networks, incidence and death, making it crucial, enhanced treatment
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gastrointestinal cancer is a leading cause of cancer-related incidence and death, making it crucial to develop novel computer-aided diagnosis systems for early detection and enhanced treatment. Traditional approaches rely on the expertise of gastroenterologists to identify diseases; however, this process is subjective, and interpretation can vary even among expert clinicians. Considering recent advancements in classifying gastrointestinal anomalies and landmarks in endoscopic and video capsule endoscopy images, this study proposes a hybrid model that combines the advantages of Transformers and Convolutional Neural Networks (CNNs) to enhance classification performance. Our model utilizes DenseNet201 as a CNN branch to extract local features and integrates a Swin Transformer branch for global feature understanding, combining both to perform the classification task. For the GastroVision dataset, our proposed model demonstrates excellent performance with Precision, Recall, F1 score, Accuracy, and Matthews Correlation Coefficient (MCC) of 0.8320, 0.8386, 0.8324, 0.8386, and 0.8191, respectively, showcasing its robustness against class imbalance and surpassing other CNNs as well as the Swin Transformer model. Similarly, for the Kvasir-Capsule, a large video capsule endoscopy dataset, our model outperforms all others, achieving overall Precision, Recall, F1 score, Accuracy, and MCC of 0.7007, 0.7239, 0.6900, 0.7239, and 0.3871. Moreover, we generated saliency maps to explain our model’s focus areas, demonstrating its reliable decision-making process. The results underscore the potential of our hybrid CNN-Transformer model in aiding the early and accurate detection of gastrointestinal (GI) anomalies.

[CV-107] deepmriprep: Voxel-based Morphometry (VBM) Preprocessing via Deep Neural Networks

链接: https://arxiv.org/abs/2408.10656
作者: Lukas Fisch,Nils R. Winter,Janik Goltermann,Carlotta Barkhau,Daniel Emden,Jan Ernsting,Maximilian Konowski,Ramona Leenings,Tiana Borgers,Kira Flinkenflügel,Dominik Grotegerd,Anna Kraus,Elisabeth J. Leehr,Susanne Meinert,Frederike Stein,Lea Teutenberg,Florian Thomas-Odenthal,Paula Usemann,Marco Hermesdorf,Hamidreza Jamalabadi,Andreas Jansen,Igor Nenadic,Benjamin Straube,Tilo Kircher,Klaus Berger,Benjamin Risse,Udo Dannlowski,Tim Hahn
关键词-EN: Voxel-based Morphometry, Magnetic Resonance Imaging, powerful approach, VBM, Resonance Imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Voxel-based Morphometry (VBM) has emerged as a powerful approach in neuroimaging research, utilized in over 7,000 studies since the year 2000. Using Magnetic Resonance Imaging (MRI) data, VBM assesses variations in the local density of brain tissue and examines its associations with biological and psychometric variables. Here, we present deepmriprep, a neural network-based pipeline that performs all necessary preprocessing steps for VBM analysis of T1-weighted MR images using deep neural networks. Utilizing the Graphics Processing Unit (GPU), deepmriprep is 37 times faster than CAT12, the leading VBM preprocessing toolbox. The proposed method matches CAT12 in accuracy for tissue segmentation and image registration across more than 100 datasets and shows strong correlations in VBM results. Tissue segmentation maps from deepmriprep have over 95% agreement with ground truth maps, and its non-linear registration, using supervised SYMNet, predicts smooth deformation fields comparable to CAT12. The high processing speed of deepmriprep enables rapid preprocessing of extensive datasets and thereby fosters the application of VBM analysis to large-scale neuroimaging studies and opens the door to real-time applications. Finally, deepmripreps straightforward, modular design enables researchers to easily understand, reuse, and advance the underlying methods, fostering further advancements in neuroimaging research. deepmriprep can be conveniently installed as a Python package and is publicly accessible at this https URL.

[CV-108] Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Color Imaging Improves Diabetic Retinopathy Stratification

链接: https://arxiv.org/abs/2408.10636
作者: Ruoyu Chen,Kezheng Xu,Kangyan Zheng,Weiyi Zhang,Yan Lu,Danli Shi,Mingguang He
关键词-EN: Ultrawide-field fluorescein angiography, facilitates diabetic retinopathy, peripheral retinal lesions, Ultrawide-field fluorescein, UWF-FA images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 27 pages, 2 figures

点击查看摘要

Abstract:Ultrawide-field fluorescein angiography (UWF-FA) facilitates diabetic retinopathy (DR) detection by providing a clear visualization of peripheral retinal lesions. However, the intravenous dye injection with potential risks hamper its application. We aim to acquire dye-free UWF-FA images from noninvasive UWF color fundus (UWF-CF) images using generative artificial intelligence (GenAI) and evaluate its effectiveness in DR screening. A total of 18,321 UWF-FA images of different phases were registered with corresponding UWF-CF images and fed into a generative adversarial networks (GAN)-based model for training. The quality of generated UWF-FA images was evaluated through quantitative metrics and human evaluation. The DeepDRiD dataset was used to externally assess the contribution of generated UWF-FA images to DR classification, using area under the receiver operating characteristic curve (AUROC) as outcome metrics. The generated early, mid, and late phase UWF-FA images achieved high authenticity, with multi-scale similarity scores ranging from 0.70 to 0.91 and qualitative visual scores ranging from 1.64 to 1.98 (1=real UWF-FA quality). In fifty randomly selected images, 56% to 76% of the generated images were difficult to distinguish from real images in the Turing test. Moreover, adding these generated UWF-FA images for DR classification significantly increased the AUROC from 0.869 to 0.904 compared to the baseline model using UWF-CF images (P .001). The model successfully generates realistic multi-frame UWF-FA images without intravenous dye injection. The generated UWF-FA enhanced DR stratification.

[CV-109] Vision Calorimeter for Anti-neutron Reconstruction: A Baseline

链接: https://arxiv.org/abs/2408.10599
作者: Hongtian Yu,Yangu Li,Mingrui Wu,Letian Shen,Yue Liu,Yunxuan Song,Qixiang Ye,Xiaorui Lyu,Yajun Mao,Yangheng Zheng,Yunfan Liu
关键词-EN: bar, high-energy physics, governing principles, kinematic properties, important probe
类目: High Energy Physics - Experiment (hep-ex); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In high-energy physics, anti-neutrons ( \barn ) are fundamental particles that frequently appear as final-state particles, and the reconstruction of their kinematic properties provides an important probe for understanding the governing principles. However, this confronts significant challenges instrumentally with the electromagnetic calorimeter (EMC), a typical experimental sensor but recovering the information of incident \barn insufficiently. In this study, we introduce Vision Calorimeter (ViC), a baseline method for anti-neutron reconstruction that leverages deep learning detectors to analyze the implicit relationships between EMC responses and incident \barn characteristics. Our motivation lies in that energy distributions of \barn samples deposited in the EMC cell arrays embody rich contextual information. Converted to 2-D images, such contextual energy distributions can be used to predict the status of \barn ( i.e. , incident position and momentum) through a deep learning detector along with pseudo bounding boxes and a specified training objective. Experimental results demonstrate that ViC substantially outperforms the conventional reconstruction approach, reducing the prediction error of incident position by 42.81% (from 17.31 ^\circ to 9.90 ^\circ ). More importantly, this study for the first time realizes the measurement of incident \barn momentum, underscoring the potential of deep learning detectors for particle reconstruction. Code is available at this https URL.

[CV-110] A Tutorial on Explainable Image Classification for Dementia Stages Using Convolutional Neural Network and Gradient-weighted Class Activation Mapping

链接: https://arxiv.org/abs/2408.10572
作者: Kevin Kam Fung Yuen
关键词-EN: Convolutional Neural Network, Class Activation Mapping, Gradient-weighted Class Activation, MRI brain images, open MRI brain
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 11 figures, 3 tables

点击查看摘要

Abstract:This paper presents a tutorial of an explainable approach using Convolutional Neural Network (CNN) and Gradient-weighted Class Activation Mapping (Grad-CAM) to classify four progressive dementia stages based on open MRI brain images. The detailed implementation steps are demonstrated with an explanation. Whilst the proposed CNN architecture is demonstrated to achieve more than 99% accuracy for the test dataset, the computational procedure of CNN remains a black box. The visualisation based on Grad-CAM is attempted to explain such very high accuracy and may provide useful information for physicians. Future motivation based on this work is discussed.

[CV-111] Prompt Your Brain: Scaffold Prompt Tuning for Efficient Adaptation of fMRI Pre-trained Model MICCAI2024

链接: https://arxiv.org/abs/2408.10567
作者: Zijian Dong,Yilei Wu,Zijiao Chen,Yichi Zhang,Yueming Jin,Juan Helen Zhou
关键词-EN: magnetic resonance imaging, introduce Scaffold Prompt, large-scale functional magnetic, functional magnetic resonance, improved performance compared
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024

点击查看摘要

Abstract:We introduce Scaffold Prompt Tuning (ScaPT), a novel prompt-based framework for adapting large-scale functional magnetic resonance imaging (fMRI) pre-trained models to downstream tasks, with high parameter efficiency and improved performance compared to fine-tuning and baselines for prompt tuning. The full fine-tuning updates all pre-trained parameters, which may distort the learned feature space and lead to overfitting with limited training data which is common in fMRI fields. In contrast, we design a hierarchical prompt structure that transfers the knowledge learned from high-resource tasks to low-resource ones. This structure, equipped with a Deeply-conditioned Input-Prompt (DIP) mapping module, allows for efficient adaptation by updating only 2% of the trainable parameters. The framework enhances semantic interpretability through attention mechanisms between inputs and prompts, and it clusters prompts in the latent space in alignment with prior knowledge. Experiments on public resting state fMRI datasets reveal ScaPT outperforms fine-tuning and multitask-based prompt tuning in neurodegenerative diseases diagnosis/prognosis and personality trait prediction, even with fewer than 20 participants. It highlights ScaPT’s efficiency in adapting pre-trained fMRI models to low-resource tasks.

[CV-112] Cervical Cancer Detection Using Multi-Branch Deep Learning Model

链接: https://arxiv.org/abs/2408.10498
作者: Tatsuhiro Baba,Abu Saleh Musa Miah,Jungpil Shin,Md. Al Mehedi Hasan
关键词-EN: High-risk HPV, young women diagnosis, women diagnosis rates, diagnosis rates soaring, infection of High-risk
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Cervical cancer is a crucial global health concern for women, and the persistent infection of High-risk HPV mainly triggers this remains a global health challenge, with young women diagnosis rates soaring from 10% to 40% over three decades. While Pap smear screening is a prevalent diagnostic method, visual image analysis can be lengthy and often leads to mistakes. Early detection of the disease can contribute significantly to improving patient outcomes. In recent decades, many researchers have employed machine learning techniques that achieved promise in cervical cancer detection processes based on medical images. In recent years, many researchers have employed various deep-learning techniques to achieve high-performance accuracy in detecting cervical cancer but are still facing various challenges. This research proposes an innovative and novel approach to automate cervical cancer image classification using Multi-Head Self-Attention (MHSA) and convolutional neural networks (CNNs). The proposed method leverages the strengths of both MHSA mechanisms and CNN to effectively capture both local and global features within cervical images in two streams. MHSA facilitates the model’s ability to focus on relevant regions of interest, while CNN extracts hierarchical features that contribute to accurate classification. Finally, we combined the two stream features and fed them into the classification module to refine the feature and the classification. To evaluate the performance of the proposed approach, we used the SIPaKMeD dataset, which classifies cervical cells into five categories. Our model achieved a remarkable accuracy of 98.522%. This performance has high recognition accuracy of medical image classification and holds promise for its applicability in other medical image recognition tasks.

[CV-113] SDE-based Multiplicative Noise Removal

链接: https://arxiv.org/abs/2408.10283
作者: An Vuong,Thinh Nguyen
关键词-EN: synthetic aperture radar, commonly affects images, affects images produced, Multiplicative noise, commonly affects
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Multiplicative noise, also known as speckle or pepper noise, commonly affects images produced by synthetic aperture radar (SAR), lasers, or optical lenses. Unlike additive noise, which typically arises from thermal processes or external factors, multiplicative noise is inherent to the system, originating from the fluctuation in diffuse reflections. These fluctuations result in multiple copies of the same signal with varying magnitudes being combined. Consequently, despeckling, or removing multiplicative noise, necessitates different techniques compared to those used for additive noise removal. In this paper, we propose a novel approach using Stochastic Differential Equations based diffusion models to address multiplicative noise. We demonstrate that multiplicative noise can be effectively modeled as a Geometric Brownian Motion process in the logarithmic domain. Utilizing the Fokker-Planck equation, we derive the corresponding reverse process for image denoising. To validate our method, we conduct extensive experiments on two different datasets, comparing our approach to both classical signal processing techniques and contemporary CNN-based noise removal models. Our results indicate that the proposed method significantly outperforms existing methods on perception-based metrics such as FID and LPIPS, while maintaining competitive performance on traditional metrics like PSNR and SSIM. Comments: 9 pages, 4 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.10283 [eess.IV] (or arXiv:2408.10283v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.10283 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-114] AID-DTI: Accelerating High-fidelity Diffusion Tensor Imaging with Detail-preserving Model-based Deep Learning MICCAI2024

链接: https://arxiv.org/abs/2408.10236
作者: Wenxin Fan,Jian Cheng,Cheng Li,Jing Yang,Ruoyou Wu,Juan Zou,Shanshan Wang
关键词-EN: diffusion tensor imaging, shown great potential, accelerating diffusion tensor, textbf, Deep learning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 3 figures, MICCAI 2024 Workshop on Computational Diffusion MRI. arXiv admin note: text overlap with arXiv:2401.01693 , arXiv:2405.03159

点击查看摘要

Abstract:Deep learning has shown great potential in accelerating diffusion tensor imaging (DTI). Nevertheless, existing methods tend to suffer from Rician noise and eddy current, leading to detail loss in reconstructing the DTI-derived parametric maps especially when sparsely sampled q-space data are used. To address this, this paper proposes a novel method, AID-DTI (\textbfAccelerating h\textbfIgh fi\textbfDelity \textbfDiffusion \textbfTensor \textbfImaging), to facilitate fast and accurate DTI with only six measurements. AID-DTI is equipped with a newly designed Singular Value Decomposition-based regularizer, which can effectively capture fine details while suppressing noise during network training by exploiting the correlation across DTI-derived parameters. Additionally, we introduce a Nesterov-based adaptive learning algorithm that optimizes the regularization parameter dynamically to enhance the performance. AID-DTI is an extendable framework capable of incorporating flexible network architecture. Experimental results on Human Connectome Project (HCP) data consistently demonstrate that the proposed method estimates DTI parameter maps with fine-grained details and outperforms other state-of-the-art methods both quantitatively and qualitatively.

机器学习

[LG-0] Accelerating Goal-Conditioned RL Algorithms and Research

链接: https://arxiv.org/abs/2408.11052
作者: Michał Bortkiewicz,Władek Pałucki,Vivek Myers,Tadeusz Dziarmaga,Tomasz Arczewski,Łukasz Kuciński,Benjamin Eysenbach
关键词-EN: transform reinforcement learning, reinforcement learning, paralleling the breakthroughs, potential to transform, areas of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervision has the potential to transform reinforcement learning (RL), paralleling the breakthroughs it has enabled in other areas of machine learning. While self-supervised learning in other domains aims to find patterns in a fixed dataset, self-supervised goal-conditioned reinforcement learning (GCRL) agents discover new behaviors by learning from the goals achieved during unstructured interaction with the environment. However, these methods have failed to see similar success, both due to a lack of data from slow environments as well as a lack of stable algorithms. We take a step toward addressing both of these issues by releasing a high-performance codebase and benchmark JaxGCRL for self-supervised GCRL, enabling researchers to train agents for millions of environment steps in minutes on a single GPU. The key to this performance is a combination of GPU-accelerated environments and a stable, batched version of the contrastive reinforcement learning algorithm, based on an infoNCE objective, that effectively makes use of this increased data throughput. With this approach, we provide a foundation for future research in self-supervised GCRL, enabling researchers to quickly iterate on new ideas and evaluate them in a diverse set of challenging environments. Website + Code: this https URL

[LG-1] RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands

链接: https://arxiv.org/abs/2408.11048
作者: Yi Zhao,Le Chen,Jan Schneider,Quankai Gao,Juho Kannala,Bernhard Schölkopf,Joni Pajarinen,Dieter Büchler
关键词-EN: long-standing research goal, robot piano playing, robot piano, endow robot hands, piano playing
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project Website: this https URL

点击查看摘要

Abstract:It has been a long-standing research goal to endow robot hands with human-level dexterity. Bi-manual robot piano playing constitutes a task that combines challenges from dynamic tasks, such as generating fast while precise motions, with slower but contact-rich manipulation problems. Although reinforcement learning based approaches have shown promising results in single-task performance, these methods struggle in a multi-song setting. Our work aims to close this gap and, thereby, enable imitation learning approaches for robot piano playing at scale. To this end, we introduce the Robot Piano 1 Million (RP1M) dataset, containing bi-manual robot piano playing motion data of more than one million trajectories. We formulate finger placements as an optimal transport problem, thus, enabling automatic annotation of vast amounts of unlabeled songs. Benchmarking existing imitation learning approaches shows that such approaches reach state-of-the-art robot piano playing performance by leveraging RP1M.

[LG-2] Atmospheric Transport Modeling of CO_2 with Neural Networks

链接: https://arxiv.org/abs/2408.11032
作者: Vitus Benson,Ana Bastos,Christian Reimers,Alexander J. Winkler,Fanny Yang,Markus Reichstein
关键词-EN: international climate agreements, greenhouse gas monitoring, verification support systems, Accurately describing, climate agreements
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Code: this https URL

点击查看摘要

Abstract:Accurately describing the distribution of CO _2 in the atmosphere with atmospheric tracer transport models is essential for greenhouse gas monitoring and verification support systems to aid implementation of international climate agreements. Large deep neural networks are poised to revolutionize weather prediction, which requires 3D modeling of the atmosphere. While similar in this regard, atmospheric transport modeling is subject to new challenges. Both, stable predictions for longer time horizons and mass conservation throughout need to be achieved, while IO plays a larger role compared to computational costs. In this study we explore four different deep neural networks (UNet, GraphCast, Spherical Fourier Neural Operator and SwinTransformer) which have proven as state-of-the-art in weather prediction to assess their usefulness for atmospheric tracer transport modeling. For this, we assemble the CarbonBench dataset, a systematic benchmark tailored for machine learning emulators of Eulerian atmospheric transport. Through architectural adjustments, we decouple the performance of our emulators from the distribution shift caused by a steady rise in atmospheric CO _2 . More specifically, we center CO _2 input fields to zero mean and then use an explicit flux scheme and a mass fixer to assure mass balance. This design enables stable and mass conserving transport for over 6 months with all four neural network architectures. In our study, the SwinTransformer displays particularly strong emulation skill (90-day R^2 0.99 ), with physically plausible emulation even for forward runs of multiple years. This work paves the way forward towards high resolution forward and inverse modeling of inert trace gases with neural networks.

[LG-3] Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos ICASSP2024

链接: https://arxiv.org/abs/2408.10998
作者: Dennis Fedorishin,Lie Lu,Srirangaraj Setlur,Venu Govindaraju
关键词-EN: similar composition transition, composition transition fluidly, common video editing, video editing technique, audio match
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICASSP 2024

点击查看摘要

Abstract:A “match cut” is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find and create “audio match cuts” within videos and movies. We create a self-supervised audio representation for audio match cutting and develop a coarse-to-fine audio match pipeline that recommends matching shots and creates the blended audio. We further annotate a dataset for the proposed audio match cut task and compare the ability of multiple audio representations to find audio match cut candidates. Finally, we evaluate multiple methods to blend two matching audio candidates with the goal of creating a smooth transition. Project page and examples are available at: this https URL

[LG-4] Wave-Mask/Mix: Exploring Wavelet-Based Augmentations for Time Series Forecasting

链接: https://arxiv.org/abs/2408.10951
作者: Dona Arabi,Jafar Bakhshaliyev,Ayse Coskuner,Kiran Madhusudhanan,Kami Serdar Uckardes
关键词-EN: improving machine learning, machine learning model, learning model performance, limited real-world data, discrete wavelet transform
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation is important for improving machine learning model performance when faced with limited real-world data. In time series forecasting (TSF), where accurate predictions are crucial in fields like finance, healthcare, and manufacturing, traditional augmentation methods for classification tasks are insufficient to maintain temporal coherence. This research introduces two augmentation approaches using the discrete wavelet transform (DWT) to adjust frequency elements while preserving temporal dependencies in time series data. Our methods, Wavelet Masking (WaveMask) and Wavelet Mixing (WaveMix), are evaluated against established baselines across various forecasting horizons. To the best of our knowledge, this is the first study to conduct extensive experiments on multivariate time series using Discrete Wavelet Transform as an augmentation technique. Experimental results demonstrate that our techniques achieve competitive results with previous methods. We also explore cold-start forecasting using downsampled training datasets, comparing outcomes to baseline methods.

[LG-5] GAIM: Attacking Graph Neural Networks via Adversarial Influence Maximization

链接: https://arxiv.org/abs/2408.10948
作者: Xiaodong Yang,Xiaoting Li,Huiyuan Chen,Yiwei Cai
关键词-EN: Graph Neural Network, trained Graph Neural, mislead trained Graph, Neural Network, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent studies show that well-devised perturbations on graph structures or node features can mislead trained Graph Neural Network (GNN) models. However, these methods often overlook practical assumptions, over-rely on heuristics, or separate vital attack components. In response, we present GAIM, an integrated adversarial attack method conducted on a node feature basis while considering the strict black-box setting. Specifically, we define an adversarial influence function to theoretically assess the adversarial impact of node perturbations, thereby reframing the GNN attack problem into the adversarial influence maximization problem. In our approach, we unify the selection of the target node and the construction of feature perturbations into a single optimization problem, ensuring a unique and consistent feature perturbation for each target node. We leverage a surrogate model to transform this problem into a solvable linear programming task, streamlining the optimization process. Moreover, we extend our method to accommodate label-oriented attacks, broadening its applicability. Thorough evaluations on five benchmark datasets across three popular models underscore the effectiveness of our method in both untargeted and label-oriented targeted attacks. Through comprehensive analysis and ablation studies, we demonstrate the practical value and efficacy inherent to our design choices.

[LG-6] Robust Regression with Ensembles Communicating over Noisy Channels

链接: https://arxiv.org/abs/2408.10942
作者: Yuval Ben-Hur,Yuval Cassuto
关键词-EN: single computer system, machine-learning models grow, grow in size, computer system, machine-learning models
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:As machine-learning models grow in size, their implementation requirements cannot be met by a single computer system. This observation motivates distributed settings, in which intermediate computations are performed across a network of processing units, while the central node only aggregates their outputs. However, distributing inference tasks across low-precision or faulty edge devices, operating over a network of noisy communication channels, gives rise to serious reliability challenges. We study the problem of an ensemble of devices, implementing regression algorithms, that communicate through additive noisy channels in order to collaboratively perform a joint regression task. We define the problem formally, and develop methods for optimizing the aggregation coefficients for the parameters of the noise in the channels, which can potentially be correlated. Our results apply to the leading state-of-the-art ensemble regression methods: bagging and gradient boosting. We demonstrate the effectiveness of our algorithms on both synthetic and real-world datasets.

[LG-7] A Closer Look at Data Augmentation Strategies for Finetuning-Based Low/Few-Shot Object Detection

链接: https://arxiv.org/abs/2408.10940
作者: Vladislav Li,Georgios Tsoumplekas,Ilias Siniosoglou,Vasileios Argyriou,Anastasios Lytos,Eleftherios Fountoukidis,Panagiotis Sarigiannidis
关键词-EN: few-shot object detection, Current methods, detection have primarily, primarily focused, focused on enhancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Current methods for low- and few-shot object detection have primarily focused on enhancing model performance for detecting objects. One common approach to achieve this is by combining model finetuning with data augmentation strategies. However, little attention has been given to the energy efficiency of these approaches in data-scarce regimes. This paper seeks to conduct a comprehensive empirical study that examines both model performance and energy efficiency of custom data augmentations and automated data augmentation selection strategies when combined with a lightweight object detector. The methods are evaluated in three different benchmark datasets in terms of their performance and energy consumption, and the Efficiency Factor is employed to gain insights into their effectiveness considering both performance and efficiency. Consequently, it is shown that in many cases, the performance gains of data augmentation strategies are overshadowed by their increased energy usage, necessitating the development of more energy efficient data augmentation strategies to address data scarcity.

[LG-8] Conformalized Interval Arithmetic with Symmetric Calibration

链接: https://arxiv.org/abs/2408.10939
作者: Rui Luo,Zhixin Zhou
关键词-EN: Uncertainty quantification, essential in decision-making, variables are involved, quantification is essential, joint distributions
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Uncertainty quantification is essential in decision-making, especially when joint distributions of random variables are involved. While conformal prediction provides distribution-free prediction sets with valid coverage guarantees, it traditionally focuses on single predictions. This paper introduces novel conformal prediction methods for estimating the sum or average of unknown labels over specific index sets. We develop conformal prediction intervals for single target to the prediction interval for sum of multiple targets. Under permutation invariant assumptions, we prove the validity of our proposed method. We also apply our algorithms on class average estimation and path cost prediction tasks, and we show that our method outperforms existing conformalized approaches as well as non-conformal approaches.

[LG-9] he Evolution of Reinforcement Learning in Quantitative Finance

链接: https://arxiv.org/abs/2408.10932
作者: Nikolaos Pippas,Cagatay Turkay,Elliot A. Ludvig
关键词-EN: experienced significant advancement, Reinforcement Learning, past decade, prompting a growing, experienced significant
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: This work is currently submitted to and under-review for ACM Computing Surveys. This copy is an unedited, pre-print version and it is the author’s version of the work. I

点击查看摘要

Abstract:Reinforcement Learning (RL) has experienced significant advancement over the past decade, prompting a growing interest in applications within finance. This survey critically evaluates 167 publications, exploring diverse RL applications and frameworks in finance. Financial markets, marked by their complexity, multi-agent nature, information asymmetry, and inherent randomness, serve as an intriguing test-bed for RL. Traditional finance offers certain solutions, and RL advances these with a more dynamic approach, incorporating machine learning methods, including transfer learning, meta-learning, and multi-agent solutions. This survey dissects key RL components through the lens of Quantitative Finance. We uncover emerging themes, propose areas for future research, and critique the strengths and weaknesses of existing methods.

[LG-10] Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

链接: https://arxiv.org/abs/2408.10920
作者: Róbert Csordás,Christopher Potts,Christopher D. Manning,Atticus Geiger
关键词-EN: Linear Representation Hypothesis, neural networks learn, Representation Hypothesis, LRH states, states that models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.

[LG-11] CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network

链接: https://arxiv.org/abs/2408.10919
作者: Zijian Zhao,Tingwei Chen,Zhijie Cai,Hang Li,Xiaoyang Li,Qimei Chen,Guangxu Zhu
关键词-EN: garnered significant attention, significant attention due, low cost, recent years, numerous benefits
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fails to perform well in environments different from the training data. One major factor contributing to this issue is the limited availability of Wi-Fi sensing datasets, which makes models learn excessive irrelevant information and over-fit to the training set. Unfortunately, collecting large-scale Wi-Fi sensing datasets across diverse scenarios is a challenging task. To address this problem, we propose CrossFi, a siamese network-based approach that excels in both in-domain scenario and cross-domain scenario, including few-shot, zero-shot scenarios, and even works in few-shot new-class scenario where testing set contains new categories. The core component of CrossFi is a sample-similarity calculation network called CSi-Net, which improves the structure of the siamese network by using an attention mechanism to capture similarity information, instead of simply calculating the distance or cosine similarity. Based on it, we develop an extra Weight-Net that can generate a template for each class, so that our CrossFi can work in different scenarios. Experimental results demonstrate that our CrossFi achieves state-of-the-art performance across various scenarios. In gesture recognition task, our CrossFi achieves an accuracy of 98.17% in in-domain scenario, 91.72% in one-shot cross-domain scenario, 64.81% in zero-shot cross-domain scenario, and 84.75% in one-shot new-class scenario. To facilitate future research, we will release the code for our model upon publication.

[LG-12] A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse

链接: https://arxiv.org/abs/2408.10901
作者: Zhongliang Guo,Lei Fang,Jingyu Lin,Yifei Qian,Shuai Zhao,Zeyu Wang,Junhao Dong,Cunjian Chen,Ognjen Arandjelović,Chun Pong Lau
关键词-EN: Latent Diffusion Models, Latent Diffusion, Recent advancements, revolutionized image synthesis, Diffusion Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techniques as a benign metric to prevent the underlying misuse of generative AI. Current approaches to safeguarding images from manipulation by LDMs are limited by their reliance on model-specific knowledge and their inability to significantly degrade semantic quality of generated images. In response to these shortcomings, we propose the Posterior Collapse Attack (PCA) based on the observation that VAEs suffer from posterior collapse during training. Our method minimizes dependence on the white-box information of target models to get rid of the implicit reliance on model-specific knowledge. By accessing merely a small amount of LDM parameters, in specific merely the VAE encoder of LDMs, our method causes a substantial semantic collapse in generation quality, particularly in perceptual consistency, and demonstrates strong transferability across various model architectures. Experimental results show that PCA achieves superior perturbation effects on image generation of LDMs with lower runtime and VRAM. Our method outperforms existing techniques, offering a more robust and generalizable solution that is helpful in alleviating the socio-technical challenges posed by the rapidly evolving landscape of generative AI.

[LG-13] DBHP: Trajectory Imputation in Multi-Agent Sports Using Derivative-Based Hybrid Prediction

链接: https://arxiv.org/abs/2408.10878
作者: Hanjun Choi,Hyunsung Kim,Minho Lee,Chang-Jo Kim,Jinsung Yoon,Sang-Ki Ko
关键词-EN: collected trajectory data, multi-agent trajectory data, spatiotemporal domains handle, domains handle multi-agent, handle multi-agent trajectory
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Many spatiotemporal domains handle multi-agent trajectory data, but in real-world scenarios, collected trajectory data are often partially missing due to various reasons. While existing approaches demonstrate good performance in trajectory imputation, they face challenges in capturing the complex dynamics and interactions between agents due to a lack of physical constraints that govern realistic trajectories, leading to suboptimal results. To address this issue, the paper proposes a Derivative-Based Hybrid Prediction (DBHP) framework that can effectively impute multiple agents’ missing trajectories. First, a neural network equipped with Set Transformers produces a naive prediction of missing trajectories while satisfying the permutation-equivariance in terms of the order of input agents. Then, the framework makes alternative predictions leveraging velocity and acceleration information and combines all the predictions with properly determined weights to provide final imputed trajectories. In this way, our proposed framework not only accurately predicts position, velocity, and acceleration values but also enforces the physical relationship between them, eventually improving both the accuracy and naturalness of the predicted trajectories. Accordingly, the experiment results about imputing player trajectories in team sports show that our framework significantly outperforms existing imputation baselines.

[LG-14] Feature Selection from Differentially Private Correlations

链接: https://arxiv.org/abs/2408.10862
作者: Ryan Swope,Amol Khanna,Philip Doldo,Saptarshi Roy,Edward Raff
关键词-EN: Data scientists, scientists often seek, seek to identify, Data, feature selection
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear in Proceedings of the 17th ACM Workshop on Artificial Intelligence and Security, 2024

点击查看摘要

Abstract:Data scientists often seek to identify the most important features in high-dimensional datasets. This can be done through L_1 -regularized regression, but this can become inefficient for very high-dimensional datasets. Additionally, high-dimensional regression can leak information about individual datapoints in a dataset. In this paper, we empirically evaluate the established baseline method for feature selection with differential privacy, the two-stage selection technique, and show that it is not stable under sparsity. This makes it perform poorly on real-world datasets, so we consider a different approach to private feature selection. We employ a correlations-based order statistic to choose important features from a dataset and privatize them to ensure that the results do not leak information about individual datapoints. We find that our method significantly outperforms the established baseline for private feature selection on many datasets.

[LG-15] Knowledge Sharing and Transfer via Centralized Reward Agent for Multi-Task Reinforcement Learning

链接: https://arxiv.org/abs/2408.10858
作者: Haozhe Ma,Zhengding Luo,Thanh Vinh Vo,Kuankuan Sima,Tze-Yun Leong
关键词-EN: auxiliary informative rewards, providing immediate feedback, feedback through auxiliary, auxiliary informative, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reward shaping is effective in addressing the sparse-reward challenge in reinforcement learning by providing immediate feedback through auxiliary informative rewards. Based on the reward shaping strategy, we propose a novel multi-task reinforcement learning framework, that integrates a centralized reward agent (CRA) and multiple distributed policy agents. The CRA functions as a knowledge pool, which aims to distill knowledge from various tasks and distribute it to individual policy agents to improve learning efficiency. Specifically, the shaped rewards serve as a straightforward metric to encode knowledge. This framework not only enhances knowledge sharing across established tasks but also adapts to new tasks by transferring valuable reward signals. We validate the proposed method on both discrete and continuous domains, demonstrating its robustness in multi-task sparse-reward settings and its effective transferability to unseen tasks.

[LG-16] Benchmarking Large Language Models for Math Reasoning Tasks

链接: https://arxiv.org/abs/2408.10839
作者: Kathrin Seßler,Yao Rong,Emek Gözlüklü,Enkelejda Kasneci
关键词-EN: Large Language Models, Large Language, enabling potential practical, mathematical problem solving, Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical applications of LLMs for mathematical reasoning. Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy, while for smaller models the in-context learning approach significantly influences the performance. Moreover, the optimal prompt depends on the chosen foundation model. We open-source our benchmark code to support the integration of additional models in future research.

[LG-17] Multilevel CNNs for Parametric PDEs based on Adaptive Finite Elements

链接: https://arxiv.org/abs/2408.10838
作者: Janina Enrica Schütte,Martin Eigel
关键词-EN: partial differential equations, high-dimensional parameter-dependent partial, parameter-dependent partial differential, low-rank tensor regression, neural network
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:A neural network architecture is presented that exploits the multilevel properties of high-dimensional parameter-dependent partial differential equations, enabling an efficient approximation of parameter-to-solution maps, rivaling best-in-class methods such as low-rank tensor regression in terms of accuracy and complexity. The neural network is trained with data on adaptively refined finite element meshes, thus reducing data complexity significantly. Error control is achieved by using a reliable finite element a posteriori error estimator, which is also provided as input to the neural network. The proposed U-Net architecture with CNN layers mimics a classical finite element multigrid algorithm. It can be shown that the CNN efficiently approximates all operations required by the solver, including the evaluation of the residual-based error estimator. In the CNN, a culling mask set-up according to the local corrections due to refinement on each mesh level reduces the overall complexity, allowing the network optimization with localized fine-scale finite element data. A complete convergence and complexity analysis is carried out for the adaptive multilevel scheme, which differs in several aspects from previous non-adaptive multilevel CNN. Moreover, numerical experiments with common benchmark problems from Uncertainty Quantification illustrate the practical performance of the architecture. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2408.10838 [cs.LG] (or arXiv:2408.10838v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.10838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting

链接: https://arxiv.org/abs/2408.10822
作者: Jianxiang Zhou,Erdong Liu,Wei Chen,Siru Zhong,Yuxuan Liang
关键词-EN: crucial research area, smart cities, forecasting has emerged, crucial research, research area
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic forecasting has emerged as a crucial research area in the development of smart cities. Although various neural networks with intricate architectures have been developed to address this problem, they still face two key challenges: i) Recent advancements in network designs for modeling spatio-temporal correlations are starting to see diminishing returns in performance enhancements. ii) Additionally, most models do not account for the spatio-temporal heterogeneity inherent in traffic data, i.e., traffic distribution varies significantly across different regions and traffic flow patterns fluctuate across various time slots. To tackle these challenges, we introduce the Spatio-Temporal Graph Transformer (STGormer), which effectively integrates attribute and structure information inherent in traffic data for learning spatio-temporal correlations, and a mixture-of-experts module for capturing heterogeneity along spaital and temporal axes. Specifically, we design two straightforward yet effective spatial encoding methods based on the graph structure and integrate time position encoding into the vanilla transformer to capture spatio-temporal traffic patterns. Additionally, a mixture-of-experts enhanced feedforward neural network (FNN) module adaptively assigns suitable expert layers to distinct patterns via a spatio-temporal gating network, further improving overall prediction accuracy. Experiments on five real-world datasets demonstrate that STGormer achieves state-of-the-art performance.

[LG-19] Learning Randomized Algorithms with Transformers

链接: https://arxiv.org/abs/2408.10818
作者: Johannes von Oswald,Seijin Kobayashi,Yassir Akram,Angelika Steger
关键词-EN: powerful tool, tool that endows, randomized algorithms, algorithms, endows algorithms
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Randomization is a powerful tool that endows algorithms with remarkable properties. For instance, randomized algorithms excel in adversarial settings, often surpassing the worst-case performance of deterministic algorithms with large margins. Furthermore, their success probability can be amplified by simple strategies such as repetition and majority voting. In this paper, we enhance deep neural networks, in particular transformer models, with randomization. We demonstrate for the first time that randomized algorithms can be instilled in transformers through learning, in a purely data- and objective-driven manner. First, we analyze known adversarial objectives for which randomized algorithms offer a distinct advantage over deterministic ones. We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters that make use of the randomness provided to the model. To illustrate the broad applicability of randomization in empowering neural networks, we study three conceptual tasks: associative recall, graph coloring, and agents that explore grid worlds. In addition to demonstrating increased robustness against oblivious adversaries through learned randomization, our experiments reveal remarkable performance improvements due to the inherently random nature of the neural networks’ computation and predictions.

[LG-20] DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

链接: https://arxiv.org/abs/2408.10807
作者: Yin-Jyun Luo,Kin Wai Cheuk,Woosung Choi,Toshimitsu Uesaka,Keisuke Toyama,Koichi Saito,Chieh-Hsin Lai,Yuhta Takida,Wei-Hsiang Liao,Simon Dixon,Yuki Mitsufuji
关键词-EN: single-instrument music audio, Existing work, pitch and timbre, music audio, excluding the cases
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.

[LG-21] Inverse Deep Learning Ray Tracing for Heliostat Surface Prediction

链接: https://arxiv.org/abs/2408.10802
作者: Jan Lewen,Max Pargmann,Mehdi Cherti,Jenia Jitsev,Robert Pitz-Paal,Daniel Maldonado Quinto
关键词-EN: Concentrating Solar Power, Concentrating Solar, Solar Power, flux density, CSP plant operations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Concentrating Solar Power (CSP) plants play a crucial role in the global transition towards sustainable energy. A key factor in ensuring the safe and efficient operation of CSP plants is the distribution of concentrated flux density on the receiver. However, the non-ideal flux density generated by individual heliostats can undermine the safety and efficiency of the power plant. The flux density from each heliostat is influenced by its precise surface profile, which includes factors such as canting and mirror errors. Accurately measuring these surface profiles for a large number of heliostats in operation is a formidable challenge. Consequently, control systems often rely on the assumption of ideal surface conditions, which compromises both safety and operational efficiency. In this study, we introduce inverse Deep Learning Ray Tracing (iDLR), an innovative method designed to predict heliostat surfaces based solely on target images obtained during heliostat calibration. Our simulation-based investigation demonstrates that sufficient information regarding the heliostat surface is retained in the flux density distribution of a single heliostat, enabling deep learning models to accurately predict the underlying surface with deflectometry-like precision for the majority of heliostats. Additionally, we assess the limitations of this method, particularly in relation to surface accuracy and resultant flux density predictions. Furthermore, we are presenting a new comprehensive heliostat model using Non-Uniform Rational B-Spline (NURBS) that has the potential to become the new State of the Art for heliostat surface parameterization. Our findings reveal that iDLR has significant potential to enhance CSP plant operations, potentially increasing the overall efficiency and energy output of the power plants.

[LG-22] Universal Novelty Detection Through Adaptive Contrastive Learning

链接: https://arxiv.org/abs/2408.10798
作者: Hossein Mirzaei,Mojtaba Nafez,Mohammad Jafari,Mohammad Bagher Soltani,Mohammad Azizmalayeri,Jafar Habibi,Mohammad Sabokrou,Mohammad Hossein Rohban
关键词-EN: Novelty detection, deploying machine learning, open world, critical task, task for deploying
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures, conference

点击查看摘要

Abstract:Novelty detection is a critical task for deploying machine learning models in the open world. A crucial property of novelty detection methods is universality, which can be interpreted as generalization across various distributions of training or test data. More precisely, for novelty detection, distribution shifts may occur in the training set or the test set. Shifts in the training set refer to cases where we train a novelty detector on a new dataset and expect strong transferability. Conversely, distribution shifts in the test set indicate the methods’ performance when the trained model encounters a shifted test sample. We experimentally show that existing methods falter in maintaining universality, which stems from their rigid inductive biases. Motivated by this, we aim for more generalized techniques that have more adaptable inductive biases. In this context, we leverage the fact that contrastive learning provides an efficient framework to easily switch and adapt to new inductive biases through the proper choice of augmentations in forming the negative pairs. We propose a novel probabilistic auto-negative pair generation method AutoAugOOD, along with contrastive learning, to yield a universal novelty detector method. Our experiments demonstrate the superiority of our method under different distribution shifts in various image benchmark datasets. Notably, our method emerges universality in the lens of adaptability to different setups of novelty detection, including one-class, unlabeled multi-class, and labeled multi-class settings. Code: this https URL

[LG-23] LightMDETR: A Lightweight Approach for Low-Cost Open-Vocabulary Object Detection Training

链接: https://arxiv.org/abs/2408.10787
作者: Binta Sow,Bilal Faye,Hanane Azzag,Mustapha Lebbah
关键词-EN: computer vision traditionally, vision traditionally involves, traditionally involves identifying, involves identifying objects, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Object detection in computer vision traditionally involves identifying objects in images. By integrating textual descriptions, we enhance this process, providing better context and accuracy. The MDETR model significantly advances this by combining image and text data for more versatile object detection and classification. However, MDETR’s complexity and high computational demands hinder its practical use. In this paper, we introduce Lightweight MDETR (LightMDETR), an optimized MDETR variant designed for improved computational efficiency while maintaining robust multimodal capabilities. Our approach involves freezing the MDETR backbone and training a sole component, the Deep Fusion Encoder (DFE), to represent image and text modalities. A learnable context vector enables the DFE to switch between these modalities. Evaluation on datasets like RefCOCO, RefCOCO+, and RefCOCOg demonstrates that LightMDETR achieves superior precision and accuracy.

[LG-24] Generative AI in Industrial Machine Vision – A Review

链接: https://arxiv.org/abs/2408.10775
作者: Hans Aoyang Zhou,Dominik Wolfschläger,Constantinos Florides,Jonas Werheid,Hannes Behnen,Jan-Henrick Woltersmann,Tiago C. Pinto,Marco Kemmerling,Anas Abdelrazeq,Robert H. Schmitt
关键词-EN: vision enhances automation, Machine vision, industrial machine vision, gls, generative
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 44 pages, 7 figures, This work has been submitted to the Journal of Intelligent Manufacturing

点击查看摘要

Abstract:Machine vision enhances automation, quality control, and operational efficiency in industrial applications by enabling machines to interpret and act on visual data. While traditional computer vision algorithms and approaches remain widely utilized, machine learning has become pivotal in current research activities. In particular, generative \glsAI demonstrates promising potential by improving pattern recognition capabilities, through data augmentation, increasing image resolution, and identifying anomalies for quality control. However, the application of generative \glsAI in machine vision is still in its early stages due to challenges in data diversity, computational requirements, and the necessity for robust validation methods. A comprehensive literature review is essential to understand the current state of generative \glsAI in industrial machine vision, focusing on recent advancements, applications, and research trends. Thus, a literature review based on the PRISMA guidelines was conducted, analyzing over 1,200 papers on generative \glsAI in industrial machine vision. Our findings reveal various patterns in current research, with the primary use of generative \glsAI being data augmentation, for machine vision tasks such as classification and object detection. Furthermore, we gather a collection of application challenges together with data requirements to enable a successful application of generative \glsAI in industrial machine vision. This overview aims to provide researchers with insights into the different areas and applications within current research, highlighting significant advancements and identifying opportunities for future work.

[LG-25] Generating Synthetic Fair Syntax-agnostic Data by Learning and Distilling Fair Representation

链接: https://arxiv.org/abs/2408.10755
作者: Md Fahim Sikder,Resmi Ramachandranpillai,Daniel de Leng,Fredrik Heintz
关键词-EN: crucial topic due, recent wide usage, latent space, Data, fair
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data Fairness is a crucial topic due to the recent wide usage of AI powered applications. Most of the real-world data is filled with human or machine biases and when those data are being used to train AI models, there is a chance that the model will reflect the bias in the training data. Existing bias-mitigating generative methods based on GANs, Diffusion models need in-processing fairness objectives and fail to consider computational overhead while choosing computationally-heavy architectures, which may lead to high computational demands, instability and poor optimization performance. To mitigate this issue, in this work, we present a fair data generation technique based on knowledge distillation, where we use a small architecture to distill the fair representation in the latent space. The idea of fair latent space distillation enables more flexible and stable training of Fair Generative Models (FGMs). We first learn a syntax-agnostic (for any data type) fair representation of the data, followed by distillation in the latent space into a smaller model. After distillation, we use the distilled fair latent space to generate high-fidelity fair synthetic data. While distilling, we employ quality loss (for fair distillation) and utility loss (for data utility) to ensure that the fairness and data utility characteristics remain in the distilled latent space. Our approaches show a 5%, 5% and 10% rise in performance in fairness, synthetic sample quality and data utility, respectively, than the state-of-the-art fair generative model.

[LG-26] Security Assessment of Hierarchical Federated Deep Learning

链接: https://arxiv.org/abs/2408.10752
作者: D Alqattan,R Sun,H Liang,G Nicosia,V Snasel,R Ranjan,V Ojha
关键词-EN: distributed deep learning, deep learning model, promising distributed deep, Hierarchical federated learning, crucial security concerns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Hierarchical federated learning (HFL) is a promising distributed deep learning model training paradigm, but it has crucial security concerns arising from adversarial attacks. This research investigates and assesses the security of HFL using a novel methodology by focusing on its resilience against adversarial attacks inference-time and training-time. Through a series of extensive experiments across diverse datasets and attack scenarios, we uncover that HFL demonstrates robustness against untargeted training-time attacks due to its hierarchical structure. However, targeted attacks, particularly backdoor attacks, exploit this architecture, especially when malicious clients are positioned in the overlapping coverage areas of edge servers. Consequently, HFL shows a dual nature in its resilience, showcasing its capability to recover from attacks thanks to its hierarchical aggregation that strengthens its suitability for adversarial training, thereby reinforcing its resistance against inference-time attacks. These insights underscore the necessity for balanced security strategies in HFL systems, leveraging their inherent strengths while effectively mitigating vulnerabilities.

[LG-27] Pluto and Charon: A Time and Memory Efficient Collaborative Edge AI Framework for Personal LLMs Fine-Tuning

链接: https://arxiv.org/abs/2408.10746
作者: Bei Ouyang,Shengyuan Ye,Liekang Zeng,Tianyi Qian,Jingyi Li,Xu Chen
关键词-EN: Large language models, Large language, personal LLMs fine-tuning, intelligent personal assistants, personal LLMs
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted by The 53rd International Conference on Parallel Processing (ICPP’24)

点击查看摘要

Abstract:Large language models (LLMs) have unlocked a plethora of powerful applications at the network edge, such as intelligent personal assistants. Data privacy and security concerns have prompted a shift towards edge-based fine-tuning of personal LLMs, away from cloud reliance. However, this raises issues of computational intensity and resource scarcity, hindering training efficiency and feasibility. While current studies investigate parameter-efficient fine-tuning (PEFT) techniques to mitigate resource constraints, our analysis indicates that these techniques are not sufficiently resource-efficient for edge devices. To tackle these challenges, we propose Pluto and Charon (PAC), a time and memory efficient collaborative edge AI framework for personal LLMs fine-tuning. PAC breaks the resource wall of personal LLMs fine-tuning with a sophisticated algorithm-system co-design. (1) Algorithmically, PAC implements a personal LLMs fine-tuning technique that is efficient in terms of parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Additionally, an activation cache mechanism further streamlining the process by negating the necessity for repeated forward passes across multiple epochs. (2) Systematically, PAC leverages edge devices in close proximity, pooling them as a collective resource for in-situ personal LLMs fine-tuning, utilizing a hybrid data and pipeline parallelism to orchestrate distributed training. The use of the activation cache eliminates the need for forward pass through the LLM backbone,enabling exclusive fine-tuning of the Parallel Adapters using data parallelism. Extensive evaluation based on prototype implementation demonstrates that PAC remarkably outperforms state-of-the-art approaches, achieving up to 8.64x end-to-end speedup and up to 88.16% reduction in memory footprint.

[LG-28] owards Foundation Models for the Industrial Forecasting of Chemical Kinetics

链接: https://arxiv.org/abs/2408.10720
作者: Imran Nasim,Joaõ Lucas de Sousa Almeida
关键词-EN: Scientific Machine Learning, Scientific Machine, Machine Learning, Learning is transforming, modeling chemical reactions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted into the IEEE CAI 2024 Workshop on Scientific Machine Learning and Its Industrial Applications (SMLIA2024)

点击查看摘要

Abstract:Scientific Machine Learning is transforming traditional engineering industries by enhancing the efficiency of existing technologies and accelerating innovation, particularly in modeling chemical reactions. Despite recent advancements, the issue of solving stiff chemically reacting problems within computational fluid dynamics remains a significant issue. In this study we propose a novel approach utilizing a multi-layer-perceptron mixer architecture (MLP-Mixer) to model the time-series of stiff chemical kinetics. We evaluate this method using the ROBER system, a benchmark model in chemical kinetics, to compare its performance with traditional numerical techniques. This study provides insight into the industrial utility of the recently developed MLP-Mixer architecture to model chemical kinetics and provides motivation for such neural architecture to be used as a base for time-series foundation models.

[LG-29] Accelerated training of deep learning surrogate models for surface displacement and flow with application to MCMC-based history matching of CO2 storage operations

链接: https://arxiv.org/abs/2408.10717
作者: Yifu Han,Francois P. Hamon,Louis J. Durlofsky
关键词-EN: Deep learning surrogate, subsurface flow applications, shows great promise, Deep learning, modeling shows great
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning surrogate modeling shows great promise for subsurface flow applications, but the training demands can be substantial. Here we introduce a new surrogate modeling framework to predict CO2 saturation, pressure and surface displacement for use in the history matching of carbon storage operations. Rather than train using a large number of expensive coupled flow-geomechanics simulation runs, training here involves a large number of inexpensive flow-only simulations combined with a much smaller number of coupled runs. The flow-only runs use an effective rock compressibility, which is shown to provide accurate predictions for saturation and pressure for our system. A recurrent residual U-Net architecture is applied for the saturation and pressure surrogate models, while a new residual U-Net model is introduced to predict surface displacement. The surface displacement surrogate accepts, as inputs, geomodel quantities along with saturation and pressure surrogate predictions. Median relative error for a diverse test set is less than 4% for all variables. The surrogate models are incorporated into a hierarchical Markov chain Monte Carlo history matching workflow. Surrogate error is included using a new treatment involving the full model error covariance matrix. A high degree of prior uncertainty, with geomodels characterized by uncertain geological scenario parameters (metaparameters) and associated realizations, is considered. History matching results for a synthetic true model are generated using in-situ monitoring-well data only, surface displacement data only, and both data types. The enhanced uncertainty reduction achieved with both data types is quantified. Posterior saturation and surface displacement fields are shown to correspond well with the true solution.

[LG-30] Offline Model-Based Reinforcement Learning with Anti-Exploration

链接: https://arxiv.org/abs/2408.10713
作者: Padmanaba Srinivasan,William Knottenbelt
关键词-EN: enable faster learning, Model-based reinforcement learning, offline reinforcement learning, reinforcement learning, generate synthetic trajectories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) algorithms learn a dynamics model from collected data and apply it to generate synthetic trajectories to enable faster learning. This is an especially promising paradigm in offline reinforcement learning (RL) where data may be limited in quantity, in addition to being deficient in coverage and quality. Practical approaches to offline MBRL usually rely on ensembles of dynamics models to prevent exploitation of any individual model and to extract uncertainty estimates that penalize values in states far from the dataset support. Uncertainty estimates from ensembles can vary greatly in scale, making it challenging to generalize hyperparameters well across even similar tasks. In this paper, we present Morse Model-based offline RL (MoMo), which extends the anti-exploration paradigm found in offline model-free RL to the model-based space. We develop model-free and model-based variants of MoMo and show how the model-free version can be extended to detect and deal with out-of-distribution (OOD) states using explicit uncertainty estimation without the need for large ensembles. MoMo performs offline MBRL using an anti-exploration bonus to counteract value overestimation in combination with a policy constraint, as well as a truncation function to terminate synthetic rollouts that are excessively OOD. Experimentally, we find that both model-free and model-based MoMo perform well, and the latter outperforms prior model-based and model-free baselines on the majority of D4RL datasets tested.

[LG-31] Variable Assignment Invariant Neural Networks for Learning Logic Programs

链接: https://arxiv.org/abs/2408.10709
作者: Yin Jun Phua,Katsumi Inoue
关键词-EN: observed state transitions, observed state, interpretation transition, state transitions, learning rules
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning from interpretation transition (LFIT) is a framework for learning rules from observed state transitions. LFIT has been implemented in purely symbolic algorithms, but they are unable to deal with noise or generalize to unobserved transitions. Rule extraction based neural network methods suffer from overfitting, while more general implementation that categorize rules suffer from combinatorial explosion. In this paper, we introduce a technique to leverage variable permutation invariance inherent in symbolic domains. Our technique ensures that the permutation and the naming of the variables would not affect the results. We demonstrate the effectiveness and the scalability of this method with various experiments. Our code is publicly available at this https URL

[LG-32] AnyGraph: Graph Foundation Model in the Wild

链接: https://arxiv.org/abs/2408.10700
作者: Lianghao Xia,Chao Huang
关键词-EN: exceptional generalization capabilities, relational data structured, graph, generalization capabilities, growing ubiquity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing ubiquity of relational data structured as graphs has underscored the need for graph learning models with exceptional generalization capabilities. However, current approaches often struggle to effectively extract generalizable insights, frequently requiring extensive fine-tuning and limiting their versatility. Graph foundation models offer a transformative solution, with the potential to learn robust, generalizable representations from graph data. This enables more effective and adaptable applications across a wide spectrum of tasks and domains. In this work, we investigate a unified graph model, AnyGraph, designed to handle key challenges: i) Structure Heterogenity. Addressing distribution shift in graph structural information; ii) Feature Heterogenity. Handling diverse feature representation spaces across graph datasets; iii) Fast Adaptation. Efficiently adapting the model to new graph domains; iv) Scaling Law Emergence. Enabling the model to exhibit scaling law behavior, where its performance scales favorably with the amount of data and parameter sizes. To tackle these critical challenges, we build the AnyGraph upon a Graph Mixture-of-Experts (MoE) architecture. This approach empowers the model to effectively manage both the in-domain and cross-domain distribution shift concerning structure-level and feature-level heterogeneity. Furthermore, a lightweight graph expert routing mechanism is proposed to facilitate AnyGraph’s fast adaptability to new data and domains. Our extensive experiments on diverse 38 graph datasets have demonstrated the strong zero-shot learning performance of AnyGraph across diverse graph domains with significant distribution shift. Furthermore, we have validated the model’s fast adaptation ability and scaling law emergence, showcasing its versatility.

[LG-33] owards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models

链接: https://arxiv.org/abs/2408.10682
作者: Hongbang Yuan,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: unlearned knowledge, unlearned, training corpora, achieved success, troubled by problematic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:LLM have achieved success in many fields but still troubled by problematic content in the training corpora. LLM unlearning aims at reducing their influence and avoid undesirable behaviours. However, existing unlearning methods remain vulnerable to adversarial queries and the unlearned knowledge resurfaces after the manually designed attack queries. As part of a red-team effort to proactively assess the vulnerabilities of unlearned models, we design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack these models and evaluate their robustness. It optimizes adversarial suffixes to reintroduce the unlearned knowledge in various scenarios. We find that unlearned knowledge can be recovered in 55.2% of the questions, even without revealing the unlearned model’s parameters. In response to this vulnerability, we propose Latent Adversarial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. It formulates the unlearning process as a min-max optimization problem and resolves it through two stages: an attack stage, where perturbation vectors are trained and added to the latent space of LLMs to recover the unlearned knowledge, and a defense stage, where previously trained perturbation vectors are used to enhance unlearned model’s robustness. With our LAU framework, we obtain two robust unlearning methods, AdvGA and AdvNPO. We conduct extensive experiments across multiple unlearning benchmarks and various models, and demonstrate that they improve the unlearning effectiveness by over 53.5% , cause only less than a 11.6% reduction in neighboring knowledge, and have almost no impact on the model’s general capabilities.

[LG-34] HMoE: Heterogeneous Mixture of Experts for Language Modeling

链接: https://arxiv.org/abs/2408.10681
作者: An Wang,Xingwu Sun,Ruobing Xie,Shuaipeng Li,Jiaqi Zhu,Zhen Yang,Pinxue Zhao,J.N.Han,Zhanhui Kang,Di Wang,Naoaki Okazaki,Cheng-zhong Xu
关键词-EN: offers remarkable performance, selectively activating subsets, offers remarkable, remarkable performance, selectively activating
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) offers remarkable performance and computational efficiency by selectively activating subsets of model parameters. Traditionally, MoE models use homogeneous experts, each with identical capacity. However, varying complexity in input data necessitates experts with diverse capabilities, while homogeneous MoE hinders effective expert specialization and efficient parameter utilization. In this study, we propose a novel Heterogeneous Mixture of Experts (HMoE), where experts differ in size and thus possess diverse capacities. This heterogeneity allows for more specialized experts to handle varying token complexities more effectively. To address the imbalance in expert activation, we propose a novel training objective that encourages the frequent activation of smaller experts, enhancing computational efficiency and parameter utilization. Extensive experiments demonstrate that HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks. Codes will be released upon acceptance.

[LG-35] Representation Norm Amplification for Out-of-Distribution Detection in Long-Tail Learning

链接: https://arxiv.org/abs/2408.10676
作者: Dong Geun Shin,Hye Won Chung
关键词-EN: reliable machine learning, OOD detection, OOD, critical task, task for reliable
类目: Machine Learning (cs.LG)
*备注: 30 pages, 8 figures, 17 tables

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) samples is a critical task for reliable machine learning. However, it becomes particularly challenging when the models are trained on long-tailed datasets, as the models often struggle to distinguish tail-class in-distribution samples from OOD samples. We examine the main challenges in this problem by identifying the trade-offs between OOD detection and in-distribution (ID) classification, faced by existing methods. We then introduce our method, called \textitRepresentation Norm Amplification (RNA), which solves this challenge by decoupling the two problems. The main idea is to use the norm of the representation as a new dimension for OOD detection, and to develop a training method that generates a noticeable discrepancy in the representation norm between ID and OOD data, while not perturbing the feature learning for ID classification. Our experiments show that RNA achieves superior performance in both OOD detection and classification compared to the state-of-the-art methods, by 1.70% and 9.46% in FPR95 and 2.43% and 6.87% in classification accuracy on CIFAR10-LT and ImageNet-LT, respectively. The code for this work is available at this https URL.

[LG-36] Neural Exploratory Landscape Analysis

链接: https://arxiv.org/abs/2408.10672
作者: Zeyuan Ma,Jiacheng Chen,Hongshu Guo,Yue-Jiao Gong
关键词-EN: complex problem distributions, Exploratory Landscape Analysis, Recent research, Neural Exploratory Landscape, meta-trained neural networks
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recent research in Meta-Black-Box Optimization (MetaBBO) have shown that meta-trained neural networks can effectively guide the design of black-box optimizers, significantly reducing the need for expert tuning and delivering robust performance across complex problem distributions. Despite their success, a paradox remains: MetaBBO still rely on human-crafted Exploratory Landscape Analysis features to inform the meta-level agent about the low-level optimization progress. To address the gap, this paper proposes Neural Exploratory Landscape Analysis (NeurELA), a novel framework that dynamically profiles landscape features through a two-stage, attention-based neural network, executed in an entirely end-to-end fashion. NeurELA is pre-trained over a variety of MetaBBO algorithms using a multi-task neuroevolution strategy. Extensive experiments show that NeurELA achieves consistently superior performance when integrated into different and even unseen MetaBBO tasks and can be efficiently fine-tuned for further performance boost. This advancement marks a pivotal step in making MetaBBO algorithms more autonomous and broadly applicable.

[LG-37] nsor tree learns hidden relational structures in data to construct generative models

链接: https://arxiv.org/abs/2408.10669
作者: Kenji Harada,Tsuyoshi Okubo,Naoki Kawashima
关键词-EN: Born machine framework, quantum wave function, wave function amplitude, function amplitude represented, target distribution function
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Based on the tensor tree network with the Born machine framework, we propose a general method for constructing a generative model by expressing the target distribution function as the quantum wave function amplitude represented by a tensor tree. The key idea is dynamically optimizing the tree structure that minimizes the bond mutual information. The proposed method offers enhanced performance and uncovers hidden relational structures in the target data. We illustrate potential practical applications with four examples: (i) random patterns, (ii) QMNIST hand-written digits, (iii) Bayesian networks, and (iv) the stock price fluctuation pattern in SP500. In (i) and (ii), strongly correlated variables were concentrated near the center of the network; in (iii), the causality pattern was identified; and, in (iv), a structure corresponding to the eleven sectors emerged.

[LG-38] Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

链接: https://arxiv.org/abs/2408.10664
作者: Mirko Nardi,Lorenzo Valerio,Andrea Passarella
关键词-EN: decentralized machine learning, direct data sharing, data, unsupervised federated learning, Federated Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a pivotal approach in decentralized machine learning, especially when data privacy is crucial and direct data sharing is impractical. While FL is typically associated with supervised learning, its potential in unsupervised scenarios is underexplored. This paper introduces a novel unsupervised federated learning methodology designed to identify the complete set of categories (global K) across multiple clients within label-free, non-uniform data distributions, a process known as Federated Clustering. Our approach, Federated Cluster-Wise Refinement (FedCRef), involves clients that collaboratively train models on clusters with similar data distributions. Initially, clients with diverse local data distributions (local K) train models on their clusters to generate compressed data representations. These local models are then shared across the network, enabling clients to compare them through reconstruction error analysis, leading to the formation of federated this http URL these groups, clients collaboratively train a shared model representing each data distribution, while continuously refining their local clusters to enhance data association accuracy. This iterative process allows our system to identify all potential data distributions across the network and develop robust representation models for each. To validate our approach, we compare it with traditional centralized methods, establishing a performance baseline and showcasing the advantages of our distributed solution. We also conduct experiments on the EMNIST and KMNIST datasets, demonstrating FedCRef’s ability to refine and align cluster models with actual data distributions, significantly improving data representation precision in unsupervised federated settings.

[LG-39] Inferring Underwater Topography with FINN

链接: https://arxiv.org/abs/2408.10649
作者: Coşku Can Horuz,Matthias Karlbauer,Timothy Praditia,Sergey Oladyshkin,Wolfgang Nowak,Sebastian Otte
关键词-EN: find extensive application, Spatiotemporal partial differential, partial differential equations, find extensive, engineering fields
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Spatiotemporal partial differential equations (PDEs) find extensive application across various scientific and engineering fields. While numerous models have emerged from both physics and machine learning (ML) communities, there is a growing trend towards integrating these approaches to develop hybrid architectures known as physics-aware machine learning models. Among these, the finite volume neural network (FINN) has emerged as a recent addition. FINN has proven to be particularly efficient in uncovering latent structures in data. In this study, we explore the capabilities of FINN in tackling the shallow-water equations, which simulates wave dynamics in coastal regions. Specifically, we investigate FINN’s efficacy to reconstruct underwater topography based on these particular wave equations. Our findings reveal that FINN exhibits a remarkable capacity to infer topography solely from wave dynamics, distinguishing itself from both conventional ML and physics-aware ML models. Our results underscore the potential of FINN in advancing our understanding of spatiotemporal phenomena and enhancing parametrization capabilities in related domains.

[LG-40] Privacy-preserving Universal Adversarial Defense for Black-box Models

链接: https://arxiv.org/abs/2408.10647
作者: Qiao Li,Cong Wu,Jing Chen,Zijun Zhang,Kun He,Ruiying Du,Xinxin Wang,Qingchuang Zhao,Yang Liu
关键词-EN: Deep neural networks, Deep neural, neural networks, autonomous driving, critical applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Deep neural networks (DNNs) are increasingly used in critical applications such as identity authentication and autonomous driving, where robustness against adversarial attacks is crucial. These attacks can exploit minor perturbations to cause significant prediction errors, making it essential to enhance the resilience of DNNs. Traditional defense methods often rely on access to detailed model information, which raises privacy concerns, as model owners may be reluctant to share such data. In contrast, existing black-box defense methods fail to offer a universal defense against various types of adversarial attacks. To address these challenges, we introduce DUCD, a universal black-box defense method that does not require access to the target model’s parameters or architecture. Our approach involves distilling the target model by querying it with data, creating a white-box surrogate while preserving data privacy. We further enhance this surrogate model using a certified defense based on randomized smoothing and optimized noise selection, enabling robust defense against a broad range of adversarial attacks. Comparative evaluations between the certified defenses of the surrogate and target models demonstrate the effectiveness of our approach. Experiments on multiple image classification datasets show that DUCD not only outperforms existing black-box defenses but also matches the accuracy of white-box defenses, all while enhancing data privacy and reducing the success rate of membership inference attacks.

[LG-41] CoRA: Collaborative Information Perception by Large Language Models Weights for Recommendation

链接: https://arxiv.org/abs/2408.10645
作者: Yuting Liu,Jinghao Zhang,Yizhou Dang,Yuliang Liang,Qiang Liu,Guibing Guo,Jianzhe Zhao,Xingwei Wang
关键词-EN: Large Language Models, Large Language, Involving collaborative information, collaborative, LLM
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Involving collaborative information in Large Language Models (LLMs) is a promising technique for adapting LLMs for recommendation. Existing methods achieve this by concatenating collaborative features with text tokens into a unified sequence input and then fine-tuning to align these features with LLM’s input space. Although effective, in this work, we identify two limitations when adapting LLMs to recommendation tasks, which hinder the integration of general knowledge and collaborative information, resulting in sub-optimal recommendation performance. (1) Fine-tuning LLM with recommendation data can undermine its inherent world knowledge and fundamental competencies, which are crucial for interpreting and inferring recommendation text. (2) Incorporating collaborative features into textual prompts disrupts the semantics of the original prompts, preventing LLM from generating appropriate outputs. In this paper, we propose a new paradigm, CoRA (an acronym for Collaborative LoRA), with a collaborative weights generator. Rather than input space alignment, this method aligns collaborative information with LLM’s parameter space, representing them as incremental weights to update LLM’s output. This way, LLM perceives collaborative information without altering its general knowledge and text inference capabilities. Specifically, we employ a collaborative filtering model to extract user and item embeddings, converting them into collaborative weights with low-rank properties through the collaborative weights generator. We then merge the collaborative weights into LLM’s weights, enabling LLM to perceive the collaborative signals and generate personalized recommendations without fine-tuning or extra collaborative tokens in prompts. Extensive experiments confirm that CoRA effectively integrates collaborative information into LLM, enhancing recommendation performance.

[LG-42] Interactive Counterfactual Generation for Univariate Time Series KDD ECML-PKDD

链接: https://arxiv.org/abs/2408.10633
作者: Udo Schlegel,Julius Rauscher,Daniel A. Keim
关键词-EN: time series data, time series, decision boundary maps, series data, tackle interpretability challenges
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 14 pages, 4 figures, accepted at XKDD @ ECML-PKDD

点击查看摘要

Abstract:We propose an interactive methodology for generating counterfactual explanations for univariate time series data in classification tasks by leveraging 2D projections and decision boundary maps to tackle interpretability challenges. Our approach aims to enhance the transparency and understanding of deep learning models’ decision processes. The application simplifies the time series data analysis by enabling users to interactively manipulate projected data points, providing intuitive insights through inverse projection techniques. By abstracting user interactions with the projected data points rather than the raw time series data, our method facilitates an intuitive generation of counterfactual explanations. This approach allows for a more straightforward exploration of univariate time series data, enabling users to manipulate data points to comprehend potential outcomes of hypothetical scenarios. We validate this method using the ECG5000 benchmark dataset, demonstrating significant improvements in interpretability and user understanding of time series classification. The results indicate a promising direction for enhancing explainable AI, with potential applications in various domains requiring transparent and interpretable deep learning models. Future work will explore the scalability of this method to multivariate time series data and its integration with other interpretability techniques.

[LG-43] LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models

链接: https://arxiv.org/abs/2408.10631
作者: Yupeng Su,Ziyi Guan,Xiaoqun Liu,Tianlai Jin,Dongkuan Wu,Graziano Chesi,Ngai Wong,Hao Yu
关键词-EN: Large language models, Large language, significantly in scale, grown significantly, Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have grown significantly in scale, leading to a critical need for efficient model pruning techniques. Existing post-training pruning techniques primarily focus on measuring weight importance on converged dense models to determine salient weights to retain. However, they often overlook the changes in weight importance during the pruning process, which can lead to performance degradation in the pruned models. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, ensuring global performance optimization. Inspired by the recent discovery of prominent outliers in LLMs, LLM-Barber introduces an innovative pruning metric that identifies weight importance using weights multiplied by gradients. Our experiments show that LLM-Barber can efficiently prune models like LLaMA and OPT families with 7B to 13B parameters on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at this https URL.

[LG-44] Finding the DeepDream for Time Series: Activation Maximization for Univariate Time Series ECML-PKDD

链接: https://arxiv.org/abs/2408.10628
作者: Udo Schlegel,Daniel A. Keim,Tobias Sutter
关键词-EN: series data remains, interpret time series, time series data, Sequence Dreaming, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 4 figures, accepted at TempXAI @ ECML-PKDD

点击查看摘要

Abstract:Understanding how models process and interpret time series data remains a significant challenge in deep learning to enable applicability in safety-critical areas such as healthcare. In this paper, we introduce Sequence Dreaming, a technique that adapts Activation Maximization to analyze sequential information, aiming to enhance the interpretability of neural networks operating on univariate time series. By leveraging this method, we visualize the temporal dynamics and patterns most influential in model decision-making processes. To counteract the generation of unrealistic or excessively noisy sequences, we enhance Sequence Dreaming with a range of regularization techniques, including exponential smoothing. This approach ensures the production of sequences that more accurately reflect the critical features identified by the neural network. Our approach is tested on a time series classification dataset encompassing applications in predictive maintenance. The results show that our proposed Sequence Dreaming approach demonstrates targeted activation maximization for different use cases so that either centered class or border activation maximization can be generated. The results underscore the versatility of Sequence Dreaming in uncovering salient temporal features learned by neural networks, thereby advancing model transparency and trustworthiness in decision-critical domains.

[LG-45] On the Approximability of Stationary Processes using the ARMA Model

链接: https://arxiv.org/abs/2408.10610
作者: Anand Ganesh,Babhrubahan Bose,Anand Rajagopalan
关键词-EN: Autoregressive Moving Average, Moving Average, Autoregressive Moving, stationary random variables, stationary random
类目: Machine Learning (cs.LG); Probability (math.PR); Methodology (stat.ME)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:We identify certain gaps in the literature on the approximability of stationary random variables using the Autoregressive Moving Average (ARMA) model. To quantify approximability, we propose that an ARMA model be viewed as an approximation of a stationary random variable. We map these stationary random variables to Hardy space functions, and formulate a new function approximation problem that corresponds to random variable approximation, and thus to ARMA. Based on this Hardy space formulation we identify a class of stationary processes where approximation guarantees are feasible. We also identify an idealized stationary random process for which we conjecture that a good ARMA approximation is not possible. Next, we provide a constructive proof that Padé approximations do not always correspond to the best ARMA approximation. Finally, we note that the spectral methods adopted in this paper can be seen as a generalization of unit root methods for stationary processes even when an ARMA model is not defined.

[LG-46] PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis ALT

链接: https://arxiv.org/abs/2408.10609
作者: Yan Wu,Esther Wershof,Sebastian M Schmon,Marcel Nassar,Błażej Osiński,Ridvan Eksi,Kun Zhang,Thore Graepel
关键词-EN: rapidly evolving field, single cells, designed to standardize, evolving field, present a comprehensive
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注: 9 pages plus 19 pages supplementary material. Code is available at this https URL

点击查看摘要

Abstract:We present a comprehensive framework for predicting the effects of perturbations in single cells, designed to standardize benchmarking in this rapidly evolving field. Our framework, PerturBench, includes a user-friendly platform, diverse datasets, metrics for fair model comparison, and detailed performance analysis. Extensive evaluations of published and baseline models reveal limitations like mode or posterior collapse, and underscore the importance of rank metrics that assess the ordering of perturbations alongside traditional measures like RMSE. Our findings show that simple models can outperform more complex approaches. This benchmarking exercise sets new standards for model evaluation, supports robust model development, and advances the potential of these models to use high-throughput and high-content genetic and chemical screens for disease target discovery.

[LG-47] Multilingual Non-Factoid Question Answering with Silver Answers

链接: https://arxiv.org/abs/2408.10604
作者: Ritwik Mishra,Sreeram Vennam,Rajiv Ratn Shah,Ponnurangam Kumaraguru
关键词-EN: existing Question Answering, short-context Question Answering, Question Answering Datasets, Question Answering, Answering Datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing Question Answering Datasets (QuADs) primarily focus on factoid-based short-context Question Answering (QA) in high-resource languages. However, the scope of such datasets for low-resource languages remains limited, with only a few works centered on factoid-based QuADs and none on non-factoid QuADs. Therefore, this work presents MuNfQuAD, a multilingual QuAD with non-factoid questions. It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers. The dataset comprises over 370K QA pairs across 38 languages, encompassing several low-resource languages, and stands as the largest multilingual QA dataset to date. Based on the manual annotations of 790 QA-pairs from MuNfQuAD (golden set), we observe that 98% of questions can be answered using their corresponding silver answer. Our fine-tuned Answer Paragraph Selection (APS) model outperforms the baselines. The APS model attained an accuracy of 80% and 72%, as well as a macro F1 of 72% and 66%, on the MuNfQuAD testset and the golden set, respectively. Furthermore, the APS model effectively generalizes certain a language within the golden set, even after being fine-tuned on silver labels.

[LG-48] SparseGrow: Addressing Growth-Induced Forgetting in Task-Agnostic Continual Learning DATE AAAI

链接: https://arxiv.org/abs/2408.10566
作者: Yuqing Zhao,Divya Saxena,Jiannong Cao,Xiaoyun Liu,Changlin Song
关键词-EN: model growth, model, growth, model growth enhances, improper model growth
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been submitted to the AAAI conference. If accepted, the final version will be updated to reflect the conference proceedings

点击查看摘要

Abstract:In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks. However, improper model growth can lead to severe degradation of previously learned knowledge, an issue we name as growth-induced forgetting (GIFt), especially in task-agnostic CL using entire grown model for inference. Existing works, despite adopting model growth and random initialization for better adaptability, often fail to recognize the presence of GIFt caused by improper model growth. This oversight limits comprehensive control of forgetting and hinders full utilization of model growth. We are the first in CL to identify this issue and conduct an in-depth study on root cause of GIFt, where layer expansion stands out among model growth strategies, widening layers without affecting model functionality. Yet, direct adoption of layer expansion presents challenges. It lacks data-driven control and initialization of expanded parameters to balance adaptability and knowledge retention. This paper presents a novel SparseGrow approach to overcome the issue of GIFt while enhancing adaptability over new data. SparseGrow employs data-driven sparse layer expansion to control efficient parameter usage during growth, reducing GIFt from excessive growth and functionality changes. It also combines sparse growth with on-data initialization at training late-stage to create partially 0-valued expansions that fit learned distribution, enhancing retention and adaptability. To further minimize forgetting, freezing is applied by calculating the sparse mask, allowing data-driven preservation of important parameters. Through experiments across datasets with various settings, cases and task numbers, we demonstrate the necessity of layer expansion and showcase the effectiveness of SparseGrow in overcoming GIFt, highlighting its adaptability and knowledge retention for incremental tasks.

[LG-49] Hokoff: Real Game Dataset from Honor of Kings and its Offline Reinforcement Learning Benchmarks

链接: https://arxiv.org/abs/2408.10556
作者: Yun Qu,Boyuan Wang,Jianzhun Shao,Yuhang Jiang,Chen Chen,Zhenbin Ye,Lin Liu,Junfeng Yang,Lin Lai,Hongyang Qin,Minwen Deng,Juchao Zhuo,Deheng Ye,Qiang Fu,Wei Yang,Guang Yang,Lanxiao Huang,Xiangyang Ji
关键词-EN: Multi-Agent Reinforcement Learning, Offline Multi-Agent Reinforcement, Offline Reinforcement Learning, Reinforcement Learning, represent real-world complexities
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre-collected offline datasets that represent real-world complexities and practical applications. However, existing datasets often fall short in their simplicity and lack of realism. To address this gap, we propose Hokoff, a comprehensive set of pre-collected datasets that covers both offline RL and offline MARL, accompanied by a robust framework, to facilitate further research. This data is derived from Honor of Kings, a recognized Multiplayer Online Battle Arena (MOBA) game known for its intricate nature, closely resembling real-life situations. Utilizing this framework, we benchmark a variety of offline RL and offline MARL algorithms. We also introduce a novel baseline algorithm tailored for the inherent hierarchical action space of the game. We reveal the incompetency of current offline RL approaches in handling task complexity, generalization and multi-task learning.

[LG-50] arget-Prompt Online Graph Collaborative Learning for Temporal QoS Prediction

链接: https://arxiv.org/abs/2408.10555
作者: Shengxiang Hu,Guobing Zou,Song Yang,Shiyi Lin,Bofeng Zhang,Yixin Chen
关键词-EN: predicting the Quality, temporal QoS prediction, service-oriented architecture, accurately predicting, vital for maintaining
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:In service-oriented architecture, accurately predicting the Quality of Service (QoS) is vital for maintaining reliability and enhancing user satisfaction. However, current methods often neglect high-order latent collaborative relationships and fail to dynamically adjust feature learning for specific user-service invocations, which are critical for precise feature extraction. Moreover, relying on RNNs to capture QoS evolution limits the ability to detect long-term trends due to challenges in managing long-range dependencies. To address these issues, we propose the Target-Prompt Online Graph Collaborative Learning (TOGCL) framework for temporal QoS prediction. It leverages a dynamic user-service invocation graph to comprehensively model historical interactions. Building on this graph, it develops a target-prompt graph attention network to extract online deep latent features of users and services at each time slice, considering implicit target-neighboring collaborative relationships and historical QoS values. Additionally, a multi-layer Transformer encoder is employed to uncover temporal feature evolution patterns, enhancing temporal QoS prediction. Extensive experiments on the WS-DREAM dataset demonstrate that TOGCL significantly outperforms state-of-the-art methods across multiple metrics, achieving improvements of up to 38.80%. These results underscore the effectiveness of TOGCL for temporal QoS prediction.

[LG-51] Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision Models: Decision MetaMamba

链接: https://arxiv.org/abs/2408.10517
作者: Wall Kim
关键词-EN: Return-Conditioned Transformer Decision, RCTDM required alternative, potential to enhance, offline reinforcement learning, RCTDM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Return-Conditioned Transformer Decision Models (RCTDM) have demonstrated the potential to enhance transformer performance in offline reinforcement learning by replacing rewards in the input sequence with returns-to-go. However, to achieve the goal of learning an optimal policy from offline datasets composed of limited suboptimal trajectories, RCTDM required alternative methods. One prominent approach, trajectory stitching, was designed to enable the network to combine multiple trajectories to find the optimal path. To implement this using only transformers without auxiliary networks, it was necessary to shorten the input sequence length to better capture the Markov property in reinforcement learnings. This, however, introduced a trade-off, as it reduced the accuracy of action inference. Our study introduces a model named Decision MetaMamba to resolve these challenges. DMM employs an input token mixer to extract patterns from short sequences and uses a State Space Model (SSM) to selectively combine information from relatively distant sequences. Inspired by Metaformer, this structure was developed by transforming Mamba’s input layer into various multi-modal layers. Fortunately, with the advent of Mamba, implemented using parallel selective scanning, we achieved a high-performance sequence model capable of replacing transformers. Based on these innovations, DMM demonstrated excellent performance across various datasets in offline RL, confirming that models using SSM can improve performance by domain-specific alterations of the input layer. Additionally, it maintained its performance even in lightweight models with fewer parameters. These results suggest that decision models based on SSM can pave the way for improved outcomes in future developments.

[LG-52] Single-cell Curriculum Learning-based Deep Graph Embedding Clustering

链接: https://arxiv.org/abs/2408.10511
作者: Huifa Li,Jie Fu,Xinpeng Ling,Zhiyu Sun,Kuncan Wang,Zhili Chen
关键词-EN: single-cell RNA sequencing, cellular-level tissue heterogeneity, RNA sequencing, technologies enables, tissue heterogeneity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:The swift advancement of single-cell RNA sequencing (scRNA-seq) technologies enables the investigation of cellular-level tissue heterogeneity. Cell annotation significantly contributes to the extensive downstream analysis of scRNA-seq data. However, The analysis of scRNA-seq for biological inference presents challenges owing to its intricate and indeterminate data distribution, characterized by a substantial volume and a high frequency of dropout events. Furthermore, the quality of training samples varies greatly, and the performance of the popular scRNA-seq data clustering solution GNN could be harmed by two types of low-quality training nodes: 1) nodes on the boundary; 2) nodes that contribute little additional information to the graph. To address these problems, we propose a single-cell curriculum learning-based deep graph embedding clustering (scCLG). We first propose a Chebyshev graph convolutional autoencoder with multi-decoder (ChebAE) that combines three optimization objectives corresponding to three decoders, including topology reconstruction loss of cell graphs, zero-inflated negative binomial (ZINB) loss, and clustering loss, to learn cell-cell topology representation. Meanwhile, we employ a selective training strategy to train GNN based on the features and entropy of nodes and prune the difficult nodes based on the difficulty scores to keep the high-quality graph. Empirical results on a variety of gene expression datasets show that our model outperforms state-of-the-art methods.

[LG-53] Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers KDD2024 ECML

链接: https://arxiv.org/abs/2408.10503
作者: Thanh Thi Nguyen,Campbell Wilson,Janis Dalins
关键词-EN: Assessing the forensic, hand images involves, unique features, hand, hand images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the ECML PKDD 2024 (Research Track)

点击查看摘要

Abstract:Assessing the forensic value of hand images involves the use of unique features and patterns present in an individual’s hand. The human hand has distinct characteristics, such as the pattern of veins, fingerprints, and the geometry of the hand itself. This paper investigates the use of vision transformers (ViTs) for classification of hand images. We use explainability tools to explore the internal representations of ViTs and assess their impact on the model outputs. Utilizing the internal understanding of ViTs, we introduce distillation methods that allow a student model to adaptively extract knowledge from a teacher model while learning on data of a different domain to prevent catastrophic forgetting. Two publicly available hand image datasets are used to conduct a series of experiments to evaluate performance of the ViTs and our proposed adaptive distillation methods. The experimental results demonstrate that ViT models significantly outperform traditional machine learning methods and the internal states of ViTs are useful for explaining the model outputs in the classification task. By averting catastrophic forgetting, our distillation methods achieve excellent performance on data from both source and target domains, particularly when these two domains exhibit significant dissimilarity. The proposed approaches therefore can be developed and implemented effectively for real-world applications such as access control, identity verification, and authentication systems.

[LG-54] Clustering by Mining Density Distributions and Splitting Manifold Structure

链接: https://arxiv.org/abs/2408.10493
作者: Zhichang Xu,Zhiguo Long,Hua Meng
关键词-EN: Spectral clustering requires, Spectral clustering, Laplacian matrix, efficient spectral clustering, requires the time-consuming
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spectral clustering requires the time-consuming decomposition of the Laplacian matrix of the similarity graph, thus limiting its applicability to large datasets. To improve the efficiency of spectral clustering, a top-down approach was recently proposed, which first divides the data into several micro-clusters (granular-balls), then splits these micro-clusters when they are not "compact’‘, and finally uses these micro-clusters as nodes to construct a similarity graph for more efficient spectral clustering. However, this top-down approach is challenging to adapt to unevenly distributed or structurally complex data. This is because constructing micro-clusters as a rough ball struggles to capture the shape and structure of data in a local range, and the simplistic splitting rule that solely targets ``compactness’’ is susceptible to noise and variations in data density and leads to micro-clusters with varying shapes, making it challenging to accurately measure the similarity between them. To resolve these issues, this paper first proposes to start from local structures to obtain micro-clusters, such that the complex structural information inside local neighborhoods is well captured by them. Moreover, by noting that Euclidean distance is more suitable for convex sets, this paper further proposes a data splitting rule that couples local density and data manifold structures, so that the similarities of the obtained micro-clusters can be easily characterized. A novel similarity measure between micro-clusters is then proposed for the final spectral clustering. A series of experiments based on synthetic and real-world datasets demonstrate that the proposed method has better adaptability to structurally complex data than granular-ball based methods.

[LG-55] Achieving the Tightest Relaxation of Sigmoids for Formal Verification

链接: https://arxiv.org/abs/2408.10491
作者: Samuel Chevalier,Duncan Starkenburg,Krishnamurthy(Dj)Dvijotham
关键词-EN: Neural Networks, equivalent mathematical programs, sigmoid activation function, activation function, sigmoid activation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of formal verification, Neural Networks (NNs) are typically reformulated into equivalent mathematical programs which are optimized over. To overcome the inherent non-convexity of these reformulations, convex relaxations of nonlinear activation functions are typically utilized. Common relaxations (i.e., static linear cuts) of ``S-shaped" activation functions, however, can be overly loose, slowing down the overall verification process. In this paper, we derive tuneable hyperplanes which upper and lower bound the sigmoid activation function. When tuned in the dual space, these affine bounds smoothly rotate around the nonlinear manifold of the sigmoid activation function. This approach, termed \alpha -sig, allows us to tractably incorporate the tightest possible, element-wise convex relaxation of the sigmoid activation function into a formal verification framework. We embed these relaxations inside of large verification tasks and compare their performance to LiRPA and \alpha -CROWN, a state-of-the-art verification duo.

[LG-56] PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2408.10483
作者: Yongbo Yu,Weizhong Yu,Feiping Nie,Xuelong Li
关键词-EN: necessitates positional embeddings, Transformer, necessitates positional, Transformer architecture, positional embeddings
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The self-attention mechanism in Transformer architecture, invariant to sequence order, necessitates positional embeddings to encode temporal order in time series prediction. We argue that this reliance on positional embeddings restricts the Transformer’s ability to effectively represent temporal sequences, particularly when employing longer lookback windows. To address this, we introduce an innovative approach that combines Pyramid RNN embeddings(PRE) for univariate time series with the Transformer’s capability to model multivariate dependencies. PRE, utilizing pyramidal one-dimensional convolutional layers, constructs multiscale convolutional features that preserve temporal order. Additionally, RNNs, layered atop these features, learn multiscale time series representations sensitive to sequence order. This integration into Transformer models with attention mechanisms results in significant performance enhancements. We present the PRformer, a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets. This performance highlights the effectiveness of our approach in leveraging longer lookback windows and underscores the critical role of robust temporal representations in maximizing Transformer’s potential for prediction tasks. Code is available at this repository: \urlthis https URL.

[LG-57] An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing

链接: https://arxiv.org/abs/2408.10479
作者: Xinlang Yue,Yiran Liu,Fangzhou Shi,Sihong Luo,Chen Zhong,Min Lu,Zhe Xu
关键词-EN: localized spatiotemporal context, influences ride-hailing service, Assigning orders, ride-hailing service experience, spatiotemporal context
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Assigning orders to drivers under localized spatiotemporal context (micro-view order-dispatching) is a major task in Didi, as it influences ride-hailing service experience. Existing industrial solutions mainly follow a two-stage pattern that incorporate heuristic or learning-based algorithms with naive combinatorial methods, tackling the uncertainty of both sides’ behaviors, including emerging timings, spatial relationships, and travel duration, etc. In this paper, we propose a one-stage end-to-end reinforcement learning based order-dispatching approach that solves behavior prediction and combinatorial optimization uniformly in a sequential decision-making manner. Specifically, we employ a two-layer Markov Decision Process framework to model this problem, and present \underlineDeep \underlineDouble \underlineScalable \underlineNetwork (D2SN), an encoder-decoder structure network to generate order-driver assignments directly and stop assignments accordingly. Besides, by leveraging contextual dynamics, our approach can adapt to the behavioral patterns for better performance. Extensive experiments on Didi’s real-world benchmarks justify that the proposed approach significantly outperforms competitive baselines in optimizing matching efficiency and user experience tasks. In addition, we evaluate the deployment outline and discuss the gains and experiences obtained during the deployment tests from the view of large-scale engineering implementation.

[LG-58] LeCov: Multi-level Testing Criteria for Large Language Models

链接: https://arxiv.org/abs/2408.10474
作者: Xuan Xie,Jiayang Song,Yuheng Huang,Da Song,Fuyuan Zhang,Felix Juefei-Xu,Lei Ma
关键词-EN: Large Language Models, Large Language, truthfulness and toxicity, Language Models, limited interpretability
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in many different domains, but because of their limited interpretability, there are questions about how trustworthy they are in various perspectives, e.g., truthfulness and toxicity. Recent research has started developing testing methods for LLMs, aiming to uncover untrustworthy issues, i.e., defects, before deployment. However, systematic and formalized testing criteria are lacking, which hinders a comprehensive assessment of the extent and adequacy of testing exploration. To mitigate this threat, we propose a set of multi-level testing criteria, LeCov, for LLMs. The criteria consider three crucial LLM internal components, i.e., the attention mechanism, feed-forward neurons, and uncertainty, and contain nine types of testing criteria in total. We apply the criteria in two scenarios: test prioritization and coverage-guided testing. The experiment evaluation, on three models and four datasets, demonstrates the usefulness and effectiveness of LeCov.

[LG-59] Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism

链接: https://arxiv.org/abs/2408.10473
作者: Guanchen Li,Xiandong Zhao,Lian Liu,Zeping Li,Dong Li,Lu Tian,Jie He,Ashish Sirasao,Emad Barsoum
关键词-EN: language processing tasks, Pre-trained language models, natural language processing, Pre-trained language, exhibit outstanding performance
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained language models (PLMs) are engineered to be robust in contextual understanding and exhibit outstanding performance in various natural language processing tasks. However, their considerable size incurs significant computational and storage costs. Modern pruning strategies employ one-shot techniques to compress PLMs without the need for retraining on task-specific or otherwise general data; however, these approaches often lead to an indispensable reduction in performance. In this paper, we propose SDS, a Sparse-Dense-Sparse pruning framework to enhance the performance of the pruned PLMs from a weight distribution optimization perspective. We outline the pruning process in three steps. Initially, we prune less critical connections in the model using conventional one-shot pruning methods. Next, we reconstruct a dense model featuring a pruning-friendly weight distribution by reactivating pruned connections with sparse regularization. Finally, we perform a second pruning round, yielding a superior pruned model compared to the initial pruning. Experimental results demonstrate that SDS outperforms the state-of-the-art pruning techniques SparseGPT and Wanda under an identical sparsity configuration. For instance, SDS reduces perplexity by 9.13 on Raw-Wikitext2 and improves accuracy by an average of 2.05% across multiple zero-shot benchmarks for OPT-125M with 2:4 sparsity.

[LG-60] racing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

链接: https://arxiv.org/abs/2408.10468
作者: Jinxin Liu,Zao Yang
关键词-EN: include sensitive information, Large Language Models, potential privacy leakage, Language Models, large gradient norms
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The responses generated by Large Language Models (LLMs) can include sensitive information from individuals and organizations, leading to potential privacy leakage. This work implements Influence Functions (IFs) to trace privacy leakage back to the training data, thereby mitigating privacy concerns of Language Models (LMs). However, we notice that current IFs struggle to accurately estimate the influence of tokens with large gradient norms, potentially overestimating their influence. When tracing the most influential samples, this leads to frequently tracing back to samples with large gradient norm tokens, overshadowing the actual most influential samples even if their influences are well estimated. To address this issue, we propose Heuristically Adjusted IF (HAIF), which reduces the weight of tokens with large gradient norms, thereby significantly improving the accuracy of tracing the most influential samples. To establish easily obtained groundtruth for tracing privacy leakage, we construct two datasets, PII-E and PII-CR, representing two distinct scenarios: one with identical text in the model outputs and pre-training data, and the other where models leverage their reasoning abilities to generate text divergent from pre-training data. HAIF significantly improves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E dataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs on real-world pretraining data CLUECorpus2020, demonstrating strong robustness regardless prompt and response lengths.

[LG-61] Learning Multimodal Latent Space with EBM Prior and MCMC Inference

链接: https://arxiv.org/abs/2408.10467
作者: Shiyu Yuan,Carlo Lipizzi,Tian Han
关键词-EN: Chain Monte Carlo, Markov Chain Monte, MCMC inference, Monte Carlo, Markov Chain
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal generative models are crucial for various applications. We propose an approach that combines an expressive energy-based model (EBM) prior with Markov Chain Monte Carlo (MCMC) inference in the latent space for multimodal generation. The EBM prior acts as an informative guide, while MCMC inference, specifically through short-run Langevin dynamics, brings the posterior distribution closer to its true form. This method not only provides an expressive prior to better capture the complexity of multimodality but also improves the learning of shared latent variables for more coherent generation across modalities. Our proposed method is supported by empirical experiments, underscoring the effectiveness of our EBM prior with MCMC inference in enhancing cross-modal and joint generative tasks in multimodal contexts.

[LG-62] Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting INTERSPEECH2024

链接: https://arxiv.org/abs/2408.10463
作者: Hyun Jin Park,Dhruuv Agarwal,Neng Chen,Rentao Sun,Kurt Partridge,Justin Chen,Harry Zhang,Pai Zhu,Jacob Bartel,Kyle Kastner,Gary Wang,Andrew Rosenberg,Quan Wang
关键词-EN: problem requires large, requires large amounts, large amounts, KWS, achieve high accuracy
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: to be published in a Workshop at Interspeech 2024, Synthetic Data’s Transformative Role in Foundational Speech Models

点击查看摘要

Abstract:The keyword spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. Utilizing large amounts of text-to-speech (TTS) synthesized data can reduce the cost and time associated with KWS development. However, TTS data may contain artifacts not present in real speech, which the KWS model can exploit (overfit), leading to degraded accuracy on real speech. To address this issue, we propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data. Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss. Surprisingly, we also observed that the adversarial setup improves accuracy by up to 8%, even when trained solely on TTS and real negative speech data, without any real positive examples.

[LG-63] ransfer Operator Learning with Fusion Frame

链接: https://arxiv.org/abs/2408.10458
作者: Haoyang Jiang,Yongzhi Qu
关键词-EN: Partial Differential Equations, solve Partial Differential, applying learned knowledge, Differential Equations, Partial Differential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The challenge of applying learned knowledge from one domain to solve problems in another related but distinct domain, known as transfer learning, is fundamental in operator learning models that solve Partial Differential Equations (PDEs). These current models often struggle with generalization across different tasks and datasets, limiting their applicability in diverse scientific and engineering disciplines. This work presents a novel framework that enhances the transfer learning capabilities of operator learning models for solving Partial Differential Equations (PDEs) through the integration of fusion frame theory with the Proper Orthogonal Decomposition (POD)-enhanced Deep Operator Network (DeepONet). We introduce an innovative architecture that combines fusion frames with POD-DeepONet, demonstrating superior performance across various PDEs in our experimental analysis. Our framework addresses the critical challenge of transfer learning in operator learning models, paving the way for adaptable and efficient solutions across a wide range of scientific and engineering applications.

[LG-64] Parkinsons Disease Classification via EEG: All You Need is a Single Convolutional Layer

链接: https://arxiv.org/abs/2408.10457
作者: Md Fahim Anjum
关键词-EN: Convolutional Neural Network, Neural Network, minimalist Convolutional Neural, Parkinson disease, Convolutional Neural
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:In this work, we introduce LightCNN, a minimalist Convolutional Neural Network (CNN) architecture designed for Parkinson’s disease (PD) classification using EEG data. LightCNN’s strength lies in its simplicity, utilizing just a single convolutional layer. Embracing Leonardo da Vinci’s principle that “simplicity is the ultimate sophistication,” LightCNN demonstrates that complexity is not required to achieve outstanding results. We benchmarked LightCNN against several state-of-the-art deep learning models known for their effectiveness in EEG-based PD classification. Remarkably, LightCNN outperformed all these complex architectures, with a 2.3% improvement in recall, a 4.6% increase in precision, a 0.1% edge in AUC, a 4% boost in F1-score, and a 3.3% higher accuracy compared to the closest competitor. Furthermore, LightCNN identifies known pathological brain rhythms associated with PD and effectively captures clinically relevant neurophysiological changes in EEG. Its simplicity and interpretability make it ideal for deployment in resource-constrained environments, such as mobile or embedded systems for EEG analysis. In conclusion, LightCNN represents a significant step forward in efficient EEG-based PD classification, demonstrating that a well-designed, lightweight model can achieve superior performance over more complex architectures. This work underscores the potential for minimalist models to meet the needs of modern healthcare applications, particularly where resources are limited.

[LG-65] Differentially Private Stochastic Gradient Descent with Fixed-Size Minibatches: Tighter RDP Guarantees with or without Replacement

链接: https://arxiv.org/abs/2408.10456
作者: Jeremiah Birrell,Reza Ebrahimi,Rouzbeh Behnia,Jason Pacheco
关键词-EN: Differentially private stochastic, privately training deep, Differentially private, training deep learning, privacy loss incurred
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 39 pages, 10 figures

点击查看摘要

Abstract:Differentially private stochastic gradient descent (DP-SGD) has been instrumental in privately training deep learning models by providing a framework to control and track the privacy loss incurred during training. At the core of this computation lies a subsampling method that uses a privacy amplification lemma to enhance the privacy guarantees provided by the additive noise. Fixed size subsampling is appealing for its constant memory usage, unlike the variable sized minibatches in Poisson subsampling. It is also of interest in addressing class imbalance and federated learning. However, the current computable guarantees for fixed-size subsampling are not tight and do not consider both add/remove and replace-one adjacency relationships. We present a new and holistic Rényi differential privacy (RDP) accountant for DP-SGD with fixed-size subsampling without replacement (FSwoR) and with replacement (FSwR). For FSwoR we consider both add/remove and replace-one adjacency. Our FSwoR results improves on the best current computable bound by a factor of 4 . We also show for the first time that the widely-used Poisson subsampling and FSwoR with replace-one adjacency have the same privacy to leading order in the sampling probability. Accordingly, our work suggests that FSwoR is often preferable to Poisson subsampling due to constant memory usage. Our FSwR accountant includes explicit non-asymptotic upper and lower bounds and, to the authors’ knowledge, is the first such analysis of fixed-size RDP with replacement for DP-SGD. We analytically and empirically compare fixed size and Poisson subsampling, and show that DP-SGD gradients in a fixed-size subsampling regime exhibit lower variance in practice in addition to memory usage benefits.

[LG-66] Federated Learning of Large ASR Models in the Real World

链接: https://arxiv.org/abs/2408.10443
作者: Yonghui Xiao,Yuxin Ding,Changwan Ryu,Petr Zadrazil,Francoise Beaufays
关键词-EN: shown promising results, Federated learning, training machine learning, machine learning models, machine learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has shown promising results on training machine learning models with privacy preservation. However, for large models with over 100 million parameters, the training resource requirement becomes an obstacle for FL because common devices do not have enough memory and computation power to finish the FL tasks. Although efficient training methods have been proposed, it is still a challenge to train the large models like Conformer based ASR. This paper presents a systematic solution to train the full-size ASR models of 130M parameters with FL. To our knowledge, this is the first real-world FL application of the Conformer model, which is also the largest model ever trained with FL so far. And this is the first paper showing FL can improve the ASR model quality with a set of proposed methods to refine the quality of data and labels of clients. We demonstrate both the training efficiency and the model quality improvement in real-world experiments.

[LG-67] Understanding Generative AI Content with Embedding Models

链接: https://arxiv.org/abs/2408.10437
作者: Max Vargas,Reilly Cannon,Andrew Engel,Anand D. Sarwate,Tony Chiang
关键词-EN: high-quality numerical features, quantitative data analysis, construction of high-quality, high-quality numerical, numerical features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The construction of high-quality numerical features is critical to any quantitative data analysis. Feature engineering has been historically addressed by carefully hand-crafting data representations based on domain expertise. This work views the internal representations of modern deep neural networks (DNNs), called embeddings, as an automated form of traditional feature engineering. For trained DNNs, we show that these embeddings can reveal interpretable, high-level concepts in unstructured sample data. We use these embeddings in natural language and computer vision tasks to uncover both inherent heterogeneity in the underlying data and human-understandable explanations for it. In particular, we find empirical evidence that there is inherent separability between real data and that generated from AI models.

[LG-68] Learning Regularization for Graph Inverse Problems

链接: https://arxiv.org/abs/2408.10436
作者: Moshe Eliasof,Md Shahriar Rahim Siddiqui,Carola-Bibiane Schönlieb,Eldad Haber
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, network design, social networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Graph Neural Networks (GNNs) have been utilized for various applications ranging from drug discovery to network design and social networks. In many applications, it is impossible to observe some properties of the graph directly; instead, noisy and indirect measurements of these properties are available. These scenarios are coined as Graph Inverse Problems (GRIP). In this work, we introduce a framework leveraging GNNs to solve GRIPs. The framework is based on a combination of likelihood and prior terms, which are used to find a solution that fits the data while adhering to learned prior information. Specifically, we propose to combine recent deep learning techniques that were developed for inverse problems, together with GNN architectures, to formulate and solve GRIP. We study our approach on a number of representative problems that demonstrate the effectiveness of the framework.

[LG-69] Second-Order Forward-Mode Automatic Differentiation for Optimization

链接: https://arxiv.org/abs/2408.10419
作者: Adam D. Cobb,Atılım Güneş Baydin,Barak A. Pearlmutter,Susmit Jha
关键词-EN: step that generalizes, second-order optimization algorithm, second-order hyperplane search, optimization step, second-order line search
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:This paper introduces a second-order hyperplane search, a novel optimization step that generalizes a second-order line search from a line to a k -dimensional hyperplane. This, combined with the forward-mode stochastic gradient method, yields a second-order optimization algorithm that consists of forward passes only, completely avoiding the storage overhead of backpropagation. Unlike recent work that relies on directional derivatives (or Jacobian–Vector Products, JVPs), we use hyper-dual numbers to jointly evaluate both directional derivatives and their second-order quadratic terms. As a result, we introduce forward-mode weight perturbation with Hessian information (FoMoH). We then use FoMoH to develop a novel generalization of line search by extending it to a hyperplane search. We illustrate the utility of this extension and how it might be used to overcome some of the recent challenges of optimizing machine learning models without backpropagation. Our code is open-sourced at this https URL.

[LG-70] Joint Modeling of Search and Recommendations Via an Unified Contextual Recommender (UniCoRn)

链接: https://arxiv.org/abs/2408.10394
作者: Moumita Bhattacharya,Vito Ostuni,Sudarshan Lamkhede
关键词-EN: Search and recommendation, developed separately, leading to complex, technical debt, recommendation systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 3 pages, 1 figure

点击查看摘要

Abstract:Search and recommendation systems are essential in many services, and they are often developed separately, leading to complex maintenance and technical debt. In this paper, we present a unified deep learning model that efficiently handles key aspects of both tasks.

[LG-71] Value Alignment from Unstructured Text

链接: https://arxiv.org/abs/2408.10392
作者: Inkit Padhi,Karthikeyan Natesan Ramamurthy,Prasanna Sattigeri,Manish Nagireddy,Pierre Dognin,Kush R. Varshney
关键词-EN: Aligning large language, large language models, large language, systems has emerged, significant area
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) to value systems has emerged as a significant area of research within the fields of AI and NLP. Currently, this alignment process relies on the availability of high-quality supervised and preference data, which can be both time-consuming and expensive to curate or annotate. In this paper, we introduce a systematic end-to-end methodology for aligning LLMs to the implicit and explicit values represented in unstructured text data. Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data. Through two distinct use-cases, we demonstrate the efficiency of our methodology on the Mistral-7B-Instruct model. Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches, as quantified through the use of automatic metrics and win rates.

[LG-72] Deep-MacroFin: Informed Equilibrium Neural Network for Continuous Time Economic Models

链接: https://arxiv.org/abs/2408.10368
作者: Yuntao Wu,Jiayuan Guo,Goutham Gopalakrishna,Zisis Poulos
关键词-EN: comprehensive framework designed, continuous time economics, solve partial differential, present Deep-MacroFin, designed to solve
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:In this paper, we present Deep-MacroFin, a comprehensive framework designed to solve partial differential equations, with a particular focus on models in continuous time economics. This framework leverages deep learning methodologies, including conventional Multi-Layer Perceptrons and the newly developed Kolmogorov-Arnold Networks. It is optimized using economic information encapsulated by Hamilton-Jacobi-Bellman equations and coupled algebraic equations. The application of neural networks holds the promise of accurately resolving high-dimensional problems with fewer computational demands and limitations compared to standard numerical methods. This versatile framework can be readily adapted for elementary differential equations, and systems of differential equations, even in cases where the solutions may exhibit discontinuities. Importantly, it offers a more straightforward and user-friendly implementation than existing libraries.

[LG-73] On the Identifiability of Sparse ICA without Assuming Non-Gaussianity NEURIPS2023

链接: https://arxiv.org/abs/2408.10353
作者: Ignavier Ng,Yujia Zheng,Xinshuai Dong,Kun Zhang
关键词-EN: Independent component analysis, fundamental statistical tool, reveal hidden generative, hidden generative processes, Independent component
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2023

点击查看摘要

Abstract:Independent component analysis (ICA) is a fundamental statistical tool used to reveal hidden generative processes from observed data. However, traditional ICA approaches struggle with the rotational invariance inherent in Gaussian distributions, often necessitating the assumption of non-Gaussianity in the underlying sources. This may limit their applicability in broader contexts. To accommodate Gaussian sources, we develop an identifiability theory that relies on second-order statistics without imposing further preconditions on the distribution of sources, by introducing novel assumptions on the connective structure from sources to observed variables. Different from recent work that focuses on potentially restrictive connective structures, our proposed assumption of structural variability is both considerably less restrictive and provably necessary. Furthermore, we propose two estimation methods based on second-order statistics and sparsity constraint. Experimental results are provided to validate our identifiability theory and estimation methods.

[LG-74] AIR: Analytic Imbalance Rectifier for Continual Learning

链接: https://arxiv.org/abs/2408.10349
作者: Di Fang,Yinan Zhu,Runze Fang,Cen Chen,Ziqian Zeng,Huiping Zhuang
关键词-EN: Continual learning enables, generalized CIL scenarios, Continual learning, sequentially without retraining, CIL scenarios
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continual learning enables AI models to learn new data sequentially without retraining in real-world scenarios. Most existing methods assume the training data are balanced, aiming to reduce the catastrophic forgetting problem that models tend to forget previously generated data. However, data imbalance and the mixture of new and old data in real-world scenarios lead the model to ignore categories with fewer training samples. To solve this problem, we propose an analytic imbalance rectifier algorithm (AIR), a novel online exemplar-free continual learning method with an analytic (i.e., closed-form) solution for data-imbalanced class-incremental learning (CIL) and generalized CIL scenarios in real-world continual learning. AIR introduces an analytic re-weighting module (ARM) that calculates a re-weighting factor for each class for the loss function to balance the contribution of each category to the overall loss and solve the problem of imbalanced training data. AIR uses the least squares technique to give a non-discriminatory optimal classifier and its iterative update method in continual learning. Experimental results on multiple datasets show that AIR significantly outperforms existing methods in long-tailed and generalized CIL scenarios. The source code is available at this https URL.

[LG-75] Spectral Guarantees for Adversarial Streaming PCA

链接: https://arxiv.org/abs/2408.10332
作者: Eric Price,Zhiyang Xun
关键词-EN: streaming PCA, covariance matrix, estimate the top, top eigenvector, PCA
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: FOCS 2024

点击查看摘要

Abstract:In streaming PCA, we see a stream of vectors x_1, \dotsc, x_n \in \mathbbR^d and want to estimate the top eigenvector of their covariance matrix. This is easier if the spectral ratio R = \lambda_1 / \lambda_2 is large. We ask: how large does R need to be to solve streaming PCA in \widetildeO(d) space? Existing algorithms require R = \widetilde\Omega(d) . We show: (1) For all mergeable summaries, R = \widetilde\Omega(\sqrtd) is necessary. (2) In the insertion-only model, a variant of Oja’s algorithm gets o(1) error for R = O(\log n \log d) . (3) No algorithm with o(d^2) space gets o(1) error for R = O(1) . Our analysis is the first application of Oja’s algorithm to adversarial streams. It is also the first algorithm for adversarial streaming PCA that is designed for a spectral, rather than Frobenius, bound on the tail; and the bound it needs is exponentially better than is possible by adapting a Frobenius guarantee. Comments: FOCS 2024 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2408.10332 [cs.DS] (or arXiv:2408.10332v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2408.10332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-76] Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

链接: https://arxiv.org/abs/2408.10330
作者: Athul Raimon,Shubha Masti,Shyam K Sateesh,Siyani Vengatagiri,Bhaskarjyoti Das
关键词-EN: speech processing scenarios, audio, audio processing, processing scenarios, meta-learning
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Survey Paper (15 pages, 1 figure)

点击查看摘要

Abstract:This survey overviews various meta-learning approaches used in audio and speech processing scenarios. Meta-learning is used where model performance needs to be maximized with minimum annotated samples, making it suitable for low-sample audio processing. Although the field has made some significant contributions, audio meta-learning still lacks the presence of comprehensive survey papers. We present a systematic review of meta-learning methodologies in audio processing. This includes audio-specific discussions on data augmentation, feature extraction, preprocessing techniques, meta-learners, task selection strategies and also presents important datasets in audio, together with crucial real-world use cases. Through this extensive review, we aim to provide valuable insights and identify future research directions in the intersection of meta-learning and audio processing.

[LG-77] Decoding Human Emotions: Analyzing Multi-Channel EEG Data using LSTM Networks

链接: https://arxiv.org/abs/2408.10328
作者: Shyam K Sateesh,Sparsh BK,Uma D
关键词-EN: Human-Computer Interaction, Long Short-Term Memory, analyze EEG signals, EEG signal data, thriving field
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 13 pages, 3 figures; accepted at ICDSA '24 Conference, Jaipur, India

点击查看摘要

Abstract:Emotion recognition from electroencephalogram (EEG) signals is a thriving field, particularly in neuroscience and Human-Computer Interaction (HCI). This study aims to understand and improve the predictive accuracy of emotional state classification through metrics such as valence, arousal, dominance, and likeness by applying a Long Short-Term Memory (LSTM) network to analyze EEG signals. Using a popular dataset of multi-channel EEG recordings known as DEAP, we look towards leveraging LSTM networks’ properties to handle temporal dependencies within EEG signal data. This allows for a more comprehensive understanding and classification of emotional parameter states. We obtain accuracies of 89.89%, 90.33%, 90.70%, and 90.54% for arousal, valence, dominance, and likeness, respectively, demonstrating significant improvements in emotion recognition model capabilities. This paper elucidates the methodology and architectural specifics of our LSTM model and provides a benchmark analysis with existing papers.

[LG-78] Leveraging Superfluous Information in Contrastive Representation Learning

链接: https://arxiv.org/abs/2408.10292
作者: Xuechu Yu
关键词-EN: learnthe shared information, downstream tasks, aims to learnthe, learnthe shared, shown its powerful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Contrastive representation learning, which aims to learnthe shared information between different views of unlabeled data by maximizing the mutual information between them, has shown its powerful competence in self-supervised learning for downstream tasks. However, recent works have demonstrated that more estimated mutual information does not guarantee better performance in different downstream tasks. Such works inspire us to conjecture that the learned representations not only maintain task-relevant information from unlabeled data but also carry task-irrelevant information which is superfluous for downstream tasks, thus leading to performance degeneration. In this paper we show that superfluous information does exist during the conventional contrastive learning framework, and further design a new objective, namely SuperInfo, to learn robust representations by a linear combination of both predictive and superfluous information. Besides, we notice that it is feasible to tune the coefficients of introduced losses to discard task-irrelevant information, while keeping partial non-shared task-relevant information according to our SuperInfo loss.We demonstrate that learning with our loss can often outperform the traditional contrastive learning approaches on image classification, object detection and instance segmentation tasks with significant improvements.

[LG-79] Chatbots and Zero Sales Resistance

链接: https://arxiv.org/abs/2408.10291
作者: Sauro Succi
关键词-EN: machine learning applications, large-scale machine learning, energetically unsustainable, financial power, increasing number
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:It is argued that the pursuit of an ever increasing number of weights in large-scale machine learning applications, besides being energetically unsustainable, is also conducive to manipulative strategies whereby Science is easily served as a strawman for economic and financial power. If machine learning is meant to serve science ahead of vested business interests, a paradigm shift is needed: from more weights and little insight to more insight and less weights.

[LG-80] Augmenting train maintenance technicians with automated incident diagnostic suggestions

链接: https://arxiv.org/abs/2408.10288
作者: Georges Tod,Jean Bruggeman,Evert Bevernage,Pieter Moelans,Walter Eeckhout,Jean-Luc Glineur
关键词-EN: diagnosed individually, individually and manually, train maintenance technicians, train maintenance, Train
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Train operational incidents are so far diagnosed individually and manually by train maintenance technicians. In order to assist maintenance crews in their responsiveness and task prioritization, a learning machine is developed and deployed in production to suggest diagnostics to train technicians on their phones, tablets or laptops as soon as a train incident is declared. A feedback loop allows to take into account the actual diagnose by designated train maintenance experts to refine the learning machine. By formulating the problem as a discrete set classification task, feature engineering methods are proposed to extract physically plausible sets of events from traces generated on-board railway vehicles. The latter feed an original ensemble classifier to class incidents by their potential technical cause. Finally, the resulting model is trained and validated using real operational data and deployed on a cloud platform. Future work will explore how the extracted sets of events can be used to avoid incidents by assisting human experts in the creation predictive maintenance alerts.

[LG-81] GPT-Augmented Reinforcement Learning with Intelligent Control for Vehicle Dispatching

链接: https://arxiv.org/abs/2408.10286
作者: Xiao Han,Zijian Zhang,Xiangyu Zhao,Guojiang Shen,Xiangjie Kong,Xuetao Wei,Liqiang Nie,Jieping Ye
关键词-EN: online ride-hailing services, residents demand higher, urban residents demand, demand higher travel, critical component
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As urban residents demand higher travel quality, vehicle dispatch has become a critical component of online ride-hailing services. However, current vehicle dispatch systems struggle to navigate the complexities of urban traffic dynamics, including unpredictable traffic conditions, diverse driver behaviors, and fluctuating supply and demand patterns. These challenges have resulted in travel difficulties for passengers in certain areas, while many drivers in other areas are unable to secure orders, leading to a decline in the overall quality of urban transportation services. To address these issues, this paper introduces GARLIC: a framework of GPT-Augmented Reinforcement Learning with Intelligent Control for vehicle dispatching. GARLIC utilizes multiview graphs to capture hierarchical traffic states, and learns a dynamic reward function that accounts for individual driving behaviors. The framework further integrates a GPT model trained with a custom loss function to enable high-precision predictions and optimize dispatching policies in real-world scenarios. Experiments conducted on two real-world datasets demonstrate that GARLIC effectively aligns with driver behaviors while reducing the empty load rate of vehicles.

[LG-82] BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

链接: https://arxiv.org/abs/2408.10285
作者: Yifei Yang,Runhan Shi,Zuchao Li,Shu Jiang,Bao-Liang Lu,Yang Yang,Hai Zhao
关键词-EN: organic chemistry, pivotal yet challenging, discovery and organic, challenging in drug, drug discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Retrosynthesis analysis is pivotal yet challenging in drug discovery and organic chemistry. Despite the proliferation of computational tools over the past decade, AI-based systems often fall short in generalizing across diverse reaction types and exploring alternative synthetic pathways. This paper presents BatGPT-Chem, a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Integrating chemical tasks via a unified framework of natural language and SMILES notation, this approach synthesizes extensive instructional data from an expansive chemical database. Employing both autoregressive and bidirectional training techniques across over one hundred million instances, BatGPT-Chem captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions and exhibiting strong zero-shot capabilities. Superior to existing AI methods, our model demonstrates significant advancements in generating effective strategies for complex molecules, as validated by stringent benchmark tests. BatGPT-Chem not only boosts the efficiency and creativity of retrosynthetic analysis but also establishes a new standard for computational tools in synthetic design. This development empowers chemists to adeptly address the synthesis of novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science. We release our trial platform at \urlthis https URL.

[LG-83] AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

链接: https://arxiv.org/abs/2408.10284
作者: Shuzhang Zhong,Ling Liang,Yuan Wang,Runsheng Wang,Ru Huang,Meng Li
关键词-EN: large language models, computational demands, language models, designed to enhance, enhance the efficiency
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models are designed to enhance the efficiency of large language models (LLMs) without proportionally increasing the computational demands. However, their deployment on edge devices still faces significant challenges due to high on-demand loading overheads from managing sparsely activated experts. This paper introduces AdapMoE, an algorithm-system co-design framework for efficient MoE inference. AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads. We observe the heterogeneity of experts loading across layers and tokens, based on which we propose a sensitivity-based strategy to adjust the number of activated experts dynamically. Meanwhile, we also integrate advanced prefetching and cache management techniques to further reduce the loading latency. Through comprehensive evaluations on various platforms, we demonstrate AdapMoE consistently outperforms existing techniques, reducing the average number of activated experts by 25% and achieving a 1.35x speedup without accuracy degradation. Code is available at: this https URL.

[LG-84] NoRA: Nested Low-Rank Adaptation for Efficient Fine-Tuning Large Models

链接: https://arxiv.org/abs/2408.10280
作者: Cheng Lin,Lujun Li,Dezhi Li,Jie Zou,Wenhan Luo,Wei Xue,Yike Guo
关键词-EN: introduce Nested Low-Rank, Nested Low-Rank Adaptation, Low-Rank Adaptation, Nested Low-Rank, extends the capabilities
类目: Machine Learning (cs.LG)
*备注: Work in progress, revisions ongoing

点击查看摘要

Abstract:In this paper, we introduce Nested Low-Rank Adaptation (NoRA), a novel approach to parameter-efficient fine-tuning that extends the capabilities of Low-Rank Adaptation (LoRA) techniques. Vanilla LoRA overlooks pre-trained weight inheritance and still requires fine-tuning numerous parameters. To addresses these issues, our NoRA adopts a dual-layer nested structure with Singular Value Decomposition (SVD), effectively leveraging original matrix knowledge while reducing tunable parameters. Specifically, NoRA freezes the outer LoRA weights and utilizes an inner LoRA design, providing enhanced control over model optimization. This approach allows the model to more precisely adapt to specific tasks while maintaining a compact parameter space. By freezing outer LoRA weights and using an inner LoRA design, NoRA enables precise task adaptation with a compact parameter space. Evaluations on tasks including commonsense reasoning with large language models, fine-tuning vision-language models, and subject-driven generation demonstrate NoRA’s superiority over LoRA and its variants. Notably, NoRA reduces fine-tuning parameters|training-time|memory-usage by 4%|22.5%|20.7% compared to LoRA on LLaMA-3 8B, while achieving 2.2% higher performance. Code will be released upon acceptance.

[LG-85] Increasing transformer token length with a Maximum Entropy Principle Method

链接: https://arxiv.org/abs/2408.10277
作者: R. I. Cukier
关键词-EN: sequences processed, quadratic dependence, Maximum Entropy Principle, Transformers suffer, length of sequences
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Transformers suffer from the computational overhead of their quadratic dependence on the length of sequences processed. We present three methods, all adding an intermediate step between training and inference/generation, which extend the autoregressive length of transformers. All rely on a Maximum Entropy Principle (MEP) whereby entropy is maximized in the presence of suitable constraints, accounted for by use of Lagrange Multipliers. These constraint methods extend the autoregressive character from T to 2T tokens in a linear-with-T fashion. There is overhead associated with this added step, but they should still be faster than the standard methods.

[LG-86] FEDKIM: Adaptive Federated Knowledge Injection into Medical Foundation Models EMNLP’24

链接: https://arxiv.org/abs/2408.10276
作者: Xiaochen Wang,Jiaqi Wang,Houping Xiao,Jinghui Chen,Fenglong Ma
关键词-EN: outperforming conventional artificial, conventional artificial intelligence, demonstrated remarkable capabilities, handling diverse modalities, outperforming conventional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to EMNLP’24

点击查看摘要

Abstract:Foundation models have demonstrated remarkable capabilities in handling diverse modalities and tasks, outperforming conventional artificial intelligence (AI) approaches that are highly task-specific and modality-reliant. In the medical domain, however, the development of comprehensive foundation models is constrained by limited access to diverse modalities and stringent privacy regulations. To address these constraints, this study introduces a novel knowledge injection approach, FedKIM, designed to scale the medical foundation model within a federated learning framework. FedKIM leverages lightweight local models to extract healthcare knowledge from private data and integrates this knowledge into a centralized foundation model using a designed adaptive Multitask Multimodal Mixture Of Experts (M3OE) module. This method not only preserves privacy but also enhances the model’s ability to handle complex medical tasks involving multiple modalities. Our extensive experiments across twelve tasks in seven modalities demonstrate the effectiveness of FedKIM in various settings, highlighting its potential to scale medical foundation models without direct access to sensitive data.

[LG-87] FedKBP: Federated dose prediction framework for knowledge-based planning in radiation therapy

链接: https://arxiv.org/abs/2408.10275
作者: Jingyun Chen,Martin King,Yading Yuan
关键词-EN: automatically generating patient-specific, generating patient-specific dose, Dose prediction plays, patient-specific dose distribution, Dose prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review by SPIE Medical Imaging 2025 Conference

点击查看摘要

Abstract:Dose prediction plays a key role in knowledge-based planning (KBP) by automatically generating patient-specific dose distribution. Recent advances in deep learning-based dose prediction methods necessitates collaboration among data contributors for improved performance. Federated learning (FL) has emerged as a solution, enabling medical centers to jointly train deep-learning models without compromising patient data privacy. We developed the FedKBP framework to evaluate the performances of centralized, federated, and individual (i.e. separated) training of dose prediction model on the 340 plans from OpenKBP dataset. To simulate FL and individual training, we divided the data into 8 training sites. To evaluate the effect of inter-site data variation on model training, we implemented two types of case distributions: 1) Independent and identically distributed (IID), where the training and validating cases were evenly divided among the 8 sites, and 2) non-IID, where some sites have more cases than others. The results show FL consistently outperforms individual training on both model optimization speed and out-of-sample testing scores, highlighting the advantage of FL over individual training. Under IID data division, FL shows comparable performance to centralized training, underscoring FL as a promising alternative to traditional pooled-data training. Under non-IID division, larger sites outperformed smaller sites by up to 19% on testing scores, confirming the need of collaboration among data owners to achieve better prediction accuracy. Meanwhile, non-IID FL showed reduced performance as compared to IID FL, posing the need for more sophisticated FL method beyond mere model averaging to handle data variation among participating sites.

[LG-88] Data-Driven Fire Modeling: Learning First Arrival Times and Model Parameters with Neural Networks

链接: https://arxiv.org/abs/2408.10271
作者: Xin Tong,Bryan Quaife
关键词-EN: complement physics-based models, Data-driven techniques, fire science, neural networks, increasingly applied
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven techniques are being increasingly applied to complement physics-based models in fire science. However, the lack of sufficiently large datasets continues to hinder the application of certain machine learning techniques. In this paper, we use simulated data to investigate the ability of neural networks to parameterize dynamics in fire science. In particular, we investigate neural networks that map five key parameters in fire spread to the first arrival time, and the corresponding inverse problem. By using simulated data, we are able to characterize the error, the required dataset size, and the convergence properties of these neural networks. For the inverse problem, we quantify the network’s sensitivity in estimating each of the key parameters. The findings demonstrate the potential of machine learning in fire science, highlight the challenges associated with limited dataset sizes, and quantify the sensitivity of neural networks to estimate key parameters governing fire spread dynamics.

[LG-89] SEAL: Systematic Error Analysis for Value ALignment

链接: https://arxiv.org/abs/2408.10270
作者: Manon Revel,Matteo Cargnelutti,Tyna Eloundou,Greg Leppert
关键词-EN: Reinforcement Learning, align language models, training reward models, Human Feedback, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 28 pages, 17 Figures, 8 Tables

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them - a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed inputs. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% incidence of alignment resistance in portions of the dataset where LM-labelers disagreed with human preferences. Furthermore, we find that misalignment often arises from ambiguous entries within the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.

[LG-90] OpenCity: Open Spatio-Temporal Foundation Models for Traffic Prediction

链接: https://arxiv.org/abs/2408.10269
作者: Zhonghang Li,Long Xia,Lei Shi,Yong Xu,Dawei Yin,Chao Huang
关键词-EN: enabling efficient resource, enhanced travel experiences, efficient resource allocation, Accurate traffic forecasting, effective urban planning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 12 pages

点击查看摘要

Abstract:Accurate traffic forecasting is crucial for effective urban planning and transportation management, enabling efficient resource allocation and enhanced travel experiences. However, existing models often face limitations in generalization, struggling with zero-shot prediction on unseen regions and cities, as well as diminished long-term accuracy. This is primarily due to the inherent challenges in handling the spatial and temporal heterogeneity of traffic data, coupled with the significant distribution shift across time and space. In this work, we aim to unlock new possibilities for building versatile, resilient and adaptive spatio-temporal foundation models for traffic prediction. To achieve this goal, we introduce a novel foundation model, named OpenCity, that can effectively capture and normalize the underlying spatio-temporal patterns from diverse data characteristics, facilitating zero-shot generalization across diverse urban environments. OpenCity integrates the Transformer architecture with graph neural networks to model the complex spatio-temporal dependencies in traffic data. By pre-training OpenCity on large-scale, heterogeneous traffic datasets, we enable the model to learn rich, generalizable representations that can be seamlessly applied to a wide range of traffic forecasting scenarios. Experimental results demonstrate that OpenCity exhibits exceptional zero-shot predictive performance. Moreover, OpenCity showcases promising scaling laws, suggesting the potential for developing a truly one-for-all traffic prediction solution that can adapt to new urban contexts with minimal overhead. We made our proposed OpenCity model open-source and it is available at the following link: this https URL.

[LG-91] Realtime Generation of Streamliners with Large Language Models

链接: https://arxiv.org/abs/2408.10268
作者: Florentina Voboril,Vaidyanathan Peruvemba Ramaswamy,Stefan Szeider
关键词-EN: Large Language Models, Language Models, Large Language, paper presents, Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the novel method StreamLLM for generating streamliners in constraint programming using Large Language Models (LLMs). Streamliners are constraints that narrow the search space, enhancing the speed and feasibility of solving complex problems. Traditionally, streamliners were crafted manually or generated through systematically combined atomic constraints with high-effort offline testing. Our approach uses LLMs to propose effective streamliners. Our system StreamLLM generates streamlines for problems specified in the MiniZinc constraint programming language and integrates feedback to the LLM with quick empirical tests. Our rigorous empirical evaluation involving ten problems with several hundreds of test instances shows robust results that are highly encouraging, showcasing the transforming power of LLMs in the domain of constraint programming.

[LG-92] owards Efficient Machine Learning Method for IoT DDoS Attack Detection

链接: https://arxiv.org/abs/2408.10267
作者: P Modi
关键词-EN: harmful security attacks, harmful security, IoT devices, big concern, concern to ensure
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:With the rise in the number of IoT devices and its users, security in IoT has become a big concern to ensure the protection from harmful security attacks. In the recent years, different variants of DDoS attacks have been on the rise in IoT devices. Failure to detect DDoS attacks at the right time can result in financial and reputational loss for victim organizations. These attacks conducted with IoT devices can cause a significant downtime of applications running on the Internet. Although researchers have developed and utilized specialized models using artificial intelligence techniques, these models do not provide the best accuracy as there is always a scope of improvement until 100% accuracy is attained. We propose a hybrid feature selection algorithm that selects only the most useful features and passes those features into an XGBoost model, the results of which are explained using feature importances. Our model attains an accuracy of 99.993% on the CIC IDS 2017 dataset and a recall of 97.64 % on the CIC IoT 2023 dataset. Overall, this research would help researchers and implementers in the field of detecting IoT DDoS attacks by providing a more accurate and comparable model.

[LG-93] Diffusion Model for Planning: A Systematic Literature Review

链接: https://arxiv.org/abs/2408.10266
作者: Toshihide Ubukata,Jialong Li,Kenji Tei
关键词-EN: leverage stochastic processes, iterative denoising processes, data distributions effectively, achieving notable success, capture complex data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 13 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Diffusion models, which leverage stochastic processes to capture complex data distributions effectively, have shown their performance as generative models, achieving notable success in image-related tasks through iterative denoising processes. Recently, diffusion models have been further applied and show their strong abilities in planning tasks, leading to a significant growth in related publications since 2023. To help researchers better understand the field and promote the development of the field, we conduct a systematic literature review of recent advancements in the application of diffusion models for planning. Specifically, this paper categorizes and discusses the current literature from the following perspectives: (i) relevant datasets and benchmarks used for evaluating diffusion modelbased planning; (ii) fundamental studies that address aspects such as sampling efficiency; (iii) skill-centric and condition-guided planning for enhancing adaptability; (iv) safety and uncertainty managing mechanism for enhancing safety and robustness; and (v) domain-specific application such as autonomous driving. Finally, given the above literature review, we further discuss the challenges and future directions in this field.

[LG-94] OPDR: Order-Preserving Dimension Reduction for Semantic Embedding of Multimodal Scientific Data

链接: https://arxiv.org/abs/2408.10264
作者: Chengyu Gong,Gefei Shen,Luanzheng Guo,Nathan Tallent,Dongfang Zhao
关键词-EN: scientific data management, multimodal scientific data, similar items, original multimodal data, multimodal machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:One of the most common operations in multimodal scientific data management is searching for the k most similar items (or, k -nearest neighbors, KNN) from the database after being provided a new item. Although recent advances of multimodal machine learning models offer a \textitsemantic index, the so-called \textitembedding vectors mapped from the original multimodal data, the dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications. This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top- k nearest neighbors do not change in the lower-dimensional space, namely Order-Preserving Dimension Reduction (OPDR). In order to develop such an OPDR method, our central hypothesis is that by analyzing the intrinsic relationship among key parameters during the dimension-reduction map, a quantitative function may be constructed to reveal the correlation between the target (lower) dimensionality and other variables. To demonstrate the hypothesis, this paper first defines a formal measure function to quantify the KNN similarity for a specific vector, then extends the measure into an aggregate accuracy of the global metric spaces, and finally derives a closed-form function between the target (lower) dimensionality and other variables. We incorporate the closed-function into popular dimension-reduction methods, various distance metrics, and embedding models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2408.10264 [cs.LG] (or arXiv:2408.10264v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.10264 Focus to learn more arXiv-issued DOI via DataCite

[LG-95] Kolmogorov Arnold Networks in Fraud Detection: Bridging the Gap Between Theory and Practice

链接: https://arxiv.org/abs/2408.10263
作者: Yang Lu,Felix Zhan
关键词-EN: Kolmogorov Arnold Networks, Kolmogorov Arnold, Arnold Networks, electronic shopping industries, handle complex patterns
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Kolmogorov Arnold Networks (KAN) are highly efficient in inference and can handle complex patterns once trained, making them desirable for production environments and ensuring a fast service experience in the finance and electronic shopping industries. However, we found that KAN, in general, is not suitable for fraud detection problems. We also discovered a quick method to determine whether a problem is solvable by KAN: if the data can be effectively separated using spline interpolation with varying intervals after applying Principal Component Analysis (PCA) to reduce the data dimensions to two, KAN can outperform most machine learning algorithms. Otherwise, it indicates KAN may not solve the problem effectively compared to other machine learning algorithms. We also propose a heuristic approach for selecting the appropriate hyperparameters for KAN to significantly accelerate training time compared to grid search hyperparameter tuning, which usually takes a month for a comprehensive grid search. Specifically, the width parameter should generally follow a pyramid structure, allowing efficient spline mixing, and k should be fixed at 15, with the grid number fixed at 5. This streamlined approach minimizes the number of evaluations required, significantly speeding up the hyperparameter tuning process while still achieving robust performance metrics.

[LG-96] Relational Graph Convolutional Networks Do Not Learn Sound Rules KR2024

链接: https://arxiv.org/abs/2408.10261
作者: Matthew Morris,David J. Tena Cucala,Bernardo Cuenca Grau,Ian Horrocks
关键词-EN: Graph neural networks, Graph neural, knowledge graphs, predict missing facts, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Full version (with appendices) of paper accepted to KR 2024 (21st International Conference on Principles of Knowledge Representation and Reasoning)

点击查看摘要

Abstract:Graph neural networks (GNNs) are frequently used to predict missing facts in knowledge graphs (KGs). Motivated by the lack of explainability for the outputs of these models, recent work has aimed to explain their predictions using Datalog, a widely used logic-based formalism. However, such work has been restricted to certain subclasses of GNNs. In this paper, we consider one of the most popular GNN architectures for KGs, R-GCN, and we provide two methods to extract rules that explain its predictions and are sound, in the sense that each fact derived by the rules is also predicted by the GNN, for any input dataset. Furthermore, we provide a method that can verify that certain classes of Datalog rules are not sound for the R-GCN. In our experiments, we train R-GCNs on KG completion benchmarks, and we are able to verify that no Datalog rule is sound for these models, even though the models often obtain high to near-perfect accuracy. This raises some concerns about the ability of R-GCN models to generalise and about the explainability of their predictions. We further provide two variations to the training paradigm of R-GCN that encourage it to learn sound rules and find a trade-off between model accuracy and the number of learned sound rules.

[LG-97] Contrastive Learning on Medical Intents for Sequential Prescription Recommendation CIKM2024

链接: https://arxiv.org/abs/2408.10259
作者: Arya Hadizadeh Moghaddam,Mohsen Nayebi Kerdabadi,Mei Liu,Zijun Yao
关键词-EN: Electronic Health Records, applied to Electronic, sequential modeling applied, prescription recommender systems, greatly influenced prescription
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024)

点击查看摘要

Abstract:Recent advancements in sequential modeling applied to Electronic Health Records (EHR) have greatly influenced prescription recommender systems. While the recent literature on drug recommendation has shown promising performance, the study of discovering a diversity of coexisting temporal relationships at the level of medical codes over consecutive visits remains less explored. The goal of this study can be motivated from two perspectives. First, there is a need to develop a sophisticated sequential model capable of disentangling the complex relationships across sequential visits. Second, it is crucial to establish multiple and diverse health profiles for the same patient to ensure a comprehensive consideration of different medical intents in drug recommendation. To achieve this goal, we introduce Attentive Recommendation with Contrasted Intents (ARCI), a multi-level transformer-based method designed to capture the different but coexisting temporal paths across a shared sequence of visits. Specifically, we propose a novel intent-aware method with contrastive learning, that links specialized medical intents of the patients to the transformer heads for extracting distinct temporal paths associated with different health profiles. We conducted experiments on two real-world datasets for the prescription recommendation task using both ranking and classification metrics. Our results demonstrate that ARCI has outperformed the state-of-the-art prescription recommendation methods and is capable of providing interpretable insights for healthcare practitioners.

[LG-98] NeRF-US: Removing Ultrasound Imaging Artifacts from Neural Radiance Fields in the Wild

链接: https://arxiv.org/abs/2408.10258
作者: Rishit Dagli,Atsuhiro Hibi,Rahul G. Krishnan,Pascal N. Tyrrell
关键词-EN: face severe artifacts, view synthesis, face severe, current approaches differ, training NeRF-based approaches
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current methods for performing 3D reconstruction and novel view synthesis (NVS) in ultrasound imaging data often face severe artifacts when training NeRF-based approaches. The artifacts produced by current approaches differ from NeRF floaters in general scenes because of the unique nature of ultrasound capture. Furthermore, existing models fail to produce reasonable 3D reconstructions when ultrasound data is captured or obtained casually in uncontrolled environments, which is common in clinical settings. Consequently, existing reconstruction and NVS methods struggle to handle ultrasound motion, fail to capture intricate details, and cannot model transparent and reflective surfaces. In this work, we introduced NeRF-US, which incorporates 3D-geometry guidance for border probability and scattering density into NeRF training, while also utilizing ultrasound-specific rendering over traditional volume rendering. These 3D priors are learned through a diffusion model. Through experiments conducted on our new “Ultrasound in the Wild” dataset, we observed accurate, clinically plausible, artifact-free reconstructions.

[LG-99] A Conceptual Framework for Ethical Evaluation of Machine Learning Systems

链接: https://arxiv.org/abs/2408.10239
作者: Neha R. Gupta,Jessica Hullman,Hari Subramonyam
关键词-EN: Research in Responsible, machine learning systems, developed a range, range of principles, machine learning
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly informative tests to ensure downstream product safety, with potential fairness harms inherent to the implemented testing procedures. We conceptualize ethics-related concerns in standard ML evaluation techniques. Specifically, we present a utility framework, characterizing the key trade-off in ethical evaluation as balancing information gain against potential ethical harms. The framework is then a tool for characterizing challenges teams face, and systematically disentangling competing considerations that teams seek to balance. Differentiating between different types of issues encountered in evaluation allows us to highlight best practices from analogous domains, such as clinical trials and automotive crash testing, which navigate these issues in ways that can offer inspiration to improve evaluation processes in ML. Our analysis underscores the critical need for development teams to deliberately assess and manage ethical complexities that arise during the evaluation of ML systems, and for the industry to move towards designing institutional policies to support ethical evaluations.

[LG-100] Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications

链接: https://arxiv.org/abs/2408.10215
作者: Sinan Ibrahim,Mostafa Mostafa,Ali Jnadi,Pavel Osinenko
关键词-EN: Reinforcement Learning, reinforcement learning algorithms, create systems capable, making autonomous decisions, Learning
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 22 pages, 25 figures, we are waiting for decision from IEEE Access

点击查看摘要

Abstract:The aim of Reinforcement Learning (RL) in real-world applications is to create systems capable of making autonomous decisions by learning from their environment through trial and error. This paper emphasizes the importance of reward engineering and reward shaping in enhancing the efficiency and effectiveness of reinforcement learning algorithms. Reward engineering involves designing reward functions that accurately reflect the desired outcomes, while reward shaping provides additional feedback to guide the learning process, accelerating convergence to optimal policies. Despite significant advancements in reinforcement learning, several limitations persist. One key challenge is the sparse and delayed nature of rewards in many real-world scenarios, which can hinder learning progress. Additionally, the complexity of accurately modeling real-world environments and the computational demands of reinforcement learning algorithms remain substantial obstacles. On the other hand, recent advancements in deep learning and neural networks have significantly improved the capability of reinforcement learning systems to handle high-dimensional state and action spaces, enabling their application to complex tasks such as robotics, autonomous driving, and game playing. This paper provides a comprehensive review of the current state of reinforcement learning, focusing on the methodologies and techniques used in reward engineering and reward shaping. It critically analyzes the limitations and recent advancements in the field, offering insights into future research directions and potential applications in various domains.

[LG-101] A Survey on Symbolic Knowledge Distillation of Large Language Models

链接: https://arxiv.org/abs/2408.10210
作者: Kamal Acharya,Alvaro Velasquez,Houbing Herbert Song
关键词-EN: Large Language Models, Large Language, Bidirectional Encoder Representations, survey paper delves, symbolic knowledge distillation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:This survey paper delves into the emerging and critical area of symbolic knowledge distillation in Large Language Models (LLMs). As LLMs like Generative Pre-trained Transformer-3 (GPT-3) and Bidirectional Encoder Representations from Transformers (BERT) continue to expand in scale and complexity, the challenge of effectively harnessing their extensive knowledge becomes paramount. This survey concentrates on the process of distilling the intricate, often implicit knowledge contained within these models into a more symbolic, explicit form. This transformation is crucial for enhancing the interpretability, efficiency, and applicability of LLMs. We categorize the existing research based on methodologies and applications, focusing on how symbolic knowledge distillation can be used to improve the transparency and functionality of smaller, more efficient Artificial Intelligence (AI) models. The survey discusses the core challenges, including maintaining the depth of knowledge in a comprehensible format, and explores the various approaches and techniques that have been developed in this field. We identify gaps in current research and potential opportunities for future advancements. This survey aims to provide a comprehensive overview of symbolic knowledge distillation in LLMs, spotlighting its significance in the progression towards more accessible and efficient AI systems.

[LG-102] In-Context Learning with Representations: Contextual Generalization of Trained Transformers

链接: https://arxiv.org/abs/2408.10147
作者: Tong Yang,Yu Huang,Yingbin Liang,Yuejie Chi
关键词-EN: pretrained large language, large language models, remarkable capability, capability of pretrained, pretrained large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) refers to a remarkable capability of pretrained large language models, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with m basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.

[LG-103] Neural Horizon Model Predictive Control – Increasing Computational Efficiency with Neural Networks

链接: https://arxiv.org/abs/2408.09781
作者: Hendrik Alsmeier,Anton Savchenko,Rolf Findeisen
关键词-EN: low-power edge devices, edge devices poses, based control algorithms, increasingly fast applications, model predictive control
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 4 tables, American Control Conference (ACC) 2024

点击查看摘要

Abstract:The expansion in automation of increasingly fast applications and low-power edge devices poses a particular challenge for optimization based control algorithms, like model predictive control. Our proposed machine-learning supported approach addresses this by utilizing a feed-forward neural network to reduce the computation load of the online-optimization. We propose approximating part of the problem horizon, while maintaining safety guarantees – constraint satisfaction – via the remaining optimization part of the controller. The approach is validated in simulation, demonstrating an improvement in computational efficiency, while maintaining guarantees and near-optimal performance. The proposed MPC scheme can be applied to a wide range of applications, including those requiring a rapid control response, such as robotics and embedded applications with limited computational resources.

[LG-104] FedST: Secure Federated Shapelet Transformation for Time Series Classification

链接: https://arxiv.org/abs/2302.10631
作者: Zhiyu Liang,Hongzhi Wang
关键词-EN: time series classification, shapelet-based time series, federated TSC framework, series classification, paper explores
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:This paper explores how to build a shapelet-based time series classification (TSC) model in the federated learning (FL) scenario, that is, using more data from multiple owners without actually sharing the data. We propose FedST, a novel federated TSC framework extended from a centralized shapelet transformation method. We recognize the federated shapelet search step as the kernel of FedST. Thus, we design a basic protocol for the FedST kernel that we prove to be secure and accurate. However, we identify that the basic protocol suffers from efficiency bottlenecks and the centralized acceleration techniques lose their efficacy due to the security issues. To speed up the federated protocol with security guarantee, we propose several optimizations tailored for the FL setting. Our theoretical analysis shows that the proposed methods are secure and more efficient. We conduct extensive experiments using both synthetic and real-world datasets. Empirical results show that our FedST solution is effective in terms of TSC accuracy, and the proposed optimizations can achieve three orders of magnitude of speedup.

[LG-105] An Overlooked Role of Context-Sensitive Dendrites

链接: https://arxiv.org/abs/2408.11019
作者: Mohsin Raza,Ahsan Adeel
关键词-EN: higher perceptual layers, pyramidal two-point neurons, predominantly focused, zone of pyramidal, pyramidal two-point
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To date, most dendritic studies have predominantly focused on the apical zone of pyramidal two-point neurons (TPNs) receiving only feedback (FB) connections from higher perceptual layers and using them for learning. Recent cellular neurophysiology and computational neuroscience studies suggests that the apical input (context), coming from feedback and lateral connections, is multifaceted and far more diverse, with greater implications for ongoing learning and processing in the brain than previously realized. In addition to the FB, the apical tuft receives signals from neighboring cells of the same network as proximal § context, other parts of the brain as distal (D) context, and overall coherent information across the network as universal (U) context. The integrated context © amplifies and suppresses the transmission of coherent and conflicting feedforward (FF) signals, respectively. Specifically, we show that complex context-sensitive (CS)-TPNs flexibly integrate C moment-by-moment with the FF somatic current at the soma such that the somatic current is amplified when both feedforward (FF) and C are coherent; otherwise, it is attenuated. This generates the event only when the FF and C currents are coherent, which is then translated into a singlet or a burst based on the FB information. Spiking simulation results show that this flexible integration of somatic and contextual currents enables the propagation of more coherent signals (bursts), making learning faster with fewer neurons. Similar behavior is observed when this functioning is used in conventional artificial networks, where orders of magnitude fewer neurons are required to process vast amounts of heterogeneous real-world audio-visual (AV) data trained using backpropagation (BP). The computational findings presented here demonstrate the universality of CS-TPNs, suggesting a dendritic narrative that was previously overlooked.

[LG-106] Approximation Rates for Shallow ReLUk Neural Networks on Sobolev Spaces via the Radon Transform

链接: https://arxiv.org/abs/2408.10996
作者: Tong Mao,Jonathan W. Siegel,Jinchao Xu
关键词-EN: Omega, bounded domain, optimal approximation rates, Sobolev spaces, Abstract
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Let \Omega\subset \mathbbR^d be a bounded domain. We consider the problem of how efficiently shallow neural networks with the ReLU ^k activation function can approximate functions from Sobolev spaces W^s(L_p(\Omega)) with error measured in the L_q(\Omega) -norm. Utilizing the Radon transform and recent results from discrepancy theory, we provide a simple proof of nearly optimal approximation rates in a variety of cases, including when q\leq p , p\geq 2 , and s \leq k + (d+1)/2 . The rates we derive are optimal up to logarithmic factors, and significantly generalize existing results. An interesting consequence is that the adaptivity of shallow ReLU ^k neural networks enables them to obtain optimal approximation rates for smoothness up to order s = k + (d+1)/2 , even though they represent piecewise polynomials of fixed degree k .

[LG-107] Kernel-Based Differentiable Learning of Non-Parametric Directed Acyclic Graphical Models

链接: https://arxiv.org/abs/2408.10976
作者: Yurou Liang,Oleksandr Zadorozhnyi,Mathias Drton
关键词-EN: directed acyclic graph, Causal discovery amounts, amounts to learning, learning a directed, directed acyclic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To be published in the Proceedings of Probabilistic Graphical Models (PGM) 2024

点击查看摘要

Abstract:Causal discovery amounts to learning a directed acyclic graph (DAG) that encodes a causal model. This model selection problem can be challenging due to its large combinatorial search space, particularly when dealing with non-parametric causal models. Recent research has sought to bypass the combinatorial search by reformulating causal discovery as a continuous optimization problem, employing constraints that ensure the acyclicity of the graph. In non-parametric settings, existing approaches typically rely on finite-dimensional approximations of the relationships between nodes, resulting in a score-based continuous optimization problem with a smooth acyclicity constraint. In this work, we develop an alternative approximation method by utilizing reproducing kernel Hilbert spaces (RKHS) and applying general sparsity-inducing regularization terms based on partial derivatives. Within this framework, we introduce an extended RKHS representer theorem. To enforce acyclicity, we advocate the log-determinant formulation of the acyclicity constraint and show its stability. Finally, we assess the performance of our proposed RKHS-DAGMA procedure through simulations and illustrative data analyses.

[LG-108] Kilometer-Scale Convection Allowing Model Emulation using Generative Diffusion Modeling

链接: https://arxiv.org/abs/2408.10958
作者: Jaideep Pathak,Yair Cohen,Piyush Garg,Peter Harrington,Noah Brenowitz,Dale Durran,Morteza Mardani,Arash Vahdat,Shaoming Xu,Karthik Kashinath,Michael Pritchard
关键词-EN: Storm-scale convection-allowing models, Storm-scale convection-allowing, mesoscale convective systems, damaging extreme weather, important tool
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Storm-scale convection-allowing models (CAMs) are an important tool for predicting the evolution of thunderstorms and mesoscale convective systems that result in damaging extreme weather. By explicitly resolving convective dynamics within the atmosphere they afford meteorologists the nuance needed to provide outlook on hazard. Deep learning models have thus far not proven skilful at km-scale atmospheric simulation, despite being competitive at coarser resolution with state-of-the-art global, medium-range weather forecasting. We present a generative diffusion model called StormCast, which emulates the high-resolution rapid refresh (HRRR) model-NOAA’s state-of-the-art 3km operational CAM. StormCast autoregressively predicts 99 state variables at km scale using a 1-hour time step, with dense vertical resolution in the atmospheric boundary layer, conditioned on 26 synoptic variables. We present evidence of successfully learnt km-scale dynamics including competitive 1-6 hour forecast skill for composite radar reflectivity alongside physically realistic convective cluster evolution, moist updrafts, and cold pool morphology. StormCast predictions maintain realistic power spectra for multiple predicted variables across multi-hour forecasts. Together, these results establish the potential for autoregressive ML to emulate CAMs – opening up new km-scale frontiers for regional ML weather prediction and future climate hazard dynamical downscaling.

[LG-109] More Options for Prelabor Rupture of Membranes A Bayesian Analysis

链接: https://arxiv.org/abs/2408.10876
作者: Ashley Klein,Edward Raff,Elisabeth Seamon,Lily Foley,Timothy Bussert
关键词-EN: major abdominal surgery, Cesarean section, abdominal surgery, obstetric goal, laboring mother
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: To appear in the 2024 IEEE 11th International Conference on Data Science and Advanced Analytics (DSAA)

点击查看摘要

Abstract:An obstetric goal for a laboring mother is to achieve a vaginal delivery as it reduces the risks inherent in major abdominal surgery (i.e., a Cesarean section). Various medical interventions may be used by a physician to increase the likelihood of this occurring while minimizing maternal and fetal morbidity. However, patients with prelabor rupture of membranes (PROM) have only two commonly used options for cervical ripening, Pitocin and misoprostol. Little research exists on the benefits/risks for these two key drugs for PROM patients. A major limitation with most induction-of-labor related research is the inability to account for differences in \textitBishop scores that are commonly used in obstetrical practice to determine the next induction agent offered to the patient. This creates a confounding factor, which biases the results, but has not been realized in the literature. In this work, we use a Bayesian model of the relationships between the relevant factors, informed by expert physicians, to separate the confounding variable from its actual impact. In doing so, we provide strong evidence that pitocin and buccal misoprostol are equally effective and safe; thus, physicians have more choice in clinical care than previously realized. This is particularly important for developing countries where neither medication may be readily available, and prior guidelines may create an artificial barrier to needed medication.

[LG-110] Radio U-Net: a convolutional neural network to detect diffuse radio sources in galaxy clusters and beyond

链接: https://arxiv.org/abs/2408.10871
作者: Chiara Stuardi,Claudio Gheller,Franco Vazza,Andrea Botteon
关键词-EN: telescope arrays promises, arrays promises significant, radio telescope arrays, radio, promises significant advancements
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by MNRAS, 16 pages, 9 figures, 2 tables

点击查看摘要

Abstract:The forthcoming generation of radio telescope arrays promises significant advancements in sensitivity and resolution, enabling the identification and characterization of many new faint and diffuse radio sources. Conventional manual cataloging methodologies are anticipated to be insufficient to exploit the capabilities of new radio surveys. Radio interferometric images of diffuse sources present a challenge for image segmentation tasks due to noise, artifacts, and embedded radio sources. In response to these challenges, we introduce Radio U-Net, a fully convolutional neural network based on the U-Net architecture. Radio U-Net is designed to detect faint and extended sources in radio surveys, such as radio halos, relics, and cosmic web filaments. Radio U-Net was trained on synthetic radio observations built upon cosmological simulations and then tested on a sample of galaxy clusters, where the detection of cluster diffuse radio sources relied on customized data reduction and visual inspection of LOFAR Two Metre Sky Survey (LoTSS) data. The 83% of clusters exhibiting diffuse radio emission were accurately identified, and the segmentation successfully recovered the morphology of the sources even in low-quality images. In a test sample comprising 246 galaxy clusters, we achieved a 73% accuracy rate in distinguishing between clusters with and without diffuse radio emission. Our results establish the applicability of Radio U-Net to extensive radio survey datasets, probing its efficiency on cutting-edge high-performance computing systems. This approach represents an advancement in optimizing the exploitation of forthcoming large radio surveys for scientific exploration.

[LG-111] Deep Learning-based Classification of Dementia using Image Representation of Subcortical Signals

链接: https://arxiv.org/abs/2408.10816
作者: Shivani Ranjan,Ayush Tripathi,Harshal Shende,Robin Badal,Amit Kumar,Pramod Yadav,Deepak Joshi,Lalan Kumar
关键词-EN: neurological syndrome marked, neurological syndrome, syndrome marked, Frontotemporal dementia, cognitive decline
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dementia is a neurological syndrome marked by cognitive decline. Alzheimer’s disease (AD) and Frontotemporal dementia (FTD) are the common forms of dementia, each with distinct progression patterns. EEG, a non-invasive tool for recording brain activity, has shown potential in distinguishing AD from FTD and mild cognitive impairment (MCI). Previous studies have utilized various EEG features, such as subband power and connectivity patterns to differentiate these conditions. However, artifacts in EEG signals can obscure crucial information, necessitating advanced signal processing techniques. This study aims to develop a deep learning-based classification system for dementia by analyzing scout time-series signals from deep brain regions, specifically the hippocampus, amygdala, and thalamus. The study utilizes scout time series extracted via the standardized low-resolution brain electromagnetic tomography (sLORETA) technique. The time series is converted to image representations using continuous wavelet transform (CWT) and fed as input to deep learning models. Two high-density EEG datasets are utilized to check for the efficacy of the proposed method: the online BrainLat dataset (comprising AD, FTD, and healthy controls (HC)) and the in-house IITD-AIIA dataset (including subjects with AD, MCI, and HC). Different classification strategies and classifier combinations have been utilized for the accurate mapping of classes on both datasets. The best results were achieved by using a product of probabilities from classifiers for left and right subcortical regions in conjunction with the DenseNet model architecture. It yields accuracies of 94.17 % and 77.72 % on the BrainLat and IITD-AIIA datasets, respectively. This highlights the potential of this approach for early and accurate differentiation of neurodegenerative disorders.

[LG-112] SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS

链接: https://arxiv.org/abs/2408.10771
作者: Karl El Hajal,Ajinkya Kulkarni,Enno Hermann,Mathew Magimai.-Doss
关键词-EN: achieve impressive results, recent zero-shot multispeaker, intricate training pipelines, impressive results, typically rely
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to IEEE Signal Processing Letters

点击查看摘要

Abstract:While recent zero-shot multispeaker text-to-speech (TTS) models achieve impressive results, they typically rely on extensive transcribed speech datasets from numerous speakers and intricate training pipelines. Meanwhile, self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. It was also observed that SSL features from different speakers that are linearly close share phonetic information while maintaining individual speaker identity, which enables straight-forward and robust voice cloning. In this study, we introduce SSL-TTS, a lightweight and efficient zero-shot TTS framework trained on transcribed speech from a single speaker. SSL-TTS leverages SSL features and retrieval methods for simple and robust zero-shot multi-speaker synthesis. Objective and subjective evaluations show that our approach achieves performance comparable to state-of-the-art models that require significantly larger training datasets. The low training data requirements mean that SSL-TTS is well suited for the development of multi-speaker TTS systems for low-resource domains and languages. We also introduce an interpolation parameter which enables fine control over the output speech by blending voices. Demo samples are available at this https URL

[LG-113] End-to-end learned Lossy Dynamic Point Cloud Attribute Compression ICIP

链接: https://arxiv.org/abs/2408.10665
作者: Dat Thanh Nguyen,Daniel Zieger,Marc Stamminger,Andre Kaup
关键词-EN: comparatively fewer efforts, Recent advancements, primarily emphasized geometry, point cloud compression, point cloud
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 6 pages, accepted for presentation at 2024 IEEE International Conference on Image Processing (ICIP) 2024

点击查看摘要

Abstract:Recent advancements in point cloud compression have primarily emphasized geometry compression while comparatively fewer efforts have been dedicated to attribute compression. This study introduces an end-to-end learned dynamic lossy attribute coding approach, utilizing an efficient high-dimensional convolution to capture extensive inter-point dependencies. This enables the efficient projection of attribute features into latent variables. Subsequently, we employ a context model that leverage previous latent space in conjunction with an auto-regressive context model for encoding the latent tensor into a bitstream. Evaluation of our method on widely utilized point cloud datasets from the MPEG and Microsoft demonstrates its superior performance compared to the core attribute compression module Region-Adaptive Hierarchical Transform method from MPEG Geometry Point Cloud Compression with 38.1% Bjontegaard Delta-rate saving in average while ensuring a low-complexity encoding/decoding.

[LG-114] Prompt Your Brain: Scaffold Prompt Tuning for Efficient Adaptation of fMRI Pre-trained Model MICCAI2024

链接: https://arxiv.org/abs/2408.10567
作者: Zijian Dong,Yilei Wu,Zijiao Chen,Yichi Zhang,Yueming Jin,Juan Helen Zhou
关键词-EN: magnetic resonance imaging, introduce Scaffold Prompt, large-scale functional magnetic, functional magnetic resonance, improved performance compared
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024

点击查看摘要

Abstract:We introduce Scaffold Prompt Tuning (ScaPT), a novel prompt-based framework for adapting large-scale functional magnetic resonance imaging (fMRI) pre-trained models to downstream tasks, with high parameter efficiency and improved performance compared to fine-tuning and baselines for prompt tuning. The full fine-tuning updates all pre-trained parameters, which may distort the learned feature space and lead to overfitting with limited training data which is common in fMRI fields. In contrast, we design a hierarchical prompt structure that transfers the knowledge learned from high-resource tasks to low-resource ones. This structure, equipped with a Deeply-conditioned Input-Prompt (DIP) mapping module, allows for efficient adaptation by updating only 2% of the trainable parameters. The framework enhances semantic interpretability through attention mechanisms between inputs and prompts, and it clusters prompts in the latent space in alignment with prior knowledge. Experiments on public resting state fMRI datasets reveal ScaPT outperforms fine-tuning and multitask-based prompt tuning in neurodegenerative diseases diagnosis/prognosis and personality trait prediction, even with fewer than 20 participants. It highlights ScaPT’s efficiency in adapting pre-trained fMRI models to low-resource tasks.

[LG-115] Asymptotic Classification Error for Heavy-Tailed Renewal Processes

链接: https://arxiv.org/abs/2408.10502
作者: Xinhui Rong,Victor Solo
关键词-EN: point process data, point process classification, point process, emerged very recently, process data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Despite the widespread occurrence of classification problems and the increasing collection of point process data across many disciplines, study of error probability for point process classification only emerged very recently. Here, we consider classification of renewal processes. We obtain asymptotic expressions for the Bhattacharyya bound on misclassification error probabilities for heavy-tailed renewal processes.

[LG-116] Efficient Reinforcement Learning in Probabilistic Reward Machines

链接: https://arxiv.org/abs/2408.10381
作者: Xiaofeng Lin,Xuezhou Zhang
关键词-EN: Markov Decision Processes, Probabilistic Reward Machines, Markov Decision, Decision Processes, Processes with Probabilistic
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 33 pages, 4 figures

点击查看摘要

Abstract:In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of \widetildeO(\sqrtHOAT + H^2O^2A^3/2 + H\sqrtT) , where H is the time horizon, O is the number of observations, A is the number of actions, and T is the number of time-steps. This result improves over the best-known bound, \widetildeO(H\sqrtOAT) of \citetpmlr-v206-bourel23a for MDPs with Deterministic Reward Machines (DRMs), a special case of PRMs. When T \geq H^3O^3A^2 and OA \geq H , our regret bound leads to a regret of \widetildeO(\sqrtHOAT) , which matches the established lower bound of \Omega(\sqrtHOAT) for MDPs with DRMs up to a logarithmic factor. To the best of our knowledge, this is the first efficient algorithm for PRMs. Additionally, we present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward given access to an approximate planner. Complementing our theoretical findings, we show through extensive experiment evaluations that our algorithm indeed outperforms prior methods in various PRM environments.

[LG-117] Can an unsupervised clustering algorithm reproduce a categorization system?

链接: https://arxiv.org/abs/2408.10340
作者: Nathalia Castellanos,Dhruv Desai,Sebastian Frank,Stefano Pasquali,Dhagash Mehta
关键词-EN: Peer analysis, expert-provided categorization systems, ground truth classes, ground truth, investment management
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP)
*备注: 9 pages, 4 tables 28 figures

点击查看摘要

Abstract:Peer analysis is a critical component of investment management, often relying on expert-provided categorization systems. These systems’ consistency is questioned when they do not align with cohorts from unsupervised clustering algorithms optimized for various metrics. We investigate whether unsupervised clustering can reproduce ground truth classes in a labeled dataset, showing that success depends on feature selection and the chosen distance metric. Using toy datasets and fund categorization as real-world examples we demonstrate that accurately reproducing ground truth classes is challenging. We also highlight the limitations of standard clustering evaluation metrics in identifying the optimal number of clusters relative to the ground truth classes. We then show that if appropriate features are available in the dataset, and a proper distance metric is known (e.g., using a supervised Random Forest-based distance metric learning method), then an unsupervised clustering can indeed reproduce the ground truth classes as distinct clusters.

[LG-118] Benchmarking quantum machine learning kernel training for classification tasks

链接: https://arxiv.org/abs/2408.10274
作者: Diego Alvarez-Estevez
关键词-EN: Quantum-enhanced machine learning, rapidly evolving field, Quantum-enhanced machine, Quantum Kernel Estimation, enhance classical machine
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Quantum-enhanced machine learning is a rapidly evolving field that aims to leverage the unique properties of quantum mechanics to enhance classical machine learning. However, the practical applicability of these methods remains an open question, particularly in the context of real-world datasets and the limitations of current quantum hardware. This work performs a benchmark study of Quantum Kernel Estimation (QKE) and Quantum Kernel Training (QKT) with a focus on classification tasks. Through a series of experiments, the versatility and generalization capabilities of two quantum feature mappings, namely ZZFeatureMap and CovariantFeatureMap, are analyzed in this context. Remarkably, these feature maps have been proposed in the literature under the conjecture of possible near-term quantum advantage and have shown promising performance in ad-hoc datasets. This study explores both artificial and established reference datasets and incorporates classical machine learning methods, specifically Support Vector Machines (SVMs) and logistic regression, as baseline comparisons. Experimental results indicate that quantum methods exhibit varying performance across different datasets. While they outperform classical methods in ad-hoc datasets, they frequently encounter difficulties in generalizing to unseen test data when dealing with reference classical datasets, even if achieving high classification accuracy on the training data. It is suggested that the choice of the feature mapping and the optimization of kernel parameters through QKT are critical for maximizing the effectiveness of quantum methods.

[LG-119] Distributed and Secure Kernel-Based Quantum Machine Learning AAAI2025

链接: https://arxiv.org/abs/2408.10265
作者: Arjhun Swaminathan,Mete Akgün
关键词-EN: offering significant efficiency, significant efficiency gains, revolutionize machine learning, machine learning, offering significant
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: This paper contains 12 pages, 4 figures and 1 table. For associated supplementary code, see this https URL . The paper has been submitted to AAAI 2025

点击查看摘要

Abstract:Quantum computing promises to revolutionize machine learning, offering significant efficiency gains in tasks such as clustering and distance estimation. Additionally, it provides enhanced security through fundamental principles like the measurement postulate and the no-cloning theorem, enabling secure protocols such as quantum teleportation and quantum key distribution. While advancements in secure quantum machine learning are notable, the development of secure and distributed quantum analogues of kernel-based machine learning techniques remains underexplored. In this work, we present a novel approach for securely computing common kernels, including polynomial, radial basis function (RBF), and Laplacian kernels, when data is distributed, using quantum feature maps. Our methodology introduces a robust framework that leverages quantum teleportation to ensure secure and distributed kernel learning. The proposed architecture is validated using IBM’s Qiskit Aer Simulator on various public datasets. Comments: This paper contains 12 pages, 4 figures and 1 table. For associated supplementary code, see this https URL. The paper has been submitted to AAAI 2025 Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG) Cite as: arXiv:2408.10265 [quant-ph] (or arXiv:2408.10265v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2408.10265 Focus to learn more arXiv-issued DOI via DataCite

[LG-120] Multi-Source EEG Emotion Recognition via Dynamic Contrastive Domain Adaptation

链接: https://arxiv.org/abs/2408.10235
作者: Yun Xiao,Yimeng Zhang,Xiaopeng Peng,Shuzheng Han,Xia Zheng,Dingyi Fang,Xiaojiang Chen
关键词-EN: EEG remains challenging, reliable indications, indications of human, human cognition, EEG remains
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) provides reliable indications of human cognition and mental states. Accurate emotion recognition from EEG remains challenging due to signal variations among individuals and across measurement sessions. To address these challenges, we introduce a multi-source dynamic contrastive domain adaptation method (MS-DCDA), which models coarse-grained inter-domain and fine-grained intra-class adaptations through a multi-branch contrastive neural network and contrastive sub-domain discrepancy learning. Our model leverages domain knowledge from each individual source and a complementary source ensemble and uses dynamically weighted learning to achieve an optimal tradeoff between domain transferability and discriminability. The proposed MS-DCDA model was evaluated using the SEED and SEED-IV datasets, achieving respectively the highest mean accuracies of 90.84% and 78.49% in cross-subject experiments as well as 95.82% and 82.25% in cross-session experiments. Our model outperforms several alternative domain adaptation methods in recognition accuracy, inter-class margin, and intra-class compactness. Our study also suggests greater emotional sensitivity in the frontal and parietal brain lobes, providing insights for mental health interventions, personalized medicine, and development of preventive strategies.

[LG-121] ECG Unveiled: Analysis of Client Re-identification Risks in Real-World ECG Datasets

链接: https://arxiv.org/abs/2408.10228
作者: Ziyu Wang,Anil Kanduri,Seyed Amir Hossein Aqajari,Salar Jafarlou,Sanaz R. Mousavi,Pasi Liljeberg,Shaista Malik,Amir M. Rahmani
关键词-EN: monitoring heart conditions, unique biometric information, poses significant privacy, heart conditions, crucial for diagnosing
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While ECG data is crucial for diagnosing and monitoring heart conditions, it also contains unique biometric information that poses significant privacy risks. Existing ECG re-identification studies rely on exhaustive analysis of numerous deep learning features, confining to ad-hoc explainability towards clinicians decision making. In this work, we delve into explainability of ECG re-identification risks using transparent machine learning models. We use SHapley Additive exPlanations (SHAP) analysis to identify and explain the key features contributing to re-identification risks. We conduct an empirical analysis of identity re-identification risks using ECG data from five diverse real-world datasets, encompassing 223 participants. By employing transparent machine learning models, we reveal the diversity among different ECG features in contributing towards re-identification of individuals with an accuracy of 0.76 for gender, 0.67 for age group, and 0.82 for participant ID re-identification. Our approach provides valuable insights for clinical experts and guides the development of effective privacy-preserving mechanisms. Further, our findings emphasize the necessity for robust privacy measures in real-world health applications and offer detailed, actionable insights for enhancing data anonymization techniques.

信息检索

[IR-0] ColBERT Retrieval and Ensemble Response Scoring for Language Model Question Answering

链接: https://arxiv.org/abs/2408.10808
作者: Alex Gichamba,Tewodros Kederalah Idris,Brian Ebiyau,Eric Nyberg,Teruko Mitamura
关键词-EN: deep technical knowledge, technical knowledge required, Domain-specific question answering, answer questions correctly, answering remains challenging
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: This work has been submitted to the 2024 IEEE Globecom Workshops for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Domain-specific question answering remains challenging for language models, given the deep technical knowledge required to answer questions correctly. This difficulty is amplified for smaller language models that cannot encode as much information in their parameters as larger models. The “Specializing Large Language Models for Telecom Networks” challenge aimed to enhance the performance of two small language models, Phi-2 and Falcon-7B in telecommunication question answering. In this paper, we present our question answering systems for this challenge. Our solutions achieved leading marks of 81.9% accuracy for Phi-2 and 57.3% for Falcon-7B. We have publicly released our code and fine-tuned models.

[IR-1] Vector Symbolic Open Source Information Discovery

链接: https://arxiv.org/abs/2408.10734
作者: Cai Davies,Sam Meek,Philip Hawkins,Benomy Tutcher,Graham Bent,Alun Preece
关键词-EN: require rapid data, rapid data sharing, operations require rapid, inter-agency and multinational, require rapid
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Combined, joint, intra-governmental, inter-agency and multinational (CJIIM) operations require rapid data sharing without the bottlenecks of metadata curation and alignment. Curation and alignment is particularly infeasible for external open source information (OSINF), e.g., social media, which has become increasingly valuable in understanding unfolding situations. Large language models (transformers) facilitate semantic data and metadata alignment but are inefficient in CJIIM settings characterised as denied, degraded, intermittent and low bandwidth (DDIL). Vector symbolic architectures (VSA) support semantic information processing using highly compact binary vectors, typically 1-10k bits, suitable in a DDIL setting. We demonstrate a novel integration of transformer models with VSA, combining the power of the former for semantic matching with the compactness and representational structure of the latter. The approach is illustrated via a proof-of-concept OSINF data discovery portal that allows partners in a CJIIM operation to share data sources with minimal metadata curation and low communications bandwidth. This work was carried out as a bridge between previous low technology readiness level (TRL) research and future higher-TRL technology demonstration and deployment.

[IR-2] Accelerating the Surrogate Retraining for Poisoning Attacks against Recommender Systems RECSYS2024

链接: https://arxiv.org/abs/2408.10666
作者: Yunfan Wu,Qi Cao,Shuchang Tao,Kaike Zhang,Fei Sun,Huawei Shen
关键词-EN: adversaries inject carefully, inject carefully crafted, carefully crafted fake, Recent studies, promote target items
类目: Information Retrieval (cs.IR)
*备注: Accepted by RecSys 2024

点击查看摘要

Abstract:Recent studies have demonstrated the vulnerability of recommender systems to data poisoning attacks, where adversaries inject carefully crafted fake user interactions into the training data of recommenders to promote target items. Current attack methods involve iteratively retraining a surrogate recommender on the poisoned data with the latest fake users to optimize the attack. However, this repetitive retraining is highly time-consuming, hindering the efficient assessment and optimization of fake users. To mitigate this computational bottleneck and develop a more effective attack in an affordable time, we analyze the retraining process and find that a change in the representation of one user/item will cause a cascading effect through the user-item interaction graph. Under theoretical guidance, we introduce \emphGradient Passing (GP), a novel technique that explicitly passes gradients between interacted user-item pairs during backpropagation, thereby approximating the cascading effect and accelerating retraining. With just a single update, GP can achieve effects comparable to multiple original training iterations. Under the same number of retraining epochs, GP enables a closer approximation of the surrogate recommender to the victim. This more accurate approximation provides better guidance for optimizing fake users, ultimately leading to enhanced data poisoning attacks. Extensive experiments on real-world datasets demonstrate the efficiency and effectiveness of our proposed GP.

[IR-3] CoRA: Collaborative Information Perception by Large Language Models Weights for Recommendation

链接: https://arxiv.org/abs/2408.10645
作者: Yuting Liu,Jinghao Zhang,Yizhou Dang,Yuliang Liang,Qiang Liu,Guibing Guo,Jianzhe Zhao,Xingwei Wang
关键词-EN: Large Language Models, Large Language, Involving collaborative information, collaborative, LLM
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Involving collaborative information in Large Language Models (LLMs) is a promising technique for adapting LLMs for recommendation. Existing methods achieve this by concatenating collaborative features with text tokens into a unified sequence input and then fine-tuning to align these features with LLM’s input space. Although effective, in this work, we identify two limitations when adapting LLMs to recommendation tasks, which hinder the integration of general knowledge and collaborative information, resulting in sub-optimal recommendation performance. (1) Fine-tuning LLM with recommendation data can undermine its inherent world knowledge and fundamental competencies, which are crucial for interpreting and inferring recommendation text. (2) Incorporating collaborative features into textual prompts disrupts the semantics of the original prompts, preventing LLM from generating appropriate outputs. In this paper, we propose a new paradigm, CoRA (an acronym for Collaborative LoRA), with a collaborative weights generator. Rather than input space alignment, this method aligns collaborative information with LLM’s parameter space, representing them as incremental weights to update LLM’s output. This way, LLM perceives collaborative information without altering its general knowledge and text inference capabilities. Specifically, we employ a collaborative filtering model to extract user and item embeddings, converting them into collaborative weights with low-rank properties through the collaborative weights generator. We then merge the collaborative weights into LLM’s weights, enabling LLM to perceive the collaborative signals and generate personalized recommendations without fine-tuning or extra collaborative tokens in prompts. Extensive experiments confirm that CoRA effectively integrates collaborative information into LLM, enhancing recommendation performance.

[IR-4] ask-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval

链接: https://arxiv.org/abs/2408.10613
作者: Guangyuan Ma,Yongliang Ma,Xing Wu,Zhenpeng Su,Ming Zhou,Songlin Hu
关键词-EN: Large Language Model-based, Language Model-based Dense, Model-based Dense Retrieval, Large Language, Language Model-based
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Model-based Dense Retrieval (LLM-DR) optimizes over numerous heterogeneous fine-tuning collections from different domains. However, the discussion about its training data distribution is still minimal. Previous studies rely on empirically assigned dataset choices or sampling ratios, which inevitably leads to sub-optimal retrieval performances. In this paper, we propose a new task-level Distributionally Robust Optimization (tDRO) algorithm for LLM-DR fine-tuning, targeted at improving the universal domain generalization ability by end-to-end reweighting the data distribution of each task. The tDRO parameterizes the domain weights and updates them with scaled domain gradients. The optimized weights are then transferred to the LLM-DR fine-tuning to train more robust retrievers. Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage after applying our optimization algorithm with a series of different-sized LLM-DR models.

[IR-5] Multilingual Non-Factoid Question Answering with Silver Answers

链接: https://arxiv.org/abs/2408.10604
作者: Ritwik Mishra,Sreeram Vennam,Rajiv Ratn Shah,Ponnurangam Kumaraguru
关键词-EN: existing Question Answering, short-context Question Answering, Question Answering Datasets, Question Answering, Answering Datasets
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing Question Answering Datasets (QuADs) primarily focus on factoid-based short-context Question Answering (QA) in high-resource languages. However, the scope of such datasets for low-resource languages remains limited, with only a few works centered on factoid-based QuADs and none on non-factoid QuADs. Therefore, this work presents MuNfQuAD, a multilingual QuAD with non-factoid questions. It utilizes interrogative sub-headings from BBC news articles as questions and the corresponding paragraphs as silver answers. The dataset comprises over 370K QA pairs across 38 languages, encompassing several low-resource languages, and stands as the largest multilingual QA dataset to date. Based on the manual annotations of 790 QA-pairs from MuNfQuAD (golden set), we observe that 98% of questions can be answered using their corresponding silver answer. Our fine-tuned Answer Paragraph Selection (APS) model outperforms the baselines. The APS model attained an accuracy of 80% and 72%, as well as a macro F1 of 72% and 66%, on the MuNfQuAD testset and the golden set, respectively. Furthermore, the APS model effectively generalizes certain a language within the golden set, even after being fine-tuned on silver labels.

[IR-6] arget-Prompt Online Graph Collaborative Learning for Temporal QoS Prediction

链接: https://arxiv.org/abs/2408.10555
作者: Shengxiang Hu,Guobing Zou,Song Yang,Shiyi Lin,Bofeng Zhang,Yixin Chen
关键词-EN: predicting the Quality, temporal QoS prediction, service-oriented architecture, accurately predicting, vital for maintaining
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:In service-oriented architecture, accurately predicting the Quality of Service (QoS) is vital for maintaining reliability and enhancing user satisfaction. However, current methods often neglect high-order latent collaborative relationships and fail to dynamically adjust feature learning for specific user-service invocations, which are critical for precise feature extraction. Moreover, relying on RNNs to capture QoS evolution limits the ability to detect long-term trends due to challenges in managing long-range dependencies. To address these issues, we propose the Target-Prompt Online Graph Collaborative Learning (TOGCL) framework for temporal QoS prediction. It leverages a dynamic user-service invocation graph to comprehensively model historical interactions. Building on this graph, it develops a target-prompt graph attention network to extract online deep latent features of users and services at each time slice, considering implicit target-neighboring collaborative relationships and historical QoS values. Additionally, a multi-layer Transformer encoder is employed to uncover temporal feature evolution patterns, enhancing temporal QoS prediction. Extensive experiments on the WS-DREAM dataset demonstrate that TOGCL significantly outperforms state-of-the-art methods across multiple metrics, achieving improvements of up to 38.80%. These results underscore the effectiveness of TOGCL for temporal QoS prediction.

[IR-7] Synergistic Approach for Simultaneous Optimization of Monolingual Cross-lingual and Multilingual Information Retrieval

链接: https://arxiv.org/abs/2408.10536
作者: Adel Elmahdy,Sheng-Chieh Lin,Amin Ahmad
关键词-EN: increasingly important challenge, Information retrieval, natural language processing, increasingly important, important challenge
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: 15 pages, 2 figures, 13 tables

点击查看摘要

Abstract:Information retrieval across different languages is an increasingly important challenge in natural language processing. Recent approaches based on multilingual pre-trained language models have achieved remarkable success, yet they often optimize for either monolingual, cross-lingual, or multilingual retrieval performance at the expense of others. This paper proposes a novel hybrid batch training strategy to simultaneously improve zero-shot retrieval performance across monolingual, cross-lingual, and multilingual settings while mitigating language bias. The approach fine-tunes multilingual language models using a mix of monolingual and cross-lingual question-answer pair batches sampled based on dataset size. Experiments on XQuAD-R, MLQA-R, and MIRACL benchmark datasets show that the proposed method consistently achieves comparable or superior results in zero-shot retrieval across various languages and retrieval tasks compared to monolingual-only or cross-lingual-only training. Hybrid batch training also substantially reduces language bias in multilingual retrieval compared to monolingual training. These results demonstrate the effectiveness of the proposed approach for learning language-agnostic representations that enable strong zero-shot retrieval performance across diverse languages.

[IR-8] Efficient and Deployable Knowledge Infusion for Open-World Recommendations via Large Language Models

链接: https://arxiv.org/abs/2408.10520
作者: Yunjia Xi,Weiwen Liu,Jianghao Lin,Muyan Weng,Xiaoling Cai,Hong Zhu,Jieming Zhu,Bo Chen,Ruiming Tang,Yong Yu,Weinan Zhang
关键词-EN: closed-loop nature constrains, today online services, play a pervasive, pervasive role, role in today
类目: Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2306.10933

点击查看摘要

Abstract:Recommender systems (RSs) play a pervasive role in today’s online services, yet their closed-loop nature constrains their access to open-world knowledge. Recently, large language models (LLMs) have shown promise in bridging this gap. However, previous attempts to directly implement LLMs as recommenders fall short in meeting the requirements of industrial RSs, particularly in terms of online inference latency and offline resource efficiency. Thus, we propose REKI to acquire two types of external knowledge about users and items from LLMs. Specifically, we introduce factorization prompting to elicit accurate knowledge reasoning on user preferences and items. We develop individual knowledge extraction and collective knowledge extraction tailored for different scales of scenarios, effectively reducing offline resource consumption. Subsequently, generated knowledge undergoes efficient transformation and condensation into augmented vectors through a hybridized expert-integrated network, ensuring compatibility. The obtained vectors can then be used to enhance any conventional recommendation model. We also ensure efficient inference by preprocessing and prestoring the knowledge from LLMs. Experiments demonstrate that REKI outperforms state-of-the-art baselines and is compatible with lots of recommendation algorithms and tasks. Now, REKI has been deployed to Huawei’s news and music recommendation platforms and gained a 7% and 1.99% improvement during the online A/B test.

[IR-9] Analysis of Plan-based Retrieval for Grounded Text Generation

链接: https://arxiv.org/abs/2408.10490
作者: Ameya Godbole,Nicholas Monath,Seungyeon Kim,Ankit Singh Rawat,Andrew McCallum,Manzil Zaheer
关键词-EN: contradicts established knowledge, seemingly coherent text, seemingly coherent, contradicts established, hallucinations refer
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In text generation, hallucinations refer to the generation of seemingly coherent text that contradicts established knowledge. One compelling hypothesis is that hallucinations occur when a language model is given a generation task outside its parametric knowledge (due to rarity, recency, domain, etc.). A common strategy to address this limitation is to infuse the language models with retrieval mechanisms, providing the model with relevant knowledge for the task. In this paper, we leverage the planning capabilities of instruction-tuned LLMs and analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations. We empirically evaluate several variations of our proposed approach on long-form text generation tasks. By improving the coverage of relevant facts, plan-guided retrieval and generation can produce more informative responses while providing a higher rate of attribution to source documents.

[IR-10] LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

链接: https://arxiv.org/abs/2408.10469
作者: Xinyu Liu,Jing Zhang,Kexin Zhang,Xu Liu,Lingling Li
关键词-EN: including object occlusion, tracking specific objects, Video Object Segmentation, including object, occlusion and fragmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overa1l.

[IR-11] Enhanced document retrieval with topic embeddings

链接: https://arxiv.org/abs/2408.10435
作者: Kavsar Huseynova,Jafar Isbarov
关键词-EN: retrieval-augmented generation, experienced a revitalized, revitalized interest, advent of retrieval-augmented, RAG architecture offers
类目: Information Retrieval (cs.IR)
*备注: Accepted to AICT 2024

点击查看摘要

Abstract:Document retrieval systems have experienced a revitalized interest with the advent of retrieval-augmented generation (RAG). RAG architecture offers a lower hallucination rate than LLM-only applications. However, the accuracy of the retrieval mechanism is known to be a bottleneck in the efficiency of these applications. A particular case of subpar retrieval performance is observed in situations where multiple documents from several different but related topics are in the corpus. We have devised a new vectorization method that takes into account the topic information of the document. The paper introduces this new method for text vectorization and evaluates it in the context of RAG. Furthermore, we discuss the challenge of evaluating RAG systems, which pertains to the case at hand.

[IR-12] Joint Modeling of Search and Recommendations Via an Unified Contextual Recommender (UniCoRn)

链接: https://arxiv.org/abs/2408.10394
作者: Moumita Bhattacharya,Vito Ostuni,Sudarshan Lamkhede
关键词-EN: Search and recommendation, developed separately, leading to complex, technical debt, recommendation systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 3 pages, 1 figure

点击查看摘要

Abstract:Search and recommendation systems are essential in many services, and they are often developed separately, leading to complex maintenance and technical debt. In this paper, we present a unified deep learning model that efficiently handles key aspects of both tasks.

[IR-13] Beyond Relevant Documents: A Knowledge-Intensive Approach for Query-Focused Summarization using Large Language Models ICPR2024

链接: https://arxiv.org/abs/2408.10357
作者: Weijia Zhang,Jia-Hong Huang,Svitlana Vakulenko,Yumo Xu,Thilina Rajapakse,Evangelos Kanoulas
关键词-EN: including search engines, natural language processing, Query-focused summarization, broad applications, including search
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Accepted by the 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Query-focused summarization (QFS) is a fundamental task in natural language processing with broad applications, including search engines and report generation. However, traditional approaches assume the availability of relevant documents, which may not always hold in practical scenarios, especially in highly specialized topics. To address this limitation, we propose a novel knowledge-intensive approach that reframes QFS as a knowledge-intensive task setup. This approach comprises two main components: a retrieval module and a summarization controller. The retrieval module efficiently retrieves potentially relevant documents from a large-scale knowledge corpus based on the given textual query, eliminating the dependence on pre-existing document sets. The summarization controller seamlessly integrates a powerful large language model (LLM)-based summarizer with a carefully tailored prompt, ensuring the generated summary is comprehensive and relevant to the query. To assess the effectiveness of our approach, we create a new dataset, along with human-annotated relevance labels, to facilitate comprehensive evaluation covering both retrieval and summarization performance. Extensive experiments demonstrate the superior performance of our approach, particularly its ability to generate accurate summaries without relying on the availability of relevant documents initially. This underscores our method’s versatility and practical applicability across diverse query scenarios.

[IR-14] OPDR: Order-Preserving Dimension Reduction for Semantic Embedding of Multimodal Scientific Data

链接: https://arxiv.org/abs/2408.10264
作者: Chengyu Gong,Gefei Shen,Luanzheng Guo,Nathan Tallent,Dongfang Zhao
关键词-EN: scientific data management, multimodal scientific data, similar items, original multimodal data, multimodal machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:One of the most common operations in multimodal scientific data management is searching for the k most similar items (or, k -nearest neighbors, KNN) from the database after being provided a new item. Although recent advances of multimodal machine learning models offer a \textitsemantic index, the so-called \textitembedding vectors mapped from the original multimodal data, the dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications. This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top- k nearest neighbors do not change in the lower-dimensional space, namely Order-Preserving Dimension Reduction (OPDR). In order to develop such an OPDR method, our central hypothesis is that by analyzing the intrinsic relationship among key parameters during the dimension-reduction map, a quantitative function may be constructed to reveal the correlation between the target (lower) dimensionality and other variables. To demonstrate the hypothesis, this paper first defines a formal measure function to quantify the KNN similarity for a specific vector, then extends the measure into an aggregate accuracy of the global metric spaces, and finally derives a closed-form function between the target (lower) dimensionality and other variables. We incorporate the closed-function into popular dimension-reduction methods, various distance metrics, and embedding models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2408.10264 [cs.LG] (or arXiv:2408.10264v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.10264 Focus to learn more arXiv-issued DOI via DataCite

[IR-15] AI Transparency in Academic Search Systems: An Initial Exploration

链接: https://arxiv.org/abs/2408.10229
作者: Yifan Liu,Peter Sullivan,Luanne Sinnamon
关键词-EN: AI-enhanced academic search, academic search systems, scholarly work, increasingly popular, crucial to ensure
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:As AI-enhanced academic search systems become increasingly popular among researchers, investigating their AI transparency is crucial to ensure trust in the search outcomes, as well as the reliability and integrity of scholarly work. This study employs a qualitative content analysis approach to examine the websites of a sample of 10 AI-enhanced academic search systems identified through university library guides. The assessed level of transparency varies across these systems: five provide detailed information about their mechanisms, three offer partial information, and two provide little to no information. These findings indicate that the academic community is recommending and using tools with opaque functionalities, raising concerns about research integrity, including issues of reproducibility and researcher responsibility.

附件下载

点击下载今日全部论文列表