本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-04)

今日共更新393篇论文,其中:

  • 自然语言处理77篇(Computation and Language (cs.CL))
  • 计算机视觉89篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能114篇(Artificial Intelligence (cs.AI))
  • 机器学习167篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
[NLP-0] Video-ME:视频分析中首个多模式LLM综合评估基准

链接: https://arxiv.org/abs/2405.21075
作者: Chaoyou Fu,Yuhan Dai,Yondong Luo,Lei Li,Shuhuai Ren,Renrui Zhang,Zihan Wang,Chenyu Zhou,Yunhang Shen,Mengdan Zhang,Peixian Chen,Yanwei Li,Shaohui Lin,Sirui Zhao,Ke Li,Tong Xu,Xiawu Zheng,Enhong Chen,Rongrong Ji,Xing Sun
关键词: Multi-modal Large Language, Large Language Models, Large Language, artificial general intelligence, Multi-modal Large
中文关键词: 多模式大型语言、大型语言模型、大型语言、人工通用智能、多模式大型
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 256 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: this https URL
摘要:在寻求人工智能的过程中,多通道大语言模型(MLLMS)已成为近年来研究的热点。然而,主要的重点仍然是发展他们在静态图像理解方面的能力。最大似然模型在处理顺序视觉数据方面的潜力仍未得到充分发掘,突出表明缺乏对其性能的全面、高质量评估。在本文中,我们介绍了有史以来第一个用于视频分析的MLLMS的全频谱、多模式评估基准Video-Mme。我们的工作通过四个主要功能区别于现有基准:1)视频类型的多样性,跨越6个主要视觉领域和30个子领域,以确保广泛的场景普适性;2)时间维度,包括短、中、长视频,从11秒到1小时,以实现稳健的上下文动态;3)数据模式的广度,除了视频帧,包括字幕和音频,还集成了多模式输入,以揭示MLLMS的全方位能力;4)注释质量,利用专家注释员严格的手动标记,以促进精确和可靠的模型评估。通过重复观看所有视频内容,手动选择并标注900个视频,总时长256个小时,产生2700个问答对。通过Video-Mme,我们广泛评估各种最先进的MLLM,包括GPT-4系列和Gemini 1.5 Pro,以及像InternVL-Chat-V1.5这样的开源图像模型和像LLaVA-Next-Video这样的视频模型。我们的实验表明,Gemini 1.5 Pro是表现最好的商业模型,远远超过开源模型。我们的数据集以及这些发现强调了在处理较长序列和多模式数据方面进一步改进的必要性。项目页面:此HTTPS URL

[NLP-1] Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights
[NLP-1] 超越数据失衡的概括:可转移洞察CLIP的对照研究

链接: https://arxiv.org/abs/2405.21070
作者: Xin Wen,Bingchen Zhao,Yilun Chen,Jiangmiao Pang,Xiaojuan Qi
关键词: web-scale vision-language datasets, Severe data imbalance, imbalance naturally exists, Severe data, vision-language datasets
中文关键词: 网络规模的视觉语言数据集,严重的数据不平衡,不平衡自然存在,严重的数据,视觉语言数据集
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP’s pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP’s generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code will be available at: this https URL.
摘要:网络规模的视觉语言数据集之间自然存在严重的数据失衡。尽管如此,我们发现,与监督学习相比,预先训练的CLIP对数据不平衡表现出了显著的稳健性,并且在学习泛化表示方面表现出了显著的有效性。为了探讨这一发现背后的原因,我们进行了对照实验来研究各种潜在的因素,并揭示了CLIP的借口任务形成了一个动态的分类问题,其中只有一个子集的类别存在于训练中。这将偏见与主导阶层隔离开来,并隐含地平衡了学习信号。此外,CLIP的稳健性和可区分性随着更具描述性的语言监督、更大的数据规模和更广泛的开放世界概念而得到改善,这些都是监督学习无法获得的。我们的研究不仅揭示了CLIP超越数据不平衡的泛化机制,而且为研究界提供了可移植的见解。这些发现在监督学习和自我监督学习中都得到了验证,使对不平衡数据进行训练的模型能够在不同的识别任务中实现片段级别的性能。代码将在以下地址获得:This https URL。

[NLP-2] Code Pretraining Improves Entity Tracking Abilities of Language Models
[NLP-2] 代码预训练提高语言模型的实体跟踪能力

链接: https://arxiv.org/abs/2405.21068
作者: Najoung Kim,Sebastian Schuster,Shubham Toshniwal
关键词: discourse entities expressed, Recent work, provided indirect evidence, pretraining language models, provided indirect
中文关键词: 表达的话语实体,最近的工作,提供了间接证据,预训练语言模型,提供了间接
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim by comparing pairs of language models on their entity tracking performance. Critically, the pairs consist of base models and models trained on top of these base models with additional code data. We extend this analysis to additionally examine the effect of math training, another highly structured data type, and alignment tuning, an important step for enhancing the usability of models. We find clear evidence that models additionally trained on large amounts of code outperform the base models. On the other hand, we find no consistent benefit of additional math training or alignment tuning across various model families.
摘要:最近的工作提供了间接证据,表明在代码上预训练语言模型可以提高模型跟踪自然语言表达的话语实体状态变化的能力。在这项工作中,我们通过比较语言模型对的实体跟踪性能来系统地测试这一说法。至关重要的是,这些模型对由基本模型和在这些基本模型上训练的模型以及额外的代码数据组成。我们扩展了此分析,以额外检查数学训练、另一种高度结构化的数据类型和对齐调整(增强模型可用性的重要步骤)的影响。我们发现明确的证据表明,额外训练大量代码的模型优于基本模型。另一方面,我们发现不同模型系列之间的额外数学培训或对齐调整没有一致的好处。

[NLP-3] Grammar-Aligned Decoding
[NLP-3] 文法对齐解码

链接: https://arxiv.org/abs/2405.21047
作者: Kanghee Park,Jiayu Wang,Taylor Berg-Kirkpatrick,Nadia Polikarpova,Loris D’Antoni
关键词: Large Language Models, Large Language, Language Models, reliably generating highly, LLM distribution
中文关键词: 大型语言模型,大型语言,语言模型,可靠生成高度,LLM分布
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup. Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint. Specifically, in grammar-constrained decoding (GCD), the LLM’s output must follow a given grammar. In this paper we demonstrate that GCD techniques (and in general constrained decoding techniques) can distort the LLM’s distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM, and so ultimately are low-quality. We call the problem of aligning sampling with a grammar constraint, grammar-aligned decoding (GAD), and propose adaptive sampling with approximate expected futures (ASAp), a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM’s distribution conditioned on the given grammar constraint. Our algorithm uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes. Our evaluation on code generation and structured NLP tasks shows how ASAp often produces outputs with higher likelihood (according to the LLM’s distribution) than existing GCD techniques, while still enforcing the desired grammatical constraints.
摘要:大型语言模型(LLM)难以可靠地生成高度结构化的输出,如程序代码、数学公式或格式良好的标记。受限解码方法通过贪婪地限制LLM在每一步可以输出什么令牌来缓解这一问题,以保证输出符合给定的约束。具体地说,在语法受限解码(GCD)中,LLM的输出必须遵循给定的语法。在这篇文章中,我们证明了GCD技术(以及一般的约束解码技术)会扭曲LLM的分布,导致输出是语法上的,但出现的可能性与LLM给出的概率不成比例,因此最终是低质量的。我们将采样与语法约束对齐的问题称为语法对齐解码(GAD),并提出了一种具有近似预期未来的自适应采样(ASAP)算法,该算法在保证输出符合语法的同时可证明地产生与给定语法约束下的LLM分布的条件概率匹配的输出。我们的算法使用先前的样本输出来合理地过度逼近不同输出前缀的未来语法。我们对代码生成和结构化NLP任务的评估表明,ASAP经常产生比现有GCD技术更高的可能性(根据LLM的分布)的输出,同时仍然强制执行所需的语法约束。

[NLP-4] Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
[NLP-4] 探索性偏好优化:利用隐式Q*-逼近实现样本高效的WLHF

链接: https://arxiv.org/abs/2405.21046
作者: Tengyang Xie,Dylan J. Foster,Akshay Krishnamurthy,Corby Rosset,Ahmed Awadallah,Alexander Rakhlin
关键词: Exploratory Preference Optimization, Direct Preference Optimization, language model alignment, Preference Optimization, Reinforcement learning
中文关键词: 探索性偏好优化、直接偏好优化、语言模型对齐、偏好优化、强化学习
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the model to produce diverse, maximally informative responses. By allowing RLHF to confidently stray from the pre-trained model, online exploration offers the possibility of novel, potentially super-human capabilities, but its full potential as a paradigm for language model training has yet to be realized, owing to computational and statistical bottlenecks in directly adapting existing reinforcement learning techniques. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO), which is simple and practical – a one-line change to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023) – yet enjoys the strongest known provable guarantees and promising empirical performance. XPO augments the DPO objective with a novel and principled exploration bonus, empowering the algorithm to explore outside the support of the initial model and human feedback data. In theory, we show that XPO is provably sample-efficient and converges to a near-optimal language model policy under natural exploration conditions, irrespective of whether the initial model has good coverage. Our analysis, which builds on the observation that DPO implicitly performs a form of Q^\star -approximation (or, Bellman error minimization), combines previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the perspective of KL-regularized Markov decision processes. Empirically, we find that XPO is more sample-efficient than non-exploratory DPO variants in a preliminary evaluation.
摘要:人类反馈强化学习(RLHF)已成为语言模型对齐的核心工具。我们考虑在RLHF中进行在线探索,它通过故意鼓励模型产生多样化的、信息量最大的反应来利用对人类或人工智能反馈的交互访问。通过允许RLHF自信地偏离预先训练的模型,在线探索提供了新的、潜在的超人能力的可能性,但由于直接适应现有强化学习技术的计算和统计瓶颈,其作为语言模型训练范例的全部潜力尚未实现。我们提出了一种用于RLHF中在线探索的新算法–探索性偏好优化(XPO),它简单实用–对(在线)直接偏好优化(DPO;Rafailov等,2023)的一行修改–但具有已知的最强可证明保证和良好的经验性能。XPO为DPO目标增加了一个新颖且原则性的探索奖励,使算法能够在初始模型和人类反馈数据的支持之外进行探索。在理论上,我们证明了XPO是样本有效的,并且在自然勘探条件下收敛到一个接近最优的语言模型策略,无论初始模型是否具有良好的覆盖率。我们的分析建立在DPO隐式执行Q^\星形逼近(或Bellman误差最小化)的观察基础上,通过KL正则化马尔可夫决策过程的观点,以一种偶然的方式结合了语言建模和理论强化学习中以前不同的技术。经验上,我们发现,在初步评估中,XPO比非探索性DPO变体更具样本效率。

[NLP-5] Direct Alignment of Language Models via Quality-Aware Self-Refinement
[NLP-5] 通过质量意识的自我细化直接对齐语言模型

链接: https://arxiv.org/abs/2405.21040
作者: Runsheng Yu,Yong Wang,Xiaoqi Jiao,Youzhi Zhang,James T. Kwok
关键词: Large Language Models, Reinforcement Learning, Large Language, Human Feedback, behaviors of Large
中文关键词: 大型语言模型、强化学习、大型语言、人类反馈、大型行为
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an LLM-based reward model with the policy itself, thus obviating the need for extra memory and training time to learn the reward model. However, DPO does not consider the relative qualities of the positive and negative responses, and can lead to sub-optimal training outcomes. To alleviate this problem, we investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function. Specifically, we leverage the knowledge of the LLM to design a refinement function to estimate the quality of both the positive and negative responses. We show that the constructed refinement function can help self-refine the loss function under mild assumptions. The refinement function is integrated into DPO and its variant Identity Policy Optimization (IPO). Experiments across various evaluators indicate that they can improve the performance of the fine-tuned models over DPO and IPO.
摘要:人类反馈强化学习(RLHF)被广泛用于将大语言模型(LLM)的行为与人类偏好相匹配。最近,一种流行的替代方案是直接策略优化(DPO),它用策略本身取代了基于LLM的奖励模型,从而消除了学习奖励模型所需的额外记忆和训练时间。然而,DPO没有考虑积极和消极反应的相对质量,可能会导致次优的培训结果。为了缓解这一问题,我们研究了在动态微调LLM中使用固有知识来获得相对质量并帮助改进损失函数。具体地说,我们利用LLM的知识来设计求精函数来估计正面和负面响应的质量。我们证明了所构造的精化函数可以在较温和的假设下帮助自精化损失函数。细化功能被集成到DPO及其变体身份策略优化(IPO)中。不同评估者的实验表明,它们可以改善微调模型的性能,而不是DPO和IPO。

[NLP-6] LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models
[NLP-6] LACIE:大型语言模型中可信度校准的听众感知微调

链接: https://arxiv.org/abs/2405.21028
作者: Elias Stengel-Eskin,Peter Hase,Mohit Bansal
关键词: explicit confidence markers, answering questions, LACIE, confidence markers, confidence
中文关键词: 显式信心标记、回答问题、LACIE、信心标记、信心
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages. Code: this https URL

点击查看摘要

Abstract:When answering questions, LLMs can convey not only an answer, but a level of confidence about the answer being correct. This includes explicit confidence markers (e.g. giving a numeric score) as well as implicit markers, like an authoritative tone or elaborating with additional knowledge. For LLMs to be trustworthy knowledge sources, the confidence they convey should match their actual expertise; however, most current models tend towards overconfidence. To calibrate both implicit and explicit confidence markers, we introduce a pragmatic, listener-aware finetuning method (LACIE) that models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We cast calibration as preference optimization, creating data via a two-agent game, where a speaker model’s outputs are judged by a simulated listener. We then finetune three LLMs (Mistral-7B, Llama3-8B, Llama3-70B) with LACIE, and show that the resulting models are better calibrated w.r.t. a simulated listener. Crucially, these trends transfer to human listeners, helping them correctly predict model correctness: we conduct a human evaluation where annotators accept or reject an LLM’s answers, finding that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers. Furthermore, LACIE generalizes to another dataset, resulting in a large increase in truthfulness on TruthfulQA when trained on TriviaQA. Our analysis indicates that LACIE leads to a better confidence separation between correct and incorrect examples. Qualitatively, we find that a LACIE-trained model hedges more and implicitly signals certainty when it is correct by using an authoritative tone or including details. Finally, LACIE finetuning leads to an emergent increase in model abstention (e.g. saying “I don’t know”) for answers that are likely wrong.
摘要:在回答问题时,LLMS不仅可以传达一个答案,而且可以传达出对答案正确的信心程度。这包括显性的信心标记(例如,给出一个数字分数)以及隐含的标记,如权威的语气或用额外的知识进行阐述。要想让LLM成为值得信赖的知识来源,它们传递的信心应该与他们的实际专业知识相匹配;然而,目前的大多数模型往往过于自信。为了校准隐式和显式置信度标记,我们引入了一种实用的、听者感知的微调方法(LACIE),该方法对听者进行建模,不仅考虑答案是否正确,而且考虑是否会被听者接受。我们将校准视为偏好优化,通过两个代理游戏创建数据,其中说话人模型的输出由模拟听众判断。然后,我们用Lacie对三个LLM(Mistral-7B,Llama3-8B,Llama3-70B)进行了微调,结果表明所得到的模型具有更好的W.r.t.一个模拟的听众。至关重要的是,这些趋势传递给人类听众,帮助他们正确预测模型的正确性:我们进行了人类评估,注释员接受或拒绝LLM的答案,发现使用Lacie进行培训后,接受的错误答案减少了47%,同时保持了对正确答案的接受程度。此外,Lacie对另一个数据集进行泛化,导致在TriviaQA上训练时TruthfulQA的真实性大大增加。我们的分析表明,Lacie导致了正确和错误例子之间更好的置信度分离。定性地,我们发现Lacie训练的模型通过使用权威的语气或包括细节来进行更多的对冲,并隐含地发出确定的信号。最后,Lacie精调导致模型弃权率(例如,对可能错误的答案说“我不知道”)的突然增加。

[NLP-7] You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet
[NLP-7] 只需扫描一次:使用LightNet进行高效多维序列建模

链接: https://arxiv.org/abs/2405.21022
作者: Zhen Qin,Yuxin Mao,Xuyang Shen,Dong Li,Jing Zhang,Yuchao Dai,Yiran Zhong
关键词: linear computational complexity, enhanced speed, gained prominence, prominence in causal, computational complexity
中文关键词: 线性计算复杂性,速度提高,突出,在因果关系方面突出,计算复杂性
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report. Yiran Zhong is the corresponding author. The code is available at this https URL

点击查看摘要

Abstract:Linear attention mechanisms have gained prominence in causal language models due to their linear computational complexity and enhanced speed. However, the inherent decay mechanism in linear attention presents challenges when applied to multi-dimensional sequence modeling tasks, such as image processing and multi-modal learning. In these scenarios, the utilization of sequential scanning to establish a global receptive field necessitates multiple scans for multi-dimensional data, thereby leading to inefficiencies. This paper identifies the inefficiency caused by a multiplicative linear recurrence and proposes an efficient alternative additive linear recurrence to avoid the issue, as it can handle multi-dimensional data within a single scan. We further develop an efficient multi-dimensional sequential modeling framework called LightNet based on the new recurrence. Moreover, we present two new multi-dimensional linear relative positional encoding methods, MD-TPE and MD-LRPE to enhance the model’s ability to discern positional information in multi-dimensional scenarios. Our empirical evaluations across various tasks, including image classification, image generation, bidirectional language modeling, and autoregressive language modeling, demonstrate the efficacy of LightNet, showcasing its potential as a versatile and efficient solution for multi-dimensional sequential modeling.
摘要:线性注意机制因其线性计算复杂性和较高的计算速度而在因果语言模型中获得了突出的地位。然而,线性注意固有的衰减机制在应用于多维序列建模任务时提出了挑战,例如图像处理和多模式学习。在这些情况下,利用顺序扫描来建立全局接受场需要对多维数据进行多次扫描,从而导致效率低下。本文指出了乘性线性递归算法的低效性,并提出了一种有效的加性线性递归算法来避免这个问题,因为它可以在一次扫描中处理多维数据。在此基础上,我们进一步开发了一个高效的多维序贯建模框架LightNet。此外,我们还提出了两种新的多维线性相对位置编码方法MD-TPE和MD-LRPE,以增强模型在多维场景中识别位置信息的能力。我们对各种任务的经验评估,包括图像分类、图像生成、双向语言建模和自回归语言建模,证明了LightNet的有效性,展示了其作为多维序列建模的通用和高效解决方案的潜力。

[NLP-8] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
[NLP-8] 基于优化的大型语言模型越狱改进技术

链接: https://arxiv.org/abs/2405.21018
作者: Xiaojun Jia,Tianyu Pang,Chao Du,Yihao Huang,Jindong Gu,Yang Liu,Xiaochun Cao,Min Lin
关键词: Large language models, Greedy Coordinate Gradient, Large language, language models, rapidly developed
中文关键词: 大型语言模型,贪婪坐标梯度,大型语言,语言模型,快速发展
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack’s success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of “Sure” largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed \mathcalI -GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at this https URL.
摘要:大型语言模型(LLM)正在迅速发展,其广泛应用的一个关键组成部分是与安全相关的一致性。许多红色团队的目标是越狱LLM,其中贪婪坐标梯度(GCG)攻击的成功导致了人们对基于优化的越狱技术的研究越来越感兴趣。虽然GCG是一个重要的里程碑,但其攻击效率仍然不能令人满意。在这篇文章中,我们提出了几种改进的(经验)技术,用于基于优化的越狱,如GCG。我们首先观察到单一目标模板“Sure”在很大程度上限制了GCG的攻击性能;鉴于此,我们建议使用包含有害自我暗示和/或引导的不同目标模板来误导LLM。此外,在优化方面,我们提出了GCG中的自动多坐标更新策略(即自适应地决定每一步需要替换多少个令牌)来加速收敛,以及容易初始化等技巧。然后,我们将这些改进的技术结合起来,开发出一种高效的越狱方法,称为\mathcalI-GCG。在我们的实验中,我们在一系列基准(例如NeurIPS 2023 Red Teaming Track)上进行了评估。结果表明,改进后的技术可以帮助GCG超越最先进的越狱攻击,并获得近100%的攻击成功率。代码在此HTTPS URL上发布。

[NLP-9] CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking
[NLP-9] CWRCzech:1亿个查询文档捷克点击数据集及其在Web相关性排名中的应用

链接: https://arxiv.org/abs/2405.20994
作者: Josef Vonášek,Milan Straka,Rostislav Krč,Lenka Lasoňová,Ekaterina Egorova,Jana Straková,Jakub Náplava
关键词: Click Web Ranking, Web Ranking dataset, Czech click dataset, Click Web, search engine logs
中文关键词: Click Web Ranking、Web Ranking数据集、捷克点击数据集、Click Web、搜索引擎日志
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted to SIGIR 2024

点击查看摘要

Abstract:We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click dataset for relevance ranking with user behavior data collected from search engine logs of this http URL. To the best of our knowledge, CWRCzech is the largest click dataset with raw text published so far. It provides document positions in the search results as well as information about user behavior: 27.6M clicked documents and 10.8M dwell times. In addition, we also publish a manually annotated Czech test for the relevance task, containing nearly 50k query-document pairs, each annotated by at least 2 annotators. Finally, we analyze how the user behavior data improve relevance ranking and show that models trained on data automatically harnessed at sufficient scale can surpass the performance of models trained on human annotated data. CWRCzech is published under an academic non-commercial license and is available to the research community at this https URL.
摘要:我们展示了CWRCzech,Czech的Click Web Ranking数据集,这是一个1亿个查询文档的捷克点击数据集,用于相关性排名,使用从该http URL的搜索引擎日志收集的用户行为数据。据我们所知,CWRCzech是迄今为止发布的最大的原始文本点击数据集。它提供搜索结果中的文档位置以及有关用户行为的信息:2760万次点击文档和1080万次停留时间。此外,我们还针对相关性任务发布了手动注释的捷克测试,其中包含近5万个查询-文档对,每个都由至少2个注释者注释。最后,我们分析了用户行为数据如何提高相关性排名,并表明在足够规模自动利用的数据上训练的模型可以超越在人类注释数据上训练的模型的性能。CWRCzech根据学术非商业许可发布,研究界可通过此https URL获取。

[NLP-10] SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales
[NLP-10] SaySelf:教法学硕士通过自我反思的课程表达信心

链接: https://arxiv.org/abs/2405.20974
作者: Tianyang Xu,Shujin Wu,Shizhe Diao,Xiaoze Liu,Xingyao Wang,Yangyi Chen,Jing Gao
关键词: Large language models, Large language, confidence estimates, broader applications, fabricated information
中文关键词: 大型语言模型、大型语言、置信度估计、更广泛的应用、捏造的信息
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The code is available at \url{ this https URL }

点击查看摘要

Abstract:Large language models (LLMs) often generate inaccurate or fabricated information and generally fail to indicate their confidence, which limits their broader applications. Previous work elicits confidence from LLMs by direct or self-consistency prompting, or constructing specific datasets for supervised finetuning. The prompting-based approaches have inferior performance, and the training-based approaches are limited to binary or inaccurate group-level confidence estimates. In this work, we present the advanced SaySelf, a training framework that teaches LLMs to express more accurate fine-grained confidence estimates. In addition, beyond the confidence scores, SaySelf initiates the process of directing LLMs to produce self-reflective rationales that clearly identify gaps in their parametric knowledge and explain their uncertainty. This is achieved by using an LLM to automatically summarize the uncertainties in specific knowledge via natural language. The summarization is based on the analysis of the inconsistency in multiple sampled reasoning chains, and the resulting data is utilized for supervised fine-tuning. Moreover, we utilize reinforcement learning with a meticulously crafted reward function to calibrate the confidence estimates, motivating LLMs to deliver accurate, high-confidence predictions and to penalize overconfidence in erroneous outputs. Experimental results in both in-distribution and out-of-distribution datasets demonstrate the effectiveness of SaySelf in reducing the confidence calibration error and maintaining the task performance. We show that the generated self-reflective rationales are reasonable and can further contribute to the calibration. The code is made public at \urlthis https URL.
摘要:大型语言模型经常产生不准确或捏造的信息,并且通常不能表明它们的可信度,这限制了它们的广泛应用。以前的工作是通过直接或自洽的提示,或者为监督精调构建特定的数据集,来从LLMS中获得信心。基于提示的方法性能较差,而基于训练的方法仅限于二进制或不准确的组级置信度估计。在这项工作中,我们提出了高级SaySelf,这是一个训练框架,教LLM表达更准确的细粒度置信度估计。此外,除了自信分数之外,SaySself还启动了指导LLM产生自我反思的理由的过程,这些理由清楚地确定了他们参数知识中的差距,并解释了他们的不确定性。这是通过使用LLM通过自然语言自动总结特定知识中的不确定性来实现的。总结是基于对多个采样推理链中的不一致性进行分析的基础上进行的,结果数据被用于有监督的微调。此外,我们利用强化学习和精心设计的奖励函数来校准置信度估计,激励LLM提供准确、高置信度的预测,并惩罚错误输出中的过度自信。在分布内和分布外的数据集上的实验结果表明,SaySself在减少置信度校准误差和保持任务性能方面是有效的。我们证明了所产生的自反射原理是合理的,并且可以进一步有助于校准。代码在此HTTPS URL上公开。

[NLP-11] LCQ: Low-Rank Codebook based Quantization for Large Language Models
[NLP-11] LCQ:大型语言模型的基于低级别码本的量化

链接: https://arxiv.org/abs/2405.20973
作者: Wen-Pu Cai,Wu-Jun Li
关键词: Large language models, Large language, recently demonstrated promising, demonstrated promising performance, recently demonstrated
中文关键词: 大型语言模型,大型语言,最近证明了有前途,最近证明了有前途的性能
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.
摘要:大型语言模型(LLM)最近在许多任务中表现出了良好的性能。然而,LLM的高存储和计算成本已成为部署LLM的挑战。权重量化已被广泛用于模型压缩,可以减少存储和计算成本。大多数现有的LLM权重量化方法使用一级码本进行量化,这会导致压缩比高时准确度的大幅损失。本文针对LLM提出了一种新型的权重量化方法,称为基于低阶码本的量化~(LCQ)。LCQ采用低等级码本进行量化,其等级可以大于一。实验表明,LCQ可以比现有方法获得更好的准确性,而额外的存储成本则可以忽略不计。

[NLP-12] Superlatives in Context: Explicit and Implicit Domain Restrictions for Superlative Frames
[NLP-12] 上下文中的最高级词:最高级框架的显式和隐式领域限制

链接: https://arxiv.org/abs/2405.20967
作者: Valentina Pyatkin,Bonnie Webber,Ido Dagan,Reut Tsarfaty
关键词: single out elements, minimal property, Superlatives, set, semantics
中文关键词: 挑选出元素、最小属性、最高级、集合、语义
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:Superlatives are used to single out elements with a maximal/minimal property. Semantically, superlatives perform a set comparison: something (or some things) has the min/max property out of a set. As such, superlatives provide an ideal phenomenon for studying implicit phenomena and discourse restrictions. While this comparison set is often not explicitly defined, its (implicit) restrictions can be inferred from the discourse context the expression appears in. In this work we provide an extensive computational study on the semantics of superlatives. We propose a unified account of superlative semantics which allows us to derive a broad-coverage annotation schema. Using this unified schema we annotated a multi-domain dataset of superlatives and their semantic interpretations. We specifically focus on interpreting implicit or ambiguous superlative expressions, by analyzing how the discourse context restricts the set of interpretations. In a set of experiments we then analyze how well models perform at variations of predicting superlative semantics, with and without context. We show that the fine-grained semantics of superlatives in context can be challenging for contemporary models, including GPT-4.
摘要:最高级用来挑选具有最大/最小性质的元素。在语义上,最高级执行集合比较:某物(或某些事物)具有集合中的最小/最大属性。因此,最高级为研究隐含现象和话语限制提供了一个理想的现象。虽然这个比较集通常没有明确的定义,但它的(隐含的)限制可以从表达出现的语篇上下文中推断出来。在这项工作中,我们对最高级的语义进行了广泛的计算研究。我们提出了对最高级语义的统一描述,这使得我们可以推导出一个覆盖广泛的注释模式。使用这个统一的模式,我们标注了最高级及其语义解释的多领域数据集。通过分析语篇语境对释义的制约,我们特别关注对隐含或歧义的最高级表达的理解。在一系列实验中,我们随后分析了模型在有上下文和无上下文的情况下预测最高级语义的变化情况。我们表明,最高级在上下文中的细粒度语义对包括GPT-4在内的当代模型可能是具有挑战性的。

[NLP-13] Large Language Models are Zero-Shot Next Location Predictors
[NLP-13] 大型语言模型是零镜头下一个位置预测器

链接: https://arxiv.org/abs/2405.20962
作者: Ciro Beneduce,Bruno Lepri,Massimiliano Luca
关键词: Predicting the locations, locations an individual, individual will visit, future is crucial, crucial for solving
中文关键词: 预测地点,个人将访问的地点,未来至关重要,对于解决问题至关重要
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Predicting the locations an individual will visit in the future is crucial for solving many societal issues like disease diffusion and reduction of pollution among many others. The models designed to tackle next-location prediction, however, require a significant amount of individual-level information to be trained effectively. Such data may be scarce or even unavailable in some geographic regions or peculiar scenarios (e.g., cold-start in recommendation systems). Moreover, the design of a next-location predictor able to generalize or geographically transfer knowledge is still an open research challenge. Recent advances in natural language processing have led to a rapid diffusion of Large Language Models (LLMs) which have shown good generalization and reasoning capabilities. These insights, coupled with the recent findings that LLMs are rich in geographical knowledge, allowed us to believe that these models can act as zero-shot next-location predictors. This paper evaluates the capabilities of many popular LLMs in this role, specifically Llama, GPT-3.5 and Mistral 7B. After designing a proper prompt, we tested the models on three real-world mobility datasets. The results show that LLMs can obtain accuracies up to 32.4%, a significant relative improvement of over 600% when compared to sophisticated DL models specifically designed for human mobility. Moreover, we show that other LLMs are unable to perform the task properly. To prevent positively biased results, we also propose a framework inspired by other studies to test data contamination. Finally, we explored the possibility of using LLMs as text-based explainers for next-location prediction showing that can effectively provide an explanation for their decision. Notably, 7B models provide more generic, but still reliable, explanations compared to larger counterparts. Code: this http URL
摘要:预测个人未来将去的地方对于解决许多社会问题至关重要,如疾病传播和减少污染等。然而,为处理下一个位置预测而设计的模型需要大量的个人级别信息才能有效地进行训练。在某些地理区域或特殊情况下(例如,推荐系统中的冷启动),此类数据可能很少,甚至无法获得。此外,能够推广或在地理上转移知识的下一个位置预测器的设计仍然是一个开放的研究挑战。自然语言处理的最新进展导致了大型语言模型(LLM)的迅速传播,这些模型表现出良好的泛化和推理能力。这些洞察力,再加上最近发现的LLM具有丰富的地理知识,让我们相信这些模型可以作为零概率的下一个位置预测者。本文评估了许多流行的LLMS在这一角色中的能力,特别是骆驼、GPT-3.5和米斯特拉尔7B。在设计了适当的提示后,我们在三个真实的移动数据集上测试了这些模型。结果表明,LLMS可以获得高达32.4%的精度,与专门为人类移动而设计的复杂的DL模型相比,相对提高了600%以上。此外,我们还证明了其他LLM不能正确地执行该任务。为了防止正向偏差的结果,我们还提出了一个受其他研究启发的框架来测试数据污染。最后,我们探索了使用LLMS作为基于文本的解释器进行下一位置预测的可能性,这表明这可以有效地为他们的决定提供解释。值得注意的是,与较大的型号相比,7B型号提供了更通用但仍然可靠的解释。代码:此http URL

[NLP-14] A Robot Walks into a Bar: Can Language Models Serve asCreativity Support Tools for Comedy? An Evaluation of LLMs Humour Alignment with Comedians
[NLP-14] 机器人走进酒吧:语言模型可以作为喜剧的创造力支持工具吗?对LLM幽默与喜剧演员的一致性的评价

链接: https://arxiv.org/abs/2405.20956
作者: Piotr Wojciech Mirowski,Juliette Love,Kory W. Mathewson,Shakir Mohamed
关键词: Edinburgh Festival Fringe, interviewed twenty professional, twenty professional comedians, perform live shows, Fringe in August
中文关键词: 爱丁堡边缘艺术节,采访了二十位专业、二十位专业喜剧演员,现场表演,边缘艺术节于八月举行
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 1 figure, published at ACM FAccT 2024

点击查看摘要

Abstract:We interviewed twenty professional comedians who perform live shows in front of audiences and who use artificial intelligence in their artistic process as part of 3-hour workshops on AI x Comedy'' conducted at the Edinburgh Festival Fringe in August 2023 and online. The workshop consisted of a comedy writing session with large language models (LLMs), a human-computer interaction questionnaire to assess the Creativity Support Index of AI as a writing tool, and a focus group interrogating the comedians' motivations for and processes of using AI, as well as their ethical concerns about bias, censorship and copyright. Participants noted that existing moderation strategies used in safety filtering and instruction-tuned LLMs reinforced hegemonic viewpoints by erasing minority groups and their perspectives, and qualified this as a form of censorship. At the same time, most participants felt the LLMs did not succeed as a creativity support tool, by producing bland and biased comedy tropes, akin to cruise ship comedy material from the 1950s, but a bit less racist’‘. Our work extends scholarship about the subtle difference between, one the one hand, harmful speech, and on the other hand, offensive'' language as a practice of resistance, satire and punching up’‘. We also interrogate the global value alignment behind such language models, and discuss the importance of community-based value alignment and data ownership to build AI tools that better suit artists’ needs.
摘要:我们采访了20名专业喜剧演员,他们在观众面前进行现场表演,并在艺术过程中使用人工智能,这是2023年8月在爱丁堡艺术节边缘举办的为期3小时的爱丁堡艺术节和在线研讨会的一部分。研讨会包括与大型语言模型(LLMS)的喜剧写作会议,评估人工智能作为写作工具的创造力支持指数的人机交互问卷,以及询问喜剧演员使用人工智能的动机和过程,以及他们对偏见、审查和版权的伦理关切的焦点小组。与会者指出,安全过滤和教学调整中使用的现有温和战略通过消除少数群体及其观点强化了霸权主义观点,并将其定性为一种审查形式。与此同时,大多数参与者认为LLMS作为一种创造力支持工具并不成功,因为它制造了平淡和有偏见的喜剧比喻,类似于20世纪50年代的游轮喜剧素材,但没有那么种族主义。我们的工作扩展了学术研究,一方面是有害言论,另一方面是作为抵抗、讽刺和打人的练习的“攻击性”语言之间的细微区别。我们还询问了这些语言模型背后的全球价值一致性,并讨论了基于社区的价值一致性和数据所有权对于构建更适合艺术家需求的人工智能工具的重要性。

[NLP-15] OR-Bench: An Over-Refusal Benchmark for Large Language Models
[NLP-15] OR-Bench:大型语言模型的过度拒绝基准

链接: https://arxiv.org/abs/2405.20947
作者: Justin Cui,Wei-Lin Chiang,Ion Stoica,Cho-Jui Hsieh
关键词: Large Language Models, Large Language, require careful safety, careful safety alignment, prevent malicious outputs
中文关键词: 大型语言模型,大型语言,需要谨慎的安全性,谨慎的安全对齐,防止恶意输出
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: version 1

点击查看摘要

Abstract:Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where the LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of ``seemingly toxic prompts’’ (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at this https URL and the corresponding demo can be found at this https URL. We hope this benchmark can help the community develop better safety aligned models.
摘要:大型语言模型(LLM)需要仔细的安全对齐,以防止恶意输出。虽然重要的研究重点是减少有害内容的产生,但增强的安全性往往伴随着过度拒绝的副作用,其中LLM可能会拒绝无害的提示,从而变得不那么有帮助。尽管已经从经验上观察到了过度拒绝的问题,但由于很难制定看起来有害但却是良性的提示,系统的衡量是具有挑战性的。这项研究提出了一种新的方法来自动生成大规模的“看似有毒的提示”集(可能被LLMS拒绝的良性提示)。利用这一技术,我们引入了OR-BENCH,这是第一个大规模的过度拒绝基准。OR-BASE包括10个常见拒绝类别的8万个看似有毒的提示,大约1000个硬提示的子集,即使对最先进的LLM来说也是具有挑战性的,以及另外600个有毒提示,以防止不分青红皂白的反应。然后,我们进行了一项全面的研究,测量了8个典型家庭中25个流行的LLM的过度拒绝。我们的数据集可以在此HTTPS URL上找到,相应的演示可以在此HTTPS URL上找到。我们希望这个基准可以帮助社区开发更好的安全一致性模型。

[NLP-16] Learning to Estimate System Specifications in Linear Temporal Logic using Transformers and Mamba
[NLP-16] 学习使用Transformers和Mamba估计线性时态逻辑中的系统规格

链接: https://arxiv.org/abs/2405.20917
作者: İlker Işık,Ebru Aydin Gol,Ramazan Gokberk Cinbis
关键词: Temporal logic, evolve over time, temporal logic formulae, framework for representing, representing and reasoning
中文关键词: 时态逻辑,随着时间的推移而演变,时态逻辑公式,表示、表示和推理的框架
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 20 pages, 15 figures

点击查看摘要

Abstract:Temporal logic is a framework for representing and reasoning about propositions that evolve over time. It is commonly used for specifying requirements in various domains, including hardware and software systems, as well as robotics. Specification mining or formula generation involves extracting temporal logic formulae from system traces and has numerous applications, such as detecting bugs and improving interpretability. Although there has been a surge of deep learning-based methods for temporal logic satisfiability checking in recent years, the specification mining literature has been lagging behind in adopting deep learning methods despite their many advantages, such as scalability. In this paper, we introduce autoregressive models that can generate linear temporal logic formulae from traces, towards addressing the specification mining problem. We propose multiple architectures for this task: transformer encoder-decoder, decoder-only transformer, and Mamba, which is an emerging alternative to transformer models. Additionally, we devise a metric for quantifying the distinctiveness of the generated formulae and a straightforward algorithm to enforce the syntax constraints. Our experiments show that the proposed architectures yield promising results, generating correct and distinct formulae at a fraction of the compute cost needed for the combinatorial baseline.
摘要:时态逻辑是一种表示和推理随时间演变的命题的框架。它通常用于指定各个领域的需求,包括硬件和软件系统以及机器人技术。规范挖掘或公式生成涉及从系统跟踪中提取时态逻辑公式,并有许多应用,如检测错误和提高可解释性。尽管近年来出现了大量基于深度学习的时态逻辑可满足性检验方法,但规范挖掘文献在采用深度学习方法方面一直滞后,尽管它们具有可扩展性等优点。在本文中,我们引入了自回归模型,它可以从踪迹生成线性时态逻辑公式,以解决规范挖掘问题。我们为这项任务提出了多种架构:转换器编解码器,仅解码器转换器,以及MAMBA,它是一种新兴的变压器模型的替代方案。此外,我们设计了一个度量来量化生成的公式的独特性,并设计了一个简单的算法来加强语法约束。我们的实验表明,所提出的体系结构产生了有希望的结果,以组合基线所需的一小部分计算成本生成了正确和独特的公式。

[NLP-17] Enhancing Vision Models for Text-Heavy Content Understanding and Interaction
[NLP-17] 增强视觉模型以实现文本密集内容理解和交互

链接: https://arxiv.org/abs/2405.20906
作者: Adithya TG,Adithya SK,Abhinav R Bharadwaj,Abhiram HA,Dr. Surabhi Narayan
关键词: heavy visual content, traditional vision models, text heavy visual, major challenge, challenge for traditional
中文关键词: 繁重的视觉内容、传统的视觉模型、文本繁重的视觉、重大挑战、传统的挑战
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 4 figures (including 1 graph)

点击查看摘要

Abstract:Interacting and understanding with text heavy visual content with multiple images is a major challenge for traditional vision models. This paper is on enhancing vision models’ capability to comprehend or understand and learn from images containing a huge amount of textual information from the likes of textbooks and research papers which contain multiple images like graphs, etc and tables in them with different types of axes and scales. The approach involves dataset preprocessing, fine tuning which is by using instructional oriented data and evaluation. We also built a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark which is developed to consider both textual and visual inputs. An accuracy of 96.71% was obtained. The aim of the project is to increase and also enhance the advance vision models’ capabilities in understanding complex visual textual data interconnected data, contributing to multimodal AI.
摘要:与具有多个图像的大量文本视觉内容进行交互和理解是传统视觉模型的一大挑战。本文旨在提高视觉模型的理解或理解和学习能力,这些图像包含来自教科书和研究论文等大量文本信息,这些教科书和研究论文中包含多个图像,例如图形等和表格,具有不同类型的轴和比例。该方法涉及数据集预处理、使用面向教学的数据进行微调和评估。我们还构建了一个视觉聊天应用程序,该应用程序集成了用于图像编码的CLIP和来自Massive文本嵌入基准的模型,该模型旨在考虑文本和视觉输入。准确率为96.71%。该项目的目标是增加并增强高级视觉模型理解复杂视觉文本数据和互连数据的能力,为多模式人工智能做出贡献。

[NLP-18] Preemptive Answer “Attacks” on Chain-of-Thought Reasoning
[NLP-18] 先发制人的回答“攻击”思维链推理

链接: https://arxiv.org/abs/2405.20902
作者: Rongwu Xu,Zehan Qi,Wei Xu
关键词: Large language models, Large language, showcase impressive reasoning, showcase impressive, impressive reasoning capabilities
中文关键词: 大型语言模型,大型语言,展示令人印象深刻的推理,展示令人印象深刻的推理能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to ACL’24 (Findings). Camera-ready version

点击查看摘要

Abstract:Large language models (LLMs) showcase impressive reasoning capabilities when coupled with Chain-of-Thought (CoT) prompting. However, the robustness of this approach warrants further investigation. In this paper, we introduce a novel scenario termed preemptive answers, where the LLM obtains an answer before engaging in reasoning. This situation can arise inadvertently or induced by malicious users by prompt injection attacks. Experiments reveal that preemptive answers significantly impair the model’s reasoning capability across various CoT methods and a broad spectrum of datasets. To bolster the robustness of reasoning, we propose two measures aimed at mitigating this issue to some extent.
摘要:大型语言模型(LLM)与思想链(CoT)提示相结合时,展示出令人印象深刻的推理能力。然而,这种方法的稳健性值得进一步研究。在本文中,我们引入了一种称为先发制人答案的新颖场景,其中LLM在进行推理之前获得答案。这种情况可能是无意中发生的,也可能是恶意用户通过提示注入攻击引起的。实验表明,先发制人的答案显着损害了模型在各种CoT方法和广泛数据集中的推理能力。为了增强推理的稳健性,我们提出了两项旨在在一定程度上缓解这一问题的措施。

[NLP-19] Large Language Models: A New Approach for Privacy Policy Analysis at Scale
[NLP-19] 大型语言模型:大规模隐私政策分析的新方法

链接: https://arxiv.org/abs/2405.20900
作者: David Rodriguez,Ian Yang,Jose M. Del Alamo,Norman Sadeh
关键词: data protection laws, presents significant challenges, mobile applications presents, applications presents significant, Natural Language Processing
中文关键词: 数据保护法,提出了重大挑战,移动应用程序提出了重大挑战,应用程序提出了重大,自然语言处理
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The number and dynamic nature of web and mobile applications presents significant challenges for assessing their compliance with data protection laws. In this context, symbolic and statistical Natural Language Processing (NLP) techniques have been employed for the automated analysis of these systems’ privacy policies. However, these techniques typically require labor-intensive and potentially error-prone manually annotated datasets for training and validation. This research proposes the application of Large Language Models (LLMs) as an alternative for effectively and efficiently extracting privacy practices from privacy policies at scale. Particularly, we leverage well-known LLMs such as ChatGPT and Llama 2, and offer guidance on the optimal design of prompts, parameters, and models, incorporating advanced strategies such as few-shot learning. We further illustrate its capability to detect detailed and varied privacy practices accurately. Using several renowned datasets in the domain as a benchmark, our evaluation validates its exceptional performance, achieving an F1 score exceeding 93%. Besides, it does so with reduced costs, faster processing times, and fewer technical knowledge requirements. Consequently, we advocate for LLM-based solutions as a sound alternative to traditional NLP techniques for the automated analysis of privacy policies at scale.
摘要:网络和移动应用程序的数量和动态特性对评估其遵守数据保护法的情况提出了重大挑战。在此背景下,符号和统计自然语言处理(NLP)技术已被用于自动分析这些系统的隐私策略。然而,这些技术通常需要劳动密集型且可能容易出错的手动标注数据集来进行训练和验证。这项研究提出应用大语言模型(LLM)作为有效和高效地从大规模隐私策略中提取隐私实践的替代方案。特别是,我们利用ChatGPT和Llama 2等知名LLM,并提供关于提示、参数和模型的优化设计的指导,并结合了一些高级策略,如极少机会学习。我们进一步说明了它能够准确地检测详细的和不同的隐私实践。使用该领域几个著名的数据集作为基准,我们的评估验证了其卓越的性能,获得了超过93%的F1得分。此外,它以更低的成本、更快的处理时间和更少的技术知识要求实现这一点。因此,我们主张将基于LLM的解决方案作为传统NLP技术的可靠替代方案,用于大规模自动分析隐私策略。

[NLP-20] A comparison of correspondence analysis with PMI-based word embedding methods
[NLP-20] 对应分析与基于PMI的词嵌入方法的比较

链接: https://arxiv.org/abs/2405.20895
作者: Qianqian Qi,David J. Hessen,Peter G. M. van der Heijden
关键词: Popular word embedding, pointwise mutual information, Popular word, word embedding methods, PMI matrix
中文关键词: 流行词嵌入、逐点互信息、流行词、词嵌入方法、PMI矩阵
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Popular word embedding methods such as GloVe and Word2Vec are related to the factorization of the pointwise mutual information (PMI) matrix. In this paper, we link correspondence analysis (CA) to the factorization of the PMI matrix. CA is a dimensionality reduction method that uses singular value decomposition (SVD), and we show that CA is mathematically close to the weighted factorization of the PMI matrix. In addition, we present variants of CA that turn out to be successful in the factorization of the word-context matrix, i.e. CA applied to a matrix where the entries undergo a square-root transformation (ROOT-CA) and a root-root transformation (ROOTROOT-CA). An empirical comparison among CA- and PMI-based methods shows that overall results of ROOT-CA and ROOTROOT-CA are slightly better than those of the PMI-based methods.
摘要:GloVe和Word 2 Vec等流行的词嵌入方法与逐点互信息(PMI)矩阵的因式分解有关。在本文中,我们将对应分析(CA)与PMI矩阵的因式分解联系起来。CA是一种使用奇异值分解(DID)的降维方法,我们表明CA在数学上接近PMI矩阵的加权因式分解。此外,我们还提出了CA的变体,这些变体在单词上下文矩阵的因式分解中成功,即将CA应用于条目经历平方根变换(ROOT-CA)和根根变换(ROOT-CA)的矩阵。基于CA和基于PMI的方法之间的经验比较表明,ROOT-CA和ROOTROOT-CA的总体结果略好于基于PMI的方法。

[NLP-21] clembench-2024: A Challenging Dynamic Complementary Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents
[NLP-21] clembench-2024:LLM作为多行动代理的令人惊叹的动态补充多语言基准和基础灵活框架

链接: https://arxiv.org/abs/2405.20859
作者: Anne Beyer,Kranti Chalamalasetti,Sherzod Hakimov,Brielen Madureira,Philipp Sadler,David Schlangen
关键词: strategic goal orientation, Large Language Models, language understanding abilities, interactive game play, resulting interactive game
中文关键词: 战略目标定位、大型语言模型、语言理解能力、互动游戏玩法、由此产生的互动游戏
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:It has been established in recent work that Large Language Models (LLMs) can be prompted to “self-play” conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding abilities), where the resulting interactive game play can be automatically scored. In this paper, we take one of the proposed frameworks for setting up such game-play environments, and further test its usefulness as an evaluation instrument, along a number of dimensions: We show that it can easily keep up with new developments while avoiding data contamination, we show that the tests implemented within it are not yet saturated (human performance is substantially higher than that of even the best models), and we show that it lends itself to investigating additional questions, such as the impact of the prompting language on performance. We believe that the approach forms a good basis for making decisions on model choice for building applied interactive systems, and perhaps ultimately setting up a closed-loop development environment of system and simulated evaluator.
摘要:最近的研究表明,大型语言模型(LLM)可以被提示进行探索特定能力(一般指令遵循、战略目标定向、语言理解能力)的对话游戏,其中所产生的交互游戏可以自动计分。在本文中,我们采取了一个建议的框架来建立这样的游戏环境,并进一步测试了它作为评估工具的有效性,沿着一些维度:我们证明了它可以很容易地跟上新的发展,同时避免数据污染,我们证明在它内部实现的测试还没有饱和(人类的表现远远高于甚至最好的模型),我们证明它适合于调查额外的问题,例如提示语言对性能的影响。我们认为,该方法为构建应用交互系统的模型选择决策提供了良好的基础,并可能最终建立系统和模拟评价器的闭环开发环境。

[NLP-22] owards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning
[NLP-22] owards通过多层次多粒度对比学习理解口语

链接: https://arxiv.org/abs/2405.20852
作者: Xuxin Cheng,Wanshi Xu,Zhihong Zhu,Hongxiang Li,Yuexian Zou
关键词: Spoken language understanding, task-oriented dialogue systems, constructing semantic frames, user current goal, Spoken language
中文关键词: 口语理解、面向任务的对话系统、构建语义框架、用户当前目标、口语
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user’s current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot filling. Although there are some SLU frameworks joint modeling the two subtasks and achieving high performance, most of them still overlook the inherent relationships between intents and slots and fail to achieve mutual guidance between the two subtasks. To solve the problem, we propose a multi-level multi-grained SLU framework MMCL to apply contrastive learning at three levels, including utterance level, slot level, and word level to enable intent and slot to mutually guide each other. For the utterance level, our framework implements coarse granularity contrastive learning and fine granularity contrastive learning simultaneously. Besides, we also apply the self-distillation method to improve the robustness of the model. Experimental results and further analysis demonstrate that our proposed model achieves new state-of-the-art results on two public multi-intent SLU datasets, obtaining a 2.6 overall accuracy improvement on the MixATIS dataset compared to previous best models.
摘要:口语理解是任务型对话系统的核心任务,其目的是通过构建语义框架来理解用户当前的目标。SLU通常由两个子任务组成,包括意图检测和空位填充。虽然已有一些SLU框架对两个子任务进行了联合建模并取得了较高的性能,但它们大多忽略了意图和槽之间的内在关系,未能实现两个子任务之间的相互指导。为了解决这一问题,我们提出了一个多层次多粒度的SLU框架MMCL,将对比学习应用于三个层次,包括话语层、槽层和词层,使意图和槽能够相互引导。在发音层面,我们的框架同时实现了粗粒度对比学习和细粒度对比学习。此外,我们还应用了自蒸馏方法来提高模型的稳健性。实验结果和进一步的分析表明,我们提出的模型在两个公共多意图SLU数据集上取得了最新的结果,在MixATIS数据集上的总体准确率比以前的最佳模型提高了2.6。

[NLP-23] Improving Reward Models with Synthetic Critiques
[NLP-23] 通过综合批评改进奖励模型

链接: https://arxiv.org/abs/2405.20850
作者: Zihuiwen Ye,Fraser Greenlee-Scott,Max Bartolo,Phil Blunsom,Jon Ander Campos,Matthias Gallé
关键词: Reward models, play a critical, critical role, role in aligning, process of reinforcement
中文关键词: 奖励模型,在协调和强化过程中发挥着至关重要的作用
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models (RM) play a critical role in aligning language models through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unseen distributions. We propose a novel approach using synthetic natural language critiques generated by large language models to provide additional feedback, evaluating aspects such as instruction following, correctness, and style. This offers richer signals and more robust features for RMs to assess and score on. We demonstrate that high-quality critiques improve the performance and data efficiency of RMs initialized from different pretrained models. Conversely, we also show that low-quality critiques negatively impact performance. Furthermore, incorporating critiques enhances the interpretability and robustness of RM training.
摘要:奖励模型通过从人类反馈中强化学习的过程,在调整语言模型方面起着至关重要的作用。RMS被训练来预测反映人类偏好的分数,这需要大量的时间和成本来进行人工标注。此外,RMS往往很快过度适应训练集中的表面特征,阻碍了它们在不可见分布上的泛化性能。我们提出了一种新的方法,使用由大型语言模型生成的合成自然语言评论来提供额外的反馈,评估指令遵循、正确性和风格等方面。这为RMS提供了更丰富的信号和更强大的功能来进行评估和评分。我们证明了高质量的评论提高了从不同的预训练模型初始化的RMS的性能和数据效率。相反,我们还表明低质量的评论会对性能产生负面影响。此外,纳入批评可以增强RM培训的可解释性和稳健性。

[NLP-24] Dont Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models
[NLP-24] 不要买!重新评估对比多模式模型的广告理解能力

链接: https://arxiv.org/abs/2405.20846
作者: A. Bavaresco,A. Testoni,R. Fernández
关键词: unusual visual elements, Image-based advertisements, figurative language, complex multimodal stimuli, advertisements are complex
中文关键词: 不寻常的视觉元素、基于图像的广告、具象语言、复杂的多模式刺激、广告是复杂的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the main conference ACL 2024

点击查看摘要

Abstract:Image-based advertisements are complex multimodal stimuli that often contain unusual visual elements and figurative language. Previous research on automatic ad understanding has reported impressive zero-shot accuracy of contrastive vision-and-language models (VLMs) on an ad-explanation retrieval task. Here, we examine the original task setup and show that contrastive VLMs can solve it by exploiting grounding heuristics. To control for this confound, we introduce TRADE, a new evaluation test set with adversarial grounded explanations. While these explanations look implausible to humans, we show that they “fool” four different contrastive VLMs. Our findings highlight the need for an improved operationalisation of automatic ad understanding that truly evaluates VLMs’ multimodal reasoning abilities. We make our code and TRADE available at this https URL .
摘要:基于图像的广告是复杂的多模式刺激,通常包含不寻常的视觉元素和具象语言。之前关于自动广告理解的研究报告称,对比视觉和语言模型(VLM)在广告解释检索任务中具有令人印象深刻的零射击准确性。在这里,我们检查了原始的任务设置,并表明对比VLM可以通过利用基础启发法来解决它。为了控制这种混乱,我们引入了TRADE,这是一种新的评估测试集,具有对抗性的基础解释。虽然这些解释对人类来说看起来难以置信,但我们表明它们“愚弄”了四种不同的对比VLM。我们的研究结果凸显了改进自动广告理解操作的必要性,以真正评估VLM的多模式推理能力。我们在此https URL上提供我们的代码和交易。

[NLP-25] Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs
[NLP-25] 异常值和校准集对现代LLM量化的影响逐渐减弱

链接: https://arxiv.org/abs/2405.20835
作者: Davide Paglieri,Saurabh Dash,Tim Rocktäschel,Jack Parker-Holder
关键词: Large Language Models, Large Language, reduced memory usage, enabling faster operation, efficiency of Large
中文关键词: 大型语言模型,大型语言,减少内存使用,实现更快的操作,高效的大型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-Training Quantization (PTQ) enhances the efficiency of Large Language Models (LLMs) by enabling faster operation and compatibility with more accessible hardware through reduced memory usage, at the cost of small performance drops. We explore the role of calibration sets in PTQ, specifically their effect on hidden activations in various notable open-source LLMs. Calibration sets are crucial for evaluating activation magnitudes and identifying outliers, which can distort the quantization range and negatively impact performance. Our analysis reveals a marked contrast in quantization effectiveness across models. The older OPT model, which much of the quantization literature is based on, shows significant performance deterioration and high susceptibility to outliers with varying calibration sets. In contrast, newer models like Llama-2 7B, Llama-3 8B, Command-R 35B, and Mistral 7B demonstrate strong robustness, with Mistral 7B showing near-immunity to outliers and stable activations. These findings suggest a shift in PTQ strategies might be needed. As advancements in pre-training methods reduce the relevance of outliers, there is an emerging need to reassess the fundamentals of current quantization literature. The emphasis should pivot towards optimizing inference speed, rather than primarily focusing on outlier preservation, to align with the evolving characteristics of state-of-the-art LLMs.
摘要:训练后量化(PTQ)以较小的性能损失为代价,通过减少内存使用量来实现更快的操作和与更易访问的硬件的兼容性,从而提高了大型语言模型(LLM)的效率。我们探索了校准集在PTQ中的作用,特别是它们对各种著名的开源LLM中隐藏激活的影响。校准集对于评估激活幅度和识别离群值至关重要,这可能会扭曲量化范围并对性能产生负面影响。我们的分析表明,不同模型之间的量化效率存在显著差异。许多量化文献都基于较旧的OPT模型,该模型显示出显著的性能恶化和对不同校准集的离群值的高度敏感性。相比之下,较新的型号如Llama-2 7B、Llama-3 8B、Command-R 35B和Mistral 7B表现出很强的稳健性,Mistral 7B表现出对异常值和稳定激活的近乎免疫力。这些发现表明,可能需要改变PTQ策略。随着训练前方法的进步降低了离群值的相关性,出现了重新评估当前量化文献的基本原理的需要。重点应该放在优化推理速度上,而不是主要集中在异常值保留上,以与最先进的最小二乘模型的演变特征保持一致。

[NLP-26] hats Optional: A Contemporary Exploration of “that” Omission in English Subordinate Clauses
[NLP-26] 帽子可选:英语附属分句中“that”省略的当代探索

链接: https://arxiv.org/abs/2405.20833
作者: Ella Rabinovich
关键词: Uniform Information Density, uniform information profile, Uniform Information, Information Density, information profile
中文关键词: 统一信息密度,统一信息概况,统一信息,信息密度,信息概况
类目: Computation and Language (cs.CL)
备注: ACL2024 (main conference), 8 pages

点击查看摘要

Abstract:The Uniform Information Density (UID) hypothesis posits that speakers optimize the communicative properties of their utterances by avoiding spikes in information, thereby maintaining a relatively uniform information profile over time. This paper investigates the impact of UID principles on syntactic reduction, specifically focusing on the optional omission of the connector “that” in English subordinate clauses. Building upon previous research, we extend our investigation to a larger corpus of written English, utilize contemporary large language models (LLMs) and extend the information-uniformity principles by the notion of entropy, to estimate the UID manifestations in the usecase of syntactic reduction choices.
摘要:均匀信息密度(UID)假设,说话者通过避免信息峰值来优化其话语的沟通属性,从而随着时间的推移保持相对统一的信息轮廓。本文探讨UID原则对语法简化的影响,特别关注英语从属分句中连接符“that”的可选省略。在之前的研究的基础上,我们将调查扩展到更大的书面英语库,利用当代大型语言模型(LLM)并通过信息均匀性原则,以估计语法简化选择的用例中UID表现。

[NLP-27] Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment
[NLP-27] 自增强偏好优化:语言模型对齐的非政策范式

链接: https://arxiv.org/abs/2405.20830
作者: Yueqin Yin,Zhendong Wang,Yujia Xie,Weizhu Chen,Mingyuan Zhou
关键词: Direct Preference Optimization, Traditional language model, Preference Optimization, Ratio Preference Optimization, pre-collected paired preference
中文关键词: 直接偏好优化、传统语言模型、偏好优化、比例偏好优化、预先收集的配对偏好
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To overcome this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation. Specifically, we employ an Exponential Moving Average (EMA) model in conjunction with a replay buffer to enable dynamic updates of response segments, effectively integrating real-time feedback with insights from historical data. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B models across benchmarks, including the Open LLM Leaderboard, IFEval, AlpacaEval 2.0, and MT-Bench, demonstrate that SAPO matches or surpasses established offline contrastive baselines, such as DPO and Odds Ratio Preference Optimization, and outperforms offline self-play methods like SPIN. Our code is available at this https URL
摘要:传统的语言模型对齐方法,如直接偏好优化(DPO),由于依赖于静态的、预先收集的成对偏好数据而受到限制,这就限制了它们的适应性和实用性。为了克服这一局限性,我们引入了自增强偏好优化(SAPO),这是一种不需要现有配对数据的有效且可扩展的训练范例。在自主产生负面反应的自我发挥概念的基础上,我们进一步纳入了非政策学习渠道,以加强数据探索和利用。具体地说,我们使用指数移动平均(EMA)模型和重放缓冲区来实现响应段的动态更新,有效地将实时反馈与来自历史数据的洞察相结合。我们跨基准(包括Open LLM排行榜、IFEval、AlpacaEval 2.0和MT-BENCH)对LLaMA3-8B和Mistral-7B型号进行的全面评估表明,SAPO达到或超过了已建立的离线对比基准,如DPO和优势比偏好优化,并优于SPIN等离线自我发挥方法。我们的代码可从以下HTTPS URL获得

[NLP-28] An iterated learning model of language change that mixes supervised and unsupervised learning
[NLP-28] 混合有监督和无监督学习的语言变化迭代学习模型

链接: https://arxiv.org/abs/2405.20818
作者: Jack Bunyan,Seth Bullock,Conor Houghton
关键词: iterated learning model, language change, tutor, language transmission bottleneck, pupil
中文关键词: 迭代学习模型、语言变化、导师、语言传播瓶颈、学生
类目: Computation and Language (cs.CL); Adaptation and Self-Organizing Systems (nlin.AO); Populations and Evolution (q-bio.PE)
备注:

点击查看摘要

Abstract:The iterated learning model is an agent-based model of language change in which language is transmitted from a tutor to a pupil which itself becomes a tutor to a new pupil, and so on. Languages that are stable, expressive, and compositional arise spontaneously as a consequence of a language transmission bottleneck. Previous models have implemented an agent’s mapping from signals to meanings using an artificial neural network decoder, but have relied on an unrealistic and computationally expensive process of obversion to implement the associated encoder, mapping from meanings to signals. Here, a new model is presented in which both decoder and encoder are neural networks, trained separately through supervised learning, and trained together through unsupervised learning in the form of an autoencoder. This avoids the substantial computational burden entailed in obversion and introduces a mixture of supervised and unsupervised learning as observed during human development.
摘要:迭代学习模型是一种基于主体的语言变化模型,其中语言从导师传递给学生,而学生本身又成为新学生的导师,以此为基础。稳定、表达和组合的语言是由于语言传输瓶颈而自发产生的。之前的模型使用人工神经网络解码器实现了代理从信号到意义的映射,但依赖于不切实际且计算成本高昂的倒相过程来实现相关的编码器,从意义到信号的映射。这里,提出了一个新模型,其中解码器和编码器都是神经网络,通过监督学习单独训练,并通过自动编码器形式的无监督学习一起训练。这避免了翻转所带来的巨大计算负担,并引入了人类发展过程中观察到的监督和无监督学习的混合。

[NLP-29] Multilingual Text Style Transfer: Datasets Models for Indian Languages
[NLP-29] 多语言文本风格转换:印度语言数据集模型

链接: https://arxiv.org/abs/2405.20805
作者: Sourabrata Mukherjee,Atul Kr. Ojha,Akanksha Bansal,Deepak Alok,John P. McCrae,Ondřej Dušek
关键词: Text style transfer, involves altering, core content, Text style, altering the linguistic
中文关键词: 文本风格转移,涉及改变核心内容、文本风格、改变语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text style transfer (TST) involves altering the linguistic style of a text while preserving its core content. This paper focuses on sentiment transfer, a vital TST subtask (Mukherjee et al., 2022a), across a spectrum of Indian languages: Hindi, Magahi, Malayalam, Marathi, Punjabi, Odia, Telugu, and Urdu, expanding upon previous work on English-Bangla sentiment transfer (Mukherjee et al., 2023). We introduce dedicated datasets of 1,000 positive and 1,000 negative style-parallel sentences for each of these eight languages. We then evaluate the performance of various benchmark models categorized into parallel, non-parallel, cross-lingual, and shared learning approaches, including the Llama2 and GPT-3.5 large language models (LLMs). Our experiments highlight the significance of parallel data in TST and demonstrate the effectiveness of the Masked Style Filling (MSF) approach (Mukherjee et al., 2023) in non-parallel techniques. Moreover, cross-lingual and joint multilingual learning methods show promise, offering insights into selecting optimal models tailored to the specific language and task requirements. To the best of our knowledge, this work represents the first comprehensive exploration of the TST task as sentiment transfer across a diverse set of languages.
摘要:文本风格转换是指在保留文本核心内容的同时,改变文本的语言风格。本文重点研究了情绪迁移,这是一项重要的TST子任务(Mukherjee等人,2022A),涉及一系列印度语言:印地语、马加希语、马来亚语、马拉提语、旁遮普语、奥迪亚语、泰卢固语和乌尔都语,在之前关于英语-孟加拉语情绪迁移的工作(Mukherjee等人,2023年)的基础上进行了扩展。我们为这八种语言分别介绍了1000个肯定句和1000个否定句的专用数据集。然后,我们评估了各种基准模型的性能,这些模型分为并行、非并行、跨语言和共享学习方法,包括Llama2和GPT-3.5大型语言模型(LLM)。我们的实验突出了并行数据在TST中的重要性,并证明了掩蔽样式填充(MSF)方法在非并行技术中的有效性(Mukherjee等人,2023)。此外,跨语言和联合多语言学习方法显示出前景,为选择适合特定语言和任务要求的最佳模式提供了见解。据我们所知,这项工作是对TST任务的第一次全面探索,因为情感在不同的语言集合中转移。

[NLP-30] Ovis: Structural Embedding Alignment for Multimodal Large Language Model
[NLP-30] Ovis:多模式大型语言模型的结构嵌入对齐

链接: https://arxiv.org/abs/2405.20797
作者: Shiyin Lu,Yang Li,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Han-Jia Ye
关键词: Large Language Models, Current Multimodal Large, Multimodal Large Language, Large Language, pre-trained LLM
中文关键词: 大型语言模型,当前多模式大型,多模式大型语言,大型语言,预培训LLM
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs – the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder – makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder’s process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks demonstrate that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis’ structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Both the source code and the training dataset of Ovis will be made publicly available.
摘要:当前的多模式大语言模型(MLLMS)通常通过连接器(如MLP)将一个预先训练的LLM与另一个预先训练的视觉转换器集成在一起,从而赋予LLM以视觉能力。然而,MLLMS中两种嵌入策略之间的错位–基于嵌入查找表的结构化文本嵌入和由视觉编码器直接生成的连续嵌入–给视觉和文本信息的更无缝融合带来了挑战。我们提出了OVIS,这是一种新颖的MLLM架构,旨在将视觉和文本嵌入在结构上对齐。OVIS将一个额外的可学习视觉嵌入表集成到视觉编码器的过程中。为了捕获丰富的视觉语义,每个图像块多次索引视觉嵌入表,从而产生最终的视觉嵌入,该最终视觉嵌入是索引嵌入的概率组合。这种结构化方法反映了用于生成文本嵌入的方法。对各种多模式基准的经验评估表明,OVIS的性能优于类似参数规模的开源MLLMS,甚至总体上超过了专有模型Qwen-VL-Plus。这些结果突出了Ovis的结构化视觉表示在推进MLLM建筑设计和促进更有效的多模式学习方面的潜力。OVIS的源代码和训练数据集都将公开提供。

[NLP-31] PGA-SciRE: Harnessing LLM on Data Augmentation for Enhancing Scientific Relation Extraction
[NLP-31] PGA-SciRE:利用LLM数据增强来增强科学关系提取

链接: https://arxiv.org/abs/2405.20787
作者: Yang Zhou,Shimin Shan,Hongkui Wei,Zhehuan Zhao,Wenshuo Feng
关键词: Relation Extraction, aims at recognizing, pairs of entities, entities mentioned, original training set
中文关键词: 关系提取,旨在识别、实体对、提到的实体、原始训练集
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Relation Extraction (RE) aims at recognizing the relation between pairs of entities mentioned in a text. Advances in LLMs have had a tremendous impact on NLP. In this work, we propose a textual data augmentation framework called PGA for improving the performance of models for RE in the scientific domain. The framework introduces two ways of data augmentation, utilizing a LLM to obtain pseudo-samples with the same sentence meaning but with different representations and forms by paraphrasing the original training set samples. As well as instructing LLM to generate sentences that implicitly contain information about the corresponding labels based on the relation and entity of the original training set samples. These two kinds of pseudo-samples participate in the training of the RE model together with the original dataset, respectively. The PGA framework in the experiment improves the F1 scores of the three mainstream models for RE within the scientific domain. Also, using a LLM to obtain samples can effectively reduce the cost of manually labeling data.
摘要:关系抽取的目的是识别文本中提到的实体对之间的关系。低密度脂蛋白的进步对自然语言处理产生了巨大影响。在这项工作中,我们提出了一个称为PGA的文本数据扩充框架,以提高科学领域中的RE模型的性能。该框架引入了两种数据扩充的方法,即通过对原始训练集样本的释义,利用LLM获得句子意义相同但表示和形式不同的伪样本。以及指示LLM基于原始训练集样本的关系和实体来生成隐含地包含关于相应标签的信息的句子。这两种伪样本分别与原始数据集一起参与RE模型的训练。实验中的PGA框架提高了科学领域内三种主流RE模型的F1分数。此外,使用LLM来获取样本可以有效地降低手动标记数据的成本。

[NLP-32] Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
[NLP-32] 跨模式越狱和对医学多模式大型语言模型的不匹配攻击

链接: https://arxiv.org/abs/2405.20775
作者: Xijie Huang,Xinyuan Wang,Hantao Zhang,Jiawen Xi,Jingkun An,Hao Wang,Chengwei Pan
关键词: Multimodal Large Language, Large Language Models, remain insufficiently studied, Large Language, Language Models
中文关键词: 多模式大型语言、大型语言模型、研究不足、大型语言、语言模型
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Security concerns related to Large Language Models (LLMs) have been extensively explored, yet the safety implications for Multimodal Large Language Models (MLLMs), particularly in medical contexts (MedMLLMs), remain insufficiently studied. This paper delves into the underexplored security vulnerabilities of MedMLLMs, especially when deployed in clinical environments where the accuracy and relevance of question-and-answer interactions are critically tested against complex medical challenges. By combining existing clinical medical data with atypical natural phenomena, we redefine two types of attacks: mismatched malicious attack (2M-attack) and optimized mismatched malicious attack (O2M-attack). Using our own constructed voluminous 3MAD dataset, which covers a wide range of medical image modalities and harmful medical scenarios, we conduct a comprehensive analysis and propose the MCM optimization method, which significantly enhances the attack success rate on MedMLLMs. Evaluations with this dataset and novel attack methods, including white-box attacks on LLaVA-Med and transfer attacks on four other state-of-the-art models, indicate that even MedMLLMs designed with enhanced security features are vulnerable to security breaches. Our work underscores the urgent need for a concerted effort to implement robust security measures and enhance the safety and efficacy of open-source MedMLLMs, particularly given the potential severity of jailbreak attacks and other malicious or clinically significant exploits in medical settings. For further research and replication, anonymous access to our code is available at this https URL. Warning: Medical large model jailbreaking may generate content that includes unverified diagnoses and treatment recommendations. Always consult professional medical advice.
摘要:与大语言模型相关的安全问题已经得到了广泛的研究,但对多模式大语言模型的安全影响,特别是在医学上下文中的安全影响,仍然缺乏足够的研究。本文深入研究了MedMLLMS未被开发的安全漏洞,特别是当部署在临床环境中时,其中问答交互的准确性和相关性针对复杂的医疗挑战进行了严格的测试。结合已有的临床医学数据和非典型自然现象,我们重新定义了两类攻击:失配恶意攻击(2M-攻击)和优化失配恶意攻击(O2M-攻击)。利用我们构建的涵盖多种医学图像模式和有害医疗场景的海量3MAD数据集,进行了综合分析,并提出了MCM优化方法,显著提高了对MedMLLms的攻击成功率。使用该数据集和新的攻击方法(包括对LLaVA-Med的白盒攻击和对其他四种最先进模型的传输攻击)的评估表明,即使是设计了增强安全功能的MedMLLM也容易受到安全漏洞的攻击。我们的工作强调了迫切需要共同努力,实施强有力的安全措施,提高开源MedMLLMS的安全性和有效性,特别是考虑到越狱攻击和医疗环境中其他恶意或具有临床意义的利用的潜在严重性。为了进行进一步的研究和复制,您可以通过此HTTPS URL匿名访问我们的代码。警告:医用大型越狱可能会生成包括未经验证的诊断和治疗建议在内的内容。一定要咨询专业医生的意见。

[NLP-33] Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent
[NLP-33] 大型语言模型哨兵:LLM Agent提高对抗鲁棒性

链接: https://arxiv.org/abs/2405.20770
作者: Guang Lin,Qibin Zhao
关键词: large language models, LAnguage MOdel Sentinel, large language, past two years, advanced rapidly
中文关键词: 大型语言模型,LAnguage MOdel Sentinel,大型语言,过去两年,快速进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Over the past two years, the use of large language models (LLMs) has advanced rapidly. While these LLMs offer considerable convenience, they also raise security concerns, as LLMs are vulnerable to adversarial attacks by some well-designed textual perturbations. In this paper, we introduce a novel defense technique named Large LAnguage MOdel Sentinel (LLAMOS), which is designed to enhance the adversarial robustness of LLMs by purifying the adversarial textual examples before feeding them into the target LLM. Our method comprises two main components: a) Agent instruction, which can simulate a new agent for adversarial defense, altering minimal characters to maintain the original meaning of the sentence while defending against attacks; b) Defense guidance, which provides strategies for modifying clean or adversarial examples to ensure effective defense and accurate outputs from the target LLMs. Remarkably, the defense agent demonstrates robust defensive capabilities even without learning from adversarial examples. Additionally, we conduct an intriguing adversarial experiment where we develop two agents, one for defense and one for defense, and engage them in mutual confrontation. During the adversarial interactions, neither agent completely beat the other. Extensive experiments on both open-source and closed-source LLMs demonstrate that our method effectively defends against adversarial attacks, thereby enhancing adversarial robustness.
摘要:在过去的两年里,大型语言模型的使用得到了迅速的发展。虽然这些LLM提供了相当大的便利,但它们也引发了安全问题,因为LLM容易受到一些精心设计的文本扰动的敌意攻击。本文介绍了一种新的防御技术–大语言模型哨兵(LLAMOS),该技术旨在通过在将对抗性文本实例输入目标LLM之前对其进行提纯来增强LLMS的对抗性健壮性。我们的方法包括两个主要部分:a)代理指令,它可以模拟一个新的代理进行对抗性防御,改变最少的字符,在防御攻击的同时保持句子的原始含义;b)防御指导,它提供修改干净或对抗性示例的策略,以确保目标LLMS的有效防御和准确输出。值得注意的是,防御代理展示了强大的防御能力,即使没有从对手的例子中学习。此外,我们还进行了一个有趣的对抗性实验,在这个实验中,我们开发了两个代理,一个用于防御,一个用于防御,并让他们相互对抗。在敌对的互动中,两个代理都没有完全击败另一个。在开源和闭源LLMS上的大量实验表明,我们的方法有效地防御了对手攻击,从而增强了对手攻击的健壮性。

[NLP-34] Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario
[NLP-34] 通过原生样本混合改进代码混合仇恨检测:印度语-英语代码混合场景的案例研究

链接: https://arxiv.org/abs/2405.20755
作者: Debajyoti Mazumder,Aakash Kumar,Jasabanta Patro
关键词: Hate, code-mixed hate, code-mixed, Hate detection, NLP community
中文关键词: 仇恨,代码混合仇恨,代码混合,仇恨检测,NLP社区
类目: Computation and Language (cs.CL)
备注: Generated from XeLaTeX

点击查看摘要

Abstract:Hate detection has long been a challenging task for the NLP community. The task becomes complex in a code-mixed environment because the models must understand the context and the hate expressed through language alteration. Compared to the monolingual setup, we see very less work on code-mixed hate as large-scale annotated hate corpora are unavailable to make the study. To overcome this bottleneck, we propose using native language hate samples. We hypothesise that in the era of multilingual language models (MLMs), hate in code-mixed settings can be detected by majorly relying on the native language samples. Even though the NLP literature reports the effectiveness of MLMs on hate detection in many cross-lingual settings, their extensive evaluation in a code-mixed scenario is yet to be done. This paper attempts to fill this gap through rigorous empirical experiments. We considered the Hindi-English code-mixed setup as a case study as we have the linguistic expertise for the same. Some of the interesting observations we got are: (i) adding native hate samples in the code-mixed training set, even in small quantity, improved the performance of MLMs for code-mixed hate detection, (ii) MLMs trained with native samples alone observed to be detecting code-mixed hate to a large extent, (iii) The visualisation of attention scores revealed that, when native samples were included in training, MLMs could better focus on the hate emitting words in the code-mixed context, and (iv) finally, when hate is subjective or sarcastic, naively mixing native samples doesn’t help much to detect code-mixed hate. We will release the data and code repository to reproduce the reported results.
摘要:仇恨检测长期以来一直是自然语言处理领域的一项具有挑战性的任务。在代码混合的环境中,任务变得复杂,因为模型必须理解上下文和通过语言更改表达的仇恨。与单一语言设置相比,我们看到在代码混合仇恨方面的工作非常少,因为无法获得大规模的注释仇恨语料库来进行研究。为了克服这一瓶颈,我们建议使用本地语言HATE样本。我们假设,在多语言模型(MLM)时代,代码混合环境中的仇恨可以通过主要依赖于母语样本来检测。尽管NLP文献报道了MLM在许多跨语言环境中对仇恨检测的有效性,但它们在代码混合场景中的广泛评估尚未完成。本文试图通过严谨的实证实验来填补这一空白。我们认为印地语和英语的代码混合设置是一个案例研究,因为我们拥有相同的语言专业知识。我们得到的一些有趣的观察结果是:(I)在混合代码的训练集中添加本地仇恨样本,即使是少量的,也提高了MLMS的代码混合仇恨检测的性能;(Ii)仅用本地样本训练的MLMS在很大程度上被观察到检测到混合代码的仇恨;(Iii)注意分数的可视化显示,当本地样本包括在训练中时,MLMS可以更好地专注于混合代码上下文中的仇恨发出的单词;以及(Iv)最后,当仇恨是主观的或讽刺的时,天真地混合本地样本对检测混合代码的仇恨帮助不大。我们将发布数据和代码库,以复制报告的结果。

[NLP-35] FinGen: A Dataset for Argument Generation in Finance
[NLP-35] FinGen:金融争论生成的数据集

链接: https://arxiv.org/abs/2405.20708
作者: Chung-Chi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao
关键词: daily life, important activities, activities that people, Thinking, Abstract
中文关键词: 日常生活、重要活动、人们的活动、思考、抽象
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thinking about the future is one of the important activities that people do in daily life. Futurists also pay a lot of effort into figuring out possible scenarios for the future. We argue that the exploration of this direction is still in an early stage in the NLP research. To this end, we propose three argument generation tasks in the financial application scenario. Our experimental results show these tasks are still big challenges for representative generation models. Based on our empirical results, we further point out several unresolved issues and challenges in this research direction.
摘要:思考未来是人们日常生活中的重要活动之一。未来学家也付出了大量努力来找出未来可能的情景。我们认为,NLP研究中对这个方向的探索仍处于早期阶段。为此,我们提出了金融应用场景中的三个参数生成任务。我们的实验结果表明,这些任务对于代表性世代模型来说仍然是巨大的挑战。根据我们的实证结果,我们进一步指出了这一研究方向上尚未解决的几个问题和挑战。

[NLP-36] It is Simple Sometimes: A Study On Improving Aspect-Based Sentiment Analysis Performance
[NLP-36] 有时很简单:关于提高基于相思的情绪分析性能的研究

链接: https://arxiv.org/abs/2405.20703
作者: Laura Cabello,Uchenna Akujuobi
关键词: Aspect-Based Sentiment Analysis, Sentiment Analysis, involves extracting opinions, Aspect-Based Sentiment, involves extracting
中文关键词: 基于目标的情感分析,情感分析,涉及提取意见,基于目标的情感,涉及提取
类目: Computation and Language (cs.CL)
备注: Accepted to ACL Findings 2024

点击查看摘要

Abstract:Aspect-Based Sentiment Analysis (ABSA) involves extracting opinions from textual data about specific entities and their corresponding aspects through various complementary subtasks. Several prior research has focused on developing ad hoc designs of varying complexities for these subtasks. In this paper, we present a generative framework extensible to any ABSA subtask. We build upon the instruction tuned model proposed by Scaria et al. (2023), who present an instruction-based model with task descriptions followed by in-context examples on ABSA subtasks. We propose PFInstruct, an extension to this instruction learning paradigm by appending an NLP-related task prefix to the task description. This simple approach leads to improved performance across all tested SemEval subtasks, surpassing previous state-of-the-art (SOTA) on the ATE subtask (Rest14) by +3.28 F1-score, and on the AOOE subtask by an average of +5.43 F1-score across SemEval datasets. Furthermore, we explore the impact of the prefix-enhanced prompt quality on the ABSA subtasks and find that even a noisy prefix enhances model performance compared to the baseline. Our method also achieves competitive results on a biomedical domain dataset (ERSA).
摘要:基于方面的情感分析(ABSA)涉及通过各种互补性的子任务从文本数据中提取关于特定实体及其相应方面的意见。以前的几项研究都集中在为这些子任务开发不同复杂程度的特别设计。在本文中,我们提出了一个可扩展到任何ABSA子任务的生成性框架。我们建立在Scaria等人提出的指令优化模型的基础上。(2023),他提出了一个基于指令的模型,其中包含任务描述,然后是关于ABSA子任务的上下文中的例子。我们提出了PFInstruct,通过在任务描述中附加一个与NLP相关的任务前缀来扩展这种指导学习范式。这种简单的方法提高了所有经过测试的SemEval子任务的性能,在整个SemEval数据集上,ATE子任务(REST14)上的性能比以前的最先进技术(SOTA)高出3.28 F1-Score,在AOOE子任务上的性能平均高出5.43 F1-Score。此外,我们探索了前缀增强的提示质量对ABSA子任务的影响,发现即使是一个噪声前缀也比基线提高了模型的性能。我们的方法在生物医学领域数据集(ERSA)上也取得了有竞争力的结果。

[NLP-37] Unveiling the Lexical Sensitivity of LLMs: Combinatorial Optimization for Prompt Enhancement
[NLP-37] 揭开LLM的词汇敏感性:快速增强的组合优化

链接: https://arxiv.org/abs/2405.20701
作者: Pengwei Zhan,Zhen Xu,Qian Tan,Jie Song,Ru Xie
关键词: Large language models, Large language, demonstrate exceptional instruct-following, demonstrate exceptional, Large
中文关键词: 大型语言模型,大型语言,表现出出色的预算遵循性,表现出出色的,大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate exceptional instruct-following ability to complete various downstream tasks. Although this impressive ability makes LLMs flexible task solvers, their performance in solving tasks also heavily relies on instructions. In this paper, we reveal that LLMs are over-sensitive to lexical variations in task instructions, even when the variations are imperceptible to humans. By providing models with neighborhood instructions, which are closely situated in the latent representation space and differ by only one semantically similar word, the performance on downstream tasks can be vastly different. Following this property, we propose a black-box Combinatorial Optimization framework for Prompt Lexical Enhancement (COPLE). COPLE performs iterative lexical optimization according to the feedback from a batch of proxy tasks, using a search strategy related to word influence. Experiments show that even widely-used human-crafted prompts for current benchmarks suffer from the lexical sensitivity of models, and COPLE recovers the declined model ability in both instruct-following and solving downstream tasks.
摘要:大型语言模型(LLM)在完成各种下游任务时表现出出色的指令跟随能力。尽管这种令人印象深刻的能力使LLMS具有灵活的任务解算器,但它们在解决任务时的表现也严重依赖于指令。在这篇文章中,我们揭示了LLM对任务指令中的词汇变化过于敏感,即使这些变化对人类来说是不可察觉的。通过提供具有邻域指令的模型,这些邻域指令紧密地位于潜在的表示空间中,并且只有一个语义相似的词不同,在下游任务上的性能可能会有很大的不同。根据这一性质,我们提出了一种黑盒组合优化框架(COPLE)用于快速词法增强。Cople根据一批代理任务的反馈,采用与词影响相关的搜索策略,进行迭代的词法优化。实验表明,即使是被广泛使用的人工制作的当前基准提示也会受到模型的词汇敏感性的影响,并且COPLE在指导跟随和解决下游任务方面都恢复了下降的模型能力。

[NLP-38] Joint Embeddings for Graph Instruction Tuning
[NLP-38] 用于图形指令调优的联合嵌入

链接: https://arxiv.org/abs/2405.20684
作者: Vlad Argatu,Aaron Haag,Oliver Lohse
关键词: Large Language Models, Large Language, building smart assistants, achieved impressive performance, Language Models
中文关键词: 大型语言模型,大型语言,构建智能助手,取得了令人印象深刻的性能,语言模型
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Conference Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive performance in text understanding and have become an essential tool for building smart assistants. Originally focusing on text, they have been enhanced with multimodal capabilities in recent works that successfully built visual instruction following assistants. As far as the graph modality goes, however, no such assistants have yet been developed. Graph structures are complex in that they represent relation between different features and are permutation invariant. Moreover, representing them in purely textual form does not always lead to good LLM performance even for finetuned models. As a result, there is a need to develop a new method to integrate graphs in LLMs for general graph understanding. This work explores the integration of the graph modality in LLM for general graph instruction following tasks. It aims at producing a deep learning model that enhances an underlying LLM with graph embeddings and trains it to understand them and to produce, given an instruction, an answer grounded in the graph representation. The approach performs significantly better than a graph to text approach and remains consistent even for larger graphs.
摘要:大语言模型在文本理解方面取得了令人印象深刻的性能,并已成为构建智能助手的重要工具。它们最初专注于文本,但在最近成功构建了视觉教学跟随助手的作品中,多通道功能得到了增强。然而,就图形形态而言,还没有开发出这样的助手。图结构是复杂的,因为它们表示不同特征之间的关系,并且是排列不变的。此外,以纯文本形式表示它们并不总是带来良好的LLM性能,即使对于经过精细调整的模型也是如此。因此,需要开发一种新的方法来集成LLMS中的图,以实现对图的一般理解。这项工作探索了图形通道在LLM中的整合,用于一般图形教学后续任务。它的目标是产生一个深度学习模型,通过图嵌入来增强潜在的LLM,并训练它理解它们,并在给定指令的情况下产生基于图表示的答案。该方法的性能明显好于图形到文本的方法,并且即使对于较大的图形也保持一致。

[NLP-39] Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models
[NLP-39] 解开和缓解检索增强大型语言模型中的检索器不确定性

链接: https://arxiv.org/abs/2405.20680
作者: Mingda Li,Xinyu Li,Yifan Chen,Wenfeng Xuan,Weinan Zhang
关键词: Large Language Models, Retrieval-Augmented Large Language, original retrieval-free Language, Large Language, retrieval-free Language Models
中文关键词: 大型语言模型、检索增强大型语言、原始免检索语言、大型语言、免检索语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2024 (findings)

点击查看摘要

Abstract:Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their superiority in terms of factuality, they do not consistently outperform the original retrieval-free Language Models (LMs). Our experiments reveal that this example-level performance inconsistency exists not only between retrieval-augmented and retrieval-free LM but also among different retrievers. To understand this phenomenon, we investigate the degeneration behavior of RALMs and theoretically decompose it into four categories. Further analysis based on our decomposition reveals that the innate difference in knowledge sources and the unpredictable degeneration of the reader model contribute most to the inconsistency. Drawing from our analysis, we introduce Ensemble of Retrievers (EoR), a trainable framework that can adaptively retrieve from different knowledge sources and effectively decrease unpredictable reader errors. Our experiments on Open Domain Question Answering show that EoR substantially improves performance over the RALM with a single retriever by considerably reducing inconsistent behaviors.
摘要:尽管支持检索的大型语言模型在真实性方面显示了其优越性,但它们并不总是优于原始的无检索语言模型。我们的实验表明,这种样例级别的性能不一致不仅存在于增强检索的和非检索的LM之间,而且存在于不同的检索者之间。为了理解这一现象,我们研究了RALM的简并行为,并从理论上将其分解为四类。基于我们分解的进一步分析表明,知识来源的先天差异和读者模型的不可预测的退化是造成不一致的主要原因。在分析的基础上,我们引入了检索者集成(EOR),这是一个可训练的框架,可以自适应地从不同的知识源中进行检索,并有效地减少不可预测的读者错误。我们在开放领域问答上的实验表明,EOR通过显著减少不一致行为,大大提高了RALM的性能。

[NLP-40] Position Coupling: Leveraging Task Structure for Improved Length Generalization of Transformers
[NLP-40] 位置耦合:利用任务结构改进变形金刚的长度概括

链接: https://arxiv.org/abs/2405.20671
作者: Hanseul Cho,Jaeyoung Cha,Pranjal Awasthi,Srinadh Bhojanapalli,Anupam Gupta,Chulhee Yun
关键词: simple arithmetic tasks, encountered during training, longer sequences, Transformer, position
中文关键词: 简单的算术任务,训练期间遇到的,更长的序列,Transformer,位置
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 73 pages, 20 figures, 90 tables

点击查看摘要

Abstract:Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more “relevant” tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as addition with multiple summands, Nx2 multiplication, copy/reverse, and a two-dimensional task.
摘要:即使是像整数加法这样的简单算术任务,对于变形金刚来说,推广到比训练过程中遇到的序列更长的序列也是一项挑战。为了解决这个问题,我们提出了位置耦合,这是一种简单而有效的方法,它直接将任务的结构嵌入到(仅解码器才能使用的)Transformer的位置编码中。与为每个令牌分配唯一位置ID的普通绝对位置机制不同,我们将相同的位置ID分配给两个或更多“相关”的令牌;对于整数相加任务,我们将相同重要性的数字视为相同位置。在实验方面,我们表明,在所提出的位置耦合的情况下,一个小的(1层)变压器被训练成1到30位的加法,可以推广到200位的加法(训练长度的6.67倍)。在理论方面,我们证明了具有耦合位置的单层变压器可以解决涉及指数多位数的加法问题,而任何没有位置信息的单层变压器都不能完全解决这一问题。我们还证明了位置耦合也可以应用于其他算法任务,如多个加法器的加法、Nx2乘法、复制/反转和二维任务。

[NLP-41] DORY: Deliberative Prompt Recovery for LLM
[NLP-41] DORY:有意迅速恢复法学硕士

链接: https://arxiv.org/abs/2405.20657
作者: Lirong Gao,Ru Peng,Yiming Zhang,Junbo Zhao
关键词: large language models, concerns regarding privacy, Prompt recovery, large language, crucial for understanding
中文关键词: 大型语言模型、隐私问题、迅速恢复、大型语言,对于理解至关重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt recovery in large language models (LLMs) is crucial for understanding how LLMs work and addressing concerns regarding privacy, copyright, etc. The trend towards inference-only APIs complicates this task by restricting access to essential outputs for recovery. To tackle this challenge, we extract prompt-related information from limited outputs and identify a strong(negative) correlation between output probability-based uncertainty and the success of prompt recovery. This finding led to the development of Deliberative PrOmpt RecoverY (DORY), our novel approach that leverages uncertainty to recover prompts accurately. DORY involves reconstructing drafts from outputs, refining these with hints, and filtering out noise based on uncertainty. Our evaluation across diverse LLMs and prompt benchmarks shows that DORY outperforms existing baselines, improving performance by approximately 10.82% and establishing a new state-of-the-art record in prompt recovery tasks. Significantly, DORY operates using a single LLM without any external resources or model, offering a cost-effective, user-friendly prompt recovery solution.
摘要:大型语言模型(LLM)中的快速恢复对于理解LLM的工作原理和解决有关隐私、版权等方面的问题至关重要。仅限推理的API的趋势限制了对恢复基本输出的访问,从而使这项任务变得复杂。为了应对这一挑战,我们从有限的产出中提取与即时相关的信息,并确定基于产出概率的不确定性与迅速恢复的成功之间存在强烈的(负)相关性。这一发现导致了商议迅速恢复(Dory)的发展,这是我们利用不确定性来准确恢复提示的新方法。Dory涉及到从输出中重建草稿,用提示来提炼这些草稿,并根据不确定性过滤噪音。我们对各种LLM和Prompt基准的评估显示,Dory的表现优于现有基准,将性能提高了约10.82%,并在快速恢复任务方面创造了新的最先进记录。值得注意的是,Dory使用单个LLM运行,无需任何外部资源或模型,提供经济高效、用户友好的快速恢复解决方案。

[NLP-42] Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models
[NLP-42] 使用大型语言模型的问题解答中的段落重新排名的特定段落提示调整

链接: https://arxiv.org/abs/2405.20654
作者: Xuyang Wu,Zhiyuan Peng,Sravanthi Rajanala,Hsin-Tai Wu,Yi Fang
关键词: Effective passage retrieval, identify suitable candidates, open-domain question answering, question answering tasks, Effective passage
中文关键词: 有效的文章检索,识别合适的候选人,开放领域问答,问答任务,有效的文章
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at Gen-IR@SIGIR24

点击查看摘要

Abstract:Effective passage retrieval and reranking methods have been widely utilized to identify suitable candidates in open-domain question answering tasks, recent studies have resorted to LLMs for reranking the retrieved passages by the log-likelihood of the question conditioned on each passage. Although these methods have demonstrated promising results, the performance is notably sensitive to the human-written prompt (or hard prompt), and fine-tuning LLMs can be computationally intensive and time-consuming. Furthermore, this approach limits the leverage of question-passage relevance pairs and passage-specific knowledge to enhance the ranking capabilities of LLMs. In this paper, we propose passage-specific prompt tuning for reranking in open-domain question answering (PSPT): a parameter-efficient method that fine-tunes learnable passage-specific soft prompts, incorporating passage-specific knowledge from a limited set of question-passage relevance pairs. The method involves ranking retrieved passages based on the log-likelihood of the model generating the question conditioned on each passage and the learned soft prompt. We conducted extensive experiments utilizing the Llama-2-chat-7B model across three publicly available open-domain question answering datasets and the results demonstrate the effectiveness of the proposed approach.
摘要:在开放领域问答任务中,有效的篇章检索和重排方法已经被广泛地用于识别合适的候选者,最近的研究求助于LLMS来根据每段问题的对数似然来对检索到的篇章进行重排。虽然这些方法已经显示了良好的结果,但是性能对人类书写的提示(或硬提示)非常敏感,并且微调LLM可能是计算密集型和耗时的。此外,这种方法限制了问题-段落相关性对和特定段落知识的杠杆作用,从而提高了LLMS的排名能力。在本文中,我们提出了一种用于开放领域问答(PSPT)中重新排序的特定于段落的提示调整:一种参数高效的方法,微调可学习的特定于段落的软提示,结合来自有限的问题-段落相关对的特定于段落的知识。该方法包括基于生成以每个段落为条件的问题的模型的对数似然以及学习的软提示来对检索到的段落进行排序。我们利用Llama-2-Chat-7B模型在三个公开可用的开放领域问答数据集上进行了广泛的实验,结果证明了所提出的方法的有效性。

[NLP-43] Reward-based Input Construction for Cross-document Relation Extraction
[NLP-43] 基于奖励的跨文档关系提取输入构造

链接: https://arxiv.org/abs/2405.20649
作者: Byeonghu Na,Suhyeon Jo,Yeongmin Kim,Il-Chul Moon
关键词: natural language processing, aiming to identify, entities in text, fundamental task, task in natural
中文关键词: 自然语言处理,旨在识别文本中的实体,基本任务,自然任务
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ACL 2024 main conference

点击查看摘要

Abstract:Relation extraction (RE) is a fundamental task in natural language processing, aiming to identify relations between target entities in text. While many RE methods are designed for a single sentence or document, cross-document RE has emerged to address relations across multiple long documents. Given the nature of long documents in cross-document RE, extracting document embeddings is challenging due to the length constraints of pre-trained language models. Therefore, we propose REward-based Input Construction (REIC), the first learning-based sentence selector for cross-document RE. REIC extracts sentences based on relational evidence, enabling the RE module to effectively infer relations. Since supervision of evidence sentences is generally unavailable, we train REIC using reinforcement learning with RE prediction scores as rewards. Experimental results demonstrate the superiority of our method over heuristic methods for different RE structures and backbones in cross-document RE. Our code is publicly available at this https URL.
摘要:关系抽取是自然语言处理中的一项基本任务,旨在识别文本中目标实体之间的关系。虽然许多RE方法是为单个句子或文档设计的,但跨文档RE已经出现,以处理多个长文档之间的关系。考虑到跨文档RE中长文档的性质,由于预先训练的语言模型的长度限制,提取文档嵌入是具有挑战性的。因此,我们提出了基于奖励的输入结构(REIC),这是第一个基于学习的跨文档RE句子选择器。REIC根据关系证据提取句子,使RE模块能够有效地推断关系。由于证据句子的监督通常是不可用的,我们使用强化学习来训练REIC,并以RE预测分数作为奖励。实验结果表明,对于跨文档RE中不同的RE结构和主干,该方法优于启发式方法。我们的代码在此HTTPS URL上公开提供。

[NLP-44] Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization
[NLP-44] Shotluck Holmes:一系列用于视频字幕和摘要的高效小规模大语言视觉模型

链接: https://arxiv.org/abs/2405.20648
作者: Richard Luo,Austin Peng,Adithya Vasudev,Rishabh Jain
关键词: poses substantial challenges, information-dense medium, increasingly prominent, prominent and information-dense, poses substantial
中文关键词: 构成实质性挑战,信息密集媒体,日益突出、突出、信息密集,构成实质性
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos’ more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.
摘要:视频是一种日益突出和信息密集的媒介,但它对语言模型提出了巨大的挑战。典型的视频由一系列较短的片段或镜头组成,这些片段或镜头共同构成了一个连贯的叙事。每个镜头类似于句子中的一个词,其中必须同时处理多个信息流(如视觉和听觉数据)。理解整个视频不仅需要理解每个镜头的视听信息,还需要模型将每个镜头之间的想法联系起来,以生成一个更大、更全面的故事。尽管该领域取得了重大进展,但目前的工作往往忽略了视频中更细粒度的逐个镜头的语义信息。在这个项目中,我们提出了一系列高效的大语言视觉模型(LLVM)来提高视频摘要和字幕的性能,称为Shotluck Holmes。通过利用更好的预训练和数据收集策略,我们将现有小型LLVM的能力从能够理解一幅图片扩展到能够理解一系列帧。具体地说,我们表明Shotluck Holmes在Shot2Story视频字幕和摘要任务上的性能比最先进的结果要好得多,模型小得多,计算效率也更高。

[NLP-45] Large Language Models Enhanced Sequential Recommendation for Long-tail User and Item
[NLP-45] 大型语言模型针对长尾用户和项目的增强序列推荐

链接: https://arxiv.org/abs/2405.20646
作者: Qidong Liu,Xian Wu,Xiangyu Zhao,Yejing Wang,Zijian Zhang,Feng Tian,Yefeng Zheng
关键词: social networking platforms, predicting users’ subsequent, Sequential recommendation systems, users’ subsequent preferences, subsequent preferences based
中文关键词: 社交网络平台、预测用户后续、顺序推荐系统、用户后续偏好、基于后续偏好
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sequential recommendation systems (SRS) serve the purpose of predicting users’ subsequent preferences based on their past interactions and have been applied across various domains such as e-commerce and social networking platforms. However, practical SRS encounters challenges due to the fact that most users engage with only a limited number of items, while the majority of items are seldom consumed. These challenges, termed as the long-tail user and long-tail item dilemmas, often create obstacles for traditional SRS methods. Mitigating these challenges is crucial as they can significantly impact user satisfaction and business profitability. While some research endeavors have alleviated these issues, they still grapple with issues such as seesaw or noise stemming from the scarcity of interactions. The emergence of large language models (LLMs) presents a promising avenue to address these challenges from a semantic standpoint. In this study, we introduce the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR), which leverages semantic embeddings from LLMs to enhance SRS performance without increasing computational overhead. To combat the long-tail item challenge, we propose a dual-view modeling approach that fuses semantic information from LLMs with collaborative signals from traditional SRS. To address the long-tail user challenge, we introduce a retrieval augmented self-distillation technique to refine user preference representations by incorporating richer interaction data from similar users. Through comprehensive experiments conducted on three authentic datasets using three widely used SRS models, our proposed enhancement framework demonstrates superior performance compared to existing methodologies.
摘要:序贯推荐系统用于根据用户过去的交互来预测用户的后续偏好,已被应用于电子商务和社交网络平台等各个领域。然而,实用的SRS遇到了挑战,因为大多数用户只接触到有限数量的物品,而大多数物品很少被消费。这些挑战被称为长尾用户和长尾物品两难境地,往往给传统的SRS方法造成障碍。缓解这些挑战至关重要,因为它们会显著影响用户满意度和业务盈利能力。虽然一些研究努力已经缓解了这些问题,但他们仍然在努力解决因缺乏互动而产生的翘翘板或噪音等问题。大型语言模型(LLM)的出现为从语义角度解决这些挑战提供了一条有希望的途径。在这项研究中,我们介绍了用于顺序推荐的大语言模型增强框架(LLM-ESR),该框架利用来自LLMS的语义嵌入来提高SRS的性能,而不增加计算开销。为了应对长尾物品的挑战,我们提出了一种双视图建模方法,该方法融合了来自LLMS的语义信息和来自传统SRS的协作信号。为了应对长尾用户的挑战,我们引入了一种检索增强的自我提炼技术,通过结合来自相似用户的更丰富的交互数据来提炼用户偏好表示。通过使用三种广泛使用的SRS模型在三个真实数据集上进行的全面实验,我们提出的增强框架表现出比现有方法更好的性能。

[NLP-46] oxVidLLM: A Multimodal LLM-based Framework for Toxicity Detection in Code-Mixed Videos
[NLP-46] oxVidLLM:一个基于LLM的多模式框架,用于代码混合视频中的毒性检测

链接: https://arxiv.org/abs/2405.20628
作者: Krishanu Maity,A.S. Poornash,Sriparna Saha,Pushpak Bhattacharyya
关键词: evolving internet technology, rapidly evolving internet, toxic content detection, toxic content, internet technology
中文关键词: 不断发展的互联网技术,快速发展的互联网,有毒内容检测,有毒内容,互联网技术
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL Findings 2024

点击查看摘要

Abstract:In an era of rapidly evolving internet technology, the surge in multimodal content, including videos, has expanded the horizons of online communication. However, the detection of toxic content in this diverse landscape, particularly in low-resource code-mixed languages, remains a critical challenge. While substantial research has addressed toxic content detection in textual data, the realm of video content, especially in non-English languages, has been relatively underexplored. This paper addresses this research gap by introducing a benchmark dataset, the first of its kind, consisting of 931 videos with 4021 code-mixed Hindi-English utterances collected from YouTube. Each utterance within this dataset has been meticulously annotated for toxicity, severity, and sentiment labels. We have developed an advanced Multimodal Multitask framework built for Toxicity detection in Video Content by leveraging Large Language Models (LLMs), crafted for the primary objective along with the additional tasks of conducting sentiment and severity analysis. ToxVidLLM incorporates three key modules the Encoder module, Cross-Modal Synchronization module, and Multitask module crafting a generic multimodal LLM customized for intricate video classification tasks. Our experiments reveal that incorporating multiple modalities from the videos substantially enhances the performance of toxic content detection by achieving an Accuracy and Weighted F1 score of 94.29% and 94.35%, respectively.
摘要:在互联网技术快速发展的时代,包括视频在内的多模式内容的激增扩大了在线交流的视野。然而,在这种多样化的环境中检测有毒内容,特别是在低资源代码混合语言中检测有毒内容,仍然是一个严峻的挑战。虽然大量的研究已经解决了文本数据中的有毒内容检测问题,但视频内容领域,特别是非英语语言领域的探索相对较少。本文通过介绍一个基准数据集来弥补这一研究差距,该数据集是首个基准数据集,包括从YouTube收集的931个视频和4021个代码混合的印度-英语话语。这个数据集中的每一句话都经过了仔细的注释,以确定其毒性、严重性和情绪标签。我们开发了一个先进的多模式多任务框架,通过利用大型语言模型(LLM)为视频内容中的毒性检测构建,该框架针对主要目标以及进行情绪和严重性分析的附加任务而精心设计。ToxVidLLM集成了三个关键模块:编码器模块、跨模式同步模块和多任务模块,为复杂的视频分类任务定制了一个通用的多模式LLM。我们的实验表明,结合视频中的多个模式可以显著提高有毒内容检测的性能,准确率和加权F1得分分别为94.29%和94.35%。

[NLP-47] Leveraging Large Language Models for Entity Matching
[NLP-47] 利用大型语言模型进行实体匹配

链接: https://arxiv.org/abs/2405.20624
作者: Qianyu Huang,Tongfang Zhao
关键词: Entity matching, aiming to identify, real-world entities, Large Language Models, critical task
中文关键词: 实体匹配,旨在识别现实世界的实体,大型语言模型,关键任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Entity matching (EM) is a critical task in data integration, aiming to identify records across different datasets that refer to the same real-world entities. Traditional methods often rely on manually engineered features and rule-based systems, which struggle with diverse and unstructured data. The emergence of Large Language Models (LLMs) such as GPT-4 offers transformative potential for EM, leveraging their advanced semantic understanding and contextual capabilities. This vision paper explores the application of LLMs to EM, discussing their advantages, challenges, and future research directions. Additionally, we review related work on applying weak supervision and unsupervised approaches to EM, highlighting how LLMs can enhance these methods.
摘要:实体匹配(EM)是数据集成中的一项关键任务,旨在识别引用相同现实世界实体的不同数据集中的记录。传统方法通常依赖于手动设计的功能和基于规则的系统,这些系统难以处理多样化和非结构化的数据。GPT-4等大型语言模型(LLM)的出现为EM提供了变革潜力,利用其先进的语义理解和上下文能力。这篇愿景论文探讨了法学硕士在EM中的应用,讨论了它们的优势、挑战和未来的研究方向。此外,我们还回顾了将弱监督和无监督方法应用于EM的相关工作,重点介绍了LLM如何增强这些方法。

[NLP-48] FineRadScore: A Radiology Report Line-by-Line Evaluation Technique Generating Corrections with Severity Scores
[NLP-48] FineRadScore:放射学报告逐行评估技术,通过严重度评分生成纠正

链接: https://arxiv.org/abs/2405.20613
作者: Alyssa Huang,Oishi Banerjee,Kay Wu,Eduardo Pontes Reis,Pranav Rajpurkar
关键词: generated chest x-ray, Large Language Model, current gold standard, chest x-ray, evaluating generated chest
中文关键词: 生成的胸部X光检查,大型语言模型,当前黄金标准,胸部X光检查,评估生成的胸部
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The current gold standard for evaluating generated chest x-ray (CXR) reports is through radiologist annotations. However, this process can be extremely time-consuming and costly, especially when evaluating large numbers of reports. In this work, we present FineRadScore, a Large Language Model (LLM)-based automated evaluation metric for generated CXR reports. Given a candidate report and a ground-truth report, FineRadScore gives the minimum number of line-by-line corrections required to go from the candidate to the ground-truth report. Additionally, FineRadScore provides an error severity rating with each correction and generates comments explaining why the correction was needed. We demonstrate that FineRadScore’s corrections and error severity scores align with radiologist opinions. We also show that, when used to judge the quality of the report as a whole, FineRadScore aligns with radiologists as well as current state-of-the-art automated CXR evaluation metrics. Finally, we analyze FineRadScore’s shortcomings to provide suggestions for future improvements.
摘要:目前评价胸部X光(CXR)报告的黄金标准是通过放射科医生注释。然而,这一过程可能非常耗时和昂贵,尤其是在评估大量报告时。在这项工作中,我们提出了FineRadScore,一个基于大型语言模型(LLM)的生成CXR报告的自动评估指标。在给定候选人报告和地面事实报告的情况下,FineRadScore给出了从候选人到地面事实报告所需的最少逐行更正次数。此外,FineRadScore为每个更正提供错误严重程度评级,并生成注释,解释为什么需要更正。我们证明FineRadScore的纠正和错误严重性分数与放射科医生的意见一致。我们还表明,当被用来判断报告的整体质量时,FineRadScore与放射科医生以及当前最先进的自动化CXR评估指标保持一致。最后,我们分析了FineRadScore的不足之处,为未来的改进提供了建议。

[NLP-49] UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation
[NLP-49] UniBias:通过内部注意力和FFN操纵揭露和缓解LLM偏见

链接: https://arxiv.org/abs/2405.20612
作者: Hanzhang Zhou,Zijian Feng,Zixiao Zhu,Junlang Qian,Kezhi Mao
关键词: Large language models, demonstrated impressive capabilities, Large language, in-context learning, demonstrated impressive
中文关键词: 大型语言模型,表现出令人印象深刻的能力,大型语言,上下文学习,表现出令人印象深刻的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in various tasks using the in-context learning (ICL) paradigm. However, their effectiveness is often compromised by inherent bias, leading to prompt brittleness, i.e., sensitivity to design settings such as example selection, order, and prompt formatting. Previous studies have addressed LLM bias through external adjustment of model outputs, but the internal mechanisms that lead to such bias remain unexplored. Our work delves into these mechanisms, particularly investigating how feedforward neural networks (FFNs) and attention heads result in the bias of LLMs. By Interpreting the contribution of individual FFN vectors and attention heads, we identify the biased LLM components that skew LLMs’ prediction toward specific labels. To mitigate these biases, we introduce UniBias, an inference-only method that effectively identifies and eliminates biased FFN vectors and attention heads. Extensive experiments across 12 NLP datasets demonstrate that UniBias significantly enhances ICL performance and alleviates prompt brittleness of LLMs.
摘要:大型语言模型(LLM)在使用情境学习(ICL)范式的各种任务中表现出了令人印象深刻的能力。然而,它们的有效性往往受到固有偏见的影响,导致迅速的脆性,即对设计设置的敏感性,如范例选择、顺序和迅速格式化。以前的研究已经通过模型输出的外部调整来解决LLM偏差,但导致这种偏差的内部机制仍未被探索。我们的工作深入研究了这些机制,特别是研究了前馈神经网络(FFN)和注意头如何导致LLMS的偏差。通过解释单个FFN向量和注意头的贡献,我们识别出使LLMS的预测向特定标签倾斜的有偏的LLM分量。为了减轻这些偏差,我们引入了UniBias,这是一种仅限推理的方法,可以有效地识别和消除有偏见的FFN向量和注意力头部。在12个NLP数据集上的大量实验表明,UniBias显著提高了ICL性能,并缓解了LLMS的即时脆性。

[NLP-50] Bi-Directional Transformers vs. word2vec: Discovering Vulnerabilities in Lifted Compiled Code
[NLP-50] 双向变形金刚与word 2vec:发现已提升的已编译代码中的漏洞

链接: https://arxiv.org/abs/2405.20611
作者: Gary A. McCully,John D. Hastings,Shengjie Xu,Adam Fortier
关键词: high-level code structures, lost high-level code, architectural dependencies, optimization options, challenging due
中文关键词: 高级代码结构、丢失的高级代码、体系结构依赖关系、优化选项、具有挑战性的
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 8 pages, 0 figures, IEEE 4th Cyber Awareness and Research Symposium 2024 (CARS’24)

点击查看摘要

Abstract:Detecting vulnerabilities within compiled binaries is challenging due to lost high-level code structures and other factors such as architectural dependencies, compilers, and optimization options. To address these obstacles, this research explores vulnerability detection by using natural language processing (NLP) embedding techniques with word2vec, BERT, and RoBERTa to learn semantics from intermediate representation (LLVM) code. Long short-term memory (LSTM) neural networks were trained on embeddings from encoders created using approximately 118k LLVM functions from the Juliet dataset. This study is pioneering in its comparison of word2vec models with multiple bidirectional transformer (BERT, RoBERTa) embeddings built using LLVM code to train neural networks to detect vulnerabilities in compiled binaries. word2vec Continuous Bag of Words (CBOW) models achieved 92.3% validation accuracy in detecting vulnerabilities, outperforming word2vec Skip-Gram, BERT, and RoBERTa. This suggests that complex contextual NLP embeddings may not provide advantages over simpler word2vec models for this task when a limited number (e.g. 118K) of data samples are used to train the bidirectional transformer-based models. The comparative results provide novel insights into selecting optimal embeddings for learning compiler-independent semantic code representations to advance machine learning detection of vulnerabilities in compiled binaries.
摘要:由于高级代码结构丢失以及架构依赖项、编译器和优化选项等其他因素的影响,检测已编译二进制文件中的漏洞具有挑战性。为了解决这些障碍,本研究通过使用自然语言处理(NLP)嵌入技术和word2vec、Bert和Roberta从中间表示(LLVM)代码学习语义来探索漏洞检测。对长短期记忆(LSTM)神经网络进行了关于来自编码器的嵌入的训练,编码器使用来自Juliet数据集的大约118k个LLVM函数创建。这项研究率先将word2vec模型与使用LLVM代码构建的多个双向转换器(Bert,Roberta)嵌入进行了比较,以训练神经网络来检测编译后的二进制文件中的漏洞。Word2vec连续词袋(CBOW)模型在检测漏洞方面的验证准确率达到92.3%,优于word2vec Skip-Gram、Bert和Roberta。这表明,当有限数量(例如118K)的数据样本用于训练基于双向变压器的模型时,复杂的上下文NLP嵌入可能不会为这项任务提供比简单的word2vec模型更好的优势。这些比较结果为选择最优嵌入来学习与编译器无关的语义代码表示提供了新的见解,从而提高了机器学习对编译的二进制文件中漏洞的检测。

[NLP-51] Identifying while Learning for Document Event Causality Identification
[NLP-51] 在学习时识别文档事件因果关系识别

链接: https://arxiv.org/abs/2405.20608
作者: Cheng Liu,Wei Xiang,Bang Wang
关键词: aims to detect, Identification, Causality, causal, Event Causality
中文关键词: 旨在检测、识别、因果关系、因果关系、事件因果关系
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2024

点击查看摘要

Abstract:Event Causality Identification (ECI) aims to detect whether there exists a causal relation between two events in a document. Existing studies adopt a kind of identifying after learning paradigm, where events’ representations are first learned and then used for the identification. Furthermore, they mainly focus on the causality existence, but ignoring causal direction. In this paper, we take care of the causal direction and propose a new identifying while learning mode for the ECI task. We argue that a few causal relations can be easily identified with high confidence, and the directionality and structure of these identified causalities can be utilized to update events’ representations for boosting next round of causality identification. To this end, this paper designs an iterative learning and identifying framework: In each iteration, we construct an event causality graph, on which events’ causal structure representations are updated for boosting causal identification. Experiments on two public datasets show that our approach outperforms the state-of-the-art algorithms in both evaluations for causality existence identification and direction identification.
摘要:事件因果关系识别旨在检测文档中两个事件之间是否存在因果关系。现有的研究采用一种先学习后识别的范式,即首先学习事件的表征,然后再用于识别。此外,他们主要关注因果关系的存在,而忽视了因果关系的方向。在本文中,我们关注了因果方向,并针对ECI任务提出了一种新的边识别边学习的模式。我们认为,一些因果关系可以很容易地识别,并具有很高的置信度,这些识别的因果关系的方向性和结构可以用于更新事件的表示,以促进下一轮因果关系识别。为此,本文设计了一个迭代学习和识别框架*:在每次迭代中,我们构造一个事件因果图,在该图上更新事件的因果结构表示,以提高因果识别。在两个公开数据集上的实验表明,我们的方法在因果关系存在识别和方向识别方面都优于最先进的算法。

[NLP-52] Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis
[NLP-52] 掩蔽语言建模成为表格数据合成的条件密度估计

链接: https://arxiv.org/abs/2405.20602
作者: Seunghwan An,Gyeongdong Woo,Jaesung Lim,ChangHyun Kim,Sungchul Hong,Jong-June Jeon
关键词: synthetic data generation, machine learning utility, high machine learning, generate synthetic data, synthetic data
中文关键词: 合成数据生成,机器学习实用程序,高级机器学习,生成合成数据,合成数据
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, our goal is to generate synthetic data for heterogeneous (mixed-type) tabular datasets with high machine learning utility (MLu). Given that the MLu performance relies on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We propose a novel synthetic data generation method, MaCoDE, by redefining the multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our proposed method enables estimating conditional densities across arbitrary combinations of target and conditional variables. Furthermore, we demonstrate that our proposed method bridges the theoretical gap between distributional learning and MLM. To validate the effectiveness of our proposed model, we conduct synthetic data generation experiments on 10 real-world datasets. Given the analogy between predicting masked input tokens in MLM and missing data imputation, we also evaluate the performance of multiple imputations on incomplete datasets with various missing data mechanisms. Moreover, our proposed model offers the advantage of enabling adjustments to data privacy levels without requiring re-training.
摘要:在本文中,我们的目标是为具有高机器学习效用(MLU)的异类(混合型)表格数据集生成合成数据。考虑到MLU的性能依赖于对条件分布的精确逼近,我们重点设计了一种基于条件分布估计的合成数据生成方法。通过将掩蔽语言建模(MLM)的多类分类任务重新定义为基于直方图的非参数条件密度估计,提出了一种新的合成数据生成方法MaCoDE。我们提出的方法可以估计目标变量和条件变量的任意组合的条件密度。此外,我们还证明了我们提出的方法弥合了分布式学习和传销之间的理论鸿沟。为了验证该模型的有效性,我们在10个真实数据集上进行了合成数据生成实验。考虑到MLM中预测屏蔽输入令牌与缺失数据填充的相似之处,我们还评估了在不同缺失数据机制下对不完整数据集的多次填充的性能。此外,我们提出的模型提供了无需重新培训即可调整数据隐私级别的优势。

[NLP-53] DAFNet: Dynamic Auxiliary Fusion for Sequential Model Editing in Large Language Models
[NLP-53] DAFNet:用于大型语言模型中顺序模型编辑的动态辅助融合

链接: https://arxiv.org/abs/2405.20588
作者: Taolin Zhang,Qizhou Chen,Dongyang Li,Chengyu Wang,Xiaofeng He,Longtao Huang,Hui Xue,Jun Huang
关键词: demonstrated impressive results, large language models, impressive results, suffer from hallucination, false information
中文关键词: 展示了令人印象深刻的结果,大型语言模型,令人印象深刻的结果,遭受幻觉,虚假信息
类目: Computation and Language (cs.CL)
备注: ACL2024 findings

点击查看摘要

Abstract:Recently, while large language models (LLMs) have demonstrated impressive results, they still suffer from hallucination, i.e., the generation of false information. Model editing is the task of fixing factual mistakes in LLMs; yet, most previous works treat it as a one-time task, paying little attention to ever-emerging mistakes generated by LLMs. We address the task of sequential model editing (SME) that aims to rectify mistakes continuously. A Dynamic Auxiliary Fusion Network (DAFNet) is designed to enhance the semantic interaction among the factual knowledge within the entire sequence, preventing catastrophic forgetting during the editing process of multiple knowledge triples. Specifically, (1) for semantic fusion within a relation triple, we aggregate the intra-editing attention flow into auto-regressive self-attention with token-level granularity in LLMs. We further leverage multi-layer diagonal inter-editing attention flow to update the weighted representations of the entire sequence-level granularity. (2) Considering that auxiliary parameters are required to store the knowledge for sequential editing, we construct a new dataset named \textbfDAFSet, fulfilling recent, popular, long-tail and robust properties to enhance the generality of sequential editing. Experiments show DAFNet significantly outperforms strong baselines in single-turn and sequential editing. The usage of DAFSet also consistently improves the performance of other auxiliary network-based methods in various scenarios
摘要:最近,尽管大型语言模型(LLM)取得了令人印象深刻的成果,但它们仍然存在幻觉,即产生错误信息。模型编辑是修复LLMS中事实错误的任务;然而,以前的大多数工作都将其视为一次性任务,很少关注LLMS产生的不断出现的错误。我们解决了序列模型编辑(SME)的任务,其目的是不断纠正错误。设计了一种动态辅助融合网络(DAFNet),以增强整个序列中事实知识之间的语义交互,防止在编辑多个知识三元组的过程中发生灾难性的遗忘。具体地说,(1)对于关系三元组中的语义融合,我们在LLMS中将编辑内的注意流聚合为令牌级粒度的自回归自我注意。我们进一步利用多层对角编辑间注意流来更新整个序列级别粒度的加权表示。(2)考虑到顺序编辑需要辅助参数来存储知识,我们构造了一个新的数据集,命名为\textbfDAFSet,实现了时代性、流行性、长尾性和健壮性,增强了顺序编辑的通用性。实验表明,DAFNet在单回合和顺序编辑方面的表现明显优于强基线。在各种情况下,DAFSet的使用还会持续提高其他基于辅助网络的方法的性能

[NLP-54] GAMedX: Generative AI-based Medical Entity Data Extractor Using Large Language Models
[NLP-54] GamedX:使用大型语言模型的基于人工智能的生成式医疗实体数据提取器

链接: https://arxiv.org/abs/2405.20585
作者: Mohammed-Khalil Ghali,Abdelrahman Farrag,Hajar Sakai,Hicham El Baz,Yu Jin,Sarah Lam
关键词: Electronic Health Records, Health Records, Electronic Health, rapidly evolving field, Large Language Models
中文关键词: 电子健康记录、健康记录、电子健康、快速发展的领域、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of healthcare and beyond, the integration of generative AI in Electronic Health Records (EHRs) represents a pivotal advancement, addressing a critical gap in current information extraction techniques. This paper introduces GAMedX, a Named Entity Recognition (NER) approach utilizing Large Language Models (LLMs) to efficiently extract entities from medical narratives and unstructured text generated throughout various phases of the patient hospital visit. By addressing the significant challenge of processing unstructured medical text, GAMedX leverages the capabilities of generative AI and LLMs for improved data extraction. Employing a unified approach, the methodology integrates open-source LLMs for NER, utilizing chained prompts and Pydantic schemas for structured output to navigate the complexities of specialized medical jargon. The findings reveal significant ROUGE F1 score on one of the evaluation datasets with an accuracy of 98%. This innovation enhances entity extraction, offering a scalable, cost-effective solution for automated forms filling from unstructured data. As a result, GAMedX streamlines the processing of unstructured narratives, and sets a new standard in NER applications, contributing significantly to theoretical and practical advancements beyond the medical technology sphere.
摘要:在快速发展的医疗保健及其他领域,生成性人工智能与电子健康记录(EHR)的集成代表着一项关键的进步,解决了当前信息提取技术中的一个关键差距。本文介绍了GAMedX,这是一种命名实体识别(NER)方法,利用大语言模型(LLMS)从医疗叙述和非结构化文本中高效地提取实体,这些实体贯穿于患者就医的各个阶段。通过解决处理非结构化医学文本的重大挑战,GAMedX利用生成性人工智能和LLMS的能力来改进数据提取。该方法采用统一的方法,为NER集成了开源的LLM,使用链式提示和结构化输出的简单模式来导航专业医学术语的复杂性。结果显示,在其中一个评估数据集上,Rouge F1得分显著,准确率为98%。这一创新增强了实体提取,为从非结构化数据自动填写表单提供了可扩展、经济高效的解决方案。因此,GAMedX简化了非结构化叙述的处理,并在NER应用程序中设定了新的标准,为医疗技术领域以外的理论和实践进步做出了重大贡献。

[NLP-55] he Point of View of a Sentiment: Towards Clinician Bias Detection in Psychiatric Notes
[NLP-55] 情感的观点:精神病学笔记中的临床医生偏见检测

链接: https://arxiv.org/abs/2405.20582
作者: Alissa A. Valentine,Lauren A. Lepow,Alexander W. Charney,Isotta Landi
关键词: negative patient descriptions, point of view, large language models, negative patient, Mount Sinai Health
中文关键词: 负面患者描述、观点、大型语言模型、负面患者、西奈山健康中心
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Oral presentation at NAACL 2024 Queer in AI Workshop

点击查看摘要

Abstract:In psychiatry, negative patient descriptions and stigmatizing language can contribute to healthcare disparities in two ways: (1) read by patients they can harm their trust and engagement with the medical center; (2) read by future providers they may negatively influence the future perspective of a patient. By leveraging large language models, this work aims to identify the sentiment expressed in psychiatric clinical notes based on the reader’s point of view. Extracting sentences from the Mount Sinai Health System’s large and diverse clinical notes, we used prompts and in-context learning to adapt three large language models (GPT-3.5, Llama 2, Mistral) to classify the sentiment conveyed by the sentences according to the provider or non-provider point of view. Results showed that GPT-3.5 aligns best to provider point of view, whereas Mistral aligns best to non-provider point of view.
摘要:在精神病学中,负面的患者描述和污名化语言可能会通过两种方式导致医疗保健差异:(1)患者阅读它们可能会损害他们对医疗中心的信任和参与;(2)未来的提供者阅读它们可能会对患者未来的看法产生负面影响。通过利用大型语言模型,这项工作旨在根据读者的观点识别精神科临床笔记中表达的情感。从西奈山卫生系统大量且多样化的临床笔记中提取句子,我们使用提示和上下文学习来适应三个大型语言模型(GPT-3.5、Llama 2、Mistral),根据提供者或非提供者的观点对句子传达的情感进行分类。结果显示,GPT-3.5最符合提供商的观点,而Mistral最符合非提供商的观点。

[NLP-56] Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark
[NLP-56] 开放Ko-LLM排行榜:使用Ko-H5基准评估韩语大型语言模型

链接: https://arxiv.org/abs/2405.20574
作者: Chanjun Park,Hyeonwoo Kim,Dahyun Kim,Seonghwan Cho,Sanghoon Kim,Sukyung Lee,Yungi Kim,Hwalsuk Lee
关键词: Large Language Models, evaluating Large Language, Language Models, Open Ko-LLM Leaderboard, Open LLM Leaderboard
中文关键词: 大型语言模型,评估大型语言,语言模型,开放Ko-LLM排行榜,开放LLM排行榜
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2024 Main

点击查看摘要

Abstract:This paper introduces the Open Ko-LLM Leaderboard and the Ko-H5 Benchmark as vital tools for evaluating Large Language Models (LLMs) in Korean. Incorporating private test sets while mirroring the English Open LLM Leaderboard, we establish a robust evaluation framework that has been well integrated in the Korean LLM community. We perform data leakage analysis that shows the benefit of private test sets along with a correlation study within the Ko-H5 benchmark and temporal analyses of the Ko-H5 score. Moreover, we present empirical support for the need to expand beyond set benchmarks. We hope the Open Ko-LLM Leaderboard sets precedent for expanding LLM evaluation to foster more linguistic diversity.
摘要:本文介绍了Open Ko-LLM排行榜和Ko-H5基准,作为评估韩语大型语言模型(LLM)的重要工具。我们在模仿英语公开赛LLM排行榜的同时,融合了私人测试集,建立了一个强大的评估框架,该框架已很好地融入韩国LLM社区。我们执行数据泄露分析,显示私人测试集的好处,以及Ko-H5基准内的相关性研究和Ko-H5评分的时间分析。此外,我们还为扩大超出既定基准的必要性提供了经验支持。我们希望Open Ko-LLM排行榜为扩大LLM评估以促进更多语言多样性开创先例。

[NLP-57] Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
[NLP-57] 被困惑困惑:使用小引用模型基于困惑的数据修剪

链接: https://arxiv.org/abs/2405.20541
作者: Zachary Ankner,Cody Blakeney,Kartik Sreenivasan,Max Marion,Matthew L. Leavitt,Mansheej Paul
关键词: small language models, determine high-quality subsets, large-scale text datasets, larger language models, language models
中文关键词: 小型语言模型,确定高质量子集,大规模文本数据集,更大的语言模型,语言模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emphsignificantly improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a 1.45\times reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.
摘要:在这项工作中,我们研究了小型语言模型是否能够确定大规模文本数据集的高质量子集,从而提高大型语言模型的性能。虽然现有的工作已经表明,基于较大模型的困惑的剪枝可以产生高质量的数据,但我们调查了较小的模型是否可以用于基于困惑的剪枝,以及剪枝如何受到被剪枝数据的域组成的影响。我们证明,对于多个数据集组合,基于困惑的修剪预训练数据可以显著提高下游任务的性能:基于1.25亿个参数模型计算的困惑的剪枝将30亿个参数模型的下游任务的平均性能提高高达2.04,并实现高达1.45倍的预训练步骤,以达到相应的基线性能。此外,我们还证明了在过度训练和数据受限的情况下,这种基于困惑的数据剪枝还可以产生下行性能提升。

[NLP-58] Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning
[NLP-58] 揭示编码数据指令微调对大型语言模型推理的影响

链接: https://arxiv.org/abs/2405.20535
作者: Xinlu Zhang,Zhiyu Zoey Chen,Xi Ye,Xianjun Yang,Lichang Chen,William Yang Wang,Linda Ruth Petzold
关键词: pretrained Large Language, Large Language Models, Large Language, Instruction Fine-Tuning, pretrained Large
中文关键词: 预训练的大型语言、大型语言模型、大型语言、指令微调、预训练的大型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction Fine-Tuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs). While coding data is known to boost reasoning abilities during LLM pretraining, its role in activating internal reasoning capacities during IFT remains understudied. This paper investigates a key question: How does coding data impact LLMs’ reasoning capacities during the IFT stage? To explore this, we thoroughly examine the impact of coding data across different coding data proportions, model families, sizes, and reasoning domains, from various perspectives. Specifically, we create three IFT datasets with increasing coding data proportions, fine-tune six LLM backbones across different families and scales on these datasets, evaluate the tuned models’ performance across twelve tasks in three reasoning domains, and analyze the outcomes from three broad-to-granular perspectives: overall, domain-level, and task-specific. Our holistic analysis provides valuable insights in each perspective. First, coding data tuning enhances the overall reasoning capabilities of LLMs across different model families and scales. Moreover, the effect of coding data varies among different domains but shows consistent trends across model families and scales within each domain. Additionally, coding data generally yields comparable task-specific benefits across different model families, with the optimal coding data proportions in IFT datasets being task-specific.
摘要:指令微调(IFT)显著增强了预先训练的大型语言模型(LLM)的零射能力。虽然编码数据在LLM预训练期间可以提高推理能力,但它在IFT期间激活内部推理能力的作用仍未被研究。本文研究了一个关键问题:编码数据在IFT阶段如何影响LLMS的推理能力?为了探索这一点,我们从不同的角度彻底检查了跨不同编码数据比例、模型族、大小和推理域对数据进行编码的影响。具体地说,我们创建了三个编码数据比例不断增加的IFT数据集,在这些数据集上微调了不同家族和规模的六个LLM主干,评估了调整后的模型在三个推理领域的十二个任务上的性能,并从三个粗略到粒度的角度分析了结果:总体、领域级别和特定任务。我们的整体分析在每个角度都提供了有价值的见解。首先,编码数据调整增强了不同模型系列和尺度上的LLM的整体推理能力。此外,编码数据的效果在不同的领域中有所不同,但在每个领域内的不同模型系列和尺度上显示出一致的趋势。此外,编码数据通常会在不同的模型系列中产生类似的特定于任务的好处,IFT数据集中的最佳编码数据比例是特定于任务的。

[NLP-59] An Automatic Question Usability Evaluation Toolkit
[NLP-59] 自动问题可用性评估工具包

链接: https://arxiv.org/abs/2405.20529
作者: Steven Moore,Eamon Costello,Huy A. Nguyen,John Stamper
关键词: Evaluating multiple-choice questions, overlooking deeper question, deeper question design, Evaluating multiple-choice, Scalable Automatic Question
中文关键词: 评估多项选择题,忽略更深层次的问题,更深层次的问题设计,评估多项选择题,可扩展自动问题
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Artificial Intelligence in Education 2024

点击查看摘要

Abstract:Evaluating multiple-choice questions (MCQs) involves either labor intensive human assessments or automated methods that prioritize readability, often overlooking deeper question design flaws. To address this issue, we introduce the Scalable Automatic Question Usability Evaluation Toolkit (SAQUET), an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs. By harnessing the latest in large language models such as GPT-4, advanced word embeddings, and Transformers designed to analyze textual complexity, SAQUET effectively pinpoints and assesses a wide array of flaws in MCQs. We first demonstrate the discrepancy between commonly used automated evaluation metrics and the human assessment of MCQ quality. Then we evaluate SAQUET on a diverse dataset of MCQs across the five domains of Chemistry, Statistics, Computer Science, Humanities, and Healthcare, showing how it effectively distinguishes between flawed and flawless questions, providing a level of analysis beyond what is achievable with traditional metrics. With an accuracy rate of over 94% in detecting the presence of flaws identified by human evaluators, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.
摘要:多项选择题(MCQ)的评估要么涉及劳动密集型人工评估,要么涉及优先考虑可读性的自动化方法,往往忽略了更深层次的问题设计缺陷。为了解决这个问题,我们引入了可伸缩的自动问题可用性评估工具包(SAQUIT),这是一个利用项目写作缺陷(IWF)标准对MCQ进行全面和自动化质量评估的开源工具。通过利用最新的大型语言模型,如GPT-4、高级单词嵌入和旨在分析文本复杂性的转换器,SAQUT有效地定位和评估了MCQ中的一系列缺陷。我们首先展示了常用的自动化评估指标和人工评估McQ质量之间的差异。然后,我们在化学、统计、计算机科学、人文和医疗五个领域的不同MCQ数据集上对SAQUET进行评估,展示了它如何有效地区分有缺陷和无缺陷的问题,提供了超过传统指标所能达到的分析水平。在检测人类评估者识别的缺陷方面的准确率超过94%,我们的发现强调了现有评估方法的局限性,并展示了提高教育评估质量的潜力。

[NLP-60] owards Ontology-Enhanced Representation Learning for Large Language Models
[NLP-60] owards大型语言模型的实体增强表示学习

链接: https://arxiv.org/abs/2405.20527
作者: Francesco Ronzano,Jay Nanavati
关键词: embedding-Large Language Model, Language Model, knowledge infusion aims, Taking advantage, embedding-Large Language
中文关键词: 嵌入-大型语言模型、语言模型、知识注入目标、利用优势、嵌入-大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figure

点击查看摘要

Abstract:Taking advantage of the widespread use of ontologies to organise and harmonize knowledge across several distinct domains, this paper proposes a novel approach to improve an embedding-Large Language Model (embedding-LLM) of interest by infusing the knowledge formalized by a reference ontology: ontological knowledge infusion aims at boosting the ability of the considered LLM to effectively model the knowledge domain described by the infused ontology. The linguistic information (i.e. concept synonyms and descriptions) and structural information (i.e. is-a relations) formalized by the ontology are utilized to compile a comprehensive set of concept definitions, with the assistance of a powerful generative LLM (i.e. GPT-3.5-turbo). These concept definitions are then employed to fine-tune the target embedding-LLM using a contrastive learning framework. To demonstrate and evaluate the proposed approach, we utilize the biomedical disease ontology MONDO. The results show that embedding-LLMs enhanced by ontological disease knowledge exhibit an improved capability to effectively evaluate the similarity of in-domain sentences from biomedical documents mentioning diseases, without compromising their out-of-domain performance.
摘要:利用本体广泛用于组织和协调不同领域的知识,提出了一种通过注入参考本体形式化的知识来改进感兴趣的嵌入大型语言模型(Embedding-Large Language Model,Embedding-LLM)的新方法:本体知识注入旨在提高所考虑的LLM对注入的本体描述的知识领域进行有效建模的能力。利用本体形式化的语言信息(即概念同义词和描述)和结构信息(即IS-a关系),在强大的生成式LLM(即GPT-3.5-TURBO)的帮助下,编译出一套全面的概念定义。然后,使用这些概念定义来使用对比学习框架来微调目标嵌入-LLM。为了演示和评估所提出的方法,我们使用了生物医学疾病本体Mondo。实验结果表明,基于疾病本体知识的Embedding-LLMS在不影响领域外性能的前提下,能够有效地评估涉及疾病的生物医学文档中领域内句子的相似度。

[NLP-61] Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
[NLP-61] 从多项选择题自动生成和标记知识组件

链接: https://arxiv.org/abs/2405.20526
作者: Steven Moore,Robin Schmucker,Tom Mitchell,John Stamper
关键词: Knowledge Components, enrich analytics, facilitate adaptivity, Large Language Model, enhance the measurement
中文关键词: 知识组件、丰富分析、促进适应性、大型语言模型、增强测量
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Learning @ Scale 2024

点击查看摘要

Abstract:Knowledge Components (KCs) linked to assessments enhance the measurement of student learning, enrich analytics, and facilitate adaptivity. However, generating and linking KCs to assessment items requires significant effort and domain-specific knowledge. To streamline this process for higher-education courses, we employed GPT-4 to generate KCs for multiple-choice questions (MCQs) in Chemistry and E-Learning. We analyzed discrepancies between the KCs generated by the Large Language Model (LLM) and those made by humans through evaluation from three domain experts in each subject area. This evaluation aimed to determine whether, in instances of non-matching KCs, evaluators showed a preference for the LLM-generated KCs over their human-created counterparts. We also developed an ontology induction algorithm to cluster questions that assess similar KCs based on their content. Our most effective LLM strategy accurately matched KCs for 56% of Chemistry and 35% of E-Learning MCQs, with even higher success when considering the top five KC suggestions. Human evaluators favored LLM-generated KCs, choosing them over human-assigned ones approximately two-thirds of the time, a preference that was statistically significant across both domains. Our clustering algorithm successfully grouped questions by their underlying KCs without needing explicit labels or contextual information. This research advances the automation of KC generation and classification for assessment items, alleviating the need for student data or predefined KC labels.
摘要:与评估挂钩的知识组件(KCs)增强了对学生学习的测量,丰富了分析,并促进了适应性。然而,生成KC并将其链接到评估项目需要大量的工作和特定领域的知识。为了简化高等教育课程的这一过程,我们使用GPT-4为化学和电子学习中的多项选择题(MCQ)生成KC。我们通过每个主题领域的三位领域专家的评估,分析了大语言模型(LLM)生成的KCs与人类生成的KCs之间的差异。这项评价旨在确定,在不匹配的KC的情况下,评估者是否表现出对LLM生成的KC的偏好,而不是其人类创造的对应KC。我们还开发了一个本体归纳算法来根据问题的内容对评估相似知识的问题进行聚类。我们最有效的LLM策略准确地匹配了56%的化学和35%的E-Learning MCQ的KC,在考虑前五个KC建议时,成功率甚至更高。人类评估者更喜欢LLM生成的KC,大约三分之二的时间选择它们而不是人类分配的KC,这一偏好在两个领域都有统计学意义。我们的聚类算法成功地根据问题的基本知识集对问题进行了分组,而不需要明确的标签或上下文信息。这项研究推进了评估项目KC生成和分类的自动化,减少了对学生数据或预定义KC标签的需求。

[NLP-62] How Multilingual Are Large Language Models Fine-Tuned for Translation?
[NLP-62] 大型语言模型如何针对翻译进行微调?

链接: https://arxiv.org/abs/2405.20512
作者: Aquia Richburg,Marine Carpuat
关键词: translation systems trained, large language models, fine-tuning large language, outperform dedicated translation, dedicated translation systems
中文关键词: 经过训练的翻译系统、大型语言模型、微调大型语言、优于专用翻译、专用翻译系统
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A new paradigm for machine translation has recently emerged: fine-tuning large language models (LLM) on parallel text has been shown to outperform dedicated translation systems trained in a supervised fashion on much larger amounts of parallel data (Xu et al., 2024a; Alves et al., 2024). However, it remains unclear whether this paradigm can enable massively multilingual machine translation or whether it requires fine-tuning dedicated models for a small number of language pairs. How does translation fine-tuning impact the MT capabilities of LLMs for zero-shot languages, zero-shot language pairs, and translation tasks that do not involve English? To address these questions, we conduct an extensive empirical evaluation of the translation quality of the TOWER family of language models (Alves et al., 2024) on 132 translation tasks from the multi-parallel FLORES-200 data. We find that translation fine-tuning improves translation quality even for zero-shot languages on average, but that the impact is uneven depending on the language pairs involved. These results call for further research to effectively enable massively multilingual translation with LLMs.
摘要:最近出现了一种新的机器翻译范式:对并行文本进行微调的大型语言模型(LLM)被证明在处理更大量的并行数据时优于以监督方式训练的专用翻译系统(Xu等人,2024a;Alves等人,2024)。然而,目前尚不清楚这一范例是否能够实现大规模多语言机器翻译,或者它是否需要为少数语言对微调专用模型。翻译微调对零语言、零语言对和不涉及英语的翻译任务的机器翻译能力有何影响?为了解决这些问题,我们从多平行Flores-200数据中对Tower语言模型(Alves et al.,2024)的132个翻译任务的翻译质量进行了广泛的实证评估。我们发现,翻译微调提高了翻译质量,即使对于零镜头语言,平均而言,但影响是参差不齐的,取决于所涉及的语言对。这些结果需要进一步的研究,以便有效地利用LLMS进行大规模多语种翻译。

[NLP-63] SPOT: Text Source Prediction from Originality Score Thresholding
[NLP-63] SPOT:来自原创性分数预设的文本源预测

链接: https://arxiv.org/abs/2405.20505
作者: Edouard Yvinec,Gabriel Kasser
关键词: large language models, social risks, wide acceptance, acceptance of large, large language
中文关键词: 大型语言模型、社会风险、广泛接受、接受大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The wide acceptance of large language models (LLMs) has unlocked new applications and social risks. Popular countermeasures aim at detecting misinformation, usually involve domain specific models trained to recognize the relevance of any information. Instead of evaluating the validity of the information, we propose to investigate LLM generated text from the perspective of trust. In this study, we define trust as the ability to know if an input text was generated by a LLM or a human. To do so, we design SPOT, an efficient method, that classifies the source of any, standalone, text input based on originality score. This score is derived from the prediction of a given LLM to detect other LLMs. We empirically demonstrate the robustness of the method to the architecture, training data, evaluation data, task and compression of modern LLMs.
摘要:大型语言模型(LLM)的广泛接受开启了新的应用程序和社会风险。流行的对策旨在检测错误信息,通常涉及经过训练以识别任何信息相关性的领域特定模型。我们建议从信任的角度调查LLM生成的文本,而不是评估信息的有效性。在这项研究中,我们将信任定义为知道输入文本是由LLM还是人类生成的能力。为此,我们设计了SPOT,这是一种有效的方法,可以根据原创性评分对任何独立文本输入的来源进行分类。该分数源自对给定LLM的预测,以检测其他LLM。我们通过经验证明了该方法对现代LLM的架构、训练数据、评估数据、任务和压缩的稳健性。

[NLP-64] ransfer Q Star: Principled Decoding for LLM Alignment
[NLP-64] ransfer Q Star:LLM对齐的原则解码

链接: https://arxiv.org/abs/2405.20495
作者: Souradip Chakraborty,Soumya Suvra Ghosal,Ming Yin,Dinesh Manocha,Mengdi Wang,Amrit Singh Bedi,Furong Huang
关键词: Aligning foundation models, Aligning foundation, trustworthy deployment, texttt, safe and trustworthy
中文关键词: 调整基础模型,调整基础,值得信赖的部署,文本,安全且值得信赖
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward r , thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ( Q^* ), which is often unavailable in practice. Hence, prior SoTA methods either approximate this Q^* using Q^\pi_\textttsft (derived from the reference \textttSFT model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose Transfer Q^* , which implicitly estimates the optimal value function for a target reward r through a baseline model \rho_\textttBL aligned with a baseline reward \rho_\textttBL (which can be different from the target reward r ). Theoretical analyses of Transfer Q^* provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference \textttSFT model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.
摘要:调整基础模型对于其安全可靠的部署至关重要。然而,传统的微调方法计算量大,需要更新数十亿个模型参数。一种有前景的替代方案,通过解码来对准,无需模型更新而直接调整响应分布以最大化目标奖励r,从而提供了用于对准的轻量级和自适应框架。然而,原则性译码方法依赖于Oracle对最优Q函数(Q^)的访问,而这在实践中往往是不可用的。因此,现有的SOTA方法要么使用q^\pi_\extttsft(源自参考\exttSFT模型)来逼近这个q^,要么依赖于短期回报,从而导致次优的译码性能。在这项工作中,我们提出了转移Q^*,它通过与基线报酬\rho_\extttBL(可以不同于目标报酬r)对齐的基线模型来隐式估计目标报酬r的最优值函数。转移Q^*的理论分析给出了其最优性的严格刻画,推导出次最优性间隙的上界,并根据用户需求识别超参数以控制与预先训练的参考\文本tSFT模型的偏差。我们的方法显著地缩小了以前SOTA方法中观察到的次优差距,并在几个合成和真实数据集上的广泛测试中展示了在一致性、多样性和质量等关键指标上的卓越经验性能。

[NLP-65] Phantom: General Trigger Attacks on Retrieval Augmented Language Generation
[NLP-65] Phantom:对检索增强语言生成的通用触发攻击

链接: https://arxiv.org/abs/2405.20485
作者: Harsh Chaudhari,Giorgio Severi,John Abascal,Matthew Jagielski,Christopher A. Choquette-Choo,Milad Nasr,Cristina Nita-Rotaru,Alina Oprea
关键词: Retrieval Augmented Generation, modern large language, large language models, RAG augmented LLMs, Retrieval Augmented
中文关键词: 检索增强生成、现代大型语言、大型语言模型、RAG增强LLM、检索增强
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs) in chatbot applications, enabling developers to adapt and personalize the LLM output without expensive training or fine-tuning. RAG systems use an external knowledge database to retrieve the most relevant documents for a given query, providing this context to the LLM generator. While RAG achieves impressive utility in many applications, its adoption to enable personalized generative models introduces new security risks. In this work, we propose new attack surfaces for an adversary to compromise a victim’s RAG system, by injecting a single malicious document in its knowledge database. We design Phantom, general two-step attack framework against RAG augmented LLMs. The first step involves crafting a poisoned document designed to be retrieved by the RAG system within the top-k results only when an adversarial trigger, a specific sequence of words acting as backdoor, is present in the victim’s queries. In the second step, a specially crafted adversarial string within the poisoned document triggers various adversarial attacks in the LLM generator, including denial of service, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama.
摘要:检索增强生成(RAG)扩展了Chatbot应用程序中现代大型语言模型(LLM)的能力,使开发人员无需昂贵的培训或微调即可适应和个性化LLM输出。RAG系统使用外部知识数据库来检索与给定查询最相关的文档,并将此上下文提供给LLM生成器。虽然RAG在许多应用程序中实现了令人印象深刻的实用性,但采用它来支持个性化的生成模型带来了新的安全风险。在这项工作中,我们提出了新的攻击面,通过在受害者的知识库中注入单个恶意文档来危害受害者的RAG系统。我们设计了一个针对RAG扩展的LLMS的Phantom通用两步攻击框架。第一步涉及精心设计一个有毒文档,仅当受害者的查询中出现敌对触发器(充当后门的特定单词序列)时,RAG系统才会在top-k结果中检索到。在第二步中,有毒文档中巧尽心思构建的敌意字符串会在LLM生成器中触发各种敌意攻击,包括拒绝服务、声誉损害、侵犯隐私和有害行为。我们演示了我们对多个LLM体系结构的攻击,包括Gema、Vicuna和Llama。

[NLP-66] Automated Focused Feedback Generation for Scientific Writing Assistance
[NLP-66] 自动生成有针对性的反馈以帮助科学写作

链接: https://arxiv.org/abs/2405.20477
作者: Eric Chamoun,Michael Schlichktrull,Andreas Vlachos
关键词: Scientific writing, novice researchers, Scientific, Scientific WrIting Focused, challenging task
中文关键词: 科学写作、新手研究人员、科学、科学写作专注、具有挑战性的任务
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:Scientific writing is a challenging task, particularly for novice researchers who often rely on feedback from experienced peers. Recent work has primarily focused on improving surface form and style rather than manuscript content. In this paper, we propose a novel task: automated focused feedback generation for scientific writing assistance. We present SWIF ^2 T: a Scientific WrIting Focused Feedback Tool. It is designed to generate specific, actionable and coherent comments, which identify weaknesses in a scientific paper and/or propose revisions to it. Our approach consists of four components - planner, investigator, reviewer and controller - leveraging multiple Large Language Models (LLMs) to implement them. We compile a dataset of 300 peer reviews citing weaknesses in scientific papers and conduct human evaluation. The results demonstrate the superiority in specificity, reading comprehension, and overall helpfulness of SWIF ^2 T’s feedback compared to other approaches. In our analysis, we also identified cases where automatically generated reviews were judged better than human ones, suggesting opportunities for integration of AI-generated feedback in scientific writing.
摘要:科学写作是一项具有挑战性的任务,尤其是对于新手研究人员来说,他们往往依赖经验丰富的同行的反馈。最近的工作主要集中在改进表面形式和风格上,而不是手稿内容。在本文中,我们提出了一个新的任务:自动生成科技写作辅助的聚焦反馈。我们介绍了SWIF^2T:一个以科学写作为中心的反馈工具。它的目的是产生具体的、可操作的和连贯的评论,找出科学论文中的弱点和/或对其提出修改。我们的方法由四个组件组成-规划者、调查者、审查者和控制器-利用多个大型语言模型(LLM)来实现它们。我们汇编了300个同行评议的数据集,引用科学论文中的弱点,并进行人类评估。结果表明,与其他方法相比,SWIF^2T的反馈在特异性、阅读理解力和总体帮助方面具有优势。在我们的分析中,我们还确定了自动生成的评论比人类的评价更好的情况,这表明有机会将人工智能生成的反馈整合到科学写作中。

[NLP-67] Extending the Massive Text Embedding Benchmark to French
[NLP-67] 将海量文本嵌入基准扩展到法语

链接: https://arxiv.org/abs/2405.20468
作者: Mathieu Ciancone,Imene Kerboua,Marion Schaeffer,Wissam Siblini
关键词: Massive Text Embedding, Text Embedding Benchmark, recent years, NLP tasks, numerous embedding models
中文关键词: 海量文本嵌入、文本嵌入基准、近年来、NLP任务、众多嵌入模型
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, numerous embedding models have been made available and widely used for various NLP tasks. Choosing a model that performs well for several tasks in English has been largely simplified by the Massive Text Embedding Benchmark (MTEB), but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. Not only we gather 22 existing datasets in an easy-to-use interface, but we also create three new French datasets for a global evaluation over 8 different tasks. We perform a large scale comparison with 46 carefully selected embedding models, conduct comprehensive statistical tests, and analyze the correlation between model performance and many of their characteristics. We find out that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform particularly well. Our work comes with open-source code, new datasets and a public leaderboard.
摘要:近年来,已有大量的嵌入模型被提供并广泛用于各种自然语言处理任务。大规模文本嵌入基准测试(MTEB)在很大程度上简化了选择一个在几个英语任务中表现良好的模型,但对其他语言的扩展仍然具有挑战性。这就是为什么我们扩展MTEB,提出第一个大规模的法语句子嵌入基准。我们不仅在一个易于使用的界面中收集了22个现有数据集,而且还创建了三个新的法语数据集,用于对8个不同任务的全球评估。我们对46个精心选择的嵌入模型进行了大规模的比较,进行了全面的统计测试,并分析了模型性能与其许多特征之间的相关性。我们发现,即使没有模型在所有任务中都是最好的,预先训练句子相似性的大型多语言模型也表现得特别好。我们的工作包括开源代码、新的数据集和公共排行榜。

[NLP-68] Scalable Detection of Salient Entities in News Articles
[NLP-68] 新闻文章中突出实体的可扩展检测

链接: https://arxiv.org/abs/2405.20461
作者: Eliyar Asgarieh,Kapil Thadani,Neil O’Hare
关键词: typically mention numerous, articles typically mention, mention numerous entities, typically mention, mention numerous
中文关键词: 通常提到许多,文章通常提到,提到许多实体,通常提到,提到许多
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:News articles typically mention numerous entities, a large fraction of which are tangential to the story. Detecting the salience of entities in articles is thus important to applications such as news search, analysis and summarization. In this work, we explore new approaches for efficient and effective salient entity detection by fine-tuning pretrained transformer models with classification heads that use entity tags or contextualized entity representations directly. Experiments show that these straightforward techniques dramatically outperform prior work across datasets with varying sizes and salience definitions. We also study knowledge distillation techniques to effectively reduce the computational cost of these models without affecting their accuracy. Finally, we conduct extensive analyses and ablation experiments to characterize the behavior of the proposed models.
摘要:新闻文章通常会提到许多实体,其中很大一部分与故事无关。因此,检测文章中实体的显著性对于新闻搜索、分析和摘要等应用非常重要。在这项工作中,我们通过微调预训练的Transformer模型,其具有直接使用实体标签或上下文化实体表示的分类头,来探索高效有效的显着实体检测的新方法。实验表明,这些简单的技术在不同大小和显著性定义的数据集上的表现显着优于之前的工作。我们还研究知识提炼技术,以有效降低这些模型的计算成本,而不影响其准确性。最后,我们进行了广泛的分析和消融实验来描述所提出模型的行为。

[NLP-69] Enhancing Antibiotic Stewardship using a Natural Language Approach for Better Feature Representation
[NLP-69] 使用自然语言方法加强抗生素管理以更好地表示特征

链接: https://arxiv.org/abs/2405.20419
作者: Simon A. Lee,Trevor Brokowski,Jeffrey N. Chiang
关键词: global healthcare crisis, undermining the efficacy, rapid emergence, emergence of antibiotic-resistant, antibiotic-resistant bacteria
中文关键词: 全球医疗危机,削弱功效,迅速出现,抗生素耐药细菌的出现
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid emergence of antibiotic-resistant bacteria is recognized as a global healthcare crisis, undermining the efficacy of life-saving antibiotics. This crisis is driven by the improper and overuse of antibiotics, which escalates bacterial resistance. In response, this study explores the use of clinical decision support systems, enhanced through the integration of electronic health records (EHRs), to improve antibiotic stewardship. However, EHR systems present numerous data-level challenges, complicating the effective synthesis and utilization of data. In this work, we transform EHR data into a serialized textual representation and employ pretrained foundation models to demonstrate how this enhanced feature representation can aid in antibiotic susceptibility predictions. Our results suggest that this text representation, combined with foundation models, provides a valuable tool to increase interpretability and support antibiotic stewardship efforts.
摘要:抗生素耐药细菌的迅速出现被认为是一场全球医疗危机,削弱了救生抗生素的功效。这场危机是由抗生素的不当和过度使用造成的,从而加剧了细菌耐药性。作为回应,这项研究探索了通过集成电子健康记录(EHR)来增强临床决策支持系统的使用,以改善抗生素管理。然而,EHR系统面临许多数据级挑战,使数据的有效合成和利用变得复杂。在这项工作中,我们将EHR数据转换为序列化的文本表示,并采用预先训练的基础模型来演示这种增强的特征表示如何帮助抗生素敏感性预测。我们的结果表明,这种文本表示与基础模型相结合,提供了一个有价值的工具来增加可解释性并支持抗生素管理工作。

[NLP-70] Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
[NLP-70] 通过密码字符破解大型语言模型对抗调节护栏

链接: https://arxiv.org/abs/2405.20413
作者: Haibo Jin,Andy Zhou,Joe D. Menke,Haohan Wang
关键词: Large Language Models, Large Language, bypass protective measures, carefully crafted prompts, Language Models
中文关键词: 大型语言模型、大型语言、绕过保护措施、精心制作的提示、语言模型
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks’', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( \sim \times 19.88) and lower filtered-out rates ( \sim \times 1/6) than baselines.
摘要:大型语言模型(LLM)通常是无害的,但仍然容易受到精心设计的被称为“越狱”的提示的影响,这些提示可能会绕过保护措施并引发有害行为。LLMS中最近的改进包括了可以过滤输出的适度防护,这会触发对某些恶意问题的处理错误。现有的红团队基准往往忽视了包括引发温和障碍的问题,这使得评估越狱的有效性变得困难。为了解决这个问题,我们引入了JAMBch,这是一个旨在触发和评估适度护栏的有害行为基准。JAMBtch涉及160个手动编写的说明,涵盖多个严重级别的四个主要风险类别。此外,我们提出了一种越狱方法JAM(JailBreak Against Medium Ation),旨在使用越狱前缀来攻击适度护栏,以绕过输入级过滤器,以及一个功能等价于护栏模型的微调阴影模型,以生成密码字符以绕过输出级过滤器。我们在四个LLM上的大量实验表明,JAM获得了比基线更高的越狱成功率(19.88倍)和更低的滤过率(1/6倍)。

[NLP-71] SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought
[NLP-71] 无限期ExpressiveLM:具有思想链的表达性言语翻译的语音语言模型

链接: https://arxiv.org/abs/2405.20410
作者: Hongyu Gong,Bandhav Veluri
关键词: key research topic, seamless communication, key research, research topic, topic in seamless
中文关键词: 重点研究课题,无缝通信,重点研究,研究课题,无缝课题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English translations, SeamlessExpressiveLM outperforms cascaded LMs in both semantic quality and style transfer, meanwhile achieving better parameter efficiency.
摘要:表现性语音到语音翻译是无缝通信领域的一个重要研究课题,其重点是在翻译后的语音中保留语义和说话人的发声风格。早期的工作是合成说话人风格对齐的语音,以便直接学习语音到目标语谱图的映射。在不依赖于风格对齐的数据的情况下,最近的研究利用了语言建模(LM)的进步,并基于语义和声学标记构建了级联的LMS。本文提出了一种面向表现性S2ST的单一语音语言模型–无缝表达模型。我们将复杂的源到目标语音映射分解为具有思想链提示的中间生成步骤。该模型首先被引导翻译目标语义内容,然后将说话人风格转换为多流声学单元。在西班牙语到英语和匈牙利语到英语的翻译上进行评估,Seamless ExpressiveLM在语义质量和风格转换方面都优于级联LMS,同时获得了更好的参数效率。

[NLP-72] XPrompt:Explaining Large Language Models Generation via Joint Prompt Attribution
[NLP-72] XPrompt:通过联合提示归因解释大型语言模型的生成

链接: https://arxiv.org/abs/2405.20404
作者: Yurui Chang,Bochuan Cao,Yujia Wang,Jinghui Chen,Lu Lin
关键词: Large Language Models, demonstrated impressive performances, Large Language, demonstrated impressive, impressive performances
中文关键词: 大型语言模型,展示了令人印象深刻的性能,大型语言,展示了令人印象深刻的性能
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of elucidating and explaining the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to be classification or next-word prediction. Few initial attempts aiming to explain the entire language generation often treat input prompt texts independently, ignoring their combinatorial effects on the follow-up generation. In this study, we introduce a counterfactual explanation framework based on joint prompt attribution, XPrompt, which aims to explain how a few prompt texts collaboratively influences the LLM’s complete generation. Particularly, we formulate the task of prompt attribution for generation interpretation as a combinatorial optimization problem, and introduce a probabilistic algorithm to search for the casual input combination in the discrete space. We define and utilize multiple metrics to evaluate the produced explanations, demonstrating both faithfulness and efficiency of our framework.
摘要:大型语言模型在复杂的文本生成任务中表现出了令人印象深刻的性能。然而,输入提示对生成的内容的贡献对人类来说仍然是模糊的,这突显了阐明和解释输入和输出对之间的因果关系的必要性。现有的提供特定于提示的解释的工作往往将模型输出限制为分类或下一词预测。很少有旨在解释整个语言生成的初始尝试经常独立地对待输入提示文本,而忽略了它们对后续生成的组合影响。在本研究中,我们引入了一个基于联合提示归因的反事实解释框架XPrompt,该框架旨在解释几个提示文本如何协同影响LLM的完整生成。具体地说,我们将生成解释的即时归因问题描述为一个组合优化问题,并引入了一种概率算法来搜索离散空间中的随机输入组合。我们定义并利用多个度量来评估产生的解释,证明了我们框架的忠实性和效率。

[NLP-73] Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
[NLP-73] 没有幻觉?评估领先人工智能法律研究工具的可靠性

链接: https://arxiv.org/abs/2405.20362
作者: Varun Magesh,Faiz Surani,Matthew Dahl,Mirac Suzgun,Christopher D. Manning,Daniel E. Ho
关键词: incorporating artificial intelligence, products incorporating artificial, artificial intelligence, practice has witnessed, witnessed a sharp
中文关键词: 融入人工智能的产品,融入人工、人工智能的产品,实践见证了,见证了尖锐的
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Our dataset, tool outputs, and labels will be made available upon publication. This version of the manuscript (May 30, 2024) is updated to reflect an evaluation of Westlaw’s AI-Assisted Research

点击查看摘要

Abstract:Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the large language models used in these tools are prone to “hallucinate,” or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as “eliminating” (Casetext, 2023) or “avoid[ing]” hallucinations (Thomson Reuters, 2023), or guaranteeing “hallucination-free” legal citations (LexisNexis, 2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers’ claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.
摘要:法律实践见证了融入人工智能(AI)的产品的急剧上升。这类工具旨在协助完成广泛的核心法律任务,从检索和总结判例法到起草文件。但这些工具中使用的大型语言模型容易产生“幻觉”,或编造虚假信息,从而使它们在高风险领域的使用存在风险。最近,某些法律研究提供商吹捧一些方法,如检索增强生成(RAG)是“消除”的(Casetext,2023)或“避免”幻觉(Thomson Reuters,2023),或保证“无幻觉”的法律引用(LexisNexis,2023)。由于这些系统的封闭性,系统地评估这些索赔是具有挑战性的。在这篇文章中,我们设计并报告了第一次对人工智能驱动的法律研究工具进行预注册的实证评估。我们证明,供应商的声明被夸大了。虽然与通用聊天机器人(GPT-4)相比,幻觉有所减少,但我们发现,LexisNexis(Lexis+AI)和汤森路透(Westlaw AI-Assisted Research and Ask Practice Law AI)制造的人工智能研究工具各自出现幻觉的时间在17%至33%之间。我们还记录了不同系统在响应性和准确性方面的显著差异。我们的文章有四个主要贡献。它是第一个评估和报告基于RAG的专有合法AI工具的性能。其次,它引入了一个全面的、预先注册的数据集,用于识别和了解这些系统中的漏洞。第三,它为区分幻觉和准确的法律反应提出了一个明确的类型。最后,它提供了证据,告知法律专业人员在监督和核实人工智能产出方面的责任,这仍然是负责任地将人工智能纳入法律的一个中心未决问题。

[NLP-74] Literature Filtering for Systematic Reviews with Transformers
[NLP-74] 《变形金刚》系统性评论的文献过滤

链接: https://arxiv.org/abs/2405.20354
作者: John Hawkins,David Tivey
关键词: Identifying critical research, Identifying critical, growing body, body of academic, essential element
中文关键词: 识别批判性研究,识别批判性的、成长的身体、学术的身体、基本要素
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identifying critical research within the growing body of academic work is an essential element of quality research. Systematic review processes, used in evidence-based medicine, formalise this as a procedure that must be followed in a research program. However, it comes with an increasing burden in terms of the time required to identify the important articles of research for a given topic. In this work, we develop a method for building a general-purpose filtering system that matches a research question, posed as a natural language description of the required content, against a candidate set of articles obtained via the application of broad search terms. Our results demonstrate that transformer models, pre-trained on biomedical literature then fine tuned for the specific task, offer a promising solution to this problem. The model can remove large volumes of irrelevant articles for most research questions.
摘要:在不断增长的学术工作中识别批判性研究是优质研究的一个重要要素。循证医学中使用的系统审查流程将其正式化为研究计划中必须遵循的程序。然而,就识别特定主题的重要研究文章所需的时间而言,它带来了越来越大的负担。在这项工作中,我们开发了一种构建通用过滤系统的方法,该系统将研究问题(作为所需内容的自然语言描述)与通过应用广泛搜索词获得的候选文章集进行匹配。我们的结果表明,Transformer模型在生物医学文献上预先训练,然后针对特定任务进行微调,为这个问题提供了一个有希望的解决方案。该模型可以删除大多数研究问题的大量无关文章。

[NLP-75] Small Language Models for Application Interactions: A Case Study
[NLP-75] 应用程序交互的小型语言模型:案例研究

链接: https://arxiv.org/abs/2405.20347
作者: Beibin Li,Yi Zhang,Sébastien Bubeck,Jeevan Pathuri,Ishai Menache
关键词: natural language interactions, language interactions, facilitating application usage, Small Language Models, natural language
中文关键词: 自然语言交互、语言交互、促进应用程序使用、小语言模型、自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the efficacy of Small Language Models (SLMs) in facilitating application usage through natural language interactions. Our focus here is on a particular internal application used in Microsoft for cloud supply chain fulfilment. Our experiments show that small models can outperform much larger ones in terms of both accuracy and running time, even when fine-tuned on small datasets. Alongside these results, we also highlight SLM-based system design considerations.
摘要:我们研究了小语言模型(SLC)通过自然语言交互促进应用程序使用的功效。我们这里的重点是微软用于云供应链履行的特定内部应用程序。我们的实验表明,即使在小型数据集上进行了微调,小型模型在准确性和运行时间方面也可以优于大型模型。除了这些结果之外,我们还强调了基于CRM的系统设计考虑因素。

[NLP-76] owards a Fluid computer
[NLP-76] 拥有一台Fluid计算机

链接: https://arxiv.org/abs/2405.20999
作者: Robert Cardona,Eva Miranda,Daniel Peralta-Salas
关键词: raised a question, performing computations, hydrodynamics is capable, capable of performing, universal Turing machine
中文关键词: 提出一个问题,进行计算,流体动力学有能力,有能力执行通用图灵机
类目: Dynamical Systems (math.DS); Computation and Language (cs.CL); Analysis of PDEs (math.AP); Symplectic Geometry (math.SG)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:In 1991, Moore [20] raised a question about whether hydrodynamics is capable of performing computations. Similarly, in 2016, Tao [25] asked whether a mechanical system, including a fluid flow, can simulate a universal Turing machine. In this expository article, we review the construction in [8] of a “Fluid computer” in dimension 3 that combines techniques in symbolic dynamics with the connection between steady Euler flows and contact geometry unveiled by Etnyre and Ghrist. In addition, we argue that the metric that renders the vector field Beltrami cannot be critical in the Chern-Hamilton sense [9]. We also sketch the completely different construction for the Euclidean metric in \mathbb R^3 as given in [7]. These results reveal the existence of undecidable fluid particle paths. We conclude the article with a list of open problems.
摘要:1991年,摩尔[20]提出了一个关于流体动力学是否能够执行计算的问题。同样,2016年,Tao [25]询问机械系统(包括流体流)是否可以模拟通用图灵机。在这篇说明性文章中,我们回顾了[8]中第3维度“流体计算机”的构造,该计算机将符号动力学中的技术与Etnyre和Ghrist揭示的稳态欧拉流和接触几何之间的联系相结合。此外,我们认为,在Chern-Hamilton意义上,呈现向量场Beltrami的度量不可能是关键的[9]。我们还为\mathbb R ’ 3中的欧几里得度量绘制了完全不同的构造,如[7]中给出的。这些结果揭示了不可判定的流体粒子路径的存在。我们以一系列悬而未决的问题来结束本文。

计算机视觉

[CV-0] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

链接: https://arxiv.org/abs/2405.21075
作者: Chaoyou Fu,Yuhan Dai,Yondong Luo,Lei Li,Shuhuai Ren,Renrui Zhang,Zihan Wang,Chenyu Zhou,Yunhang Shen,Mengdan Zhang,Peixian Chen,Yanwei Li,Shaohui Lin,Sirui Zhao,Ke Li,Tong Xu,Xiawu Zheng,Enhong Chen,Rongrong Ji,Xing Sun
关键词: Multi-modal Large Language, Large Language Models, Large Language, artificial general intelligence, Multi-modal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Project Page: this https URL

点击查看摘要

Abstract:In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 256 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: this https URL

[CV-1] Latent Intrinsics Emerge from Training to Relight

链接: https://arxiv.org/abs/2405.21074
作者: Xiao Zhang,William Gao,Seemandhar Jain,Michael Maire,David.A.Forsyth,Anand Bhattad
关键词: source image, illuminated differently, task of showing, Image, Inverse graphics
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image relighting is the task of showing what a scene from a source image would look like if illuminated differently. Inverse graphics schemes recover an explicit representation of geometry and a set of chosen intrinsics, then relight with some form of renderer. However error control for inverse graphics is difficult, and inverse graphics methods can represent only the effects of the chosen intrinsics. This paper describes a relighting method that is entirely data-driven, where intrinsics and lighting are each represented as latent variables. Our approach produces SOTA relightings of real scenes, as measured by standard metrics. We show that albedo can be recovered from our latent intrinsics without using any example albedos, and that the albedos recovered are competitive with SOTA methods.

[CV-2] Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights

链接: https://arxiv.org/abs/2405.21070
作者: Xin Wen,Bingchen Zhao,Yilun Chen,Jiangmiao Pang,Xiaojuan Qi
关键词: web-scale vision-language datasets, Severe data imbalance, imbalance naturally exists, Severe data, vision-language datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP’s pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP’s generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code will be available at: this https URL.

[CV-3] Mixed Diffusion for 3D Indoor Scene Synthesis

链接: https://arxiv.org/abs/2405.21066
作者: Siyi Hu,Diego Martin Arroyo,Stephanie Debats,Fabian Manhardt,Luca Carlone,Federico Tombari
关键词: Realistic conditional, synthesis significantly enhances, provide extensive training, extensive training data, virtual environments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 14 figures. Under review. Code to be released at: this https URL

点击查看摘要

Abstract:Realistic conditional 3D scene synthesis significantly enhances and accelerates the creation of virtual environments, which can also provide extensive training data for computer vision and robotics research among other applications. Diffusion models have shown great performance in related applications, e.g., making precise arrangements of unordered sets. However, these models have not been fully explored in floor-conditioned scene synthesis problems. We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture, designed to synthesize plausible 3D indoor scenes from given room types, floor plans, and potentially pre-existing objects. We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation. Our approach uniquely implements structured corruption across the mixed discrete semantic and continuous geometric domains, resulting in a better conditioned problem for the reverse denoising step. We evaluate our approach on the 3D-FRONT dataset. Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis. In addition, our models can handle partial object constraints via a corruption-and-masking strategy without task specific training. We show MiDiffusion maintains clear advantages over existing approaches in scene completion and furniture arrangement experiments.

[CV-4] Unified Directly Denoising for Both Variance Preserving and Variance Exploding Diffusion Models

链接: https://arxiv.org/abs/2405.21059
作者: Jingjing Wang,Dan Zhang,Feng Luo
关键词: Directly Denoising Diffusion, nascent Directly Denoising, Denoising Diffusion Models, Directly Denoising, Denoising Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Previous work has demonstrated that, in the Variance Preserving (VP) scenario, the nascent Directly Denoising Diffusion Models (DDDM) can generate high-quality images in one step while achieving even better performance in multistep sampling. However, the Pseudo-LPIPS loss used in DDDM leads to concerns about the bias in assessment. Here, we propose a unified DDDM (uDDDM) framework that generates images in one-step/multiple steps for both Variance Preserving (VP) and Variance Exploding (VE) cases. We provide theoretical proofs of the existence and uniqueness of the model’s solution paths, as well as the non-intersecting property of the sampling paths. Additionally, we propose an adaptive Pseudo-Huber loss function to balance the convergence to the true solution and the stability of convergence process.Through a comprehensive evaluation, we demonstrate that uDDDMs achieve FID scores comparable to the best-performing methods available for CIFAR-10 in both VP and VE. Specifically, uDDDM achieves one-step generation on CIFAR10 with FID of 2.63 and 2.53 for VE and VP respectively. By extending the sampling to 1000 steps, we further reduce FID score to 1.71 and 1.65 for VE and VP respectively, setting state-of-the-art performance in both cases.

[CV-5] An Organic Weed Control Prototype using Directed Energy and Deep Learning

链接: https://arxiv.org/abs/2405.21056
作者: Deng Cao,Hongbo Zhang,Rajveer Dhillon
关键词: improve crop yield, sustainable approach, vital to improve, Organic weed control, weed control
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Organic weed control is a vital to improve crop yield with a sustainable approach. In this work, a directed energy weed control robot prototype specifically designed for organic farms is proposed. The robot uses a novel distributed array robot (DAR) unit for weed treatment. Soybean and corn databases are built to train deep learning neural nets to perform weed recognition. The initial deep learning neural nets show a high performance in classifying crops. The robot uses a patented directed energy plant eradication recipe that is completely organic and UV-C free, with no chemical damage or physical disturbance to the soil. The deep learning can classify 8 common weed species in a soybean field under natural environment with up to 98% accuracy.

[CV-6] Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models

链接: https://arxiv.org/abs/2405.21050
作者: Xinxi Zhang,Song Wen,Ligong Han,Felix Juefei-Xu,Akash Srivastava,Junzhou Huang,Hao Wang,Molei Tao,Dimitris N. Metaxas
关键词: Adapting large-scale pre-trained, Adapting large-scale, large-scale pre-trained generative, gaining traction, large-scale pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adapting large-scale pre-trained generative models in a parameter-efficient manner is gaining traction. Traditional methods like low rank adaptation achieve parameter efficiency by imposing constraints but may not be optimal for tasks requiring high representation capacity. We propose a novel spectrum-aware adaptation framework for generative models. Our method adjusts both singular values and their basis vectors of pretrained weights. Using the Kronecker product and efficient Stiefel optimizers, we achieve parameter-efficient adaptation of orthogonal matrices. We introduce Spectral Orthogonal Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity. Extensive evaluations on text-to-image diffusion models demonstrate SODA’s effectiveness, offering a spectrum-aware alternative to existing fine-tuning methods.

[CV-7] Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

链接: https://arxiv.org/abs/2405.21048
作者: Jiatao Gu,Ying Shen,Shuangfei Zhai,Yizhe Zhang,Navdeep Jaitly,Joshua M. Susskind
关键词: generating high-quality images, powerful tool, tool for generating, generating high-quality, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 14 figures

点击查看摘要

Abstract:Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.

[CV-8] You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet

链接: https://arxiv.org/abs/2405.21022
作者: Zhen Qin,Yuxin Mao,Xuyang Shen,Dong Li,Jing Zhang,Yuchao Dai,Yiran Zhong
关键词: linear computational complexity, enhanced speed, gained prominence, prominence in causal, computational complexity
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical report. Yiran Zhong is the corresponding author. The code is available at this https URL

点击查看摘要

Abstract:Linear attention mechanisms have gained prominence in causal language models due to their linear computational complexity and enhanced speed. However, the inherent decay mechanism in linear attention presents challenges when applied to multi-dimensional sequence modeling tasks, such as image processing and multi-modal learning. In these scenarios, the utilization of sequential scanning to establish a global receptive field necessitates multiple scans for multi-dimensional data, thereby leading to inefficiencies. This paper identifies the inefficiency caused by a multiplicative linear recurrence and proposes an efficient alternative additive linear recurrence to avoid the issue, as it can handle multi-dimensional data within a single scan. We further develop an efficient multi-dimensional sequential modeling framework called LightNet based on the new recurrence. Moreover, we present two new multi-dimensional linear relative positional encoding methods, MD-TPE and MD-LRPE to enhance the model’s ability to discern positional information in multi-dimensional scenarios. Our empirical evaluations across various tasks, including image classification, image generation, bidirectional language modeling, and autoregressive language modeling, demonstrate the efficacy of LightNet, showcasing its potential as a versatile and efficient solution for multi-dimensional sequential modeling.

[CV-9] MpoxSLDNet: A Novel CNN Model for Detecting Monkeypox Lesions and Performance Comparison with Pre-trained Models

链接: https://arxiv.org/abs/2405.21016
作者: Fatema Jannat Dihan,Saydul Akbar Murad,Abu Jafar Md Muzahid,K. M. Aslam Uddin,Mohammed J.F. Alenazi,Anupam Kumar Bairagi,Sujit Biswas
关键词: West Africa, Central and West, parts of Central, Monkeypox Skin Lesion, monkeypox lesions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Monkeypox virus (MPXV) is a zoonotic virus that poses a significant threat to public health, particularly in remote parts of Central and West Africa. Early detection of monkeypox lesions is crucial for effective treatment. However, due to its similarity with other skin diseases, monkeypox lesion detection is a challenging task. To detect monkeypox, many researchers used various deep-learning models such as MobileNetv2, VGG16, ResNet50, InceptionV3, DenseNet121, EfficientNetB3, MobileNetV2, and Xception. However, these models often require high storage space due to their large size. This study aims to improve the existing challenges by introducing a CNN model named MpoxSLDNet (Monkeypox Skin Lesion Detector Network) to facilitate early detection and categorization of Monkeypox lesions and Non-Monkeypox lesions in digital images. Our model represents a significant advancement in the field of monkeypox lesion detection by offering superior performance metrics, including precision, recall, F1-score, accuracy, and AUC, compared to traditional pre-trained models such as VGG16, ResNet50, and DenseNet121. The key novelty of our approach lies in MpoxSLDNet’s ability to achieve high detection accuracy while requiring significantly less storage space than existing models. By addressing the challenge of high storage requirements, MpoxSLDNet presents a practical solution for early detection and categorization of monkeypox lesions in resource-constrained healthcare settings. In this study, we have used “Monkeypox Skin Lesion Dataset” comprising 1428 skin images of monkeypox lesions and 1764 skin images of Non-Monkeypox lesions. Dataset’s limitations could potentially impact the model’s ability to generalize to unseen cases. However, the MpoxSLDNet model achieved a validation accuracy of 94.56%, compared to 86.25%, 84.38%, and 67.19% for VGG16, DenseNet121, and ResNet50, respectively.

[CV-10] StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception Comprehension and Beyond

链接: https://arxiv.org/abs/2405.21013
作者: Pengyuan Lyu,Yulin Li,Hao Zhou,Weihong Ma,Xingyu Wan,Qunyi Xie,Liang Wu,Chengquan Zhang,Kun Yao,Errui Ding,Jingdong Wang
关键词: Text-rich images, Text-rich, deeply integrated, human life, images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-rich images have significant and extensive value, deeply integrated into various aspects of human life. Notably, both visual cues and linguistic symbols in text-rich images play crucial roles in information transmission but are accompanied by diverse challenges. Therefore, the efficient and effective understanding of text-rich images is a crucial litmus test for the capability of Vision-Language Models. We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images. The significant design of StrucTexTv3 is presented in the following aspects: Firstly, we adopt a combination of an effective multi-scale reduced visual transformer and a multi-granularity token sampler (MG-Sampler) as a visual token generator, successfully solving the challenges of high-resolution input and complex representation learning for text-rich images. Secondly, we enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning, seamlessly integrating various text-oriented tasks into a unified framework. Thirdly, we have curated a comprehensive collection of high-quality text-rich images, abbreviated as TIM-30M, encompassing diverse scenarios like incidental scenes, office documents, web pages, and screenshots, thereby improving the robustness of our model. Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks. Among multimodal models with LLM decoder of approximately 1.8B parameters, it stands out as a leader, which also makes the deployment of edge devices feasible. In summary, the StrucTexTv3 model, featuring efficient structural design, outstanding performance, and broad adaptability, offers robust support for diverse intelligent application tasks involving text-rich images, thus exhibiting immense potential for widespread application.

[CV-11] Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

链接: https://arxiv.org/abs/2405.20991
作者: Yi Yang,Qingwen Zhang,Kei Ikemura,Nazre Batool,John Folkesson
关键词: extreme weather conditions, presents significant challenges, anomalous road users, Addressing hard cases, complex traffic interactions
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: IEEE Intelligent Vehicles Symposium (IV) 2024

点击查看摘要

Abstract:Addressing hard cases in autonomous driving, such as anomalous road users, extreme weather conditions, and complex traffic interactions, presents significant challenges. To ensure safety, it is crucial to detect and manage these scenarios effectively for autonomous driving systems. However, the rarity and high-risk nature of these cases demand extensive, diverse datasets for training robust models. Vision-Language Foundation Models (VLMs) have shown remarkable zero-shot capabilities as being trained on extensive datasets. This work explores the potential of VLMs in detecting hard cases in autonomous driving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard cases in traffic participant motion prediction on both agent and scenario levels. We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify challenging agents or scenarios, which are verified by existing prediction models. Moreover, by taking advantage of this detection of hard cases by VLMs, we further improve the training efficiency of the existing motion prediction pipeline by performing data selection for the training samples suggested by GPT. We show the effectiveness and feasibility of our pipeline incorporating VLMs with state-of-the-art methods on NuScenes datasets. The code is accessible at this https URL.

[CV-12] Early Stopping Criteria for Training Generative Adversarial Networks in Biomedical Imaging

链接: https://arxiv.org/abs/2405.20987
作者: Muhammad Muneeb Saad,Mubashir Husain Rehmani,Ruairi O’Reilly
关键词: Generative Adversarial Networks, Generative Adversarial, Adversarial Networks, computational cost, training
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: This paper is accepted at the 35th IEEE Irish Signals and Systems Conference (ISSC 2024)

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have high computational costs to train their complex architectures. Throughout the training process, GANs’ output is analyzed qualitatively based on the loss and synthetic images’ diversity and quality. Based on this qualitative analysis, training is manually halted once the desired synthetic images are generated. By utilizing an early stopping criterion, the computational cost and dependence on manual oversight can be reduced yet impacted by training problems such as mode collapse, non-convergence, and instability. This is particularly prevalent in biomedical imagery, where training problems degrade the diversity and quality of synthetic images, and the high computational cost associated with training makes complex architectures increasingly inaccessible. This work proposes a novel early stopping criteria to quantitatively detect training problems, halt training, and reduce the computational costs associated with synthesizing biomedical images. Firstly, the range of generator and discriminator loss values is investigated to assess whether mode collapse, non-convergence, and instability occur sequentially, concurrently, or interchangeably throughout the training of GANs. Secondly, utilizing these occurrences in conjunction with the Mean Structural Similarity Index (MS-SSIM) and Fréchet Inception Distance (FID) scores of synthetic images forms the basis of the proposed early stopping criteria. This work helps identify the occurrence of training problems in GANs using low-resource computational cost and reduces training time to generate diversified and high-quality synthetic images.

[CV-13] Uncertainty Quantification for Birds Eye View Semantic Segmentation: Methods and Benchmarks

链接: https://arxiv.org/abs/2405.20986
作者: Linlin Yu,Bowen Yang,Tianhao Wang,Kangshuo Li,Feng Chen
关键词: Bird Eye View, Eye View, Bird Eye, create a Bird, representation is crucial
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The fusion of raw features from multiple sensors on an autonomous vehicle to create a Bird’s Eye View (BEV) representation is crucial for planning and control systems. There is growing interest in using deep learning models for BEV semantic segmentation. Anticipating segmentation errors and improving the explainability of DNNs is essential for autonomous driving, yet it is under-studied. This paper introduces a benchmark for predictive uncertainty quantification in BEV segmentation. The benchmark assesses various approaches across three popular datasets using two representative backbones and focuses on the effectiveness of predicted uncertainty in identifying misclassified and out-of-distribution (OOD) pixels, as well as calibration. Empirical findings highlight the challenges in uncertainty quantification. Our results find that evidential deep learning based approaches show the most promise by efficiently quantifying aleatoric and epistemic uncertainty. We propose the Uncertainty-Focal-Cross-Entropy (UFCE) loss, designed for highly imbalanced data, which consistently improves the segmentation quality and calibration. Additionally, we introduce a vacuity-scaled regularization term that enhances the model’s focus on high uncertainty pixels, improving epistemic uncertainty quantification.

[CV-14] DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

链接: https://arxiv.org/abs/2405.20985
作者: Linli Yao,Lei Li,Shuhuai Ren,Lean Wang,Yuanxin Liu,Xu Sun,Lu Hou
关键词: facilitates cross-modal alignment, visual semantic abstraction, modalities and facilitates, facilitates cross-modal, crucial component
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in MLLMs. However, measuring the effectiveness of projectors in vision-language alignment remains under-explored, which currently can only be inferred from the performance of MLLMs on downstream tasks. Motivated by the problem, this study examines the projector module by interpreting the vision-language semantic flow within MLLMs. Specifically, we trace back the semantic relevance flow from generated language tokens to raw visual encoder patches and the intermediate outputs produced by projectors. Our findings reveal that compressive projectors (e.g., QFormer), abstract visual patches into a limited set of semantic concepts, such as objects or attributes, resulting in a ‘double abstraction’ phenomenon. This involves a first visual semantic abstraction by the projector referring to pre-defined query tokens, and a second extraction by the LLM based on text instructions. The double abstraction is inefficient in training and will result in cumulative vision semantics deficiency. To mitigate this issue, we propose the key insight of 'Decouple Compression from Abstraction (DeCo), that is compressing the visual token number at the patch level by projectors and allowing the LLM to handle visual semantic abstraction entirely. Consequently, we adopt a simple compressor, i.e., 2D Adaptive Pooling, to downsample visual patches in a parameter-free manner. Empirical evaluation demonstrates that DeCo surpasses traditional compressive projectors regarding both performance and efficiency. It achieves performance gains of 0.9%, 7.1%, and 2.9% across the MLLM Benchmarks, Visual Localization, and Open-ended VQA tasks with fewer trainable parameters and faster convergence speed.

[CV-15] Generative Adversarial Networks in Ultrasound Imaging: Extending Field of View Beyond Conventional Limits

链接: https://arxiv.org/abs/2405.20981
作者: Matej Gazda,Samuel Kadoury,Jakub Gazda,Peter Drotar
关键词: Transthoracic Echocardiography, TTE ultrasound imaging, enabling detailed visualization, Generative Adversarial Networks, cardiovascular medicine
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transthoracic Echocardiography (TTE) is a fundamental, non-invasive diagnostic tool in cardiovascular medicine, enabling detailed visualization of cardiac structures crucial for diagnosing various heart conditions. Despite its widespread use, TTE ultrasound imaging faces inherent limitations, notably the trade-off between field of view (FoV) and resolution. This paper introduces a novel application of conditional Generative Adversarial Networks (cGANs), specifically designed to extend the FoV in TTE ultrasound imaging while maintaining high resolution. Our proposed cGAN architecture, termed echoGAN, demonstrates the capability to generate realistic anatomical structures through outpainting, effectively broadening the viewable area in medical imaging. This advancement has the potential to enhance both automatic and manual ultrasound navigation, offering a more comprehensive view that could significantly reduce the learning curve associated with ultrasound imaging and aid in more accurate diagnoses. The results confirm that echoGAN reliably reproduce detailed cardiac features, thereby promising a significant step forward in the field of non-invasive cardiac naviagation and diagnostics.

[CV-16] Neural Gaussian Scale-Space Fields

链接: https://arxiv.org/abs/2405.20980
作者: Felix Mujkanovic,Ntumba Elie Nsampi,Christian Theobalt,Hans-Peter Seidel,Thomas Leimkühler
关键词: Gaussian scale spaces, Gaussian scale, anisotropic Gaussian scale, scale space, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 15 pages; SIGGRAPH 2024; project page at this https URL

点击查看摘要

Abstract:Gaussian scale spaces are a cornerstone of signal representation and processing, with applications in filtering, multiscale analysis, anti-aliasing, and many more. However, obtaining such a scale space is costly and cumbersome, in particular for continuous representations such as neural fields. We present an efficient and lightweight method to learn the fully continuous, anisotropic Gaussian scale space of an arbitrary signal. Based on Fourier feature modulation and Lipschitz bounding, our approach is trained self-supervised, i.e., training does not require any manual filtering. Our neural Gaussian scale-space fields faithfully capture multiscale representations across a broad range of modalities, and support a diverse set of applications. These include images, geometry, light-stage data, texture anti-aliasing, and multiscale optimization.

[CV-17] Amortizing intractable inference in diffusion models for vision language and control

链接: https://arxiv.org/abs/2405.20971
作者: Siddarth Venkatraman,Moksh Jain,Luca Scimeca,Minsu Kim,Marcin Sendera,Mohsin Hasan,Luke Rowe,Sarthak Mittal,Pablo Lemos,Emmanuel Bengio,Alexandre Adam,Jarrid Rector-Brooks,Yoshua Bengio,Glen Berseth,Nikolay Malkin
关键词: effective distribution estimators, downstream tasks poses, mathbf, relative trajectory balance, emerged as effective
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Diffusion models have emerged as effective distribution estimators in vision, language, and reinforcement learning, but their use as priors in downstream tasks poses an intractable posterior inference problem. This paper studies amortized sampling of the posterior over data, \mathbfx\sim p^\rm post(\mathbfx)\propto p(\mathbfx)r(\mathbfx) , in a model that consists of a diffusion generative model prior p(\mathbfx) and a black-box constraint or likelihood function r(\mathbfx) . We state and prove the asymptotic correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from this posterior, a problem that existing methods solve only approximately or in restricted cases. Relative trajectory balance arises from the generative flow network perspective on diffusion models, which allows the use of deep reinforcement learning techniques to improve mode coverage. Experiments illustrate the broad potential of unbiased inference of arbitrary posteriors under diffusion priors: in vision (classifier guidance), language (infilling under a discrete diffusion LLM), and multimodal data (text-to-image generation). Beyond generative modeling, we apply relative trajectory balance to the problem of continuous control with a score-based behavior prior, achieving state-of-the-art results on benchmarks in offline reinforcement learning.

[CV-18] Fast yet Safe: Early-Exiting with Risk Control

链接: https://arxiv.org/abs/2405.20915
作者: Metod Jazbec,Alexander Timans,Tin Hadži Veljković,Kaspar Sakmann,Dan Zhang,Christian A. Naesseth,Eric Nalisnick
关键词: Scaling machine learning, machine learning models, learning models significantly, models significantly improves, Scaling machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 25 pages, 11 figures, 4 tables (incl. appendix)

点击查看摘要

Abstract:Scaling machine learning models significantly improves their performance. However, such gains come at the cost of inference being slow and resource-intensive. Early-exit neural networks (EENNs) offer a promising solution: they accelerate inference by allowing intermediate layers to exit and produce a prediction early. Yet a fundamental issue with EENNs is how to determine when to exit without severely degrading performance. In other words, when is it ‘safe’ for an EENN to go ‘fast’? To address this issue, we investigate how to adapt frameworks of risk control to EENNs. Risk control offers a distribution-free, post-hoc solution that tunes the EENN’s exiting mechanism so that exits only occur when the output is of sufficient quality. We empirically validate our insights on a range of vision and language tasks, demonstrating that risk control can produce substantial computational savings, all the while preserving user-specified performance goals.

[CV-19] Enhancing Vision Models for Text-Heavy Content Understanding and Interaction

链接: https://arxiv.org/abs/2405.20906
作者: Adithya TG,Adithya SK,Abhinav R Bharadwaj,Abhiram HA,Dr. Surabhi Narayan
关键词: heavy visual content, traditional vision models, text heavy visual, major challenge, challenge for traditional
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 5 pages, 4 figures (including 1 graph)

点击查看摘要

Abstract:Interacting and understanding with text heavy visual content with multiple images is a major challenge for traditional vision models. This paper is on enhancing vision models’ capability to comprehend or understand and learn from images containing a huge amount of textual information from the likes of textbooks and research papers which contain multiple images like graphs, etc and tables in them with different types of axes and scales. The approach involves dataset preprocessing, fine tuning which is by using instructional oriented data and evaluation. We also built a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark which is developed to consider both textual and visual inputs. An accuracy of 96.71% was obtained. The aim of the project is to increase and also enhance the advance vision models’ capabilities in understanding complex visual textual data interconnected data, contributing to multimodal AI.

[CV-20] MALT: Multi-scale Action Learning Transformer for Online Action Detection

链接: https://arxiv.org/abs/2405.20892
作者: Zhipeng Yang,Ruoyu Wang,Yang Tan,Liping Xie
关键词: Online action detection, identify ongoing actions, Online action, aims to identify, video in real-time
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames. Since these actions manifest at varying scales of granularity, ranging from coarse to fine, projecting an entire set of action frames to a single latent encoding may result in a lack of local information, necessitating the acquisition of action features across multiple scales. In this paper, we propose a multi-scale action learning transformer (MALT), which includes a novel recurrent decoder (used for feature fusion) that includes fewer parameters and can be trained more efficiently. A hierarchical encoder with multiple encoding branches is further proposed to capture multi-scale action features. The output from the preceding branch is then incrementally input to the subsequent branch as part of a cross-attention calculation. In this way, output features transition from coarse to fine as the branches deepen. We also introduce an explicit frame scoring mechanism employing sparse attention, which filters irrelevant frames more efficiently, without requiring an additional network. The proposed method achieved state-of-the-art performance on two benchmark datasets (THUMOS’14 and TVSeries), outperforming all existing models used for comparison, with an mAP of 0.2% for THUMOS’14 and an mcAP of 0.1% for TVseries.

[CV-21] S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion

链接: https://arxiv.org/abs/2405.20881
作者: Haolong Ma,Hui Li,Chunyang Cheng,Gaoang Wang,Xiaoning Song,Xiaojun Wu
关键词: Infrared and Visible, Visible Image Fusion, Selective State Space, global spatial information, State Space Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims to integrate complementary information captured by sensors of different modalities into a single image. The Selective State Space Model (SSSM), known for its ability to capture long-range dependencies, has demonstrated its potential in the field of computer vision. However, in image fusion, current methods underestimate the potential of SSSM in capturing the global spatial information of both modalities. This limitation prevents the simultaneous consideration of the global spatial information from both modalities during interaction, leading to a lack of comprehensive perception of salient targets. Consequently, the fusion results tend to bias towards one modality instead of adaptively preserving salient targets. To address this issue, we propose the Saliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion, the designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously focus on global spatial information from both modalities while facilitating their interaction, thereby comprehensively capturing complementary information. Additionally, S4Fusion leverages a pre-trained network to perceive uncertainty in the fused images. By minimizing this uncertainty, S4Fusion adaptively highlights salient targets from both images. Extensive experiments demonstrate that our approach produces high-quality images and enhances performance in downstream tasks.

[CV-22] Investigating Calibration and Corruption Robustness of Post-hoc Pruned Perception CNNs: An Image Classification Benchmark Study

链接: https://arxiv.org/abs/2405.20876
作者: Pallavi Mitra,Gesina Schwalbe,Nadja Klein
关键词: Convolutional Neural Networks, Convolutional Neural, Neural Networks, computer vision tasks, natural corruption robustness
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in many computer vision tasks. However, high computational and storage demands hinder their deployment into resource-constrained environments, such as embedded devices. Model pruning helps to meet these restrictions by reducing the model size, while maintaining superior performance. Meanwhile, safety-critical applications pose more than just resource and performance constraints. In particular, predictions must not be overly confident, i.e., provide properly calibrated uncertainty estimations (proper uncertainty calibration), and CNNs must be robust against corruptions like naturally occurring input perturbations (natural corruption robustness). This work investigates the important trade-off between uncertainty calibration, natural corruption robustness, and performance for current state-of-research post-hoc CNN pruning techniques in the context of image classification tasks. Our study reveals that post-hoc pruning substantially improves the model’s uncertainty calibration, performance, and natural corruption robustness, sparking hope for safe and robust embedded CNNs.Furthermore, uncertainty calibration and natural corruption robustness are not mutually exclusive targets under pruning, as evidenced by the improved safety aspects obtained by post-hoc unstructured pruning with increasing compression.

[CV-23] Responsible AI for Earth Observation

链接: https://arxiv.org/abs/2405.20868
作者: Pedram Ghamisi,Weikang Yu,Andrea Marinoni,Caroline M. Gevaert,Claudio Persello,Sivasakthy Selvakumaran,Manuela Girotto,Benjamin P. Horton,Philippe Rufin,Patrick Hostert,Fabio Pacifici,Peter M. Atkinson
关键词: Earth observation, artificial intelligence, technologies has brought, unparalleled capabilities, convergence of artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The convergence of artificial intelligence (AI) and Earth observation (EO) technologies has brought geoscience and remote sensing into an era of unparalleled capabilities. AI’s transformative impact on data analysis, particularly derived from EO platforms, holds great promise in addressing global challenges such as environmental monitoring, disaster response and climate change analysis. However, the rapid integration of AI necessitates a careful examination of the responsible dimensions inherent in its application within these domains. In this paper, we represent a pioneering effort to systematically define the intersection of AI and EO, with a central focus on responsible AI practices. Specifically, we identify several critical components guiding this exploration from both academia and industry perspectives within the EO field: AI and EO for social good, mitigating unfair biases, AI security in EO, geo-privacy and privacy-preserving measures, as well as maintaining scientific excellence, open data, and guiding AI usage based on ethical principles. Furthermore, the paper explores potential opportunities and emerging trends, providing valuable insights for future research endeavors.

[CV-24] Automatic Channel Pruning for Multi-Head Attention

链接: https://arxiv.org/abs/2405.20867
作者: Eunho Lee,Youngbae Hwang
关键词: performance of Transformers, complexity presents challenges, quadratic computation complexity, computation complexity presents, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:Despite the strong performance of Transformers, their quadratic computation complexity presents challenges in applying them to vision tasks. Automatic pruning is one of effective methods for reducing computation complexity without heuristic approaches. However, directly applying it to multi-head attention is not straightforward due to channel misalignment. In this paper, we propose an automatic channel pruning method to take into account the multi-head attention mechanism. First, we incorporate channel similarity-based weights into the pruning indicator to preserve more informative channels in each head. Then, we adjust pruning indicator to enforce removal of channels in equal proportions across all heads, preventing the channel misalignment. We also add a reweight module to compensate for information loss resulting from channel removal, and an effective initialization step for pruning indicator based on difference of attention between original structure and each channel. Our proposed method can be used to not only original attention, but also linear attention, which is more efficient as linear complexity with respect to the number of tokens. On ImageNet-1K, applying our pruning method to the FLattenTransformer, which includes both attention mechanisms, shows outperformed accuracy for several MACs compared with previous state-of-the-art efficient models and pruned methods. Code will be available soon.

[CV-25] MeshXL: Neural Coordinate Field for Generative 3D Foundation Models

链接: https://arxiv.org/abs/2405.20853
作者: Sijin Chen,Xin Chen,Anqi Pang,Xianfang Zeng,Wei Cheng,Yijun Fu,Fukun Yin,Yanru Wang,Zhibin Wang,Chi Zhang,Jingyi Yu,Gang Yu,Bin Fu,Tao Chen
关键词: fast rendering speed, data exhibits great, exhibits great flexibility, data exhibits, great flexibility
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed, and storage efficiency, which is widely preferred in various applications. However, given its unstructured graph representation, the direct generation of high-fidelity 3D meshes is challenging. Fortunately, with a pre-defined ordering strategy, 3D meshes can be represented as sequences, and the generation process can be seamlessly treated as an auto-regressive problem. In this paper, we validate the Neural Coordinate Field (NeurCF), an explicit coordinate representation with implicit neural embeddings, is a simple-yet-effective representation for large-scale sequential mesh modeling. After that, we present MeshXL, a family of generative pre-trained auto-regressive models, which addresses the process of 3D mesh generation with modern large language model approaches. Extensive experiments show that MeshXL is able to generate high-quality 3D meshes, and can also serve as foundation models for various down-stream applications.

[CV-26] MegActor: Harness the Power of Raw Video for Vivid Portrait Animation

链接: https://arxiv.org/abs/2405.20851
作者: Shurong Yang,Huadong Li,Juhao Wu,Minhao Jing,Linze Li,Renhe Ji,Jiajun Liang,Haoqiang Fan
关键词: portrait animation, raw driving videos, portrait animation driven, subject of research, intermediate representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite raw driving videos contain richer information on facial expressions than intermediate representations such as landmarks in the field of portrait animation, they are seldom the subject of research. This is due to two challenges inherent in portrait animation driven with raw videos: 1) significant identity leakage; 2) Irrelevant background and facial details such as wrinkles degrade performance. To harnesses the power of the raw videos for vivid portrait animation, we proposed a pioneering conditional diffusion model named as MegActor. First, we introduced a synthetic data generation framework for creating videos with consistent motion and expressions but inconsistent IDs to mitigate the issue of ID leakage. Second, we segmented the foreground and background of the reference image and employed CLIP to encode the background details. This encoded information is then integrated into the network via a text embedding module, thereby ensuring the stability of the background. Finally, we further style transfer the appearance of the reference image to the driving video to eliminate the influence of facial details in the driving videos. Our final model was trained solely on public datasets, achieving results comparable to commercial models. We hope this will help the open-source community.The code is available at this https URL.

[CV-27] nspace: Searching for Neural Architectures from Fundamental Operations

链接: https://arxiv.org/abs/2405.20838
作者: Linus Ericsson,Miguel Espinosa,Chenhongyi Yang,Antreas Antoniou,Amos Storkey,Shay B. Cohen,Steven McDonagh,Elliot J. Crowley
关键词: Neural architecture search, high performing networks, Neural architecture, NAS, finds high performing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Project page at this https URL

点击查看摘要

Abstract:Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren’t diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shifts, we need a novel expressive search space design which is built from more fundamental operations. To this end, we introduce einspace, a search space based on a parameterised probabilistic context-free grammar. Our space is versatile, supporting architectures of various sizes and complexities, while also containing diverse network operations which allow it to model convolutions, attention components and more. It contains many existing competitive architectures, and provides flexibility for discovering new ones. Using this search space, we perform experiments to find novel architectures as well as improvements on existing ones on the diverse Unseen NAS datasets. We show that competitive architectures can be obtained by searching from scratch, and we consistently find large improvements when initialising the search with strong baselines. We believe that this work is an important advancement towards a transformative NAS paradigm where search space expressivity and strategic search initialisation play key roles.

[CV-28] Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning

链接: https://arxiv.org/abs/2405.20834
作者: Cheng Tan,Jingxuan Wei,Linzhuang Sun,Zhangyang Gao,Siyuan Li,Bihui Yu,Ruifeng Guo,Stan Z. Li
关键词: Large language models, external knowledge bases, burgeoning field aimed, leveraging external knowledge, Large language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Large language models equipped with retrieval-augmented generation (RAG) represent a burgeoning field aimed at enhancing answering capabilities by leveraging external knowledge bases. Although the application of RAG with language-only models has been extensively explored, its adaptation into multimodal vision-language models remains nascent. Going beyond mere answer generation, the primary goal of multimodal RAG is to cultivate the models’ ability to reason in response to relevant queries. To this end, we introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning). The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs, which then serve as scaffolds for the multimodal reasoning process. This training-free approach not only encourages the model to engage deeply with the reasoning processes inherent in the retrieved content but also facilitates the generation of answers that are precise and richly interpretable. Surprisingly, utilizing solely the ScienceQA dataset, collected from elementary and high school science curricula, RMR significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets, including A-OKVQA, MMBench, and SEED. These outcomes highlight the substantial potential of our multimodal retrieval and reasoning mechanism to improve the reasoning capabilities of vision-language models.

[CV-29] Rethinking Open-World Semi-Supervised Learning: Distribution Mismatch and Inductive Inference

链接: https://arxiv.org/abs/2405.20829
作者: Seongheon Park,Hyuk Kwon,Kwanghoon Sohn,Kibok Lee
关键词: extends conventional semi-supervised, Open-world semi-supervised learning, conventional semi-supervised learning, Open-world semi-supervised, open-world scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CVPR Workshop on Computer Vision in the Wild (CVinW), 2024

点击查看摘要

Abstract:Open-world semi-supervised learning (OWSSL) extends conventional semi-supervised learning to open-world scenarios by taking account of novel categories in unlabeled datasets. Despite the recent advancements in OWSSL, the success often relies on the assumptions that 1) labeled and unlabeled datasets share the same balanced class prior distribution, which does not generally hold in real-world applications, and 2) unlabeled training datasets are utilized for evaluation, where such transductive inference might not adequately address challenges in the wild. In this paper, we aim to generalize OWSSL by addressing them. Our work suggests that practical OWSSL may require different training settings, evaluation methods, and learning strategies compared to those prevalent in the existing literature.

[CV-30] Context-aware Difference Distilling for Multi-change Captioning

链接: https://arxiv.org/abs/2405.20810
作者: Yunbin Tu,Liang Li,Li Su,Zheng-Jun Zha,Chenggang Yan,Qingming Huang
关键词: Multi-change captioning aims, Multi-change captioning, natural language, context features, aims to describe
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACL 2024 main conference (long paper)

点击查看摘要

Abstract:Multi-change captioning aims to describe complex and coupled changes within an image pair in natural language. Compared with single-change captioning, this task requires the model to have higher-level cognition ability to reason an arbitrary number of changes. In this paper, we propose a novel context-aware difference distilling (CARD) network to capture all genuine changes for yielding sentences. Given an image pair, CARD first decouples context features that aggregate all similar/dissimilar semantics, termed common/difference context features. Then, the consistency and independence constraints are designed to guarantee the alignment/discrepancy of common/difference context features. Further, the common context features guide the model to mine locally unchanged features, which are subtracted from the pair to distill locally difference features. Next, the difference context features augment the locally difference features to ensure that all changes are distilled. In this way, we obtain an omni-representation of all changes, which is translated into linguistic sentences by a transformer decoder. Extensive experiments on three public datasets show CARD performs favourably against state-of-the-art methods.The code is available at this https URL.

[CV-31] Ovis: Structural Embedding Alignment for Multimodal Large Language Model

链接: https://arxiv.org/abs/2405.20797
作者: Shiyin Lu,Yang Li,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Han-Jia Ye
关键词: Large Language Models, Current Multimodal Large, Multimodal Large Language, Large Language, pre-trained LLM
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs – the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder – makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder’s process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks demonstrate that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis’ structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Both the source code and the training dataset of Ovis will be made publicly available.

[CV-32] InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

链接: https://arxiv.org/abs/2405.20795
作者: Huaxiang Zhang,Yaojia Mu,Guo-Niu Zhu,Zhongxue Gan
关键词: advancing autonomous systems, Accurate visual understanding, Accurate visual, intelligent robots, imperative for advancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate visual understanding is imperative for advancing autonomous systems and intelligent robots. Despite the powerful capabilities of vision-language models (VLMs) in processing complex visual scenes, precisely recognizing obscured or ambiguously presented visual elements remains challenging. To tackle such issues, this paper proposes InsightSee, a multi-agent framework to enhance VLMs’ interpretative capabilities in handling complex visual understanding scenarios. The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation. The design of these agents and the mechanisms by which they can be enhanced in visual information processing are presented. Experimental results demonstrate that the InsightSee framework not only boosts performance on specific visual tasks but also retains the original models’ strength. The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.

[CV-33] GS-Phong: Meta-Learned 3D Gaussians for Relightable Novel View Synthesis

链接: https://arxiv.org/abs/2405.20791
作者: Yumeng He,Yunbo Wang,Xiaokang Yang
关键词: Decoupling the illumination, Decoupling, Gaussian points, Abstract, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoupling the illumination in 3D scenes is crucial for novel view synthesis and relighting. In this paper, we propose a novel method for representing a scene illuminated by a point light using a set of relightable 3D Gaussian points. Inspired by the Blinn-Phong model, our approach decomposes the scene into ambient, diffuse, and specular components, enabling the synthesis of realistic lighting effects. To facilitate the decomposition of geometric information independent of lighting conditions, we introduce a novel bilevel optimization-based meta-learning framework. The fundamental idea is to view the rendering tasks under various lighting positions as a multi-task learning problem, which our meta-learning approach effectively addresses by generalizing the learned Gaussian geometries not only across different viewpoints but also across diverse light positions. Experimental results demonstrate the effectiveness of our approach in terms of training efficiency and rendering quality compared to existing methods for free-viewpoint relighting.

[CV-34] Stratified Avatar Generation from Sparse Observations

链接: https://arxiv.org/abs/2405.20786
作者: Han Feng,Wenchao Ma,Quankai Gao,Xianwei Zheng,Nan Xue,Huijuan Xu
关键词: creating immersive experiences, Head Mounted Devices, essential for creating, creating immersive, immersive experiences
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Accepted by CVPR 2024 (Oral)

点击查看摘要

Abstract:Estimating 3D full-body avatars from AR/VR devices is essential for creating immersive experiences in AR/VR applications. This task is challenging due to the limited input from Head Mounted Devices, which capture only sparse observations from the head and hands. Predicting the full-body avatars, particularly the lower body, from these sparse observations presents significant difficulties. In this paper, we are inspired by the inherent property of the kinematic tree defined in the Skinned Multi-Person Linear (SMPL) model, where the upper body and lower body share only one common ancestor node, bringing the potential of decoupled reconstruction. We propose a stratified approach to decouple the conventional full-body avatar reconstruction pipeline into two stages, with the reconstruction of the upper body first and a subsequent reconstruction of the lower body conditioned on the previous stage. To implement this straightforward idea, we leverage the latent diffusion model as a powerful probabilistic generator, and train it to follow the latent distribution of decoupled motions explored by a VQ-VAE encoder-decoder model. Extensive experiments on AMASS mocap dataset demonstrate our state-of-the-art performance in the reconstruction of full-body motions.

[CV-35] owards Black-Box Membership Inference Attack for Diffusion Models

链接: https://arxiv.org/abs/2405.20771
作者: Jingwei Li,Jing Dong,Tianxing He,Jingzhao Zhang
关键词: important research topic, research topic, train a diffusion, important research, rising popularity
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying whether an artwork was used to train a diffusion model is an important research topic, given the rising popularity of AI-generated art and the associated copyright concerns. The work approaches this problem from the membership inference attack (MIA) perspective. We first identify the limitations of applying existing MIA methods for copyright protection: the required access of internal U-nets and the choice of non-member datasets for evaluation. To address the above problems, we introduce a novel black-box membership inference attack method that operates without needing access to the model’s internal U-net. We then construct a DALL-E generated dataset for a more comprehensive evaluation. We validate our method across various setups, and our experimental results outperform previous works.

[CV-36] CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image with Consistency Model

链接: https://arxiv.org/abs/2405.20764
作者: Zhiming Meng,Hui Li,Zeyang Zhang,Zhongwei Shen,Yunlong Yu,Xiaoning Song,Xiaojun Wu
关键词: generative models based, Generative models, current generative models, widely utilized, consistency model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generative models are widely utilized to model the distribution of fused images in the field of infrared and visible image fusion. However, current generative models based fusion methods often suffer from unstable training and slow inference speed. To tackle this problem, a novel fusion method based on consistency model is proposed, termed as CoMoFusion, which can generate the high-quality images and achieve fast image inference speed. In specific, the consistency model is used to construct multi-modal joint features in the latent space with the forward and reverse process. Then, the infrared and visible features extracted by the trained consistency model are fed into fusion module to generate the final fused image. In order to enhance the texture and salient information of fused images, a novel loss based on pixel value selection is also designed. Extensive experiments on public datasets illustrate that our method obtains the SOTA fusion performance compared with the existing fusion methods.

[CV-37] Information Theoretic Text-to-Image Alignment

链接: https://arxiv.org/abs/2405.20759
作者: Chao Wang,Giulio Franzese,Alessandro Finamore,Massimo Gallo,Pietro Michiardi
关键词: tremendous success recently, Diffusion models, success recently, tremendous success, Diffusion
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models for Text-to-Image (T2I) conditional generation have seen tremendous success recently. Despite their success, accurately capturing user intentions with these models still requires a laborious trial and error process. This challenge is commonly identified as a model alignment problem, an issue that has attracted considerable attention by the research community. Instead of relying on fine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-language models to steer image generation, in this work we present a novel method that relies on an information-theoretic alignment measure. In a nutshell, our method uses self-supervised fine-tuning and relies on point-wise mutual information between prompts and images to define a synthetic training set to induce model alignment. Our comparative analysis shows that our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI and a lightweight fine-tuning strategy.

[CV-38] Diffusion Models Are Innate One-Step Generators

链接: https://arxiv.org/abs/2405.20750
作者: Bowen Zheng,Tianming Yang
关键词: Diffusion Models, achieved great success, FID, model, student model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 4 figures and 4 tables on the main contents

点击查看摘要

Abstract:Diffusion Models (DMs) have achieved great success in image generation and other fields. By fine sampling through the trajectory defined by the SDE/ODE solver based on a well-trained score model, DMs can generate remarkable high-quality results. However, this precise sampling often requires multiple steps and is computationally demanding. To address this problem, instance-based distillation methods have been proposed to distill a one-step generator from a DM by having a simpler student model mimic a more complex teacher model. Yet, our research reveals an inherent limitations in these methods: the teacher model, with more steps and more parameters, occupies different local minima compared to the student model, leading to suboptimal performance when the student model attempts to replicate the teacher. To avoid this problem, we introduce a novel distributional distillation method, which uses an exclusive distributional loss. This method exceeds state-of-the-art (SOTA) results while requiring significantly fewer training images. Additionally, we show that DMs’ layers are activated differently at different time steps, leading to an inherent capability to generate images in a single step. Freezing most of the convolutional layers in a DM during distributional distillation leads to further performance improvements. Our method achieves the SOTA results on CIFAR-10 (FID 1.54), AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and ImageNet 64x64 (FID 1.16) with great efficiency. Most of those results are obtained with only 5 million training images within 6 hours on 8 A100 GPUs. This breakthrough not only enhances the understanding of efficient image generation models but also offers a scalable framework for advancing the state of the art in various applications.

[CV-39] rajectory Forecasting through Low-Rank Adaptation of Discrete Latent Codes

链接: https://arxiv.org/abs/2405.20743
作者: Riccardo Benaglia,Angelo Porrello,Pietro Buzzega,Simone Calderara,Rita Cucchiara
关键词: video surveillance analytics, basketball players engaged, Trajectory forecasting, Quantized Variational Autoencoders, surveillance analytics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g. basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity and diversity. We address this challenge by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a discrete latent space to tackle the issue of posterior collapse. Specifically, we introduce an instance-based codebook that allows tailored latent representations for each example. In a nutshell, the rows of the codebook are dynamically adjusted to reflect contextual information (i.e., past motion patterns extracted from the observed trajectories). In this way, the discretization process gains flexibility, leading to improved reconstructions. Notably, instance-level dynamics are injected into the codebook through low-rank updates, which restrict the customization of the codebook to a lower dimension space. The resulting discrete space serves as the basis of the subsequent step, which regards the training of a diffusion-based predictive model. We show that such a two-fold framework, augmented with instance-level discretization, leads to accurate and diverse forecasts, yielding state-of-the-art performance on three established benchmarks.

[CV-40] Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images

链接: https://arxiv.org/abs/2405.20735
作者: Mansi Kakkar,Dattesh Shanbhag,Chandan Aladahalli,Gurunath Reddy M
关键词: previously challenging multi-modal, challenging multi-modal classification, multi-modal classification problem, Vision-language models, medical domain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ©© 2024 IEEE. Accepted in 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2024

点击查看摘要

Abstract:Vision-language models have emerged as a powerful tool for previously challenging multi-modal classification problem in the medical domain. This development has led to the exploration of automated image description generation for multi-modal clinical scans, particularly for radiology report generation. Existing research has focused on clinical descriptions for specific modalities or body regions, leaving a gap for a model providing entire-body multi-modal descriptions. In this paper, we address this gap by automating the generation of standardized body station(s) and list of organ(s) across the whole body in multi-modal MR and CT radiological images. Leveraging the versatility of the Contrastive Language-Image Pre-training (CLIP), we refine and augment the existing approach through multiple experiments, including baseline model fine-tuning, adding station(s) as a superset for better correlation between organs, along with image and language augmentations. Our proposed approach demonstrates 47.6% performance improvement over baseline PubMedCLIP.

[CV-41] Extreme Point Supervised Instance Segmentation

链接: https://arxiv.org/abs/2405.20729
作者: Hyeonjun Lee,Sehyun Hwang,Suha Kwak
关键词: paper introduces, points, extreme points, rightmost points, learning instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024 Accepted

点击查看摘要

Abstract:This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points, which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks, our method significantly outperforms existing box-supervised methods, further narrowing the gap with its fully supervised counterpart. In particular, our model generates high-quality masks when a target object is separated into multiple parts, where previous box-supervised methods often fail.

[CV-42] GI-NAS: Boosting Gradient Inversion Attacks through Adaptive Neural Architecture Search

链接: https://arxiv.org/abs/2405.20725
作者: Wenbo Yu,Hao Fang,Bin Chen,Xiaohang Sui,Chuan Chen,Hao Wu,Shu-Tao Xia,Ke Xu
关键词: Federated Learning, considerable privacy concerns, raised considerable privacy, Gradient Inversion, Inversion Attacks invert
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gradient Inversion Attacks invert the transmitted gradients in Federated Learning (FL) systems to reconstruct the sensitive data of local clients and have raised considerable privacy concerns. A majority of gradient inversion methods rely heavily on explicit prior knowledge (e.g., a well pre-trained generative model), which is often unavailable in realistic scenarios. To alleviate this issue, researchers have proposed to leverage the implicit prior knowledge of an over-parameterized network. However, they only utilize a fixed neural architecture for all the attack settings. This would hinder the adaptive use of implicit architectural priors and consequently limit the generalizability. In this paper, we further exploit such implicit prior knowledge by proposing Gradient Inversion via Neural Architecture Search (GI-NAS), which adaptively searches the network and captures the implicit priors behind neural architectures. Extensive experiments verify that our proposed GI-NAS can achieve superior attack performance compared to state-of-the-art gradient inversion methods, even under more practical settings with high-resolution images, large-sized batches, and advanced defense strategies.

[CV-43] ContextGS: Compact 3D Gaussian Splatting with Anchor Level Context Model

链接: https://arxiv.org/abs/2405.20721
作者: Yufei Wang,Zhihao Li,Lanqing Guo,Wenhan Yang,Alex C. Kot,Bihan Wen
关键词: Gaussian Splatting, offering fast rendering, fast rendering speeds, neural Gaussians, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has become a promising framework for novel view synthesis, offering fast rendering speeds and high fidelity. However, the large number of Gaussians and their associated attributes require effective compression techniques. Existing methods primarily compress neural Gaussians individually and independently, i.e., coding all the neural Gaussians at the same time, with little design for their interactions and spatial dependence. Inspired by the effectiveness of the context model in image compression, we propose the first autoregressive model at the anchor level for 3DGS compression in this work. We divide anchors into different levels and the anchors that are not coded yet can be predicted based on the already coded ones in all the coarser levels, leading to more accurate modeling and higher coding efficiency. To further improve the efficiency of entropy coding, e.g., to code the coarsest level with no already coded anchors, we propose to introduce a low-dimensional quantized feature as the hyperprior for each anchor, which can be effectively compressed. Our work pioneers the context model in the anchor level for 3DGS representation, yielding an impressive size reduction of over 100 times compared to vanilla 3DGS and 15 times compared to the most recent state-of-the-art work Scaffold-GS, while achieving comparable or even higher rendering quality.

[CV-44] Power of Cooperative Supervision: Multiple Teachers Framework for Enhanced 3D Semi-Supervised Object Detection

链接: https://arxiv.org/abs/2405.20720
作者: Jin-Hee Lee,Jae-Keun Lee,Je-Seok Kim,Soon Kwon
关键词: safe urban driving, ensure safe urban, develop high-performance object, diverse urban environments, high-performance object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: under review

点击查看摘要

Abstract:To ensure safe urban driving for autonomous platforms, it is crucial not only to develop high-performance object detection techniques but also to establish a diverse and representative dataset that captures various urban environments and object characteristics. To address these two issues, we have constructed a multi-class 3D LiDAR dataset reflecting diverse urban environments and object characteristics, and developed a robust 3D semi-supervised object detection (SSOD) based on a multiple teachers framework. This SSOD framework categorizes similar classes and assigns specialized teachers to each category. Through collaborative supervision among these category-specialized teachers, the student network becomes increasingly proficient, leading to a highly effective object detector. We propose a simple yet effective augmentation technique, Pie-based Point Compensating Augmentation (PieAug), to enable the teacher network to generate high-quality pseudo-labels. Extensive experiments on the WOD, KITTI, and our datasets validate the effectiveness of our proposed method and the quality of our dataset. Experimental results demonstrate that our approach consistently outperforms existing state-of-the-art 3D semi-supervised object detection methods across all datasets. We plan to release our multi-class LiDAR dataset and the source code available on our Github repository in the near future.

[CV-45] Climate Variable Downscaling with Conditional Normalizing Flows

链接: https://arxiv.org/abs/2405.20719
作者: Christina Winkler,Paula Harder,David Rolnick
关键词: models typically operate, coarse spatial scales, spatial scales due, large computational costs, Predictions of global
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Predictions of global climate models typically operate on coarse spatial scales due to the large computational costs of climate simulations. This has led to a considerable interest in methods for statistical downscaling, a similar process to super-resolution in the computer vision context, to provide more local and regional climate information. In this work, we apply conditional normalizing flows to the task of climate variable downscaling. We showcase its successful performance on an ERA5 water content dataset for different upsampling factors. Additionally, we show that the method allows us to assess the predictive uncertainty in terms of standard deviation from the fitted conditional distribution mean.

[CV-46] Cyclic image generation using chaotic dynamics

链接: https://arxiv.org/abs/2405.20717
作者: Takaya Tanaka,Yutaka Yamaguti
关键词: cyclic transformations, transformations is demonstrated, demonstrated by extending, extending the CycleGAN, generated image sequences
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Successive image generation using cyclic transformations is demonstrated by extending the CycleGAN model to transform images among three different categories. Repeated application of the trained generators produces sequences of images that transition among the different categories. The generated image sequences occupy a more limited region of the image space compared with the original training dataset. Quantitative evaluation using precision and recall metrics indicates that the generated images have high quality but reduced diversity relative to the training dataset. Such successive generation processes are characterized as chaotic dynamics in terms of dynamical system theory. Positive Lyapunov exponents estimated from the generated trajectories confirm the presence of chaotic dynamics, with the Lyapunov dimension of the attractor found to be comparable to the intrinsic dimension of the training data manifold. The results suggest that chaotic dynamics in the image space defined by the deep generative model contribute to the diversity of the generated images, constituting a novel approach for multi-class image generation. This model can be interpreted as an extension of classical associative memory to perform hetero-association among image categories.

[CV-47] Revisiting Mutual Information Maximization for Generalized Category Discovery

链接: https://arxiv.org/abs/2405.20711
作者: Zhaorui Tan,Chengrui Zhang,Xi Yang,Jie Sun,Kaizhu Huang
关键词: Generalized category discovery, category discovery presents, model generalization ability, Generalized category, category discovery
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint version

点击查看摘要

Abstract:Generalized category discovery presents a challenge in a realistic scenario, which requires the model’s generalization ability to recognize unlabeled samples from known and unknown categories. This paper revisits the challenge of generalized category discovery through the lens of information maximization (InfoMax) with a probabilistic parametric classifier. Our findings reveal that ensuring independence between known and unknown classes while concurrently assuming a uniform probability distribution across all classes, yields an enlarged margin among known and unknown classes that promotes the model’s performance. To achieve the aforementioned independence, we propose a novel InfoMax-based method, Regularized Parametric InfoMax (RPIM), which adopts pseudo labels to supervise unlabeled samples during InfoMax, while proposing a regularization to ensure the quality of the pseudo labels. Additionally, we introduce novel semantic-bias transformation to refine the features from the pre-trained model instead of direct fine-tuning to rescue the computational costs. Extensive experiments on six benchmark datasets validate the effectiveness of our method. RPIM significantly improves the performance regarding unknown classes, surpassing the state-of-the-art method by an average margin of 3.5%.

[CV-48] Conditioning GAN Without Training Dataset

链接: https://arxiv.org/abs/2405.20687
作者: Kidist Amde Mekonnen
关键词: Toggle, Deep learning algorithms, training dataset, Deep learning, Training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 5 pages, 2 figures, Part of my MSc project course, School Project Course 2022

点击查看摘要

Abstract:Deep learning algorithms have a large number of trainable parameters often with sizes of hundreds of thousands or more. Training this algorithm requires a large amount of training data and generating a sufficiently large dataset for these algorithms is costly\citenoguchi2019image. GANs are generative neural networks that use two deep learning networks that are competing with each other. The networks are generator and discriminator networks. The generator tries to generate realistic images which resemble the actual training dataset by approximating the training data distribution and the discriminator is trained to classify images as real or fake(generated)\citegoodfellow2016nips. Training these GAN algorithms also requires a large amount of training dataset\citenoguchi2019image. In this study, the aim is to address the question, “Given an unconditioned pretrained generator network and a pretrained classifier, is it feasible to develop a conditioned generator without relying on any training dataset?” The paper begins with a general introduction to the problem. The subsequent sections are structured as follows: Section 2 provides background information on the problem. Section 3 reviews relevant literature on the topic. Section 4 outlines the methodology employed in this study. Section 5 presents the experimental results. Section 6 discusses the findings and proposes potential future research directions. Finally, Section 7 offers concluding remarks. The implementation can be accessed \hrefthis https URLhere. Comments: 5 pages, 2 figures, Part of my MSc project course, School Project Course 2022 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2405.20687 [cs.CV] (or arXiv:2405.20687v1 [cs.CV] for this version) Submission history From: Kidist Amde Mekonnen Miss [view email] [v1] Fri, 31 May 2024 08:31:26 UTC (883 KB) Full-text links: Access Paper: View a PDF of the paper titled Conditioning GAN Without Training Dataset, by Kidist Amde MekonnenView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2405 Change to browse by: cs cs.AI cs.LG cs.MM References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-49] Enhancing Counterfactual Image Generation Using Mahalanobis Distance with Distribution Preferences in Feature Space

链接: https://arxiv.org/abs/2405.20685
作者: Yukai Zhang,Ao Xu,Zihao Li,Tieru Wu
关键词: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, Intelligence, Artificial
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the realm of Artificial Intelligence (AI), the importance of Explainable Artificial Intelligence (XAI) is increasingly recognized, particularly as AI models become more integral to our lives. One notable single-instance XAI approach is counterfactual explanation, which aids users in comprehending a model’s decisions and offers guidance on altering these decisions. Specifically in the context of image classification models, effective image counterfactual explanations can significantly enhance user understanding. This paper introduces a novel method for computing feature importance within the feature space of a black-box model. By employing information fusion techniques, our method maximizes the use of data to address feature counterfactual explanations in the feature space. Subsequently, we utilize an image generation model to transform these feature counterfactual explanations into image counterfactual explanations. Our experiments demonstrate that the counterfactual explanations generated by our method closely resemble the original images in both pixel and feature spaces. Additionally, our method outperforms established baselines, achieving impressive experimental results.

[CV-50] Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling

链接: https://arxiv.org/abs/2405.20675
作者: Kidist Amde Mekonnen,Nicola Dall’Asen,Paolo Rota
关键词: image synthesis tasks, Diffusion Probabilistic Models, achieving remarkable performance, Diffusion Probabilistic, Probabilistic Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki, Finland

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of deep generative models, achieving remarkable performance in image synthesis tasks. However, these models face challenges in terms of widespread adoption due to their reliance on sequential denoising steps during sample generation. This dependence leads to substantial computational requirements, making them unsuitable for resource-constrained or real-time processing systems. To address these challenges, we propose a novel method that integrates denoising phases directly into the model’s architecture, thereby reducing the need for resource-intensive computations. Our approach combines diffusion models with generative adversarial networks (GANs) through knowledge distillation, enabling more efficient training and evaluation. By utilizing a pre-trained diffusion model as a teacher model, we train a student model through adversarial learning, employing layerwise transformations for denoising and submodules for predicting the teacher model’s output at various points in time. This integration significantly reduces the number of parameters and denoising steps required, leading to improved sampling speed at test time. We validate our method with extensive experiments, demonstrating comparable performance with reduced computational requirements compared to existing approaches. By enabling the deployment of diffusion models on resource-constrained devices, our research mitigates their computational burden and paves the way for wider accessibility and practical use across the research community and end-users. Our code is publicly available at this https URL Comments: 7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki, Finland Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2405.20675 [cs.CV] (or arXiv:2405.20675v1 [cs.CV] for this version)

[CV-51] 4Diffusion: Multi-view Video Diffusion Model for 4D Generation

链接: https://arxiv.org/abs/2405.20674
作者: Haiyu Zhang,Xinyuan Chen,Yaohui Wang,Xihui Liu,Yunhong Wang,Yu Qiao
关键词: achieved noteworthy efficacy, advanced diffusion generative, diffusion model, diffusion, achieved noteworthy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Current 4D generation methods have achieved noteworthy efficacy with the aid of advanced diffusion generative models. However, these methods lack multi-view spatial-temporal modeling and encounter challenges in integrating diverse prior knowledge from multiple diffusion models, resulting in inconsistent temporal appearance and flickers. In this paper, we propose a novel 4D generation pipeline, namely 4Diffusion aimed at generating spatial-temporally consistent 4D content from a monocular video. We first design a unified diffusion model tailored for multi-view video generation by incorporating a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations. After training on a curated dataset, our diffusion model acquires reasonable temporal consistency and inherently preserves the generalizability and spatial consistency of the 3D-aware diffusion model. Subsequently, we propose 4D-aware Score Distillation Sampling loss, which is based on our multi-view video diffusion model, to optimize 4D representation parameterized by dynamic NeRF. This aims to eliminate discrepancies arising from multiple diffusion models, allowing for generating spatial-temporally consistent 4D content. Moreover, we devise an anchor loss to enhance the appearance details and facilitate the learning of dynamic NeRF. Extensive qualitative and quantitative experiments demonstrate that our method achieves superior performance compared to previous methods.

[CV-52] Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations

链接: https://arxiv.org/abs/2405.20672
作者: Davide Coppola,Hwee Kuan Lee
关键词: Convolutional Neural Networks, Neural Networks, Convolutional Neural, explores the impact, aim of enhancing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 15 figures (including appendix)

点击查看摘要

Abstract:This study explores the impact of adversarial perturbations on Convolutional Neural Networks (CNNs) with the aim of enhancing the understanding of their underlying mechanisms. Despite numerous defense methods proposed in the literature, there is still an incomplete understanding of this phenomenon. Instead of treating the entire model as vulnerable, we propose that specific feature maps learned during training contribute to the overall vulnerability. To investigate how the hidden representations learned by a CNN affect its vulnerability, we introduce the Adversarial Intervention framework. Experiments were conducted on models trained on three well-known computer vision datasets, subjecting them to attacks of different nature. Our focus centers on the effects that adversarial perturbations to a model’s initial layer have on the overall behavior of the model. Empirical results revealed compelling insights: a) perturbing selected channel combinations in shallow layers causes significant disruptions; b) the channel combinations most responsible for the disruptions are common among different types of attacks; c) despite shared vulnerable combinations of channels, different attacks affect hidden representations with varying magnitudes; d) there exists a positive correlation between a kernel’s magnitude and its vulnerability. In conclusion, this work introduces a novel framework to study the vulnerability of a CNN model to adversarial perturbations, revealing insights that contribute to a deeper understanding of the phenomenon. The identified properties pave the way for the development of efficient ad-hoc defense mechanisms in future applications.

[CV-53] Fourier123: One Image to High-Quality 3D Object Generation with Hybrid Fourier Score Distillation

链接: https://arxiv.org/abs/2405.20669
作者: Shuzhou Yang,Yu Wang,Haijie Li,Jiarui Meng,Xiandong Meng,Jian Zhang
关键词: crafting controllable, pivotal for crafting, Single, Fourier Score Distillation, generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Single image-to-3D generation is pivotal for crafting controllable 3D assets. Given its underconstrained nature, we leverage geometric priors from a 3D novel view generation diffusion model and appearance priors from a 2D image generation method to guide the optimization process. We note that a disparity exists between the training datasets of 2D and 3D diffusion models, leading to their outputs showing marked differences in appearance. Specifically, 2D models tend to deliver more detailed visuals, whereas 3D models produce consistent yet over-smooth results across different views. Hence, we optimize a set of 3D Gaussians using 3D priors in spatial domain to ensure geometric consistency, while exploiting 2D priors in the frequency domain through Fourier transform for higher visual quality. This 2D-3D hybrid Fourier Score Distillation objective function (dubbed hy-FSD), can be integrated into existing 3D generation methods, yielding significant performance improvements. With this technique, we further develop an image-to-3D generation pipeline to create high-quality 3D objects within one minute, named Fourier123. Extensive experiments demonstrate that Fourier123 excels in efficient generation with rapid convergence speed and visual-friendly generation results.

[CV-54] MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

链接: https://arxiv.org/abs/2405.20666
作者: Weichao Zhao,Hezhen Hu,Wengang Zhou,Yunyao Mao,Min Wang,Houqiang Li
关键词: model representation capabilities, insufficient model representation, long been plagued, plagued by insufficient, insufficient model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by TCSVT 2024

点击查看摘要

Abstract:Sign language recognition (SLR) has long been plagued by insufficient model representation capabilities. Although current pre-training approaches have alleviated this dilemma to some extent and yielded promising performance by employing various pretext tasks on sign pose data, these methods still suffer from two primary limitations: 1) Explicit motion information is usually disregarded in previous pretext tasks, leading to partial information loss and limited representation capability. 2) Previous methods focus on the local context of a sign pose sequence, without incorporating the guidance of the global meaning of lexical signs. To this end, we propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information in a self-supervised learning paradigm for SLR. Our framework contains two crucial components, i.e., a motion-aware masked autoencoder (MA) and a momentum semantic alignment module (SA). Specifically, in MA, we introduce an autoencoder architecture with a motion-aware masked strategy to reconstruct motion residuals of masked frames, thereby explicitly exploring dynamic motion cues among sign pose sequences. Moreover, in SA, we embed our framework with global semantic awareness by aligning the embeddings of different augmented samples from the input sequence in the shared latent space. In this way, our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation. Furthermore, we conduct extensive experiments to validate the effectiveness of our method, achieving new state-of-the-art performance on four public benchmarks.

[CV-55] GenMix: Combining Generative and Mixture Data Augmentation for Medical Image Classification

链接: https://arxiv.org/abs/2405.20650
作者: Hansang Lee,Haeil Lee,Helen Hong
关键词: augmentation technique called, data augmentation technique, technique called GenMix, augmentation technique, technique called
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel data augmentation technique called GenMix, which combines generative and mixture approaches to leverage the strengths of both methods. While generative models excel at creating new data patterns, they face challenges such as mode collapse in GANs and difficulties in training diffusion models, especially with limited medical imaging data. On the other hand, mixture models enhance class boundary regions but tend to favor the major class in scenarios with class imbalance. To address these limitations, GenMix integrates both approaches to complement each other. GenMix operates in two stages: (1) training a generative model to produce synthetic images, and (2) performing mixup between synthetic and real data. This process improves the quality and diversity of synthetic data while simultaneously benefiting from the new pattern learning of generative models and the boundary enhancement of mixture models. We validate the effectiveness of our method on the task of classifying focal liver lesions (FLLs) in CT images. Our results demonstrate that GenMix enhances the performance of various generative models, including DCGAN, StyleGAN, Textual Inversion, and Diffusion Models. Notably, the proposed method with Textual Inversion outperforms other methods without fine-tuning diffusion model on the FLL dataset.

[CV-56] Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization

链接: https://arxiv.org/abs/2405.20648
作者: Richard Luo,Austin Peng,Adithya Vasudev,Rishabh Jain
关键词: poses substantial challenges, information-dense medium, increasingly prominent, prominent and information-dense, poses substantial
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos’ more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.

[CV-57] Learning Gaze-aware Compositional GAN

链接: https://arxiv.org/abs/2405.20643
作者: Nerea Aranjuelo,Siyu Huang,Ignacio Arganda-Carreras,Luis Unzueta,Oihana Otaegui,Hanspeter Pfister,Donglai Wei
关键词: deep neural networks, Gaze-annotated facial data, training deep neural, Gaze-annotated facial, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ETRA 2024 as Full paper, and as journal paper in Proceedings of the ACM on Computer Graphics and Interactive Techniques

点击查看摘要

Abstract:Gaze-annotated facial data is crucial for training deep neural networks (DNNs) for gaze estimation. However, obtaining these data is labor-intensive and requires specialized equipment due to the challenge of accurately annotating the gaze direction of a subject. In this work, we present a generative framework to create annotated gaze data by leveraging the benefits of labeled and unlabeled data sources. We propose a Gaze-aware Compositional GAN that learns to generate annotated facial images from a limited labeled dataset. Then we transfer this model to an unlabeled data domain to take advantage of the diversity it provides. Experiments demonstrate our approach’s effectiveness in generating within-domain image augmentations in the ETH-XGaze dataset and cross-domain augmentations in the CelebAMask-HQ dataset domain for gaze estimation DNN training. We also show additional applications of our work, which include facial image editing and gaze redirection.

[CV-58] Action-OOD: An End-to-End Skeleton-Based Model for Robust Out-of-Distribution Human Action Detection

链接: https://arxiv.org/abs/2405.20633
作者: Jing Xu,Anqi Zhu,Jingyu Lin,Qiuhong Ke,Cunjian Chen
关键词: computer vision systems, OOD human action, OOD, OOD detection, vision systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under consideration at Computer Vision and Image Understanding

点击查看摘要

Abstract:Human action recognition is a crucial task in computer vision systems. However, in real-world scenarios, human actions often fall outside the distribution of training data, requiring a model to both recognize in-distribution (ID) actions and reject out-of-distribution (OOD) ones. Despite its importance, there has been limited research on OOD detection in human actions. Existing works on OOD detection mainly focus on image data with RGB structure, and many methods are post-hoc in nature. While these methods are convenient and computationally efficient, they often lack sufficient accuracy and fail to consider the presence of OOD samples. To address these challenges, we propose a novel end-to-end skeleton-based model called Action-OOD, specifically designed for OOD human action detection. Unlike some existing approaches that may require prior knowledge of existing OOD data distribution, our model solely utilizes in-distribution (ID) data during the training stage, effectively mitigating the overconfidence issue prevalent in OOD detection. We introduce an attention-based feature fusion block, which enhances the model’s capability to recognize unknown classes while preserving classification accuracy for known classes. Further, we present a novel energy-based loss function and successfully integrate it with the traditional cross-entropy loss to maximize the separation of data distributions between ID and OOD. Through extensive experiments conducted on NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics-400 datasets, we demonstrate the superior performance of our proposed approach compared to state-of-the-art methods. Our findings underscore the effectiveness of classic OOD detection techniques in the context of skeleton-based action recognition tasks, offering promising avenues for future research in this field. Code will be available at: this https URL.

[CV-59] oxVidLLM: A Multimodal LLM-based Framework for Toxicity Detection in Code-Mixed Videos

链接: https://arxiv.org/abs/2405.20628
作者: Krishanu Maity,A.S. Poornash,Sriparna Saha,Pushpak Bhattacharyya
关键词: evolving internet technology, rapidly evolving internet, toxic content detection, toxic content, internet technology
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: ACL Findings 2024

点击查看摘要

Abstract:In an era of rapidly evolving internet technology, the surge in multimodal content, including videos, has expanded the horizons of online communication. However, the detection of toxic content in this diverse landscape, particularly in low-resource code-mixed languages, remains a critical challenge. While substantial research has addressed toxic content detection in textual data, the realm of video content, especially in non-English languages, has been relatively underexplored. This paper addresses this research gap by introducing a benchmark dataset, the first of its kind, consisting of 931 videos with 4021 code-mixed Hindi-English utterances collected from YouTube. Each utterance within this dataset has been meticulously annotated for toxicity, severity, and sentiment labels. We have developed an advanced Multimodal Multitask framework built for Toxicity detection in Video Content by leveraging Large Language Models (LLMs), crafted for the primary objective along with the additional tasks of conducting sentiment and severity analysis. ToxVidLLM incorporates three key modules the Encoder module, Cross-Modal Synchronization module, and Multitask module crafting a generic multimodal LLM customized for intricate video classification tasks. Our experiments reveal that incorporating multiple modalities from the videos substantially enhances the performance of toxic content detection by achieving an Accuracy and Weighted F1 score of 94.29% and 94.35%, respectively.

[CV-60] EPIDetect: Video-based convulsive seizure detection in chronic epilepsy mouse model for anti-epilepsy drug screening

链接: https://arxiv.org/abs/2405.20614
作者: Junming Ren,Zhoujian Xiao,Yujia Zhang,Yujie Yang,Ling He,Ezra Yoon,Stephen Temitayo Bello,Xi Chen,Dapeng Wu,Micky Tortorella,Jufang He
关键词: spontaneous recurrent seizures, preclinical translational studies, remarkable anti-epileptic efficacy, anti-epileptic efficacy demonstrate, efficacy demonstrate long-term
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the preclinical translational studies, drug candidates with remarkable anti-epileptic efficacy demonstrate long-term suppression of spontaneous recurrent seizures (SRSs), particularly convulsive seizures (CSs), in mouse models of chronic epilepsy. However, the current methods for monitoring CSs have limitations in terms of invasiveness, specific laboratory settings, high cost, and complex operation, which hinder drug screening efforts. In this study, a camera-based system for automated detection of CSs in chronically epileptic mice is first established to screen potential anti-epilepsy drugs.

[CV-61] Revisiting and Maximizing Temporal Knowledge in Semi-supervised Semantic Segmentation

链接: https://arxiv.org/abs/2405.20610
作者: Wooseok Shin,Hyun Joon Park,Jin Sob Kim,Sung Won Han
关键词: mitigate confirmation bias, coupling problems, confirmation bias, bias and coupling, semi-supervised semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 5 figures, submitted to IEEE TPAMI. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:In semi-supervised semantic segmentation, the Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems. However, despite their high performance, these approaches frequently involve complex training pipelines and a substantial computational burden, limiting the scalability and compatibility of these methods. In this paper, we propose a PrevMatch framework that effectively mitigates the aforementioned limitations by maximizing the utilization of the temporal knowledge obtained during the training process. The PrevMatch framework relies on two core strategies: (1) we reconsider the use of temporal knowledge and thus directly utilize previous models obtained during training to generate additional pseudo-label guidance, referred to as previous guidance. (2) we design a highly randomized ensemble strategy to maximize the effectiveness of the previous guidance. Experimental results on four benchmark semantic segmentation datasets confirm that the proposed method consistently outperforms existing methods across various evaluation protocols. In particular, with DeepLabV3+ and ResNet-101 network settings, PrevMatch outperforms the existing state-of-the-art method, Diverse Co-training, by +1.6 mIoU on Pascal VOC with only 92 annotated images, while achieving 2.4 times faster training. Furthermore, the results indicate that PrevMatch induces stable optimization, particularly in benefiting classes that exhibit poor performance. Code is available at this https URL

[CV-62] xtual Inversion and Self-supervised Refinement for Radiology Report Generation

链接: https://arxiv.org/abs/2405.20607
作者: Yuanjiang Luo,Hongxiang Li,Xuan Wu,Meng Cao,Xiaoshuang Huang,Zhihong Zhu,Peixi Liao,Hu Chen,Yi Zhang
关键词: mainstream approaches follow, generating radiology reports, Existing mainstream approaches, mainstream approaches, approaches follow
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring report content constraints. In this paper, we proposed Textual Inversion and Self-supervised Refinement (TISR) to address the above two issues. Specifically, textual inversion can project text and image into the same space by representing images as pseudo words to eliminate the cross-modeling gap. Subsequently, self-supervised refinement refines these pseudo words through contrastive loss computation between images and texts, enhancing the fidelity of generated reports to images. Notably, TISR is orthogonal to most existing methods, plug-and-play. We conduct experiments on two widely-used public datasets and achieve significant improvements on various baselines, which demonstrates the effectiveness and generalization of TISR. The code will be available soon.

[CV-63] Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

链接: https://arxiv.org/abs/2405.20606
作者: Yang Chen,Tian He,Junfeng Fu,Ling Wang,Jingcai Guo,Hong Cheng
关键词: Supervised and self-supervised, main training paradigms, Vision-Language knowledge prompts, human skeleton action, Vision-Language knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Supervised and self-supervised learning are two main training paradigms for skeleton-based human action recognition. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C ^2 VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal contrastive process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments show that our method achieves state-of-the-art results on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The code will be available in the future.

[CV-64] Searching for internal symbols underlying deep learning

链接: https://arxiv.org/abs/2405.20605
作者: Jung H. Lee,Sujith Vijayan
关键词: enables deep neural, deep neural networks, automatically learn complex, learn complex tasks, enables deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures, 3 tables and Appendix

点击查看摘要

Abstract:Deep learning (DL) enables deep neural networks (DNNs) to automatically learn complex tasks or rules from given examples without instructions or guiding principles. As we do not engineer DNNs’ functions, it is extremely difficult to diagnose their decisions, and multiple lines of studies proposed to explain principles of DNNs/DL operations. Notably, one line of studies suggests that DNNs may learn concepts, the high level features recognizable to humans. Thus, we hypothesized that DNNs develop abstract codes, not necessarily recognizable to humans, which can be used to augment DNNs’ decision-making. To address this hypothesis, we combined foundation segmentation models and unsupervised learning to extract internal codes and identify potential use of abstract codes to make DL’s decision-making more reliable and safer.

[CV-65] Generalized Semi-Supervised Learning via Self-Supervised Feature Adaptation

链接: https://arxiv.org/abs/2405.20596
作者: Jiachen Liang,Ruibing Hou,Hong Chang,Bingpeng Ma,Shiguang Shan,Xilin Chen
关键词: Traditional semi-supervised learning, Traditional semi-supervised, SSL, unlabeled data, semi-supervised learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages; Accepted by NeurIPS 2023

点击查看摘要

Abstract:Traditional semi-supervised learning (SSL) assumes that the feature distributions of labeled and unlabeled data are consistent which rarely holds in realistic scenarios. In this paper, we propose a novel SSL setting, where unlabeled samples are drawn from a mixed distribution that deviates from the feature distribution of labeled samples. Under this setting, previous SSL methods tend to predict wrong pseudo-labels with the model fitted on labeled data, resulting in noise accumulation. To tackle this issue, we propose Self-Supervised Feature Adaptation (SSFA), a generic framework for improving SSL performance when labeled and unlabeled data come from different distributions. SSFA decouples the prediction of pseudo-labels from the current model to improve the quality of pseudo-labels. Particularly, SSFA incorporates a self-supervised task into the SSL framework and uses it to adapt the feature extractor of the model to the unlabeled data. In this way, the extracted features better fit the distribution of unlabeled data, thereby generating high-quality pseudo-labels. Extensive experiments show that our proposed SSFA is applicable to various pseudo-label-based SSL learners and significantly improves performance in labeled, unlabeled, and even unseen distributions.

[CV-66] Disrupting Diffusion: Token-Level Attention Erasure Attack against Diffusion-based Customization

链接: https://arxiv.org/abs/2405.20584
作者: Yisu Liu,Jinyang An,Wanqian Zhang,Dayan Wu,Jingzi Gu,Zheng Lin,Weiping Wang
关键词: diffusion-based customization methods, development of diffusion-based, access to train, generate their personalized, personalized images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:With the development of diffusion-based customization methods like DreamBooth, individuals now have access to train the models that can generate their personalized images. Despite the convenience, malicious users have misused these techniques to create fake images, thereby triggering a privacy security crisis. In light of this, proactive adversarial attacks are proposed to protect users against customization. The adversarial examples are trained to distort the customization model’s outputs and thus block the misuse. In this paper, we propose DisDiff (Disrupting Diffusion), a novel adversarial attack method to disrupt the diffusion model outputs. We first delve into the intrinsic image-text relationships, well-known as cross-attention, and empirically find that the subject-identifier token plays an important role in guiding image generation. Thus, we propose the Cross-Attention Erasure module to explicitly “erase” the indicated attention maps and disrupt the text guidance. Besides,we analyze the influence of the sampling process of the diffusion model on Projected Gradient Descent (PGD) attack and introduce a novel Merit Sampling Scheduler to adaptively modulate the perturbation updating amplitude in a step-aware manner. Our DisDiff outperforms the state-of-the-art methods by 12.75% of FDFR scores and 7.25% of ISM scores across two facial benchmarks and two commonly used prompts on average.

[CV-67] Comparing Quantum Annealing and Spiking Neuromorphic Computing for Sampling Binary Sparse Coding QUBO Problems

链接: https://arxiv.org/abs/2405.20525
作者: Kyle Henke,Elijah Pelofske,Garrett Kenyon,Georg Hahn
关键词: sparse representation QUBO, quantum annealing, quantum, annealing, Loihi
类目: Emerging Technologies (cs.ET); Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM); Neural and Evolutionary Computing (cs.NE); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:We consider the problem of computing a sparse binary representation of an image. To be precise, given an image and an overcomplete, non-orthonormal basis, we aim to find a sparse binary vector indicating the minimal set of basis vectors that when added together best reconstruct the given input. We formulate this problem with an L_2 loss on the reconstruction error, and an L_0 (or, equivalently, an L_1 ) loss on the binary vector enforcing sparsity. This yields a quadratic binary optimization problem (QUBO), whose optimal solution(s) in general is NP-hard to find. The method of unsupervised and unnormalized dictionary feature learning for a desired sparsity level to best match the data is presented. Next, we solve the sparse representation QUBO by implementing it both on a D-Wave quantum annealer with Pegasus chip connectivity via minor embedding, as well as on the Intel Loihi 2 spiking neuromorphic processor. On the quantum annealer, we sample from the sparse representation QUBO using parallel quantum annealing combined with quantum evolution Monte Carlo, also known as iterated reverse annealing. On Loihi 2, we use a stochastic winner take all network of neurons. The solutions are benchmarked against simulated annealing, a classical heuristic, and the optimal solutions are computed using CPLEX. Iterated reverse quantum annealing performs similarly to simulated annealing, although simulated annealing is always able to sample the optimal solution whereas quantum annealing was not always able to. The Loihi 2 solutions that are sampled are on average more sparse than the solutions from any of the other methods. Loihi 2 outperforms a D-Wave quantum annealer standard linear-schedule anneal, while iterated reverse quantum annealing performs much better than both unmodified linear-schedule quantum annealing and iterated warm starting on Loihi 2.

[CV-68] Deep Modeling of Non-Gaussian Aleatoric Uncertainty

链接: https://arxiv.org/abs/2405.20513
作者: Aastha Acharya,Caleb Lee,Marissa D’Alonzo,Jared Shamwell,Nisar R. Ahmed,Rebecca Russell
关键词: fixed and Gaussian, learning offers promising, Deep learning offers, model aleatoric uncertainty, robotic estimation systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Deep learning offers promising new ways to accurately model aleatoric uncertainty in robotic estimation systems, particularly when the uncertainty distributions do not conform to traditional assumptions of being fixed and Gaussian. In this study, we formulate and evaluate three fundamental deep learning approaches for conditional probability density modeling to quantify non-Gaussian aleatoric uncertainty: parametric, discretized, and generative modeling. We systematically compare the respective strengths and weaknesses of these three methods on simulated non-Gaussian densities as well as on real-world terrain-relative navigation data. Our results show that these deep learning methods can accurately capture complex uncertainty patterns, highlighting their potential for improving the reliability and robustness of estimation systems.

[CV-69] Physically Compatible 3D Object Modeling from a Single Image

链接: https://arxiv.org/abs/2405.20510
作者: Minghao Guo,Bohan Wang,Pingchuan Ma,Tianyuan Zhang,Crystal Elaine Owens,Chuang Gan,Joshua B. Tenenbaum,Kaiming He,Wojciech Matusik
关键词: transforms single images, present a computational, transforms single, physical, external forces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a computational framework that transforms single images into 3D physical objects. The visual geometry of a physical object in an image is determined by three orthogonal attributes: mechanical properties, external forces, and rest-shape geometry. Existing single-view 3D reconstruction methods often overlook this underlying composition, presuming rigidity or neglecting external forces. Consequently, the reconstructed objects fail to withstand real-world physical forces, resulting in instability or undesirable deformation – diverging from their intended designs as depicted in the image. Our optimization framework addresses this by embedding physical compatibility into the reconstruction process. We explicitly decompose the three physical attributes and link them through static equilibrium, which serves as a hard constraint, ensuring that the optimized physical shapes exhibit desired physical behaviors. Evaluations on a dataset collected from Objaverse demonstrate that our framework consistently enhances the physical realism of 3D models over existing methods. The utility of our framework extends to practical applications in dynamic simulations and 3D printing, where adherence to physical compatibility is paramount.

[CV-70] ShelfHelp: Empowering Humans to Perform Vision-Independent Manipulation Tasks with a Socially Assistive Robotic Cane

链接: https://arxiv.org/abs/2405.20501
作者: Shivendra Agrawal,Suresh Nayak,Ashutosh Naik,Bradley Hayes
关键词: shop independently, quality of life, ability to shop, important for maintaining, maintaining a high
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 14 figures and charts

点击查看摘要

Abstract:The ability to shop independently, especially in grocery stores, is important for maintaining a high quality of life. This can be particularly challenging for people with visual impairments (PVI). Stores carry thousands of products, with approximately 30,000 new products introduced each year in the US market alone, presenting a challenge even for modern computer vision solutions. Through this work, we present a proof-of-concept socially assistive robotic system we call ShelfHelp, and propose novel technical solutions for enhancing instrumented canes traditionally meant for navigation tasks with additional capability within the domain of shopping. ShelfHelp includes a novel visual product locator algorithm designed for use in grocery stores and a novel planner that autonomously issues verbal manipulation guidance commands to guide the user during product retrieval. Through a human subjects study, we show the system’s success in locating and providing effective manipulation guidance to retrieve desired products with novice users. We compare two autonomous verbal guidance modes achieving comparable performance to a human assistance baseline and present encouraging findings that validate our system’s efficiency and effectiveness and through positive subjective metrics including competence, intelligence, and ease of use.

[CV-71] Slight Corruption in Pre-training Data Makes Better Diffusion Models

链接: https://arxiv.org/abs/2405.20494
作者: Hao Chen,Yujin Han,Diganta Misra,Xiang Li,Kai Hu,Difan Zou,Masashi Sugiyama,Jindong Wang,Bhiksha Raj
关键词: shown remarkable capabilities, generating realistic high-quality, realistic high-quality images, Diffusion models, shown remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 50 pages, 33 figures, 4 tables

点击查看摘要

Abstract:Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs.

[CV-72] STHN: Deep Homography Estimation for UAV Thermal Geo-localization with Satellite Imagery

链接: https://arxiv.org/abs/2405.20470
作者: Jiuhong Xiao,Ning Zhang,Daniel Tortei,Giuseppe Loianno
关键词: Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, power line inspections, outdoor applications including
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 7 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Accurate geo-localization of Unmanned Aerial Vehicles (UAVs) is crucial for a variety of outdoor applications including search and rescue operations, power line inspections, and environmental monitoring. The vulnerability of Global Navigation Satellite Systems (GNSS) signals to interference and spoofing necessitates the development of additional robust localization methods for autonomous navigation. Visual Geo-localization (VG), leveraging onboard cameras and reference satellite maps, offers a promising solution for absolute localization. Specifically, Thermal Geo-localization (TG), which relies on image-based matching between thermal imagery with satellite databases, stands out by utilizing infrared cameras for effective night-time localization. However, the efficiency and effectiveness of current TG approaches, are hindered by dense sampling on satellite maps and geometric noises in thermal query images. To overcome these challenges, in this paper, we introduce STHN, a novel UAV thermal geo-localization approach that employs a coarse-to-fine deep homography estimation method. This method attains reliable thermal geo-localization within a 512-meter radius of the UAV’s last known location even with a challenging 11% overlap between satellite and thermal images, despite the presence of indistinct textures in thermal imagery and self-similar patterns in both spectra. Our research significantly enhances UAV thermal geo-localization performance and robustness against the impacts of geometric noises under low-visibility conditions in the wild. The code will be made publicly available.

[CV-73] Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

链接: https://arxiv.org/abs/2405.20469
作者: Krishnakant Singh,Thanush Navaratnam,Jannik Holmer,Simone Schaub-Meyer,Stefan Roth
关键词: developing machine learning, machine learning approaches, high-quality labeled data, synthetic clone models, long-standing challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at CVPR 2024 Workshop: SyntaGen-Harnessing Generative Models for Synthetic Visual Datasets. Project page at this https URL

点击查看摘要

Abstract:A long-standing challenge in developing machine learning approaches has been the lack of high-quality labeled data. Recently, models trained with purely synthetic data, here termed synthetic clones, generated using large-scale pre-trained diffusion models have shown promising results in overcoming this annotation bottleneck. As these synthetic clone models progress, they are likely to be deployed in challenging real-world settings, yet their suitability remains understudied. Our work addresses this gap by providing the first benchmark for three classes of synthetic clone models, namely supervised, self-supervised, and multi-modal ones, across a range of robustness measures. We show that existing synthetic self-supervised and multi-modal clones are comparable to or outperform state-of-the-art real-image baselines for a range of robustness metrics - shape bias, background bias, calibration, etc. However, we also find that synthetic clones are much more susceptible to adversarial and real-world noise than models trained with real data. To address this, we find that combining both real and synthetic data further increases the robustness, and that the choice of prompt used for generating synthetic images plays an important part in the robustness of synthetic clones.

[CV-74] ENTIRe-ID: An Extensive and Diverse Dataset for Person Re-Identification

链接: https://arxiv.org/abs/2405.20465
作者: Serdar Yildiz,Ahmet Nezih Kasim
关键词: growing importance, reidentification in computer, computer vision, vision has highlighted, ENTIRe-ID dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 2024 18th International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:The growing importance of person reidentification in computer vision has highlighted the need for more extensive and diverse datasets. In response, we introduce the ENTIRe-ID dataset, an extensive collection comprising over 4.45 million images from 37 different cameras in varied environments. This dataset is uniquely designed to tackle the challenges of domain variability and model generalization, areas where existing datasets for person re-identification have fallen short. The ENTIRe-ID dataset stands out for its coverage of a wide array of real-world scenarios, encompassing various lighting conditions, angles of view, and diverse human activities. This design ensures a realistic and robust training platform for ReID models. The ENTIRe-ID dataset is publicly available at this https URL

[CV-75] Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining

链接: https://arxiv.org/abs/2405.20462
作者: Yi Wang,Conrad M Albrecht,Xiao Xiang Zhu
关键词: building Earth observation, raised great interest, Earth observation, large-scale satellite data, Self-supervised pretraining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Self-supervised pretraining on large-scale satellite data has raised great interest in building Earth observation (EO) foundation models. However, many important resources beyond pure satellite imagery, such as land-cover-land-use products that provide free global semantic information, as well as vision foundation models that hold strong knowledge of the natural world, tend to be overlooked. In this work, we show these free additional resources not only help resolve common contrastive learning bottlenecks, but also significantly boost the efficiency and effectiveness of EO pretraining. Specifically, we first propose soft contrastive learning that optimizes cross-scene soft similarity based on land-cover-generated multi-label supervision, naturally solving the issue of multiple positive samples and too strict positive matching in complex scenes. Second, we explore cross-domain continual pretraining for both multispectral and SAR imagery, building efficient EO foundation models from strongest vision models such as DINOv2. Integrating simple weight-initialization and Siamese masking strategies into our soft contrastive learning framework, we demonstrate impressive continual pretraining performance even when the input channels and modalities are not aligned. Without prohibitive training, we produce multispectral and SAR foundation models that achieve significantly better results in 9 out of 10 downstream tasks than most existing SOTA models. For example, our ResNet50/ViT-S achieve 84.8/85.0 linear probing mAP scores on BigEarthNet-10% which are better than most existing ViT-L models; under the same setting, our ViT-B sets a new record of 86.8 in multispectral, and 82.5 in SAR, the latter even better than many multispectral models. Dataset and models are available at this https URL. Comments: 16 pages, 9 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2405.20462 [cs.CV] (or arXiv:2405.20462v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2405.20462 Focus to learn more arXiv-issued DOI via DataCite

[CV-76] On Calibration of Object Detectors: Pitfalls Evaluation and Baselines

链接: https://arxiv.org/abs/2405.20459
作者: Selim Kuzucu,Kemal Oksuz,Jonathan Sadeghi,Puneet K. Dokania
关键词: requires careful attention, Reliable usage, object detectors require, requires careful, careful attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 31 pages, 8 figures

点击查看摘要

Abstract:Reliable usage of object detectors require them to be calibrated – a crucial problem that requires careful attention. Recent approaches towards this involve (1) designing new loss functions to obtain calibrated detectors by training them from scratch, and (2) post-hoc Temperature Scaling (TS) that learns to scale the likelihood of a trained detector to output calibrated predictions. These approaches are then evaluated based on a combination of Detection Expected Calibration Error (D-ECE) and Average Precision. In this work, via extensive analysis and insights, we highlight that these recent evaluation frameworks, evaluation metrics, and the use of TS have notable drawbacks leading to incorrect conclusions. As a step towards fixing these issues, we propose a principled evaluation framework to jointly measure calibration and accuracy of object detectors. We also tailor efficient and easy-to-use post-hoc calibration approaches such as Platt Scaling and Isotonic Regression specifically for object detection task. Contrary to the common notion, our experiments show that once designed and evaluated properly, post-hoc calibrators, which are extremely cheap to build and use, are much more powerful and effective than the recent train-time calibration methods. To illustrate, D-DETR with our post-hoc Isotonic Regression calibrator outperforms the recent train-time state-of-the-art calibration method Cal-DETR by more than 7 D-ECE on the COCO dataset. Additionally, we propose improved versions of the recently proposed Localization-aware ECE and show the efficacy of our method on these metrics as well. Code is available at: this https URL.

[CV-77] P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation

链接: https://arxiv.org/abs/2405.20443
作者: Qi Zhang,Guohua Geng,Longquan Yan,Pengbo Zhou,Zhaodi Li,Kang Li,Qinglin Liu
关键词: remote-sensing images, essential components, deal with remote-sensing, semantic segmentation tasks, segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models and multi-scale features are essential components in semantic segmentation tasks that deal with remote-sensing images. They contribute to improved segmentation boundaries and offer significant contextual information. U-net-like architectures are frequently employed in diffusion models for segmentation tasks. These architectural designs include dense skip connections that may pose challenges for interpreting intermediate features. Consequently, they might not efficiently convey semantic information throughout various layers of the encoder-decoder architecture. To address these challenges, we propose a new model for semantic segmentation known as the diffusion model with parallel multi-scale branches. This model consists of Parallel Multiscale Diffusion modules (P-MSDiff) and a Cross-Bridge Linear Attention mechanism (CBLA). P-MSDiff enhances the understanding of semantic information across multiple levels of granularity and detects repetitive distribution data through the integration of recursive denoising branches. It further facilitates the amalgamation of data by connecting relevant branches to the primary framework to enable concurrent denoising. Furthermore, within the interconnected transformer architecture, the LA module has been substituted with the CBLA module. This module integrates a semidefinite matrix linked to the query into the dot product computation of keys and values. This integration enables the adaptation of queries within the LA framework. This adjustment enhances the structure for multi-head attention computation, leading to enhanced network performance and CBLA is a plug-and-play module. Our model demonstrates superior performance based on the J1 metric on both the UAVid and Vaihingen Building datasets, showing improvements of 1.60% and 1.40% over strong baseline models, respectively.

[CV-78] Exploring the Practicality of Federated Learning: A Survey Towards the Communication Perspective

链接: https://arxiv.org/abs/2405.20431
作者: Khiem Le,Nhan Luong-Ha,Manh Nguyen-Duc,Danh Le-Phuoc,Cuong Do,Kok-Seng Wong
关键词: enabling collaborative training, Federated Learning, offers significant advancements, centralizing data, paradigm that offers
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a promising paradigm that offers significant advancements in privacy-preserving, decentralized machine learning by enabling collaborative training of models across distributed devices without centralizing data. However, the practical deployment of FL systems faces a significant bottleneck: the communication overhead caused by frequently exchanging large model updates between numerous devices and a central server. This communication inefficiency can hinder training speed, model performance, and the overall feasibility of real-world FL applications. In this survey, we investigate various strategies and advancements made in communication-efficient FL, highlighting their impact and potential to overcome the communication challenges inherent in FL systems. Specifically, we define measures for communication efficiency, analyze sources of communication inefficiency in FL systems, and provide a taxonomy and comprehensive review of state-of-the-art communication-efficient FL methods. Additionally, we discuss promising future research directions for enhancing the communication efficiency of FL systems. By addressing the communication bottleneck, FL can be effectively applied and enable scalable and practical deployment across diverse applications that require privacy-preserving, decentralized machine learning, such as IoT, healthcare, or finance.

[CV-79] Back to the Basics on Predicting Transfer Performance

链接: https://arxiv.org/abs/2405.20420
作者: Levy Chaves,Eduardo Valle,Alceu Bissoto,Sandra Avila
关键词: deep learning, evolving landscape, landscape of deep, growing number, number of choices
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 3 figures, 2 tables

点击查看摘要

Abstract:In the evolving landscape of deep learning, selecting the best pre-trained models from a growing number of choices is a challenge. Transferability scorers propose alleviating this scenario, but their recent proliferation, ironically, poses the challenge of their own assessment. In this work, we propose both robust benchmark guidelines for transferability scorers, and a well-founded technique to combine multiple scorers, which we show consistently improves their results. We extensively evaluate 13 scorers from literature across 11 datasets, comprising generalist, fine-grained, and medical imaging datasets. We show that few scorers match the predictive performance of the simple raw metric of models on ImageNet, and that all predictors suffer on medical datasets. Our results highlight the potential of combining different information sources for reliably predicting transferability across varied domains.

[CV-80] Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

链接: https://arxiv.org/abs/2405.20413
作者: Haibo Jin,Andy Zhou,Joe D. Menke,Haohan Wang
关键词: Large Language Models, Large Language, bypass protective measures, carefully crafted prompts, Language Models
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks’', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( \sim \times 19.88) and lower filtered-out rates ( \sim \times 1/6) than baselines.

[CV-81] Gradient Inversion of Federated Diffusion Models

链接: https://arxiv.org/abs/2405.20380
作者: Jiyue Huang,Chi Hong,Lydia Y. Chen,Stefanie Roos
关键词: Diffusion models, generate exceptionally high-resolution, defector generative models, generate exceptionally, effective diffusion models
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models are becoming defector generative models, which generate exceptionally high-resolution image data. Training effective diffusion models require massive real data, which is privately owned by distributed parties. Each data party can collaboratively train diffusion models in a federated learning manner by sharing gradients instead of the raw data. In this paper, we study the privacy leakage risk of gradient inversion attacks. First, we design a two-phase fusion optimization, GIDM, to leverage the well-trained generative model itself as prior knowledge to constrain the inversion search (latent) space, followed by pixel-wise fine-tuning. GIDM is shown to be able to reconstruct images almost identical to the original ones. Considering a more privacy-preserving training scenario, we then argue that locally initialized private training noise \epsilon and sampling step t may raise additional challenges for the inversion attack. To solve this, we propose a triple-optimization GIDM+ that coordinates the optimization of the unknown data, \epsilon and t . Our extensive evaluation results demonstrate the vulnerability of sharing gradient for data protection of diffusion models, even high-resolution images can be reconstructed with high quality.

[CV-82] Learning 3D Robotics Perception using Inductive Priors

链接: https://arxiv.org/abs/2405.20364
作者: Muhammad Zubair Irshad
关键词: Recent advances, performing digital tasks, unlocking the potential, machine-human conversation, image recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Georgia Tech Ph.D. Thesis, December 2023. For more details: this https URL

点击查看摘要

Abstract:Recent advances in deep learning have led to a data-centric intelligence i.e. artificially intelligent models unlocking the potential to ingest a large amount of data and be really good at performing digital tasks such as text-to-image generation, machine-human conversation, and image recognition. This thesis covers the topic of learning with structured inductive bias and priors to design approaches and algorithms unlocking the potential of principle-centric intelligence. Prior knowledge (priors for short), often available in terms of past experience as well as assumptions of how the world works, helps the autonomous agent generalize better and adapt their behavior based on past experience. In this thesis, I demonstrate the use of prior knowledge in three different robotics perception problems. 1. object-centric 3D reconstruction, 2. vision and language for decision-making, and 3. 3D scene understanding. To solve these challenging problems, I propose various sources of prior knowledge including 1. geometry and appearance priors from synthetic data, 2. modularity and semantic map priors and 3. semantic, structural, and contextual priors. I study these priors for solving robotics 3D perception tasks and propose ways to efficiently encode them in deep learning models. Some priors are used to warm-start the network for transfer learning, others are used as hard constraints to restrict the action space of robotics agents. While classical techniques are brittle and fail to generalize to unseen scenarios and data-centric approaches require a large amount of labeled data, this thesis aims to build intelligent agents which require very-less real-world data or data acquired only from simulation to generalize to highly dynamic and cluttered environments in novel simulations (i.e. sim2sim) or real-world unseen environments (i.e. sim2real) for a holistic scene understanding of the 3D world.

[CV-83] LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild

链接: https://arxiv.org/abs/2405.20363
作者: Zhiqiang Wang,Dejia Xu,Rana Muhammad Shahroz Khan,Yanbin Lin,Zhiwen Fan,Xingquan Zhu
关键词: image-understanding applications, Google Street View, critical task, models, Street View
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures, 5 tables, CVPR 2024 Workshop on Computer Vision in the Wild

点击查看摘要

Abstract:Image geolocation is a critical task in various image-understanding applications. However, existing methods often fail when analyzing challenging, in-the-wild images. Inspired by the exceptional background knowledge of multimodal language models, we systematically evaluate their geolocation capabilities using a novel image dataset and a comprehensive evaluation framework. We first collect images from various countries via Google Street View. Then, we conduct training-free and training-based evaluations on closed-source and open-source multi-modal language models. we conduct both training-free and training-based evaluations on closed-source and open-source multimodal language models. Our findings indicate that closed-source models demonstrate superior geolocation abilities, while open-source models can achieve comparable performance through fine-tuning.

[CV-84] Enhancing Adversarial Robustness in SNNs with Sparse Gradients

链接: https://arxiv.org/abs/2405.20355
作者: Yujia Liu,Tong Bu,Jianhao Ding,Zecheng Hao,Tiejun Huang,Zhaofei Yu
关键词: Spiking Neural Networks, Artificial Neural Networks, Neural Networks, Spiking Neural, Artificial Neural
类目: Neural and Evolutionary Computing (cs.NE); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by ICML 2024

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have attracted great attention for their energy-efficient operations and biologically inspired structures, offering potential advantages over Artificial Neural Networks (ANNs) in terms of energy efficiency and interpretability. Nonetheless, similar to ANNs, the robustness of SNNs remains a challenge, especially when facing adversarial attacks. Existing techniques, whether adapted from ANNs or specifically designed for SNNs, exhibit limitations in training SNNs or defending against strong attacks. In this paper, we propose a novel approach to enhance the robustness of SNNs through gradient sparsity regularization. We observe that SNNs exhibit greater resilience to random perturbations compared to adversarial perturbations, even at larger scales. Motivated by this, we aim to narrow the gap between SNNs under adversarial and random perturbations, thereby improving their overall robustness. To achieve this, we theoretically prove that this performance gap is upper bounded by the gradient sparsity of the probability associated with the true label concerning the input image, laying the groundwork for a practical strategy to train robust SNNs by regularizing the gradient sparsity. We validate the effectiveness of our approach through extensive experiments on both image-based and event-based datasets. The results demonstrate notable improvements in the robustness of SNNs. Our work highlights the importance of gradient sparsity in SNNs and its role in enhancing robustness.

[CV-85] Predicting ptychography probe positions using single-shot phase retrieval neural network

链接: https://arxiv.org/abs/2405.20910
作者: Ming Du,Tao Zhou,Junjing Deng,Daniel J. Ching,Steven Henke,Mathew J. Cherukara
关键词: including materials science, powerful imaging technique, variety of fields, including materials, materials science
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Ptychography is a powerful imaging technique that is used in a variety of fields, including materials science, biology, and nanotechnology. However, the accuracy of the reconstructed ptychography image is highly dependent on the accuracy of the recorded probe positions which often contain errors. These errors are typically corrected jointly with phase retrieval through numerical optimization approaches. When the error accumulates along the scan path or when the error magnitude is large, these approaches may not converge with satisfactory result. We propose a fundamentally new approach for ptychography probe position prediction for data with large position errors, where a neural network is used to make single-shot phase retrieval on individual diffraction patterns, yielding the object image at each scan point. The pairwise offsets among these images are then found using a robust image registration method, and the results are combined to yield the complete scan path by constructing and solving a linear equation. We show that our method can achieve good position prediction accuracy for data with large and accumulating errors on the order of 10^2 pixels, a magnitude that often makes optimization-based algorithms fail to converge. For ptychography instruments without sophisticated position control equipment such as interferometers, our method is of significant practical potential.

[CV-86] R2-Gaussian: Rectifying Radiative Gaussian Splatting for Tomographic Reconstruction

链接: https://arxiv.org/abs/2405.20693
作者: Ruyi Zha,Tao Jun Lin,Yuanhao Cai,Jiwen Cao,Yanhao Zhang,Hongdong Li
关键词: shown promising results, shown promising, image rendering, rendering and surface, Gaussian splatting
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) has shown promising results in image rendering and surface reconstruction. However, its potential in volumetric reconstruction tasks, such as X-ray computed tomography, remains under-explored. This paper introduces R2-Gaussian, the first 3DGS-based framework for sparse-view tomographic reconstruction. By carefully deriving X-ray rasterization functions, we discover a previously unknown integration bias in the standard 3DGS formulation, which hampers accurate volume retrieval. To address this issue, we propose a novel rectification technique via refactoring the projection from 3D to 2D Gaussians. Our new method presents three key innovations: (1) introducing tailored Gaussian kernels, (2) extending rasterization to X-ray imaging, and (3) developing a CUDA-based differentiable voxelizer. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 0.93 dB in PSNR and 0.014 in SSIM. Crucially, it delivers high-quality results in 3 minutes, which is 12x faster than NeRF-based methods and on par with traditional algorithms. The superior performance and rapid convergence of our method highlight its practical value.

[CV-87] Universal evaluation and design of imaging systems using information estimation

链接: https://arxiv.org/abs/2405.20559
作者: Henry Pinkard,Leyla Kabuli,Eric Markley,Tiffany Chien,Jiantao Jiao,Laura Waller
关键词: reliable communication systems, Imaging systems, presence of noise, modern world, describes the transmission
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Image and Video Processing (eess.IV); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Information theory, which describes the transmission of signals in the presence of noise, has enabled the development of reliable communication systems that underlie the modern world. Imaging systems can also be viewed as a form of communication, in which information about the object is “transmitted” through images. However, the application of information theory to imaging systems has been limited by the challenges of accounting for their physical constraints. Here, we introduce a framework that addresses these limitations by modeling the probabilistic relationship between objects and their measurements. Using this framework, we develop a method to estimate information using only a dataset of noisy measurements, without making any assumptions about the image formation process. We demonstrate that these estimates comprehensively quantify measurement quality across a diverse range of imaging systems and applications. Furthermore, we introduce Information-Driven Encoder Analysis Learning (IDEAL), a technique to optimize the design of imaging hardware for maximum information capture. This work provides new insights into the fundamental performance limits of imaging systems and offers powerful new tools for their analysis and design.

[CV-88] Can No-Reference Quality-Assessment Methods Serve as Perceptual Losses for Super-Resolution?

链接: https://arxiv.org/abs/2405.20392
作者: Egor Kashkarov,Egor Chistov,Ivan Molodetskikh,Dmitriy Vatolin
关键词: Perceptual losses play, Perceptual losses, role in constructing, images and videos, play an important
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, 3 figures. The first two authors contributed equally to this work

点击查看摘要

Abstract:Perceptual losses play an important role in constructing deep-neural-network-based methods by increasing the naturalness and realism of processed images and videos. Use of perceptual losses is often limited to LPIPS, a fullreference method. Even though deep no-reference image-qualityassessment methods are excellent at predicting human judgment, little research has examined their incorporation in loss functions. This paper investigates direct optimization of several video-superresolution models using no-reference image-quality-assessment methods as perceptual losses. Our experimental results show that straightforward optimization of these methods produce artifacts, but a special training procedure can mitigate them.

机器学习

[LG-0] Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights

链接: https://arxiv.org/abs/2405.21070
作者: Xin Wen,Bingchen Zhao,Yilun Chen,Jiangmiao Pang,Xiaojuan Qi
关键词: web-scale vision-language datasets, Severe data imbalance, imbalance naturally exists, Severe data, vision-language datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP’s pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP’s generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code will be available at: this https URL.

[LG-1] Recurrent neural networks: vanishing and exploding gradients are not the end of the story

链接: https://arxiv.org/abs/2405.21064
作者: Nicolas Zucchet,Antonio Orvieto
关键词: Recurrent neural networks, learn long-term memories, Recurrent neural, notoriously struggle, long-term memories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) notoriously struggle to learn long-term memories, primarily due to vanishing and exploding gradients. The recent success of state-space models (SSMs), a subclass of RNNs, to overcome such difficulties challenges our theoretical understanding. In this paper, we delve into the optimization challenges of RNNs and discover that, as the memory of a network increases, changes in its parameters result in increasingly large output variations, making gradient-based learning highly sensitive, even without exploding gradients. Our analysis further reveals the importance of the element-wise recurrence design pattern combined with careful parametrizations in mitigating this effect. This feature is present in SSMs, as well as in other architectures, such as LSTMs. Overall, our insights provide a new explanation for some of the difficulties in gradient-based learning of RNNs and why some architectures perform better than others.

[LG-2] Neural Network Verification with Branch-and-Bound for General Nonlinearities

链接: https://arxiv.org/abs/2405.21063
作者: Zhouxing Shi,Qirui Jin,Zico Kolter,Suman Jana,Cho-Jui Hsieh,Huan Zhang
关键词: effective methods, Optimal Power Flow, verification, networks, general
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Branch-and-bound (BaB) is among the most effective methods for neural network (NN) verification. However, existing works on BaB have mostly focused on NNs with piecewise linear activations, especially ReLU networks. In this paper, we develop a general framework, named GenBaB, to conduct BaB for general nonlinearities in general computational graphs based on linear bound propagation. To decide which neuron to branch, we design a new branching heuristic which leverages linear bounds as shortcuts to efficiently estimate the potential improvement after branching. To decide nontrivial branching points for general nonlinear functions, we propose to optimize branching points offline, which can be efficiently leveraged during verification with a lookup table. We demonstrate the effectiveness of our GenBaB on verifying a wide range of NNs, including networks with activation functions such as Sigmoid, Tanh, Sine and GeLU, as well as networks involving multi-dimensional nonlinear operations such as multiplications in LSTMs and Vision Transformers. Our framework also allows the verification of general nonlinear computation graphs and enables verification applications beyond simple neural networks, particularly for AC Optimal Power Flow (ACOPF). GenBaB is part of the latest \alpha,!\beta -CROWN, the winner of the 4th International Verification of Neural Networks Competition (VNN-COMP 2023).

[LG-3] Graph External Attention Enhanced Transformer

链接: https://arxiv.org/abs/2405.21061
作者: Jianqing Liang,Min Chen,Jiye Liang
关键词: Graph Neural Networks, Neural Networks, recently gained considerable, gained considerable attention, Graph Neural
类目: Machine Learning (cs.LG)
*备注: In Proceedings of ICML 2024

点击查看摘要

Abstract:The Transformer architecture has recently gained considerable attention in the field of graph representation learning, as it naturally overcomes several limitations of Graph Neural Networks (GNNs) with customized attention mechanisms or positional and structural encodings. Despite making some progress, existing works tend to overlook external information of graphs, specifically the correlation between graphs. Intuitively, graphs with similar structures should have similar representations. Therefore, we propose Graph External Attention (GEA) – a novel attention mechanism that leverages multiple external node/edge key-value units to capture inter-graph correlations implicitly. On this basis, we design an effective architecture called Graph External Attention Enhanced Transformer (GEAET), which integrates local structure and global interaction information for more comprehensive graph representations. Extensive experiments on benchmark datasets demonstrate that GEAET achieves state-of-the-art empirical performance. The source code is available for reproducibility at: this https URL.

[LG-4] ransformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

链接: https://arxiv.org/abs/2405.21060
作者: Tri Dao,Albert Gu
关键词: deep learning success, medium scale, deep learning, learning success, recently been shown
类目: Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:While Transformers have been the main architecture behind deep learning’s success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba’s selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.

[LG-5] Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models

链接: https://arxiv.org/abs/2405.21050
作者: Xinxi Zhang,Song Wen,Ligong Han,Felix Juefei-Xu,Akash Srivastava,Junzhou Huang,Hao Wang,Molei Tao,Dimitris N. Metaxas
关键词: Adapting large-scale pre-trained, Adapting large-scale, large-scale pre-trained generative, gaining traction, large-scale pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adapting large-scale pre-trained generative models in a parameter-efficient manner is gaining traction. Traditional methods like low rank adaptation achieve parameter efficiency by imposing constraints but may not be optimal for tasks requiring high representation capacity. We propose a novel spectrum-aware adaptation framework for generative models. Our method adjusts both singular values and their basis vectors of pretrained weights. Using the Kronecker product and efficient Stiefel optimizers, we achieve parameter-efficient adaptation of orthogonal matrices. We introduce Spectral Orthogonal Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity. Extensive evaluations on text-to-image diffusion models demonstrate SODA’s effectiveness, offering a spectrum-aware alternative to existing fine-tuning methods.

[LG-6] Grammar-Aligned Decoding

链接: https://arxiv.org/abs/2405.21047
作者: Kanghee Park,Jiayu Wang,Taylor Berg-Kirkpatrick,Nadia Polikarpova,Loris D’Antoni
关键词: Large Language Models, Large Language, Language Models, reliably generating highly, LLM distribution
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup. Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint. Specifically, in grammar-constrained decoding (GCD), the LLM’s output must follow a given grammar. In this paper we demonstrate that GCD techniques (and in general constrained decoding techniques) can distort the LLM’s distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM, and so ultimately are low-quality. We call the problem of aligning sampling with a grammar constraint, grammar-aligned decoding (GAD), and propose adaptive sampling with approximate expected futures (ASAp), a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM’s distribution conditioned on the given grammar constraint. Our algorithm uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes. Our evaluation on code generation and structured NLP tasks shows how ASAp often produces outputs with higher likelihood (according to the LLM’s distribution) than existing GCD techniques, while still enforcing the desired grammatical constraints.

[LG-7] Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

链接: https://arxiv.org/abs/2405.21046
作者: Tengyang Xie,Dylan J. Foster,Akshay Krishnamurthy,Corby Rosset,Ahmed Awadallah,Alexander Rakhlin
关键词: Exploratory Preference Optimization, Direct Preference Optimization, language model alignment, Preference Optimization, Reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the model to produce diverse, maximally informative responses. By allowing RLHF to confidently stray from the pre-trained model, online exploration offers the possibility of novel, potentially super-human capabilities, but its full potential as a paradigm for language model training has yet to be realized, owing to computational and statistical bottlenecks in directly adapting existing reinforcement learning techniques. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO), which is simple and practical – a one-line change to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023) – yet enjoys the strongest known provable guarantees and promising empirical performance. XPO augments the DPO objective with a novel and principled exploration bonus, empowering the algorithm to explore outside the support of the initial model and human feedback data. In theory, we show that XPO is provably sample-efficient and converges to a near-optimal language model policy under natural exploration conditions, irrespective of whether the initial model has good coverage. Our analysis, which builds on the observation that DPO implicitly performs a form of Q^\star -approximation (or, Bellman error minimization), combines previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the perspective of KL-regularized Markov decision processes. Empirically, we find that XPO is more sample-efficient than non-exploratory DPO variants in a preliminary evaluation.

[LG-8] An Attention-Based Multi-Context Convolutional Encoder-Decoder Neural Network for Work Zone Traffic Impact Prediction

链接: https://arxiv.org/abs/2405.21045
作者: Qinhua Jiang,Xishun Liao,Yaofa Gong,Jiaqi Ma
关键词: Work zone, work zone events, Work, work zone traffic, traffic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Work zone is one of the major causes of non-recurrent traffic congestion and road incidents. Despite the significance of its impact, studies on predicting the traffic impact of work zones remain scarce. In this paper, we propose a data integration pipeline that enhances the utilization of work zone and traffic data from diversified platforms, and introduce a novel deep learning model to predict the traffic speed and incident likelihood during planned work zone events. The proposed model transforms traffic patterns into 2D space-time images for both model input and output and employs an attention-based multi-context convolutional encoder-decoder architecture to capture the spatial-temporal dependencies between work zone events and traffic variations. Trained and validated on four years of archived work zone traffic data from Maryland, USA, the model demonstrates superior performance over baseline models in predicting traffic speed, incident likelihood, and inferred traffic attributes such as queue length and congestion timings (i.e., start time and duration). Specifically, the proposed model outperforms the baseline models by reducing the prediction error of traffic speed by 5% to 34%, queue length by 11% to 29%, congestion timing by 6% to 17%, and increasing the accuracy of incident predictions by 5% to 7%. Consequently, this model offers substantial promise for enhancing the planning and traffic management of work zones.

[LG-9] arget Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

链接: https://arxiv.org/abs/2405.21043
作者: Fengdi Che,Chenjun Xiao,Jincheng Mei,Bo Dai,Ramki Gummadi,Oscar A Ramirez,Christopher K Harris,A. Rupam Mahmood,Dale Schuurmans
关键词: linear function approximation, function approximation establishes, over-parameterized linear function, off-policy data, linear function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird’s counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.

[LG-10] Comparing information content of representation spaces for disentanglement with VAE ensembles

链接: https://arxiv.org/abs/2405.21042
作者: Kieran A. Murphy,Sam Dillavou,Dani S. Bassett
关键词: representation subspaces, divide information, information, representation, fragments
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Disentanglement is the endeavour to use machine learning to divide information about a dataset into meaningful fragments. In practice these fragments are representation (sub)spaces, often the set of channels in the latent space of a variational autoencoder (VAE). Assessments of disentanglement predominantly employ metrics that are coarse-grained at the model level, but this approach can obscure much about the process of information fragmentation. Here we propose to study the learned channels in aggregate, as the fragments of information learned by an ensemble of repeat training runs. Additionally, we depart from prior work where measures of similarity between individual subspaces neglected the nature of data embeddings as probability distributions. Instead, we view representation subspaces as communication channels that perform a soft clustering of the data; consequently, we generalize two classic information-theoretic measures of similarity between clustering assignments to compare representation spaces. We develop a lightweight method of estimation based on fingerprinting representation subspaces by their ability to distinguish dataset samples, allowing us to identify, analyze, and leverage meaningful structure in ensembles of VAEs trained on synthetic and natural datasets. Using this fully unsupervised pipeline we identify “hotspots” in the space of information fragments: groups of nearly identical representation subspaces that appear repeatedly in an ensemble of VAEs, particularly as regularization is increased. Finally, we leverage the proposed methodology to achieve ensemble learning with VAEs, boosting the information content of a set of weak learners – a capability not possible with previous methods of assessing channel similarity.

[LG-11] A-PETE: Adaptive Prototype Explanations of Tree Ensembles

链接: https://arxiv.org/abs/2405.21036
作者: Jacek Karolczak,Jerzy Stefanowski
关键词: machine learning models, tree ensembles, Adaptive Prototype Explanations, interpreting machine learning, context of tree
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The need for interpreting machine learning models is addressed through prototype explanations within the context of tree ensembles. An algorithm named Adaptive Prototype Explanations of Tree Ensembles (A-PETE) is proposed to automatise the selection of prototypes for these classifiers. Its unique characteristics is using a specialised distance measure and a modified k-medoid approach. Experiments demonstrated its competitive predictive accuracy with respect to earlier explanation algorithms. It also provides a a sufficient number of prototypes for the purpose of interpreting the random forest classifier.

[LG-12] Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles

链接: https://arxiv.org/abs/2405.21027
作者: Jiesong Lian,Yucong Huang,Mingzhi Wang,Chengdong Ma,Yixue Hao,Ying Wen,Yaodong Yang
关键词: Nash Equilibrium, Space Response Oracle, Policy Space Response, solving zero-sum games, policies
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents’ mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.

[LG-13] Beyond Conventional Parametric Modeling: Data-Driven Framework for Estimation and Prediction of Time Activity Curves in Dynamic PET Imaging

链接: https://arxiv.org/abs/2405.21021
作者: Niloufar Zakariaei,Arman Rahmim,Eldad Haber
关键词: Positron Emission Tomography, Dynamic Positron Emission, Emission Tomography, Positron Emission, Time-Activity Curve
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Dynamic Positron Emission Tomography (dPET) imaging and Time-Activity Curve (TAC) analyses are essential for understanding and quantifying the biodistribution of radiopharmaceuticals over time and space. Traditional compartmental modeling, while foundational, commonly struggles to fully capture the complexities of biological systems, including non-linear dynamics and variability. This study introduces an innovative data-driven neural network-based framework, inspired by Reaction Diffusion systems, designed to address these limitations. Our approach, which adaptively fits TACs from dPET, enables the direct calibration of diffusion coefficients and reaction terms from observed data, offering significant improvements in predictive accuracy and robustness over traditional methods, especially in complex biological scenarios. By more accurately modeling the spatio-temporal dynamics of radiopharmaceuticals, our method advances modeling of pharmacokinetic and pharmacodynamic processes, enabling new possibilities in quantitative nuclear medicine.

[LG-14] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

链接: https://arxiv.org/abs/2405.21018
作者: Xiaojun Jia,Tianyu Pang,Chao Du,Yihao Huang,Jindong Gu,Yang Liu,Xiaochun Cao,Min Lin
关键词: Large language models, Greedy Coordinate Gradient, Large language, language models, rapidly developed
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack’s success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of “Sure” largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed \mathcalI -GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at this https URL.

[LG-15] G-Transformer for Conditional Average Potential Outcome Estimation over Time

链接: https://arxiv.org/abs/2405.21012
作者: Konstantin Hess,Dennis Frauen,Valentyn Melnychuk,Stefan Feuerriegel
关键词: Estimating potential outcomes, Estimating potential, based on observational, observational data, data is important
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. Yet, existing neural methods for this task suffer from either (a) bias or (b) large variance. In order to address both limitations, we introduce the G-transformer (GT). Our GT is a novel, neural end-to-end model designed for unbiased, low-variance estimation of conditional average potential outcomes (CAPOs) over time. Specifically, our GT is the first neural model to perform regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our GT across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records.

[LG-16] Explaining Predictions by Characteristic Rules

链接: https://arxiv.org/abs/2405.21003
作者: Amr Alkhatib,Henrik Boström,Michalis Vazirgiannis
关键词: Characteristic Explanatory General, CEGA, rules, Explanatory General Association, Anchors
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022

点击查看摘要

Abstract:Characteristic rules have been advocated for their ability to improve interpretability over discriminative rules within the area of rule learning. However, the former type of rule has not yet been used by techniques for explaining predictions. A novel explanation technique, called CEGA (Characteristic Explanatory General Association rules), is proposed, which employs association rule mining to aggregate multiple explanations generated by any standard local explanation technique into a set of characteristic rules. An empirical investigation is presented, in which CEGA is compared to two state-of-the-art methods, Anchors and GLocalX, for producing local and aggregated explanations in the form of discriminative rules. The results suggest that the proposed approach provides a better trade-off between fidelity and complexity compared to the two state-of-the-art approaches; CEGA and Anchors significantly outperform GLocalX with respect to fidelity, while CEGA and GLocalX significantly outperform Anchors with respect to the number of generated rules. The effect of changing the format of the explanations of CEGA to discriminative rules and using LIME and SHAP as local explanation techniques instead of Anchors are also investigated. The results show that the characteristic explanatory rules still compete favorably with rules in the standard discriminative format. The results also indicate that using CEGA in combination with either SHAP or Anchors consistently leads to a higher fidelity compared to using LIME as the local explanation technique.

[LG-17] Information limits and Thouless-Anderson-Palmer equations for spiked matrix models with structured noise

链接: https://arxiv.org/abs/2405.20993
作者: Jean Barbier,Francesco Camilli,Marco Mondelli,Yizhou Xu
关键词: problem of Bayesian, Bayesian inference, prototypical problem, low-rank signal, signal is corrupted
类目: Information Theory (cs.IT); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider a prototypical problem of Bayesian inference for a structured spiked model: a low-rank signal is corrupted by additive noise. While both information-theoretic and algorithmic limits are well understood when the noise is i.i.d. Gaussian, the more realistic case of structured noise still proves to be challenging. To capture the structure while maintaining mathematical tractability, a line of work has focused on rotationally invariant noise. However, existing studies either provide sub-optimal algorithms or they are limited to a special class of noise ensembles. In this paper, we establish the first characterization of the information-theoretic limits for a noise matrix drawn from a general trace ensemble. These limits are then achieved by an efficient algorithm inspired by the theory of adaptive Thouless-Anderson-Palmer (TAP) equations. Our approach leverages tools from statistical physics (replica method) and random matrix theory (generalized spherical integrals), and it unveils the equivalence between the rotationally invariant model and a surrogate Gaussian model.

[LG-18] Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models

链接: https://arxiv.org/abs/2405.20991
作者: Yi Yang,Qingwen Zhang,Kei Ikemura,Nazre Batool,John Folkesson
关键词: extreme weather conditions, presents significant challenges, anomalous road users, Addressing hard cases, complex traffic interactions
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: IEEE Intelligent Vehicles Symposium (IV) 2024

点击查看摘要

Abstract:Addressing hard cases in autonomous driving, such as anomalous road users, extreme weather conditions, and complex traffic interactions, presents significant challenges. To ensure safety, it is crucial to detect and manage these scenarios effectively for autonomous driving systems. However, the rarity and high-risk nature of these cases demand extensive, diverse datasets for training robust models. Vision-Language Foundation Models (VLMs) have shown remarkable zero-shot capabilities as being trained on extensive datasets. This work explores the potential of VLMs in detecting hard cases in autonomous driving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard cases in traffic participant motion prediction on both agent and scenario levels. We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify challenging agents or scenarios, which are verified by existing prediction models. Moreover, by taking advantage of this detection of hard cases by VLMs, we further improve the training efficiency of the existing motion prediction pipeline by performing data selection for the training samples suggested by GPT. We show the effectiveness and feasibility of our pipeline incorporating VLMs with state-of-the-art methods on NuScenes datasets. The code is accessible at this https URL.

[LG-19] Locking Machine Learning Models into Hardware

链接: https://arxiv.org/abs/2405.20990
作者: Eleanor Clifford,Adhithya Saravanan,Harry Langford,Cheng Zhang,Yiren Zhao,Robert Mullins,Ilia Shumailov,Jamie Hayes
关键词: Modern Machine Learning, Modern Machine, Machine Learning models, business competitiveness, competitiveness often depends
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 2 figures of main text; 14 pages, 16 figures of appendices

点击查看摘要

Abstract:Modern Machine Learning models are expensive IP and business competitiveness often depends on keeping this IP confidential. This in turn restricts how these models are deployed – for example it is unclear how to deploy a model on-device without inevitably leaking the underlying model. At the same time, confidential computing technologies such as Multi-Party Computation or Homomorphic encryption remain impractical for wide adoption. In this paper we take a different approach and investigate feasibility of ML-specific mechanisms that deter unauthorized model use by restricting the model to only be usable on specific hardware, making adoption on unauthorized hardware inconvenient. That way, even if IP is compromised, it cannot be trivially used without specialised hardware or major model adjustment. In a sense, we seek to enable cheap locking of machine learning models into specific hardware. We demonstrate that locking mechanisms are feasible by either targeting efficiency of model representations, such making models incompatible with quantisation, or tie the model’s operation on specific characteristics of hardware, such as number of cycles for arithmetic operations. We demonstrate that locking comes with negligible work and latency overheads, while significantly restricting usability of the resultant model on unauthorized hardware.

[LG-20] Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging

链接: https://arxiv.org/abs/2405.20988
作者: Michail Theologitis,Georgios Frangias,Georgios Anestis,Vasilis Samoladas,Antonios Deligiannakis
关键词: distributed deep learning, distributed deep, paradigm for training, ever-growing volume, volume and decentralized
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Driven by the ever-growing volume and decentralized nature of data, coupled with the escalating size of modern models, distributed deep learning (DDL) has been entrenched as the preferred paradigm for training. However, frequent synchronization of DL models, encompassing millions to many billions of parameters, creates a communication bottleneck, severely hindering scalability. Worse yet, DDL algorithms typically waste valuable bandwidth, and make themselves less practical in bandwidth-constrained federated settings, by relying on overly simplistic, periodic, and rigid synchronization schedules. To address these shortcomings, we propose Federated Dynamic Averaging (FDA), a communication-efficient DDL strategy that dynamically triggers synchronization based on the value of the model variance. Through extensive experiments across a wide range of learning tasks we demonstrate that FDA reduces communication cost by orders of magnitude, compared to both traditional and cutting-edge communication-efficient algorithms. Remarkably, FDA achieves this without sacrificing convergence speed - in stark contrast to the trade-offs encountered in the field. Additionally, we show that FDA maintains robust performance across diverse data heterogeneity settings.

[LG-21] Early Stopping Criteria for Training Generative Adversarial Networks in Biomedical Imaging

链接: https://arxiv.org/abs/2405.20987
作者: Muhammad Muneeb Saad,Mubashir Husain Rehmani,Ruairi O’Reilly
关键词: Generative Adversarial Networks, Generative Adversarial, Adversarial Networks, computational cost, training
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: This paper is accepted at the 35th IEEE Irish Signals and Systems Conference (ISSC 2024)

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have high computational costs to train their complex architectures. Throughout the training process, GANs’ output is analyzed qualitatively based on the loss and synthetic images’ diversity and quality. Based on this qualitative analysis, training is manually halted once the desired synthetic images are generated. By utilizing an early stopping criterion, the computational cost and dependence on manual oversight can be reduced yet impacted by training problems such as mode collapse, non-convergence, and instability. This is particularly prevalent in biomedical imagery, where training problems degrade the diversity and quality of synthetic images, and the high computational cost associated with training makes complex architectures increasingly inaccessible. This work proposes a novel early stopping criteria to quantitatively detect training problems, halt training, and reduce the computational costs associated with synthesizing biomedical images. Firstly, the range of generator and discriminator loss values is investigated to assess whether mode collapse, non-convergence, and instability occur sequentially, concurrently, or interchangeably throughout the training of GANs. Secondly, utilizing these occurrences in conjunction with the Mean Structural Similarity Index (MS-SSIM) and Fréchet Inception Distance (FID) scores of synthetic images forms the basis of the proposed early stopping criteria. This work helps identify the occurrence of training problems in GANs using low-resource computational cost and reduces training time to generate diversified and high-quality synthetic images.

[LG-22] Uncertainty Quantification for Birds Eye View Semantic Segmentation: Methods and Benchmarks

链接: https://arxiv.org/abs/2405.20986
作者: Linlin Yu,Bowen Yang,Tianhao Wang,Kangshuo Li,Feng Chen
关键词: Bird Eye View, Eye View, Bird Eye, create a Bird, representation is crucial
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The fusion of raw features from multiple sensors on an autonomous vehicle to create a Bird’s Eye View (BEV) representation is crucial for planning and control systems. There is growing interest in using deep learning models for BEV semantic segmentation. Anticipating segmentation errors and improving the explainability of DNNs is essential for autonomous driving, yet it is under-studied. This paper introduces a benchmark for predictive uncertainty quantification in BEV segmentation. The benchmark assesses various approaches across three popular datasets using two representative backbones and focuses on the effectiveness of predicted uncertainty in identifying misclassified and out-of-distribution (OOD) pixels, as well as calibration. Empirical findings highlight the challenges in uncertainty quantification. Our results find that evidential deep learning based approaches show the most promise by efficiently quantifying aleatoric and epistemic uncertainty. We propose the Uncertainty-Focal-Cross-Entropy (UFCE) loss, designed for highly imbalanced data, which consistently improves the segmentation quality and calibration. Additionally, we introduce a vacuity-scaled regularization term that enhances the model’s focus on high uncertainty pixels, improving epistemic uncertainty quantification.

[LG-23] Bayesian Design Principles for Offline-to-Online Reinforcement Learning

链接: https://arxiv.org/abs/2405.20984
作者: Hao Hu,Yiqin Yang,Jianing Ye,Chengjie Wu,Ziqing Mai,Yujing Hu,Tangjie Lv,Changjie Fan,Qianchuan Zhao,Chongjie Zhang
关键词: Offline reinforcement learning, costly or unsafe, real-world applications, applications where exploration, Offline reinforcement
类目: Machine Learning (cs.LG)
*备注: Forty-first International Conference on Machine Learning (ICML), 2024

点击查看摘要

Abstract:Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data. Comments: Forty-first International Conference on Machine Learning (ICML), 2024 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2405.20984 [cs.LG] (or arXiv:2405.20984v1 [cs.LG] for this version)

[LG-24] Neural Gaussian Scale-Space Fields

链接: https://arxiv.org/abs/2405.20980
作者: Felix Mujkanovic,Ntumba Elie Nsampi,Christian Theobalt,Hans-Peter Seidel,Thomas Leimkühler
关键词: Gaussian scale spaces, Gaussian scale, anisotropic Gaussian scale, scale space, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 15 pages; SIGGRAPH 2024; project page at this https URL

点击查看摘要

Abstract:Gaussian scale spaces are a cornerstone of signal representation and processing, with applications in filtering, multiscale analysis, anti-aliasing, and many more. However, obtaining such a scale space is costly and cumbersome, in particular for continuous representations such as neural fields. We present an efficient and lightweight method to learn the fully continuous, anisotropic Gaussian scale space of an arbitrary signal. Based on Fourier feature modulation and Lipschitz bounding, our approach is trained self-supervised, i.e., training does not require any manual filtering. Our neural Gaussian scale-space fields faithfully capture multiscale representations across a broad range of modalities, and support a diverse set of applications. These include images, geometry, light-stage data, texture anti-aliasing, and multiscale optimization.

[LG-25] ACE: A Model Poisoning Attack on Contribution Evaluation Methods in Federated Learning

链接: https://arxiv.org/abs/2405.20975
作者: Zhangchen Xu,Fengqing Jiang,Luyao Niu,Jinyuan Jia,Bo Li,Radha Poovendran
关键词: Federated Learning, local training data, machine learning model, contribution evaluation methods, machine learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in the 33rd USENIX Security Symposium, 2024

点击查看摘要

Abstract:In Federated Learning (FL), a set of clients collaboratively train a machine learning model (called global model) without sharing their local training data. The local training data of clients is typically non-i.i.d. and heterogeneous, resulting in varying contributions from individual clients to the final performance of the global model. In response, many contribution evaluation methods were proposed, where the server could evaluate the contribution made by each client and incentivize the high-contributing clients to sustain their long-term participation in FL. Existing studies mainly focus on developing new metrics or algorithms to better measure the contribution of each client. However, the security of contribution evaluation methods of FL operating in adversarial environments is largely unexplored. In this paper, we propose the first model poisoning attack on contribution evaluation methods in FL, termed ACE. Specifically, we show that any malicious client utilizing ACE could manipulate the parameters of its local model such that it is evaluated to have a high contribution by the server, even when its local training data is indeed of low quality. We perform both theoretical analysis and empirical evaluations of ACE. Theoretically, we show our design of ACE can effectively boost the malicious client’s perceived contribution when the server employs the widely-used cosine distance metric to measure contribution. Empirically, our results show ACE effectively and efficiently deceive five state-of-the-art contribution evaluation methods. In addition, ACE preserves the accuracy of the final global models on testing inputs. We also explore six countermeasures to defend ACE. Our results show they are inadequate to thwart ACE, highlighting the urgent need for new defenses to safeguard the contribution evaluation methods in FL.

[LG-26] SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

链接: https://arxiv.org/abs/2405.20974
作者: Tianyang Xu,Shujin Wu,Shizhe Diao,Xiaoze Liu,Xingyao Wang,Yangyi Chen,Jing Gao
关键词: Large language models, Large language, confidence estimates, broader applications, fabricated information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The code is available at \url{ this https URL }

点击查看摘要

Abstract:Large language models (LLMs) often generate inaccurate or fabricated information and generally fail to indicate their confidence, which limits their broader applications. Previous work elicits confidence from LLMs by direct or self-consistency prompting, or constructing specific datasets for supervised finetuning. The prompting-based approaches have inferior performance, and the training-based approaches are limited to binary or inaccurate group-level confidence estimates. In this work, we present the advanced SaySelf, a training framework that teaches LLMs to express more accurate fine-grained confidence estimates. In addition, beyond the confidence scores, SaySelf initiates the process of directing LLMs to produce self-reflective rationales that clearly identify gaps in their parametric knowledge and explain their uncertainty. This is achieved by using an LLM to automatically summarize the uncertainties in specific knowledge via natural language. The summarization is based on the analysis of the inconsistency in multiple sampled reasoning chains, and the resulting data is utilized for supervised fine-tuning. Moreover, we utilize reinforcement learning with a meticulously crafted reward function to calibrate the confidence estimates, motivating LLMs to deliver accurate, high-confidence predictions and to penalize overconfidence in erroneous outputs. Experimental results in both in-distribution and out-of-distribution datasets demonstrate the effectiveness of SaySelf in reducing the confidence calibration error and maintaining the task performance. We show that the generated self-reflective rationales are reasonable and can further contribute to the calibration. The code is made public at \urlthis https URL.

[LG-27] LCQ: Low-Rank Codebook based Quantization for Large Language Models

链接: https://arxiv.org/abs/2405.20973
作者: Wen-Pu Cai,Wu-Jun Li
关键词: Large language models, Large language, recently demonstrated promising, demonstrated promising performance, recently demonstrated
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.

[LG-28] Amortizing intractable inference in diffusion models for vision language and control

链接: https://arxiv.org/abs/2405.20971
作者: Siddarth Venkatraman,Moksh Jain,Luca Scimeca,Minsu Kim,Marcin Sendera,Mohsin Hasan,Luke Rowe,Sarthak Mittal,Pablo Lemos,Emmanuel Bengio,Alexandre Adam,Jarrid Rector-Brooks,Yoshua Bengio,Glen Berseth,Nikolay Malkin
关键词: effective distribution estimators, downstream tasks poses, mathbf, relative trajectory balance, emerged as effective
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Diffusion models have emerged as effective distribution estimators in vision, language, and reinforcement learning, but their use as priors in downstream tasks poses an intractable posterior inference problem. This paper studies amortized sampling of the posterior over data, \mathbfx\sim p^\rm post(\mathbfx)\propto p(\mathbfx)r(\mathbfx) , in a model that consists of a diffusion generative model prior p(\mathbfx) and a black-box constraint or likelihood function r(\mathbfx) . We state and prove the asymptotic correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from this posterior, a problem that existing methods solve only approximately or in restricted cases. Relative trajectory balance arises from the generative flow network perspective on diffusion models, which allows the use of deep reinforcement learning techniques to improve mode coverage. Experiments illustrate the broad potential of unbiased inference of arbitrary posteriors under diffusion priors: in vision (classifier guidance), language (infilling under a discrete diffusion LLM), and multimodal data (text-to-image generation). Beyond generative modeling, we apply relative trajectory balance to the problem of continuous control with a score-based behavior prior, achieving state-of-the-art results on benchmarks in offline reinforcement learning.

[LG-29] Aligning Multiclass Neural Network Classifier Criterion with Task Performance via F_beta-Score

链接: https://arxiv.org/abs/2405.20954
作者: Nathan Tsoi,Deyuan Li,Taesoo Daniel Lee,Marynel Vázquez
关键词: Multiclass neural network, neural network classifiers, neural network, beta, Multiclass neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multiclass neural network classifiers are typically trained using cross-entropy loss. Following training, the performance of this same neural network is evaluated using an application-specific metric based on the multiclass confusion matrix, such as the Macro F_\beta -Score. It is questionable whether the use of cross-entropy will yield a classifier that aligns with the intended application-specific performance criteria, particularly in scenarios where there is a need to emphasize one aspect of classifier performance. For example, if greater precision is preferred over recall, the \beta value in the F_\beta evaluation metric can be adjusted accordingly, but the cross-entropy objective remains unaware of this preference during training. We propose a method that addresses this training-evaluation gap for multiclass neural network classifiers such that users can train these models informed by the desired final F_\beta -Score. Following prior work in binary classification, we utilize the concepts of the soft-set confusion matrices and a piecewise-linear approximation of the Heaviside step function. Our method extends the 2 \times 2 binary soft-set confusion matrix to a multiclass d \times d confusion matrix and proposes dynamic adaptation of the threshold value \tau , which parameterizes the piecewise-linear Heaviside approximation during run-time. We present a theoretical analysis that shows that our method can be used to optimize for a soft-set based approximation of Macro- F_\beta that is a consistent estimator of Macro- F_\beta , and our extensive experiments show the practical effectiveness of our approach.

[LG-30] Effective Interplay between Sparsity and Quantization: From Theory to Practice

链接: https://arxiv.org/abs/2405.20935
作者: Simla Burcu Harma,Ayan Chakraborty,Elizaveta Kostenok,Danila Mishin,Dongho Ha,Babak Falsafi,Martin Jaggi,Ming Liu,Yunho Oh,Suvinay Subramanian,Amir Yazdanbakhsh
关键词: deep neural networks, neural networks necessitates, improve computational efficiency, networks necessitates effective, increasing size
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy. While effective, the interplay between these two methods remains an open question. In this paper, we investigate the interaction between these two methods and assess whether their combination impacts final model accuracy. We mathematically prove that applying sparsity before quantization is the optimal sequence for these operations, minimizing error in computation. Our empirical studies across a wide range of models, including OPT and Llama model families (125M-8B) and ViT corroborate these theoretical findings. In addition, through rigorous analysis, we demonstrate that sparsity and quantization are not orthogonal; their interaction can significantly harm model accuracy, with quantization error playing a dominant role in this degradation. Our findings extend to the efficient deployment of large models in resource-limited compute platforms and reduce serving cost, offering insights into best practices for applying these compression methods to maximize efficacy without compromising accuracy.

[LG-31] Concentration Bounds for Optimized Certainty Equivalent Risk Estimation

链接: https://arxiv.org/abs/2405.20933
作者: Ayon Ghosh,L.A. Prashanth,Krishna Jagannathan
关键词: Optimized Certainty Equivalent, Certainty Equivalent, Optimized Certainty, estimating the Optimized, identically distributed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of estimating the Optimized Certainty Equivalent (OCE) risk from independent and identically distributed (i.i.d.) samples. For the classic sample average approximation (SAA) of OCE, we derive mean-squared error as well as concentration bounds (assuming sub-Gaussianity). Further, we analyze an efficient stochastic approximation-based OCE estimator, and derive finite sample bounds for the same. To show the applicability of our bounds, we consider a risk-aware bandit problem, with OCE as the risk. For this problem, we derive bound on the probability of mis-identification. Finally, we conduct numerical experiments to validate the theoretical findings.

[LG-32] Learning to Estimate System Specifications in Linear Temporal Logic using Transformers and Mamba

链接: https://arxiv.org/abs/2405.20917
作者: İlker Işık,Ebru Aydin Gol,Ramazan Gokberk Cinbis
关键词: Temporal logic, evolve over time, temporal logic formulae, framework for representing, representing and reasoning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 20 pages, 15 figures

点击查看摘要

Abstract:Temporal logic is a framework for representing and reasoning about propositions that evolve over time. It is commonly used for specifying requirements in various domains, including hardware and software systems, as well as robotics. Specification mining or formula generation involves extracting temporal logic formulae from system traces and has numerous applications, such as detecting bugs and improving interpretability. Although there has been a surge of deep learning-based methods for temporal logic satisfiability checking in recent years, the specification mining literature has been lagging behind in adopting deep learning methods despite their many advantages, such as scalability. In this paper, we introduce autoregressive models that can generate linear temporal logic formulae from traces, towards addressing the specification mining problem. We propose multiple architectures for this task: transformer encoder-decoder, decoder-only transformer, and Mamba, which is an emerging alternative to transformer models. Additionally, we devise a metric for quantifying the distinctiveness of the generated formulae and a straightforward algorithm to enforce the syntax constraints. Our experiments show that the proposed architectures yield promising results, generating correct and distinct formulae at a fraction of the compute cost needed for the combinatorial baseline.

[LG-33] Fast yet Safe: Early-Exiting with Risk Control

链接: https://arxiv.org/abs/2405.20915
作者: Metod Jazbec,Alexander Timans,Tin Hadži Veljković,Kaspar Sakmann,Dan Zhang,Christian A. Naesseth,Eric Nalisnick
关键词: Scaling machine learning, machine learning models, learning models significantly, models significantly improves, Scaling machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 25 pages, 11 figures, 4 tables (incl. appendix)

点击查看摘要

Abstract:Scaling machine learning models significantly improves their performance. However, such gains come at the cost of inference being slow and resource-intensive. Early-exit neural networks (EENNs) offer a promising solution: they accelerate inference by allowing intermediate layers to exit and produce a prediction early. Yet a fundamental issue with EENNs is how to determine when to exit without severely degrading performance. In other words, when is it ‘safe’ for an EENN to go ‘fast’? To address this issue, we investigate how to adapt frameworks of risk control to EENNs. Risk control offers a distribution-free, post-hoc solution that tunes the EENN’s exiting mechanism so that exits only occur when the output is of sufficient quality. We empirically validate our insights on a range of vision and language tasks, demonstrating that risk control can produce substantial computational savings, all the while preserving user-specified performance goals.

[LG-34] VENI VINDy VICI: a variational reduced-order modeling framework with uncertainty quantification

链接: https://arxiv.org/abs/2405.20905
作者: Paolo Conti,Jonas Kneifl,Andrea Manzoni,Attilio Frangi,Jörg Fehr,Steven L. Brunton,J. Nathan Kutz
关键词: requires solving expensive, partial differential equations, science requires solving, solving expensive, complex phenomena
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:The simulation of many complex phenomena in engineering and science requires solving expensive, high-dimensional systems of partial differential equations (PDEs). To circumvent this, reduced-order models (ROMs) have been developed to speed up computations. However, when governing equations are unknown or partially known, typically ROMs lack interpretability and reliability of the predicted solutions. In this work we present a data-driven, non-intrusive framework for building ROMs where the latent variables and dynamics are identified in an interpretable manner and uncertainty is quantified. Starting from a limited amount of high-dimensional, noisy data the proposed framework constructs an efficient ROM by leveraging variational autoencoders for dimensionality reduction along with a newly introduced, variational version of sparse identification of nonlinear dynamics (SINDy), which we refer to as Variational Identification of Nonlinear Dynamics (VINDy). In detail, the method consists of Variational Encoding of Noisy Inputs (VENI) to identify the distribution of reduced coordinates. Simultaneously, we learn the distribution of the coefficients of a pre-determined set of candidate functions by VINDy. Once trained offline, the identified model can be queried for new parameter instances and new initial conditions to compute the corresponding full-time solutions. The probabilistic setup enables uncertainty quantification as the online testing consists of Variational Inference naturally providing Certainty Intervals (VICI). In this work we showcase the effectiveness of the newly proposed VINDy method in identifying interpretable and accurate dynamical system for the Rössler system with different noise intensities and sources. Then the performance of the overall method - named VENI, VINDy, VICI - is tested on PDE benchmarks including structural mechanics and fluid dynamics. Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Dynamical Systems (math.DS) Cite as: arXiv:2405.20905 [cs.LG] (or arXiv:2405.20905v1 [cs.LG] for this version)

[LG-35] On the Condition Monitoring of Bolted Joints through Acoustic Emission and Deep Transfer Learning: Generalization Ordinal Loss and Super-Convergence

链接: https://arxiv.org/abs/2405.20887
作者: Emmanuel Ramasso,Rafael de O. Teloli,Romain Marcel
关键词: convolutional neural networks, paper investigates, based on convolutional, convolutional neural, acoustic emission
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper investigates the use of deep transfer learning based on convolutional neural networks (CNNs) to monitor the condition of bolted joints using acoustic emissions. Bolted structures are critical components in many mechanical systems, and the ability to monitor their condition status is crucial for effective structural health monitoring. We evaluated the performance of our methodology using the ORION-AE benchmark, a structure composed of two thin beams connected by three bolts, where highly noisy acoustic emission measurements were taken to detect changes in the applied tightening torque of the bolts. The data used from this structure is derived from the transformation of acoustic emission data streams into images using continuous wavelet transform, and leveraging pretrained CNNs for feature extraction and denoising. Our experiments compared single-sensor versus multiple-sensor fusion for estimating the tightening level (loosening) of bolts and evaluated the use of raw versus prefiltered data on the performance. We particularly focused on the generalization capabilities of CNN-based transfer learning across different measurement campaigns and we studied ordinal loss functions to penalize incorrect predictions less severely when close to the ground truth, thereby encouraging misclassification errors to be in adjacent classes. Network configurations as well as learning rate schedulers are also investigated, and super-convergence is obtained, i.e., high classification accuracy is achieved in a few number of iterations with different networks. Furthermore, results demonstrate the generalization capabilities of CNN-based transfer learning for monitoring bolted structures by acoustic emission with varying amounts of prior information required during training.

[LG-36] Sheaf HyperNetworks for Personalized Federated Learning

链接: https://arxiv.org/abs/2405.20882
作者: Bao Nguyen,Lorenzo Sani,Xinchi Qiu,Pietro Liò,Nicholas D. Lane
关键词: neural architecture search, leverage relational data, molecular property prediction, graph neural networks, combining graph neural
类目: Machine Learning (cs.LG)
*备注: 25 pages, 12 figures, 7 tables, pre-print under review

点击查看摘要

Abstract:Graph hypernetworks (GHNs), constructed by combining graph neural networks (GNNs) with hypernetworks (HNs), leverage relational data across various domains such as neural architecture search, molecular property prediction and federated learning. Despite GNNs and HNs being individually successful, we show that GHNs present problems compromising their performance, such as over-smoothing and heterophily. Moreover, we cannot apply GHNs directly to personalized federated learning (PFL) scenarios, where a priori client relation graph may be absent, private, or inaccessible. To mitigate these limitations in the context of PFL, we propose a novel class of HNs, sheaf hypernetworks (SHNs), which combine cellular sheaf theory with HNs to improve parameter sharing for PFL. We thoroughly evaluate SHNs across diverse PFL tasks, including multi-class classification, traffic and weather forecasting. Additionally, we provide a methodology for constructing client relation graphs in scenarios where such graphs are unavailable. We show that SHNs consistently outperform existing PFL solutions in complex non-IID scenarios. While the baselines’ performance fluctuates depending on the task, SHNs show improvements of up to 2.7% in accuracy and 5.3% in lower mean squared error over the best-performing baseline.

[LG-37] Flow matching achieves minimax optimal convergence

链接: https://arxiv.org/abs/2405.20879
作者: Kenji Fukumizu,Taiji Suzuki,Noboru Isobe,Kazusato Oko,Masanori Koyama
关键词: gained significant attention, simulation-free generative model, Flow matching, gained significant, significant attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow matching (FM) has gained significant attention as a simulation-free generative model. Unlike diffusion models, which are based on stochastic differential equations, FM employs a simpler approach by solving an ordinary differential equation with an initial condition from a normal distribution, thus streamlining the sample generation process. This paper discusses the convergence properties of FM in terms of the p -Wasserstein distance, a measure of distributional discrepancy. We establish that FM can achieve the minmax optimal convergence rate for 1 \leq p \leq 2 , presenting the first theoretical evidence that FM can reach convergence rates comparable to those of diffusion models. Our analysis extends existing frameworks by examining a broader class of mean and variance functions for the vector fields and identifies specific conditions necessary to attain these optimal rates.

[LG-38] Waveform Design for Over-the-Air Computing

链接: https://arxiv.org/abs/2405.20877
作者: Nikos G. Evgenidis,Nikos A. Mitsiou,Sotiris A. Tegos,Panagiotis D. Diamantoulakis,Panagiotis Sarigiannidis,Ioannis T. Rekanos,George K. Karagiannidis
关键词: time sampling error, OTA transmission, OTA computing enables, OTA computing, OTA
类目: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP); Statistics Theory (math.ST)
*备注: 14 pages

点击查看摘要

Abstract:In response to the increasing number of devices anticipated in next-generation networks, a shift toward over-the-air (OTA) computing has been proposed. Leveraging the superposition of multiple access channels, OTA computing enables efficient resource management by supporting simultaneous uncoded transmission in the time and the frequency domain. Thus, to advance the integration of OTA computing, our study presents a theoretical analysis addressing practical issues encountered in current digital communication transceivers, such as time sampling error and intersymbol interference (ISI). To this end, we examine the theoretical mean squared error (MSE) for OTA transmission under time sampling error and ISI, while also exploring methods for minimizing the MSE in the OTA transmission. Utilizing alternating optimization, we also derive optimal power policies for both the devices and the base station. Additionally, we propose a novel deep neural network (DNN)-based approach to design waveforms enhancing OTA transmission performance under time sampling error and ISI. To ensure fair comparison with existing waveforms like the raised cosine (RC) and the better-than-raised-cosine (BRTC), we incorporate a custom loss function integrating energy and bandwidth constraints, along with practical design considerations such as waveform symmetry. Simulation results validate our theoretical analysis and demonstrate performance gains of the designed pulse over RC and BTRC waveforms. To facilitate testing of our results without necessitating the DNN structure recreation, we provide curve fitting parameters for select DNN-based waveforms as well.

[LG-39] Enhancing Efficiency of Safe Reinforcement Learning via Sample Manipulation

链接: https://arxiv.org/abs/2405.20860
作者: Shangding Gu,Laixi Shi,Yuhao Ding,Alois Knoll,Costas Spanos,Adam Wierman,Ming Jin
关键词: Safe reinforcement learning, Efficient Safe Policy, maximize long-term rewards, reinforcement learning, real-world applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safe reinforcement learning (RL) is crucial for deploying RL agents in real-world applications, as it aims to maximize long-term rewards while satisfying safety constraints. However, safe RL often suffers from sample inefficiency, requiring extensive interactions with the environment to learn a safe policy. We propose Efficient Safe Policy Optimization (ESPO), a novel approach that enhances the efficiency of safe RL through sample manipulation. ESPO employs an optimization framework with three modes: maximizing rewards, minimizing costs, and balancing the trade-off between the two. By dynamically adjusting the sampling process based on the observed conflict between reward and safety gradients, ESPO theoretically guarantees convergence, optimization stability, and improved sample complexity bounds. Experiments on the Safety-MuJoCo and Omnisafe benchmarks demonstrate that ESPO significantly outperforms existing primal-based and primal-dual-based baselines in terms of reward maximization and constraint satisfaction. Moreover, ESPO achieves substantial gains in sample efficiency, requiring 25–29% fewer samples than baselines, and reduces training time by 21–38%.

[LG-40] SLIM: a Scalable Light-weight Root Cause Analysis for Imbalanced Data in Microservice

链接: https://arxiv.org/abs/2405.20848
作者: Rui Ren,Jingbang Yang,Linxiao Yang,Xinyue Gu,Liang Sun
关键词: change service, newly deployed service, type of minority, service, fault
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The newly deployed service – one kind of change service, could lead to a new type of minority fault. Existing state-of-the-art methods for fault localization rarely consider the imbalanced fault classification in change service. This paper proposes a novel method that utilizes decision rule sets to deal with highly imbalanced data by optimizing the F1 score subject to cardinality constraints. The proposed method greedily generates the rule with maximal marginal gain and uses an efficient minorize-maximization (MM) approach to select rules iteratively, maximizing a non-monotone submodular lower bound. Compared with existing fault localization algorithms, our algorithm can adapt to the imbalanced fault scenario of change service, and provide interpretable fault causes which are easy to understand and verify. Our method can also be deployed in the online training setting, with only about 15% training overhead compared to the current SOTA methods. Empirical studies showcase that our algorithm outperforms existing fault localization algorithms in both accuracy and model interpretability.

[LG-41] nspace: Searching for Neural Architectures from Fundamental Operations

链接: https://arxiv.org/abs/2405.20838
作者: Linus Ericsson,Miguel Espinosa,Chenhongyi Yang,Antreas Antoniou,Amos Storkey,Shay B. Cohen,Steven McDonagh,Elliot J. Crowley
关键词: Neural architecture search, high performing networks, Neural architecture, NAS, finds high performing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Project page at this https URL

点击查看摘要

Abstract:Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren’t diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shifts, we need a novel expressive search space design which is built from more fundamental operations. To this end, we introduce einspace, a search space based on a parameterised probabilistic context-free grammar. Our space is versatile, supporting architectures of various sizes and complexities, while also containing diverse network operations which allow it to model convolutions, attention components and more. It contains many existing competitive architectures, and provides flexibility for discovering new ones. Using this search space, we perform experiments to find novel architectures as well as improvements on existing ones on the diverse Unseen NAS datasets. We show that competitive architectures can be obtained by searching from scratch, and we consistently find large improvements when initialising the search with strong baselines. We believe that this work is an important advancement towards a transformative NAS paradigm where search space expressivity and strategic search initialisation play key roles.

[LG-42] Solving partial differential equations with sampled neural networks

链接: https://arxiv.org/abs/2405.20836
作者: Chinmay Datar,Taniya Kapoor,Abhishek Chandra,Qing Sun,Iryna Burak,Erik Lien Bolager,Anna Veselovska,Massimo Fornasier,Felix Dietrich
关键词: science and engineering, computational science, partial differential equations, PDE, Approximation
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 16 pages, 15 figures

点击查看摘要

Abstract:Approximation of solutions to partial differential equations (PDE) is an important problem in computational science and engineering. Using neural networks as an ansatz for the solution has proven a challenge in terms of training time and approximation accuracy. In this contribution, we discuss how sampling the hidden weights and biases of the ansatz network from data-agnostic and data-dependent probability distributions allows us to progress on both challenges. In most examples, the random sampling schemes outperform iterative, gradient-based optimization of physics-informed neural networks regarding training time and accuracy by several orders of magnitude. For time-dependent PDE, we construct neural basis functions only in the spatial domain and then solve the associated ordinary differential equation with classical methods from scientific computing over a long time horizon. This alleviates one of the greatest challenges for neural PDE solvers because it does not require us to parameterize the solution in time. For second-order elliptic PDE in Barron spaces, we prove the existence of sampled networks with L^2 convergence to the solution. We demonstrate our approach on several time-dependent and static PDEs. We also illustrate how sampled networks can effectively solve inverse problems in this setting. Benefits compared to common numerical schemes include spectral convergence and mesh-free construction of basis functions.

[LG-43] Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs

链接: https://arxiv.org/abs/2405.20835
作者: Davide Paglieri,Saurabh Dash,Tim Rocktäschel,Jack Parker-Holder
关键词: Large Language Models, Large Language, reduced memory usage, enabling faster operation, efficiency of Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Post-Training Quantization (PTQ) enhances the efficiency of Large Language Models (LLMs) by enabling faster operation and compatibility with more accessible hardware through reduced memory usage, at the cost of small performance drops. We explore the role of calibration sets in PTQ, specifically their effect on hidden activations in various notable open-source LLMs. Calibration sets are crucial for evaluating activation magnitudes and identifying outliers, which can distort the quantization range and negatively impact performance. Our analysis reveals a marked contrast in quantization effectiveness across models. The older OPT model, which much of the quantization literature is based on, shows significant performance deterioration and high susceptibility to outliers with varying calibration sets. In contrast, newer models like Llama-2 7B, Llama-3 8B, Command-R 35B, and Mistral 7B demonstrate strong robustness, with Mistral 7B showing near-immunity to outliers and stable activations. These findings suggest a shift in PTQ strategies might be needed. As advancements in pre-training methods reduce the relevance of outliers, there is an emerging need to reassess the fundamentals of current quantization literature. The emphasis should pivot towards optimizing inference speed, rather than primarily focusing on outlier preservation, to align with the evolving characteristics of state-of-the-art LLMs.

[LG-44] Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

链接: https://arxiv.org/abs/2405.20830
作者: Yueqin Yin,Zhendong Wang,Yujia Xie,Weizhu Chen,Mingyuan Zhou
关键词: Direct Preference Optimization, Traditional language model, Preference Optimization, Ratio Preference Optimization, pre-collected paired preference
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To overcome this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation. Specifically, we employ an Exponential Moving Average (EMA) model in conjunction with a replay buffer to enable dynamic updates of response segments, effectively integrating real-time feedback with insights from historical data. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B models across benchmarks, including the Open LLM Leaderboard, IFEval, AlpacaEval 2.0, and MT-Bench, demonstrate that SAPO matches or surpasses established offline contrastive baselines, such as DPO and Odds Ratio Preference Optimization, and outperforms offline self-play methods like SPIN. Our code is available at this https URL

[LG-45] Rethinking Open-World Semi-Supervised Learning: Distribution Mismatch and Inductive Inference

链接: https://arxiv.org/abs/2405.20829
作者: Seongheon Park,Hyuk Kwon,Kwanghoon Sohn,Kibok Lee
关键词: extends conventional semi-supervised, Open-world semi-supervised learning, conventional semi-supervised learning, Open-world semi-supervised, open-world scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CVPR Workshop on Computer Vision in the Wild (CVinW), 2024

点击查看摘要

Abstract:Open-world semi-supervised learning (OWSSL) extends conventional semi-supervised learning to open-world scenarios by taking account of novel categories in unlabeled datasets. Despite the recent advancements in OWSSL, the success often relies on the assumptions that 1) labeled and unlabeled datasets share the same balanced class prior distribution, which does not generally hold in real-world applications, and 2) unlabeled training datasets are utilized for evaluation, where such transductive inference might not adequately address challenges in the wild. In this paper, we aim to generalize OWSSL by addressing them. Our work suggests that practical OWSSL may require different training settings, evaluation methods, and learning strategies compared to those prevalent in the existing literature.

[LG-46] Online Convex Optimisation: The Optimal Switching Regret for all Segmentations Simultaneously

链接: https://arxiv.org/abs/2405.20824
作者: Stephen Pasteris,Chris Hicks,Vasilios Mavroudis,Mark Herbster
关键词: online convex optimisation, convex optimisation, online convex, switching regret, classic problem
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the classic problem of online convex optimisation. Whereas the notion of static regret is relevant for stationary problems, the notion of switching regret is more appropriate for non-stationary problems. A switching regret is defined relative to any segmentation of the trial sequence, and is equal to the sum of the static regrets of each segment. In this paper we show that, perhaps surprisingly, we can achieve the asymptotically optimal switching regret on every possible segmentation simultaneously. Our algorithm for doing so is very efficient: having a space and per-trial time complexity that is logarithmic in the time-horizon. Our algorithm also obtains novel bounds on its dynamic regret: being adaptive to variations in the rate of change of the comparator sequence.

[LG-47] Pursuing Overall Welfare in Federated Learning through Sequential Decision Making

链接: https://arxiv.org/abs/2405.20821
作者: Seok-Ju Hahn,Gi-Soo Kim,Junghye Lee
关键词: traditional federated learning, single global model, global model, perform equally, single global
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:In traditional federated learning, a single global model cannot perform equally well for all clients. Therefore, the need to achieve the client-level fairness in federated system has been emphasized, which can be realized by modifying the static aggregation scheme for updating the global model to an adaptive one, in response to the local signals of the participating clients. Our work reveals that existing fairness-aware aggregation strategies can be unified into an online convex optimization framework, in other words, a central server’s sequential decision making process. To enhance the decision making capability, we propose simple and intuitive improvements for suboptimal designs within existing methods, presenting AAggFF. Considering practical requirements, we further subdivide our method tailored for the cross-device and the cross-silo settings, respectively. Theoretical analyses guarantee sublinear regret upper bounds for both settings: \mathcalO(\sqrtT \logK) for the cross-device setting, and \mathcalO(K \logT) for the cross-silo setting, with K clients and T federation rounds. Extensive experiments demonstrate that the federated system equipped with AAggFF achieves better degree of client-level fairness than existing methods in both practical settings. Code is available at this https URL

[LG-48] Optimally Improving Cooperative Learning in a Social Setting

链接: https://arxiv.org/abs/2405.20808
作者: Shahrzad Haddadan,Cheng Xin,Jie Gao
关键词: cooperative learning scenario, individually owned classifiers, owned classifiers dynamically, classifiers dynamically update, classification task
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We consider a cooperative learning scenario where a collection of networked agents with individually owned classifiers dynamically update their predictions, for the same classification task, through communication or observations of each other’s predictions. Clearly if highly influential vertices use erroneous classifiers, there will be a negative effect on the accuracy of all the agents in the network. We ask the following question: how can we optimally fix the prediction of a few classifiers so as maximize the overall accuracy in the entire network. To this end we consider an aggregate and an egalitarian objective function. We show a polynomial time algorithm for optimizing the aggregate objective function, and show that optimizing the egalitarian objective function is NP-hard. Furthermore, we develop approximation algorithms for the egalitarian improvement. The performance of all of our algorithms are guaranteed by mathematical analysis and backed by experiments on synthetic and real data.

[LG-49] Shape Constraints in Symbolic Regression using Penalized Least Squares

链接: https://arxiv.org/abs/2405.20800
作者: Viktor Martinek,Julia Reuter,Ophelia Frotscher,Sanaz Mostaghim,Markus Richter,Roland Herzog
关键词: shape constraints, parameter estimation step, study the addition, shape, parameter estimation
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:We study the addition of shape constraints and their consideration during the parameter estimation step of symbolic regression (SR). Shape constraints serve as a means to introduce prior knowledge about the shape of the otherwise unknown model function into SR. Unlike previous works that have explored shape constraints in SR, we propose minimizing shape constraint violations during parameter estimation using gradient-based numerical optimization. We test three algorithm variants to evaluate their performance in identifying three symbolic expressions from a synthetically generated data set. This paper examines two benchmark scenarios: one with varying noise levels and another with reduced amounts of training data. The results indicate that incorporating shape constraints into the expression search is particularly beneficial when data is scarce. Compared to using shape constraints only in the selection process, our approach of minimizing violations during parameter estimation shows a statistically significant benefit in some of our test cases, without being significantly worse in any instance. Subjects: Machine Learning (cs.LG); Symbolic Computation (cs.SC) Cite as: arXiv:2405.20800 [cs.LG] (or arXiv:2405.20800v1 [cs.LG] for this version)

[LG-50] Ovis: Structural Embedding Alignment for Multimodal Large Language Model

链接: https://arxiv.org/abs/2405.20797
作者: Shiyin Lu,Yang Li,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Han-Jia Ye
关键词: Large Language Models, Current Multimodal Large, Multimodal Large Language, Large Language, pre-trained LLM
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs – the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder – makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder’s process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks demonstrate that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis’ structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Both the source code and the training dataset of Ovis will be made publicly available.

[LG-51] Model Interpretation and Explainability: Towards Creating Transparency in Prediction Models

链接: https://arxiv.org/abs/2405.20794
作者: Donald Kridel,Jacob Dineen,Daniel Dolk,David Castillo
关键词: model explainability, counterpart in analytical, analytical modeling, XAI, Explainable
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) has a counterpart in analytical modeling which we refer to as model explainability. We tackle the issue of model explainability in the context of prediction models. We analyze a dataset of loans from a credit card company and apply three stages: execute and compare four different prediction methods, apply the best known explainability techniques in the current literature to the model training sets to identify feature importance (FI) (static case), and finally to cross-check whether the FI set holds up under what if prediction scenarios for continuous and categorical variables (dynamic case). We found inconsistency in FI identification between the static and dynamic cases. We summarize the state of the art in model explainability and suggest further research to advance the field.

[LG-52] GS-Phong: Meta-Learned 3D Gaussians for Relightable Novel View Synthesis

链接: https://arxiv.org/abs/2405.20791
作者: Yumeng He,Yunbo Wang,Xiaokang Yang
关键词: Decoupling the illumination, Decoupling, Gaussian points, Abstract, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoupling the illumination in 3D scenes is crucial for novel view synthesis and relighting. In this paper, we propose a novel method for representing a scene illuminated by a point light using a set of relightable 3D Gaussian points. Inspired by the Blinn-Phong model, our approach decomposes the scene into ambient, diffuse, and specular components, enabling the synthesis of realistic lighting effects. To facilitate the decomposition of geometric information independent of lighting conditions, we introduce a novel bilevel optimization-based meta-learning framework. The fundamental idea is to view the rendering tasks under various lighting positions as a multi-task learning problem, which our meta-learning approach effectively addresses by generalizing the learned Gaussian geometries not only across different viewpoints but also across diverse light positions. Experimental results demonstrate the effectiveness of our approach in terms of training efficiency and rendering quality compared to existing methods for free-viewpoint relighting.

[LG-53] Intersectional Unfairness Discovery

链接: https://arxiv.org/abs/2405.20790
作者: Gezheng Xu,Qi Chen,Charles Ling,Boyu Wang,Changjian Shui
关键词: sensitive attributes, intersectional sensitive attributes, attributes, single sensitive attribute, multiple sensitive attributes
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: ICML-2024 Camera-ready

点击查看摘要

Abstract:AI systems have been shown to produce unfair results for certain subgroups of population, highlighting the need to understand bias on certain sensitive attributes. Current research often falls short, primarily focusing on the subgroups characterized by a single sensitive attribute, while neglecting the nature of intersectional fairness of multiple sensitive attributes. This paper focuses on its one fundamental aspect by discovering diverse high-bias subgroups under intersectional sensitive attributes. Specifically, we propose a Bias-Guided Generative Network (BGGN). By treating each bias value as a reward, BGGN efficiently generates high-bias intersectional sensitive attributes. Experiments on real-world text and image datasets demonstrate a diverse and efficient discovery of BGGN. To further evaluate the generated unseen but possible unfair intersectional sensitive attributes, we formulate them as prompts and use modern generative AI to produce new texts and images. The results of frequently generating biased data provides new insights of discovering potential unfairness in popular modern generative AI systems. Warning: This paper contains generative examples that are offensive in nature.

[LG-54] Improved Generation of Adversarial Examples Against Safety-aligned LLMs

链接: https://arxiv.org/abs/2405.20778
作者: Qizhang Li,Yiwen Guo,Wangmeng Zuo,Hao Chen
关键词: produce harmless content, ensure large language, large language models, adhere to safety, harmless content
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite numerous efforts to ensure large language models (LLMs) adhere to safety standards and produce harmless content, some successes have been achieved in bypassing these restrictions, known as jailbreak attacks against LLMs. Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing jailbreak attacks automatically. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i.e., Skip Gradient Method and Intermediate Level Attack, for improving the effectiveness of automatically generated adversarial examples against white-box LLMs. With appropriate adaptations, we inject these ideologies into gradient-based adversarial prompt generation processes and achieve significant performance gains without introducing obvious computational cost. Meanwhile, by discussing mechanisms behind the gains, new insights are drawn, and proper combinations of these methods are also developed. Our empirical results show that the developed combination achieves 30% absolute increase in attack success rates compared with GCG for attacking the Llama-2-7B-Chat model on AdvBench.

[LG-55] Black-Box Detection of Language Model Watermarks

链接: https://arxiv.org/abs/2405.20777
作者: Gloaguen Thibaud,Jovanović Nikola,Staab Robin,Vechev Martin
关键词: detect LLM-generated text, LLM-generated text, LLM, watermarking schemes, watermark
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Watermarking has emerged as a promising way to detect LLM-generated text. To apply a watermark an LLM provider, given a secret key, augments generations with a signal that is later detectable by any party with the same key. Recent work has proposed three main families of watermarking schemes, two of which focus on the property of preserving the LLM distribution. This is motivated by it being a tractable proxy for maintaining LLM capabilities, but also by the idea that concealing a watermark deployment makes it harder for malicious actors to hide misuse by avoiding a certain LLM or attacking its watermark. Yet, despite much discourse around detectability, no prior work has investigated if any of these scheme families are detectable in a realistic black-box setting. We tackle this for the first time, developing rigorous statistical tests to detect the presence of all three most popular watermarking scheme families using only a limited number of black-box queries. We experimentally confirm the effectiveness of our methods on a range of schemes and a diverse set of open-source models. Our findings indicate that current watermarking schemes are more detectable than previously believed, and that obscuring the fact that a watermark was deployed may not be a viable way for providers to protect against adversaries. We further apply our methods to test for watermark presence behind the most popular public APIs: GPT4, Claude 3, Gemini 1.0 Pro, finding no strong evidence of a watermark at this point in time.

[LG-56] Federated Learning with Blockchain-Enhanced Machine Unlearning: A Trustworthy Approach

链接: https://arxiv.org/abs/2405.20776
作者: Xuhan Zuo,Minghao Wang,Tianqing Zhu,Lefeng Zhang,Shui Yu,Wanlei Zhou
关键词: integrating machine unlearning, user data deletion, integrating machine, regulations and respond, respond to user
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages, 25 figures

点击查看摘要

Abstract:With the growing need to comply with privacy regulations and respond to user data deletion requests, integrating machine unlearning into IoT-based federated learning has become imperative. Traditional unlearning methods, however, often lack verifiable mechanisms, leading to challenges in establishing trust. This paper delves into the innovative integration of blockchain technology with federated learning to surmount these obstacles. Blockchain fortifies the unlearning process through its inherent qualities of immutability, transparency, and robust security. It facilitates verifiable certification, harmonizes security with privacy, and sustains system efficiency. We introduce a framework that melds blockchain with federated learning, thereby ensuring an immutable record of unlearning requests and actions. This strategy not only bolsters the trustworthiness and integrity of the federated learning model but also adeptly addresses efficiency and security challenges typical in IoT environments. Our key contributions encompass a certification mechanism for the unlearning process, the enhancement of data security and privacy, and the optimization of data management to ensure system responsiveness in IoT scenarios.

[LG-57] Reinforcement Learning for Sociohydrology

链接: https://arxiv.org/abs/2405.20772
作者: Tirthankar Roy,Shivendra Srivastava,Beichen Zhang
关键词: solving sociohydrology problems, reinforcement learning, effective and efficient, efficient framework, framework for solving
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In this study, we discuss how reinforcement learning (RL) provides an effective and efficient framework for solving sociohydrology problems. The efficacy of RL for these types of problems is evident because of its ability to update policies in an iterative manner - something that is also foundational to sociohydrology, where we are interested in representing the co-evolution of human-water interactions. We present a simple case study to demonstrate the implementation of RL in a problem of runoff reduction through management decisions related to changes in land-use land-cover (LULC). We then discuss the benefits of RL for these types of problems and share our perspectives on the future research directions in this area.

[LG-58] owards Black-Box Membership Inference Attack for Diffusion Models

链接: https://arxiv.org/abs/2405.20771
作者: Jingwei Li,Jing Dong,Tianxing He,Jingzhao Zhang
关键词: important research topic, research topic, train a diffusion, important research, rising popularity
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying whether an artwork was used to train a diffusion model is an important research topic, given the rising popularity of AI-generated art and the associated copyright concerns. The work approaches this problem from the membership inference attack (MIA) perspective. We first identify the limitations of applying existing MIA methods for copyright protection: the required access of internal U-nets and the choice of non-member datasets for evaluation. To address the above problems, we introduce a novel black-box membership inference attack method that operates without needing access to the model’s internal U-net. We then construct a DALL-E generated dataset for a more comprehensive evaluation. We validate our method across various setups, and our experimental results outperform previous works.

[LG-59] Avoiding Pitfalls for Privacy Accounting of Subsampled Mechanisms under Composition

链接: https://arxiv.org/abs/2405.20769
作者: Christian Janos Lebeda,Matthew Regehr,Gautam Kamath,Thomas Steinke
关键词: computing tight privacy, differentially private mechanisms, subsampled differentially private, tight privacy guarantees, privacy guarantees
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of computing tight privacy guarantees for the composition of subsampled differentially private mechanisms. Recent algorithms can numerically compute the privacy parameters to arbitrary precision but must be carefully applied. Our main contribution is to address two common points of confusion. First, some privacy accountants assume that the privacy guarantees for the composition of a subsampled mechanism are determined by self-composing the worst-case datasets for the uncomposed mechanism. We show that this is not true in general. Second, Poisson subsampling is sometimes assumed to have similar privacy guarantees compared to sampling without replacement. We show that the privacy guarantees may in fact differ significantly between the two sampling schemes. In particular, we give an example of hyperparameters that result in \varepsilon \approx 1 for Poisson subsampling and \varepsilon 10 for sampling without replacement. This occurs for some parameters that could realistically be chosen for DP-SGD. Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2405.20769 [cs.CR] (or arXiv:2405.20769v1 [cs.CR] for this version)

[LG-60] Expanded Gating Ranges Improve Activation Functions

链接: https://arxiv.org/abs/2405.20768
作者: Allen Hao Huang
关键词: Activation functions, deep learning architectures, Activation, self-gated activation functions, core components
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation functions are core components of all deep learning architectures. Currently, the most popular activation functions are smooth ReLU variants like GELU and SiLU. These are self-gated activation functions where the range of the gating function is between zero and one. In this paper, we explore the viability of using arctan as a gating mechanism. A self-gated activation function that uses arctan as its gating function has a monotonically increasing first derivative. To make this activation function competitive, it is necessary to introduce a trainable parameter for every MLP block to expand the range of the gating function beyond zero and one. We find that this technique also improves existing self-gated activation functions. We conduct an empirical evaluation of Expanded ArcTan Linear Unit (xATLU), Expanded GELU (xGELU), and Expanded SiLU (xSiLU) and show that they outperform existing activation functions within a transformer architecture. Additionally, expanded gating ranges show promising results in improving first-order Gated Linear Units (GLU).

[LG-61] Improving Generalization and Convergence by Enhancing Implicit Regularization

链接: https://arxiv.org/abs/2405.20763
作者: Mingze Wang,Haotian He,Jinbo Wang,Zilin Wang,Guanhua Huang,Feiyu Xiong,Zhiyu Li,Weinan E,Lei Wu
关键词: Implicit Regularization Enhancement, Regularization Enhancement, Implicit Regularization, propose an Implicit, deep learning
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 35 pages

点击查看摘要

Abstract:In this work, we propose an Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning, thereby improving generalization and convergence. Specifically, IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions while maintaining the training stability in sharp directions. We show that IRE can be practically incorporated with \em generic base optimizers without introducing significant computational overload. Experiments show that IRE consistently improves the generalization performance for image classification tasks across a variety of benchmark datasets (CIFAR-10/100, ImageNet) and models (ResNets and ViTs). Surprisingly, IRE also achieves a 2\times \em speed-up compared to AdamW in the pre-training of Llama models (of sizes ranging from 60M to 229M) on datasets including Wikitext-103, Minipile, and Openwebtext. Moreover, we provide theoretical guarantees, showing that IRE can substantially accelerate the convergence towards flat minima in Sharpness-aware Minimization (SAM).

[LG-62] Share Your Secrets for Privacy! Confidential Forecasting with Vertical Federated Learning

链接: https://arxiv.org/abs/2405.20761
作者: Aditya Shankar,Lydia Y. Chen,Jérémie Decouchant,Dimitra Gkorou,Rihan Hai
关键词: Vertical federated learning, time series forecasting, Vertical federated, Secret-shared Time Series, federated learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Submitted to the 27TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)

点击查看摘要

Abstract:Vertical federated learning (VFL) is a promising area for time series forecasting in industrial applications, such as predictive maintenance and machine control. Critical challenges to address in manufacturing include data privacy and over-fitting on small and noisy datasets during both training and inference. Additionally, to increase industry adaptability, such forecasting models must scale well with the number of parties while ensuring strong convergence and low-tuning complexity. We address those challenges and propose ‘Secret-shared Time Series Forecasting with VFL’ (STV), a novel framework that exhibits the following key features: i) a privacy-preserving algorithm for forecasting with SARIMAX and autoregressive trees on vertically partitioned data; ii) serverless forecasting using secret sharing and multi-party computation; iii) novel N-party algorithms for matrix multiplication and inverse operations for direct parameter optimization, giving strong convergence with minimal hyperparameter tuning complexity. We conduct evaluations on six representative datasets from public and industry-specific contexts. Our results demonstrate that STV’s forecasting accuracy is comparable to those of centralized approaches. They also show that our direct optimization can outperform centralized methods, which include state-of-the-art diffusion models and long-short-term memory, by 23.81% on forecasting accuracy. We also conduct a scalability analysis by examining the communication costs of direct and iterative optimization to navigate the choice between the two. Code and appendix are available: this https URL

[LG-63] Information Theoretic Text-to-Image Alignment

链接: https://arxiv.org/abs/2405.20759
作者: Chao Wang,Giulio Franzese,Alessandro Finamore,Massimo Gallo,Pietro Michiardi
关键词: tremendous success recently, Diffusion models, success recently, tremendous success, Diffusion
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models for Text-to-Image (T2I) conditional generation have seen tremendous success recently. Despite their success, accurately capturing user intentions with these models still requires a laborious trial and error process. This challenge is commonly identified as a model alignment problem, an issue that has attracted considerable attention by the research community. Instead of relying on fine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-language models to steer image generation, in this work we present a novel method that relies on an information-theoretic alignment measure. In a nutshell, our method uses self-supervised fine-tuning and relies on point-wise mutual information between prompts and images to define a synthetic training set to induce model alignment. Our comparative analysis shows that our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI and a lightweight fine-tuning strategy.

[LG-64] OpenTensor: Reproducing Faster Matrix Multiplication Discovering Algorithms

链接: https://arxiv.org/abs/2405.20748
作者: Yiwen Sun,Wenye Li
关键词: Deep Reinforcement Learning, Reinforcement Learning, Deep Reinforcement, DRL, multiplication by Deep
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:OpenTensor is a reproduction of AlphaTensor, which discovered a new algorithm that outperforms the state-of-the-art methods for matrix multiplication by Deep Reinforcement Learning (DRL). While AlphaTensor provides a promising framework for solving scientific problems, it is really hard to reproduce due to the massive tricks and lack of source codes. In this paper, we clean up the algorithm pipeline, clarify the technical details, and make some improvements to the training process. Computational results show that OpenTensor can successfully find efficient matrix multiplication algorithms.

[LG-65] rajectory Forecasting through Low-Rank Adaptation of Discrete Latent Codes

链接: https://arxiv.org/abs/2405.20743
作者: Riccardo Benaglia,Angelo Porrello,Pietro Buzzega,Simone Calderara,Rita Cucchiara
关键词: video surveillance analytics, basketball players engaged, Trajectory forecasting, Quantized Variational Autoencoders, surveillance analytics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g. basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity and diversity. We address this challenge by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a discrete latent space to tackle the issue of posterior collapse. Specifically, we introduce an instance-based codebook that allows tailored latent representations for each example. In a nutshell, the rows of the codebook are dynamically adjusted to reflect contextual information (i.e., past motion patterns extracted from the observed trajectories). In this way, the discretization process gains flexibility, leading to improved reconstructions. Notably, instance-level dynamics are injected into the codebook through low-rank updates, which restrict the customization of the codebook to a lower dimension space. The resulting discrete space serves as the basis of the subsequent step, which regards the training of a diffusion-based predictive model. We show that such a two-fold framework, augmented with instance-level discretization, leads to accurate and diverse forecasts, yielding state-of-the-art performance on three established benchmarks.

[LG-66] Federated Random Forest for Partially Overlapping Clinical Data

链接: https://arxiv.org/abs/2405.20738
作者: Youngjun Park,Cord Eric Schmidt,Benedikt Marcel Batton,Anne-Christin Hauschild
关键词: pose huge challenges, federated random forest, federated random, data protection regulations, large-scale data analysis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the healthcare sector, a consciousness surrounding data privacy and corresponding data protection regulations, as well as heterogeneous and non-harmonized data, pose huge challenges to large-scale data analysis. Moreover, clinical data often involves partially overlapping features, as some observations may be missing due to various reasons, such as differences in procedures, diagnostic tests, or other recorded patient history information across hospitals or institutes. To address the challenges posed by partially overlapping features and incomplete data in clinical datasets, a comprehensive approach is required. Particularly in the domain of medical data, promising outcomes are achieved by federated random forests whenever features align. However, for most standard algorithms, like random forest, it is essential that all data sets have identical parameters. Therefore, in this work the concept of federated random forest is adapted to a setting with partially overlapping features. Moreover, our research assesses the effectiveness of the newly developed federated random forest models for partially overlapping clinical data. For aggregating the federated, globally optimized model, only features available locally at each site can be used. We tackled two issues in federation: (i) the quantity of involved parties, (ii) the varying overlap of features. This evaluation was conducted across three clinical datasets. The federated random forest model even in cases where only a subset of features overlaps consistently demonstrates superior performance compared to its local counterpart. This holds true across various scenarios, including datasets with imbalanced classes. Consequently, federated random forests for partially overlapped data offer a promising solution to transcend barriers in collaborative research and corporate cooperation.

[LG-67] Maximum Temperature Prediction Using Remote Sensing Data Via Convolutional Neural Network

链接: https://arxiv.org/abs/2405.20731
作者: Lorenzo Innocenti,Giacomo Blanco,Luca Barco,Claudio Rossi
关键词: pose significant threats, specific zones exhibiting, zones exhibiting substantially, exhibiting substantially higher, Urban heat islands
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, submitted to IEEE MetroLivEnv 2024 conference

点击查看摘要

Abstract:Urban heat islands, defined as specific zones exhibiting substantially higher temperatures than their immediate environs, pose significant threats to environmental sustainability and public health. This study introduces a novel machine-learning model that amalgamates data from the Sentinel-3 satellite, meteorological predictions, and additional remote sensing inputs. The primary aim is to generate detailed spatiotemporal maps that forecast the peak temperatures within a 24-hour period in Turin. Experimental results validate the model’s proficiency in predicting temperature patterns, achieving a Mean Absolute Error (MAE) of 2.09 degrees Celsius for the year 2023 at a resolution of 20 meters per pixel, thereby enriching our knowledge of urban climatic behavior. This investigation enhances the understanding of urban microclimates, emphasizing the importance of cross-disciplinary data integration, and laying the groundwork for informed policy-making aimed at alleviating the negative impacts of extreme urban temperatures.

[LG-68] Learning on Large Graphs using Intersecting Communities

链接: https://arxiv.org/abs/2405.20724
作者: Ben Finkelshtein,İsmail İlkan Ceylan,Michael Bronstein,Ron Levie
关键词: Passing Neural Networks, Message Passing Neural, Neural Networks, Passing Neural, Message Passing
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Message Passing Neural Networks (MPNNs) are a staple of graph machine learning. MPNNs iteratively update each node’s representation in an input graph by aggregating messages from the node’s neighbors, which necessitates a memory complexity of the order of the number of graph edges. This complexity might quickly become prohibitive for large graphs provided they are not very sparse. In this paper, we propose a novel approach to alleviate this problem by approximating the input graph as an intersecting community graph (ICG) – a combination of intersecting cliques. The key insight is that the number of communities required to approximate a graph does not depend on the graph size. We develop a new constructive version of the Weak Graph Regularity Lemma to efficiently construct an approximating ICG for any input graph. We then devise an efficient graph learning algorithm operating directly on ICG in linear memory and time with respect to the number of nodes (rather than edges). This offers a new and fundamentally different pipeline for learning on very large non-sparse graphs, whose applicability is demonstrated empirically on node classification tasks and spatio-temporal data processing.

[LG-69] Cyclic image generation using chaotic dynamics

链接: https://arxiv.org/abs/2405.20717
作者: Takaya Tanaka,Yutaka Yamaguti
关键词: cyclic transformations, transformations is demonstrated, demonstrated by extending, extending the CycleGAN, generated image sequences
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Successive image generation using cyclic transformations is demonstrated by extending the CycleGAN model to transform images among three different categories. Repeated application of the trained generators produces sequences of images that transition among the different categories. The generated image sequences occupy a more limited region of the image space compared with the original training dataset. Quantitative evaluation using precision and recall metrics indicates that the generated images have high quality but reduced diversity relative to the training dataset. Such successive generation processes are characterized as chaotic dynamics in terms of dynamical system theory. Positive Lyapunov exponents estimated from the generated trajectories confirm the presence of chaotic dynamics, with the Lyapunov dimension of the attractor found to be comparable to the intrinsic dimension of the training data manifold. The results suggest that chaotic dynamics in the image space defined by the deep generative model contribute to the diversity of the generated images, constituting a novel approach for multi-class image generation. This model can be interpreted as an extension of classical associative memory to perform hetero-association among image categories.

[LG-70] In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought

链接: https://arxiv.org/abs/2405.20692
作者: Sili Huang,Jifeng Hu,Hechang Chen,Lichao Sun,Bo Yang
关键词: offline reinforcement learning, providing task prompts, reinforcement learning, promising approach, approach for offline
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In-context learning is a promising approach for offline reinforcement learning (RL) to handle online tasks, which can be achieved by providing task prompts. Recent works demonstrated that in-context RL could emerge with self-improvement in a trial-and-error manner when treating RL tasks as an across-episodic sequential prediction problem. Despite the self-improvement not requiring gradient updates, current works still suffer from high computational costs when the across-episodic sequence increases with task horizons. To this end, we propose an In-context Decision Transformer (IDT) to achieve self-improvement in a high-level trial-and-error manner. Specifically, IDT is inspired by the efficient hierarchical structure of human decision-making and thus reconstructs the sequence to consist of high-level decisions instead of low-level actions that interact with environments. As one high-level decision can guide multi-step low-level actions, IDT naturally avoids excessively long sequences and solves online tasks more efficiently. Experimental results show that IDT achieves state-of-the-art in long-horizon tasks over current in-context RL methods. In particular, the online evaluation time of our IDT is \textbf36 \times times faster than baselines in the D4RL benchmark and \textbf27 \times times faster in the Grid World benchmark.

[LG-71] Unleashing the Potential of Diffusion Models for Incomplete Data Imputation

链接: https://arxiv.org/abs/2405.20690
作者: Hengrui Zhang,Liancheng Fang,Philip S. Yu
关键词: missing data, missing data imputation, paper introduces DiffPuter, diffusion model, data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces DiffPuter, an iterative method for missing data imputation that leverages the Expectation-Maximization (EM) algorithm and Diffusion Models. By treating missing data as hidden variables that can be updated during model training, we frame the missing data imputation task as an EM problem. During the M-step, DiffPuter employs a diffusion model to learn the joint distribution of both the observed and currently estimated missing data. In the E-step, DiffPuter re-estimates the missing data based on the conditional probability given the observed data, utilizing the diffusion model learned in the M-step. Starting with an initial imputation, DiffPuter alternates between the M-step and E-step until convergence. Through this iterative process, DiffPuter progressively refines the complete data distribution, yielding increasingly accurate estimations of the missing data. Our theoretical analysis demonstrates that the unconditional training and conditional sampling processes of the diffusion model align precisely with the objectives of the M-step and E-step, respectively. Empirical evaluations across 10 diverse datasets and comparisons with 16 different imputation methods highlight DiffPuter’s superior performance. Notably, DiffPuter achieves an average improvement of 8.10% in MAE and 5.64% in RMSE compared to the most competitive existing method.

[LG-72] Conditioning GAN Without Training Dataset

链接: https://arxiv.org/abs/2405.20687
作者: Kidist Amde Mekonnen
关键词: Toggle, Deep learning algorithms, training dataset, Deep learning, Training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 5 pages, 2 figures, Part of my MSc project course, School Project Course 2022

点击查看摘要

Abstract:Deep learning algorithms have a large number of trainable parameters often with sizes of hundreds of thousands or more. Training this algorithm requires a large amount of training data and generating a sufficiently large dataset for these algorithms is costly\citenoguchi2019image. GANs are generative neural networks that use two deep learning networks that are competing with each other. The networks are generator and discriminator networks. The generator tries to generate realistic images which resemble the actual training dataset by approximating the training data distribution and the discriminator is trained to classify images as real or fake(generated)\citegoodfellow2016nips. Training these GAN algorithms also requires a large amount of training dataset\citenoguchi2019image. In this study, the aim is to address the question, “Given an unconditioned pretrained generator network and a pretrained classifier, is it feasible to develop a conditioned generator without relying on any training dataset?” The paper begins with a general introduction to the problem. The subsequent sections are structured as follows: Section 2 provides background information on the problem. Section 3 reviews relevant literature on the topic. Section 4 outlines the methodology employed in this study. Section 5 presents the experimental results. Section 6 discusses the findings and proposes potential future research directions. Finally, Section 7 offers concluding remarks. The implementation can be accessed \hrefthis https URLhere. Comments: 5 pages, 2 figures, Part of my MSc project course, School Project Course 2022 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2405.20687 [cs.CV] (or arXiv:2405.20687v1 [cs.CV] for this version) Submission history From: Kidist Amde Mekonnen Miss [view email] [v1] Fri, 31 May 2024 08:31:26 UTC (883 KB) Full-text links: Access Paper: View a PDF of the paper titled Conditioning GAN Without Training Dataset, by Kidist Amde MekonnenView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2405 Change to browse by: cs cs.AI cs.LG cs.MM References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-73] Enhancing Counterfactual Image Generation Using Mahalanobis Distance with Distribution Preferences in Feature Space

链接: https://arxiv.org/abs/2405.20685
作者: Yukai Zhang,Ao Xu,Zihao Li,Tieru Wu
关键词: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, Intelligence, Artificial
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the realm of Artificial Intelligence (AI), the importance of Explainable Artificial Intelligence (XAI) is increasingly recognized, particularly as AI models become more integral to our lives. One notable single-instance XAI approach is counterfactual explanation, which aids users in comprehending a model’s decisions and offers guidance on altering these decisions. Specifically in the context of image classification models, effective image counterfactual explanations can significantly enhance user understanding. This paper introduces a novel method for computing feature importance within the feature space of a black-box model. By employing information fusion techniques, our method maximizes the use of data to address feature counterfactual explanations in the feature space. Subsequently, we utilize an image generation model to transform these feature counterfactual explanations into image counterfactual explanations. Our experiments demonstrate that the counterfactual explanations generated by our method closely resemble the original images in both pixel and feature spaces. Additionally, our method outperforms established baselines, achieving impressive experimental results.

[LG-74] No-Regret Learning for Fair Multi-Agent Social Welfare Optimization

链接: https://arxiv.org/abs/2405.20678
作者: Mengxiao Zhang,Ramiro Deo-Campo Vuong,Haipeng Luo
关键词: online multi-agent Nash, multi-agent Nash social, Nash social welfare, Nash social, multi-agent Nash
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of online multi-agent Nash social welfare (NSW) maximization. While previous works of Hossain et al. [2021], Jones et al. [2023] study similar problems in stochastic multi-agent multi-armed bandits and show that \sqrtT -regret is possible after T rounds, their fairness measure is the product of all agents’ rewards, instead of their NSW (that is, their geometric mean). Given the fundamental role of NSW in the fairness literature, it is more than natural to ask whether no-regret fair learning with NSW as the objective is possible. In this work, we provide a complete answer to this question in various settings. Specifically, in stochastic N -agent K -armed bandits, we develop an algorithm with \widetilde\mathcalO\left(K^\frac2NT^\fracN-1N\right) regret and prove that the dependence on T is tight, making it a sharp contrast to the \sqrtT -regret bounds of Hossain et al. [2021], Jones et al. [2023]. We then consider a more challenging version of the problem with adversarial rewards. Somewhat surprisingly, despite NSW being a concave function, we prove that no algorithm can achieve sublinear regret. To circumvent such negative results, we further consider a setting with full-information feedback and design two algorithms with \sqrtT -regret: the first one has no dependence on N at all and is applicable to not just NSW but a broad class of welfare functions, while the second one has better dependence on K and is preferable when N is small. Finally, we also show that logarithmic regret is possible whenever there exists one agent who is indifferent about different arms.

[LG-75] Provably Efficient Interactive-Grounded Learning with Personalized Reward

链接: https://arxiv.org/abs/2405.20677
作者: Mengxiao Zhang,Yuheng Zhang,Haipeng Luo,Paul Mineiro
关键词: maximizing unobservable rewards, observing reward-dependent feedback, powerful framework, learner aims, aims at maximizing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Interactive-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in which a learner aims at maximizing unobservable rewards through interacting with an environment and observing reward-dependent feedback on the taken actions. To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. In this work, we consider the same problem and provide the first provably efficient algorithms with sublinear regret under realizability. Our analysis reveals that the step-function estimator of prior work can deviate uncontrollably due to finite-sample effects. Our solution is a novel Lipschitz reward estimator which underestimates the true reward and enjoys favorable generalization performances. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting. We apply IGL to learning from image feedback and learning from text feedback, which are reward-free settings that arise in practice. Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.

[LG-76] Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling

链接: https://arxiv.org/abs/2405.20675
作者: Kidist Amde Mekonnen,Nicola Dall’Asen,Paolo Rota
关键词: image synthesis tasks, Diffusion Probabilistic Models, achieving remarkable performance, Diffusion Probabilistic, Probabilistic Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki, Finland

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of deep generative models, achieving remarkable performance in image synthesis tasks. However, these models face challenges in terms of widespread adoption due to their reliance on sequential denoising steps during sample generation. This dependence leads to substantial computational requirements, making them unsuitable for resource-constrained or real-time processing systems. To address these challenges, we propose a novel method that integrates denoising phases directly into the model’s architecture, thereby reducing the need for resource-intensive computations. Our approach combines diffusion models with generative adversarial networks (GANs) through knowledge distillation, enabling more efficient training and evaluation. By utilizing a pre-trained diffusion model as a teacher model, we train a student model through adversarial learning, employing layerwise transformations for denoising and submodules for predicting the teacher model’s output at various points in time. This integration significantly reduces the number of parameters and denoising steps required, leading to improved sampling speed at test time. We validate our method with extensive experiments, demonstrating comparable performance with reduced computational requirements compared to existing approaches. By enabling the deployment of diffusion models on resource-constrained devices, our research mitigates their computational burden and paves the way for wider accessibility and practical use across the research community and end-users. Our code is publicly available at this https URL Comments: 7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki, Finland Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2405.20675 [cs.CV] (or arXiv:2405.20675v1 [cs.CV] for this version)

[LG-77] Position Coupling: Leveraging Task Structure for Improved Length Generalization of Transformers

链接: https://arxiv.org/abs/2405.20671
作者: Hanseul Cho,Jaeyoung Cha,Pranjal Awasthi,Srinadh Bhojanapalli,Anupam Gupta,Chulhee Yun
关键词: simple arithmetic tasks, encountered during training, longer sequences, Transformer, position
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 73 pages, 20 figures, 90 tables

点击查看摘要

Abstract:Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more “relevant” tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as addition with multiple summands, Nx2 multiplication, copy/reverse, and a two-dimensional task.

[LG-78] Weak Robust Compatibility Between Learning Algorithms and Counterfactual Explanation Generation Algorithms

链接: https://arxiv.org/abs/2405.20664
作者: Ao Xu,Tieru Wu
关键词: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, method for Explainable, Counterfactual explanation generation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual explanation generation is a powerful method for Explainable Artificial Intelligence. It can help users understand why machine learning models make specific decisions, and how to change those decisions. Evaluating the robustness of counterfactual explanation algorithms is therefore crucial. Previous literature has widely studied the robustness based on the perturbation of input instances. However, the robustness defined from the perspective of perturbed instances is sometimes biased, because this definition ignores the impact of learning algorithms on robustness. In this paper, we propose a more reasonable definition, Weak Robust Compatibility, based on the perspective of explanation strength. In practice, we propose WRC-Test to help us generate more robust counterfactuals. Meanwhile, we designed experiments to verify the effectiveness of WRC-Test. Theoretically, we introduce the concepts of PAC learning theory and define the concept of PAC WRC-Approximability. Based on reasonable assumptions, we establish oracle inequalities about weak robustness, which gives a sufficient condition for PAC WRC-Approximability.

[LG-79] Sign is Not a Remedy: Multiset-to-Multiset Message Passing for Learning on Heterophilic Graphs

链接: https://arxiv.org/abs/2405.20652
作者: Langzhang Liang,Sunwoo Kim,Kijung Shin,Zenglin Xu,Shirui Pan,Yuan Qi
关键词: Graph Neural Networks, Neural Networks, homophilic graph-structured data, gained significant attention, Graph Neural
类目: Machine Learning (cs.LG)
*备注: Published as a conference paper at ICML 2024

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have gained significant attention as a powerful modeling and inference method, especially for homophilic graph-structured data. To empower GNNs in heterophilic graphs, where adjacent nodes exhibit dissimilar labels or features, Signed Message Passing (SMP) has been widely adopted. However, there is a lack of theoretical and empirical analysis regarding the limitations of SMP. In this work, we unveil some potential pitfalls of SMP and their remedies. We first identify two limitations of SMP: undesirable representation update for multi-hop neighbors and vulnerability against oversmoothing issues. To overcome these challenges, we propose a novel message passing function called Multiset to Multiset GNN(M2M-GNN). Our theoretical analyses and extensive experiments demonstrate that M2M-GNN effectively alleviates the aforementioned limitations of SMP, yielding superior performance in comparison

[LG-80] Reward-based Input Construction for Cross-document Relation Extraction

链接: https://arxiv.org/abs/2405.20649
作者: Byeonghu Na,Suhyeon Jo,Yeongmin Kim,Il-Chul Moon
关键词: natural language processing, aiming to identify, entities in text, fundamental task, task in natural
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at ACL 2024 main conference

点击查看摘要

Abstract:Relation extraction (RE) is a fundamental task in natural language processing, aiming to identify relations between target entities in text. While many RE methods are designed for a single sentence or document, cross-document RE has emerged to address relations across multiple long documents. Given the nature of long documents in cross-document RE, extracting document embeddings is challenging due to the length constraints of pre-trained language models. Therefore, we propose REward-based Input Construction (REIC), the first learning-based sentence selector for cross-document RE. REIC extracts sentences based on relational evidence, enabling the RE module to effectively infer relations. Since supervision of evidence sentences is generally unavailable, we train REIC using reinforcement learning with RE prediction scores as rewards. Experimental results demonstrate the superiority of our method over heuristic methods for different RE structures and backbones in cross-document RE. Our code is publicly available at this https URL.

[LG-81] Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization

链接: https://arxiv.org/abs/2405.20648
作者: Richard Luo,Austin Peng,Adithya Vasudev,Rishabh Jain
关键词: poses substantial challenges, information-dense medium, increasingly prominent, prominent and information-dense, poses substantial
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos’ more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck Holmes. By leveraging better pretraining and data collection strategies, we extend the abilities of existing small LLVMs from being able to understand a picture to being able to understand a sequence of frames. Specifically, we show that Shotluck Holmes achieves better performance than state-of-the-art results on the Shot2Story video captioning and summary task with significantly smaller and more computationally efficient models.

[LG-82] Principal-Agent Multitasking: the Uniformity of Optimal Contracts and its Efficient Learning via Instrumental Regression

链接: https://arxiv.org/abs/2405.20642
作者: Shiliang Zuo
关键词: multitasking principal-agent problem, optimal contract, studies the multitasking, multitasking principal-agent, contract
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work studies the multitasking principal-agent problem. I first show a uniformity'' result. Specifically, when the tasks are perfect substitutes, and the agent's cost function is homogeneous to a certain degree, then the optimal contract only depends on the marginal utility of each task and the degree of homogeneity. I then study a setting where the marginal utility of each task is unknown so that the optimal contract must be learned or estimated with observational data. I identify this problem as a regression problem with measurement errors and observe that this problem can be cast as an instrumental regression problem. The current works observe that both the contract and the repeated observations (when available) can act as valid instrumental variables, and propose using the generalized method of moments estimator to compute an approximately optimal contract from offline data. I also study an online setting and show how the optimal contract can be efficiently learned in an online fashion using the two estimators. Here the principal faces an exploration-exploitation tradeoff: she must experiment with new contracts and observe their outcome whilst at the same time ensuring her experimentations are not deviating too much from the optimal contract. This work shows when repeated observations are available and agents are sufficiently diverse", the principal can achieve a very low \widetildeO(d) cumulative utility loss, even with a ``pure exploitation" algorithm.

[LG-83] Heterophilous Distribution Propagation for Graph Neural Networks

链接: https://arxiv.org/abs/2405.20640
作者: Zhuonan Zheng,Sheng Zhou,Hongjia Xu,Ming Gu,Yilun Xu,Ao Li,Yuhong Li,Jingjun Gu,Jiajun Bu
关键词: Graph Neural Networks, achieved remarkable success, graph mining tasks, Neural Networks, Graph Neural
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved remarkable success in various graph mining tasks by aggregating information from neighborhoods for representation learning. The success relies on the homophily assumption that nearby nodes exhibit similar behaviors, while it may be violated in many real-world graphs. Recently, heterophilous graph neural networks (HeterGNNs) have attracted increasing attention by modifying the neural message passing schema for heterophilous neighborhoods. However, they suffer from insufficient neighborhood partition and heterophily modeling, both of which are critical but challenging to break through. To tackle these challenges, in this paper, we propose heterophilous distribution propagation (HDP) for graph neural networks. Instead of aggregating information from all neighborhoods, HDP adaptively separates the neighbors into homophilous and heterphilous parts based on the pseudo assignments during training. The heterophilous neighborhood distribution is learned with orthogonality-oriented constraint via a trusted prototype contrastive learning paradigm. Both the homophilous and heterophilous patterns are propagated with a novel semantic-aware message passing mechanism. We conduct extensive experiments on 9 benchmark datasets with different levels of homophily. Experimental results show that our method outperforms representative baselines on heterophilous datasets.

[LG-84] Stochastic Optimal Control for Diffusion Bridges in Function Spaces

链接: https://arxiv.org/abs/2405.20630
作者: Byoungwoo Park,Jungwon Choi,Sungbin Lim,Juho Lee
关键词: Recent advancements, bridges primarily focus, real-world problems necessitate, problems necessitate operations, interpretable formulations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion models and diffusion bridges primarily focus on finite-dimensional spaces, yet many real-world problems necessitate operations in infinite-dimensional function spaces for more natural and interpretable formulations. In this paper, we present a theory of stochastic optimal control (SOC) tailored to infinite-dimensional spaces, aiming to extend diffusion-based algorithms to function spaces. Specifically, we demonstrate how Doob’s h -transform, the fundamental tool for constructing diffusion bridges, can be derived from the SOC perspective and expanded to infinite dimensions. This expansion presents a challenge, as infinite-dimensional spaces typically lack closed-form densities. Leveraging our theory, we establish that solving the optimal control problem with a specific objective function choice is equivalent to learning diffusion-based generative models. We propose two applications: (1) learning bridges between two infinite-dimensional distributions and (2) generative models for sampling from an infinite-dimensional distribution. Our approach proves effective for diverse problems involving continuous function space representations, such as resolution-free images, time-series data, and probability density functions.

[LG-85] Prune at the Clients Not the Server: Accelerated Sparse Training in Federated Learning

链接: https://arxiv.org/abs/2405.20623
作者: Georg Meinhardt,Kai Yi,Laurent Condat,Peter Richtárik
关键词: Federated Learning, paradigm of Federated, local data private, multiple clients train, train a shared
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In the recent paradigm of Federated Learning (FL), multiple clients train a shared model while keeping their local data private. Resource constraints of clients and communication costs pose major problems for training large models in FL. On the one hand, addressing the resource limitations of the clients, sparse training has proven to be a powerful tool in the centralized setting. On the other hand, communication costs in FL can be addressed by local training, where each client takes multiple gradient steps on its local data. Recent work has shown that local training can provably achieve the optimal accelerated communication complexity [Mishchenko et al., 2022]. Hence, one would like an accelerated sparse training algorithm. In this work we show that naive integration of sparse training and acceleration at the server fails, and how to fix it by letting the clients perform these tasks appropriately. We introduce Sparse-ProxSkip, our method developed for the nonconvex setting, inspired by RandProx [Condat and Richtárik, 2022], which provably combines sparse training and acceleration in the convex setting. We demonstrate the good performance of Sparse-ProxSkip in extensive experiments.

[LG-86] Superfast Selection for Decision Tree Algorithms

链接: https://arxiv.org/abs/2405.20622
作者: Huaduo Wang,Gopal Gupta
关键词: called Superfast Selection, Superfast Selection, integrating Superfast Selection, standard selection methods, Ultrafast Decision Tree
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel and systematic method, called Superfast Selection, for selecting the “optimal split” for decision tree and feature selection algorithms over tabular data. The method speeds up split selection on a single feature by lowering the time complexity, from O(MN) (using the standard selection methods) to O(M), where M represents the number of input examples and N the number of unique values. Additionally, the need for pre-encoding, such as one-hot or integer encoding, for feature value heterogeneity is eliminated. To demonstrate the efficiency of Superfast Selection, we empower the CART algorithm by integrating Superfast Selection into it, creating what we call Ultrafast Decision Tree (UDT). This enhancement enables UDT to complete the training process with a time complexity O(KMlogM) (K is the number of features). Additionally, the Training Only Once Tuning enables UDT to avoid the repetitive training process required to find the optimal hyper-parameter. Experiments show that the UDT can finish a single training on KDD99-10% dataset (494K examples with 41 features) within 1 second and tuning with 214.8 sets of hyper-parameters within 0.25 second on a laptop.

[LG-87] “Forgetting” in Machine Learning and Beyond: A Survey

链接: https://arxiv.org/abs/2405.20620
作者: Alyssa Shuang Sha,Bernardo Pereira Nunes,Armin Haller
关键词: drawing insights, preventing overfitting, investigates the multifaceted, multifaceted nature, insights from neuroscientific
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This survey investigates the multifaceted nature of forgetting in machine learning, drawing insights from neuroscientific research that posits forgetting as an adaptive function rather than a defect, enhancing the learning process and preventing overfitting. This survey focuses on the benefits of forgetting and its applications across various machine learning sub-fields that can help improve model performance and enhance data privacy. Moreover, the paper discusses current challenges, future directions, and ethical considerations regarding the integration of forgetting mechanisms into machine learning models.

[LG-88] Bi-Directional Transformers vs. word2vec: Discovering Vulnerabilities in Lifted Compiled Code

链接: https://arxiv.org/abs/2405.20611
作者: Gary A. McCully,John D. Hastings,Shengjie Xu,Adam Fortier
关键词: high-level code structures, lost high-level code, architectural dependencies, optimization options, challenging due
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 8 pages, 0 figures, IEEE 4th Cyber Awareness and Research Symposium 2024 (CARS’24)

点击查看摘要

Abstract:Detecting vulnerabilities within compiled binaries is challenging due to lost high-level code structures and other factors such as architectural dependencies, compilers, and optimization options. To address these obstacles, this research explores vulnerability detection by using natural language processing (NLP) embedding techniques with word2vec, BERT, and RoBERTa to learn semantics from intermediate representation (LLVM) code. Long short-term memory (LSTM) neural networks were trained on embeddings from encoders created using approximately 118k LLVM functions from the Juliet dataset. This study is pioneering in its comparison of word2vec models with multiple bidirectional transformer (BERT, RoBERTa) embeddings built using LLVM code to train neural networks to detect vulnerabilities in compiled binaries. word2vec Continuous Bag of Words (CBOW) models achieved 92.3% validation accuracy in detecting vulnerabilities, outperforming word2vec Skip-Gram, BERT, and RoBERTa. This suggests that complex contextual NLP embeddings may not provide advantages over simpler word2vec models for this task when a limited number (e.g. 118K) of data samples are used to train the bidirectional transformer-based models. The comparative results provide novel insights into selecting optimal embeddings for learning compiler-independent semantic code representations to advance machine learning detection of vulnerabilities in compiled binaries.

[LG-89] Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

链接: https://arxiv.org/abs/2405.20606
作者: Yang Chen,Tian He,Junfeng Fu,Ling Wang,Jingcai Guo,Hong Cheng
关键词: Supervised and self-supervised, main training paradigms, Vision-Language knowledge prompts, human skeleton action, Vision-Language knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Supervised and self-supervised learning are two main training paradigms for skeleton-based human action recognition. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C ^2 VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal contrastive process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments show that our method achieves state-of-the-art results on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The code will be available in the future.

[LG-90] Searching for internal symbols underlying deep learning

链接: https://arxiv.org/abs/2405.20605
作者: Jung H. Lee,Sujith Vijayan
关键词: enables deep neural, deep neural networks, automatically learn complex, learn complex tasks, enables deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures, 3 tables and Appendix

点击查看摘要

Abstract:Deep learning (DL) enables deep neural networks (DNNs) to automatically learn complex tasks or rules from given examples without instructions or guiding principles. As we do not engineer DNNs’ functions, it is extremely difficult to diagnose their decisions, and multiple lines of studies proposed to explain principles of DNNs/DL operations. Notably, one line of studies suggests that DNNs may learn concepts, the high level features recognizable to humans. Thus, we hypothesized that DNNs develop abstract codes, not necessarily recognizable to humans, which can be used to augment DNNs’ decision-making. To address this hypothesis, we combined foundation segmentation models and unsupervised learning to extract internal codes and identify potential use of abstract codes to make DL’s decision-making more reliable and safer.

[LG-91] Advancing Financial Risk Prediction Through Optimized LSTM Model Performance and Comparative Analysis

链接: https://arxiv.org/abs/2405.20603
作者: Ke Xu,Yu Cheng,Shiqing Long,Junjie Guo,Jue Xiao,Mengfang Sun
关键词: financial risk prediction, LSTM model shows, LSTM model, optimized LSTM model, paper focuses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper focuses on the application and optimization of LSTM model in financial risk prediction. The study starts with an overview of the architecture and algorithm foundation of LSTM, and then details the model training process and hyperparameter tuning strategy, and adjusts network parameters through experiments to improve performance. Comparative experiments show that the optimized LSTM model shows significant advantages in AUC index compared with random forest, BP neural network and XGBoost, which verifies its efficiency and practicability in the field of financial risk prediction, especially its ability to deal with complex time series data, which lays a solid foundation for the application of the model in the actual production environment.

[LG-92] Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis

链接: https://arxiv.org/abs/2405.20602
作者: Seunghwan An,Gyeongdong Woo,Jaesung Lim,ChangHyun Kim,Sungchul Hong,Jong-June Jeon
关键词: synthetic data generation, machine learning utility, high machine learning, generate synthetic data, synthetic data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this paper, our goal is to generate synthetic data for heterogeneous (mixed-type) tabular datasets with high machine learning utility (MLu). Given that the MLu performance relies on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We propose a novel synthetic data generation method, MaCoDE, by redefining the multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our proposed method enables estimating conditional densities across arbitrary combinations of target and conditional variables. Furthermore, we demonstrate that our proposed method bridges the theoretical gap between distributional learning and MLM. To validate the effectiveness of our proposed model, we conduct synthetic data generation experiments on 10 real-world datasets. Given the analogy between predicting masked input tokens in MLM and missing data imputation, we also evaluate the performance of multiple imputations on incomplete datasets with various missing data mechanisms. Moreover, our proposed model offers the advantage of enabling adjustments to data privacy levels without requiring re-training.

[LG-93] Generalized Semi-Supervised Learning via Self-Supervised Feature Adaptation

链接: https://arxiv.org/abs/2405.20596
作者: Jiachen Liang,Ruibing Hou,Hong Chang,Bingpeng Ma,Shiguang Shan,Xilin Chen
关键词: Traditional semi-supervised learning, Traditional semi-supervised, SSL, unlabeled data, semi-supervised learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages; Accepted by NeurIPS 2023

点击查看摘要

Abstract:Traditional semi-supervised learning (SSL) assumes that the feature distributions of labeled and unlabeled data are consistent which rarely holds in realistic scenarios. In this paper, we propose a novel SSL setting, where unlabeled samples are drawn from a mixed distribution that deviates from the feature distribution of labeled samples. Under this setting, previous SSL methods tend to predict wrong pseudo-labels with the model fitted on labeled data, resulting in noise accumulation. To tackle this issue, we propose Self-Supervised Feature Adaptation (SSFA), a generic framework for improving SSL performance when labeled and unlabeled data come from different distributions. SSFA decouples the prediction of pseudo-labels from the current model to improve the quality of pseudo-labels. Particularly, SSFA incorporates a self-supervised task into the SSL framework and uses it to adapt the feature extractor of the model to the unlabeled data. In this way, the extracted features better fit the distribution of unlabeled data, thereby generating high-quality pseudo-labels. Extensive experiments show that our proposed SSFA is applicable to various pseudo-label-based SSL learners and significantly improves performance in labeled, unlabeled, and even unseen distributions.

[LG-94] Deep Learning without Weight Symmetry

链接: https://arxiv.org/abs/2405.20594
作者: Li Ji-An,Marcus K. Benna
关键词: training artificial neural, artificial neural networks, predominates in contemporary, training artificial, artificial neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Backpropagation (BP), a foundational algorithm for training artificial neural networks, predominates in contemporary deep learning. Although highly successful, it is often considered biologically implausible. A significant limitation arises from the need for precise symmetry between connections in the backward and forward pathways to backpropagate gradient signals accurately, which is not observed in biological brains. Researchers have proposed several algorithms to alleviate this symmetry constraint, such as feedback alignment and direct feedback alignment. However, their divergence from backpropagation dynamics presents challenges, particularly in deeper networks and convolutional layers. Here we introduce the Product Feedback Alignment (PFA) algorithm. Our findings demonstrate that PFA closely approximates BP and achieves comparable performance in deep convolutional networks while avoiding explicit weight symmetry. Our results offer a novel solution to the longstanding weight symmetry problem, leading to more biologically plausible learning in deep convolutional networks compared to earlier methods.

[LG-95] LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis

链接: https://arxiv.org/abs/2405.20592
作者: Amin Heyrani Nobari,Akash Srivastava,Dan Gutfreund,Kai Xu,Faez Ahmed
关键词: integrates contrastive learning, continuous variables, techniques for solving, discrete and continuous, solving complex inverse
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce LInK, a novel framework that integrates contrastive learning of performance and design space with optimization techniques for solving complex inverse problems in engineering design with discrete and continuous variables. We focus on the path synthesis problem for planar linkage mechanisms. By leveraging a multi-modal and transformation-invariant contrastive learning framework, LInK learns a joint representation that captures complex physics and design representations of mechanisms, enabling rapid retrieval from a vast dataset of over 10 million mechanisms. This approach improves precision through the warm start of a hierarchical unconstrained nonlinear optimization algorithm, combining the robustness of traditional optimization with the speed and adaptability of modern deep learning methods. Our results on an existing benchmark demonstrate that LInK outperforms existing methods with 28 times less error compared to a state-of-the-art approach while taking 20 times less time on an existing benchmark. Moreover, we introduce a significantly more challenging benchmark, named LINK-ABC, which involves synthesizing linkages that trace the trajectories of English capital alphabets - an inverse design benchmark task that existing methods struggle with due to large non-linearities and tiny feasible space. Our results demonstrate that LInK not only advances the field of mechanism design but also broadens the applicability of contrastive learning and optimization to other areas of engineering.

[LG-96] Class-Based Time Series Data Augmentation to Mitigate Extreme Class Imbalance for Solar Flare Prediction

链接: https://arxiv.org/abs/2405.20590
作者: Junzhi Wen,Rafal A. Angryk
关键词: Time series data, Time series, multivariate time series, series data plays, making it valuable
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series data plays a crucial role across various domains, making it valuable for decision-making and predictive modeling. Machine learning (ML) and deep learning (DL) have shown promise in this regard, yet their performance hinges on data quality and quantity, often constrained by data scarcity and class imbalance, particularly for rare events like solar flares. Data augmentation techniques offer a potential solution to address these challenges, yet their effectiveness on multivariate time series datasets remains underexplored. In this study, we propose a novel data augmentation method for time series data named Mean Gaussian Noise (MGN). We investigate the performance of MGN compared to eight existing basic data augmentation methods on a multivariate time series dataset for solar flare prediction, SWAN-SF, using a ML algorithm for time series data, TimeSeriesSVC. The results demonstrate the efficacy of MGN and highlight its potential for improving classification performance in scenarios with extremely imbalanced data. Our time complexity analysis shows that MGN also has a competitive computational cost compared to the investigated alternative methods.

[LG-97] Selective Knowledge Sharing for Personalized Federated Learning Under Capacity Heterogeneity

链接: https://arxiv.org/abs/2405.20589
作者: Zheng Wang,Zheng Wang,Zhaopeng Peng,Zihui Wang,Cheng Wang
关键词: Federated Learning, gain significant advantages, collaboratively training capacity-heterogeneous, training capacity-heterogeneous models, stands to gain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) stands to gain significant advantages from collaboratively training capacity-heterogeneous models, enabling the utilization of private data and computing power from low-capacity devices. However, the focus on personalizing capacity-heterogeneous models based on client-specific data has been limited, resulting in suboptimal local model utility, particularly for low-capacity clients. The heterogeneity in both data and device capacity poses two key challenges for model personalization: 1) accurately retaining necessary knowledge embedded within reduced submodels for each client, and 2) effectively sharing knowledge through aggregating size-varying parameters. To this end, we introduce Pa3dFL, a novel framework designed to enhance local model performance by decoupling and selectively sharing knowledge among capacity-heterogeneous models. First, we decompose each layer of the model into general and personal parameters. Then, we maintain uniform sizes for the general parameters across clients and aggregate them through direct averaging. Subsequently, we employ a hyper-network to generate size-varying personal parameters for clients using learnable embeddings. Finally, we facilitate the implicit aggregation of personal parameters by aggregating client embeddings through a self-attention module. We conducted extensive experiments on three datasets to evaluate the effectiveness of Pa3dFL. Our findings indicate that Pa3dFL consistently outperforms baseline methods across various heterogeneity settings. Moreover, Pa3dFL demonstrates competitive communication and computation efficiency compared to baseline approaches, highlighting its practicality and adaptability in adverse system conditions.

[LG-98] he Point of View of a Sentiment: Towards Clinician Bias Detection in Psychiatric Notes

链接: https://arxiv.org/abs/2405.20582
作者: Alissa A. Valentine,Lauren A. Lepow,Alexander W. Charney,Isotta Landi
关键词: negative patient descriptions, point of view, large language models, negative patient, Mount Sinai Health
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Oral presentation at NAACL 2024 Queer in AI Workshop

点击查看摘要

Abstract:In psychiatry, negative patient descriptions and stigmatizing language can contribute to healthcare disparities in two ways: (1) read by patients they can harm their trust and engagement with the medical center; (2) read by future providers they may negatively influence the future perspective of a patient. By leveraging large language models, this work aims to identify the sentiment expressed in psychiatric clinical notes based on the reader’s point of view. Extracting sentences from the Mount Sinai Health System’s large and diverse clinical notes, we used prompts and in-context learning to adapt three large language models (GPT-3.5, Llama 2, Mistral) to classify the sentiment conveyed by the sentences according to the provider or non-provider point of view. Results showed that GPT-3.5 aligns best to provider point of view, whereas Mistral aligns best to non-provider point of view.

[LG-99] HOPE: A Reinforcement Learning-based Hybrid Policy Path Planner for Diverse Parking Scenarios

链接: https://arxiv.org/abs/2405.20579
作者: Mingyang Jiang,Yueyuan Li,Songan Zhang,Chunxiang Wang,Ming Yang
关键词: current methods struggle, reinforcement learning methods, Path planning plays, reinforcement learning, plays a pivotal
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 10 pages, 6 tables, 5 figures, 1 page appendix

点击查看摘要

Abstract:Path planning plays a pivotal role in automated parking, yet current methods struggle to efficiently handle the intricate and diverse parking scenarios. One potential solution is the reinforcement learning-based method, leveraging its exploration in unrecorded situations. However, a key challenge lies in training reinforcement learning methods is the inherent randomness in converging to a feasible policy. This paper introduces a novel solution, the Hybrid POlicy Path plannEr (HOPE), which integrates a reinforcement learning agent with Reeds-Shepp curves, enabling effective planning across diverse scenarios. The paper presents a method to calculate and implement an action mask mechanism in path planning, significantly boosting the efficiency and effectiveness of reinforcement learning training. A transformer is employed as the network structure to fuse environmental information and generate planned paths. To facilitate the training and evaluation of the proposed planner, we propose a criterion for categorizing the difficulty level of parking scenarios based on space and obstacle distribution. Experimental results demonstrate that our approach outperforms typical rule-based algorithms and traditional reinforcement learning methods, showcasing higher planning success rates and generalization across various scenarios. The code for our solution will be openly available on \hrefGitHubthis https URL. % after the paper’s acceptance.

[LG-100] Enhancing Generative Molecular Design via Uncertainty-guided Fine-tuning of Variational Autoencoders

链接: https://arxiv.org/abs/2405.20573
作者: A N M Nafiz Abeer,Sanket Jantre,Nathan M Urban,Byung-Jun Yoon
关键词: generative molecular design, molecular design tasks, deep generative models, molecular design, design tasks
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, deep generative models have been successfully adopted for various molecular design tasks, particularly in the life and material sciences. A critical challenge for pre-trained generative molecular design (GMD) models is to fine-tune them to be better suited for downstream design tasks aimed at optimizing specific molecular properties. However, redesigning and training an existing effective generative model from scratch for each new design task is impractical. Furthermore, the black-box nature of typical downstream tasks \unicodex2013 such as property prediction \unicodex2013 makes it nontrivial to optimize the generative model in a task-specific manner. In this work, we propose a novel approach for a model uncertainty-guided fine-tuning of a pre-trained variational autoencoder (VAE)-based GMD model through performance feedback in an active learning setting. The main idea is to quantify model uncertainty in the generative model, which is made efficient by working within a low-dimensional active subspace of the high-dimensional VAE parameters explaining most of the variability in the model’s output. The inclusion of model uncertainty expands the space of viable molecules through decoder diversity. We then explore the resulting model uncertainty class via black-box optimization made tractable by low-dimensionality of the active subspace. This enables us to identify and leverage a diverse set of high-performing models to generate enhanced molecules. Empirical results across six target molecular properties, using multiple VAE-based generative models, demonstrate that our uncertainty-guided fine-tuning approach consistently outperforms the original pre-trained models.

[LG-101] Generative AI for Deep Reinforcement Learning: Framework Analysis and Use Cases

链接: https://arxiv.org/abs/2405.20568
作者: Geng Sun,Wenwen Xie,Dusit Niyato,Fang Mei,Jiawen Kang,Hongyang Du,Shiwen Mao
关键词: deep reinforcement learning, achieved remarkable accomplishments, DRL algorithms, DRL, interactive learning
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:As a form of artificial intelligence (AI) technology based on interactive learning, deep reinforcement learning (DRL) has been widely applied across various fields and has achieved remarkable accomplishments. However, DRL faces certain limitations, including low sample efficiency and poor generalization. Therefore, we present how to leverage generative AI (GAI) to address these issues above and enhance the performance of DRL algorithms in this paper. We first introduce several classic GAI and DRL algorithms and demonstrate the applications of GAI-enhanced DRL algorithms. Then, we discuss how to use GAI to improve DRL algorithms from the data and policy perspectives. Subsequently, we introduce a framework that demonstrates an actual and novel integration of GAI with DRL, i.e., GAI-enhanced DRL. Additionally, we provide a case study of the framework on UAV-assisted integrated near-field/far-field communication to validate the performance of the proposed framework. Moreover, we present several future directions. Finally, the related code is available at: this https URL.

[LG-102] Can Machine Learning Assist in Diagnosis of Primary Immune Thrombocytopenia? A feasibility study

链接: https://arxiv.org/abs/2405.20562
作者: Haroon Miah,Dimitrios Kollias,Giacinto Luca Pedone,Drew Provan,Frederick Chen
关键词: Primary Immune thrombocytopenia, Primary Immune, rare autoimmune disease, autoimmune disease characterised, Immune thrombocytopenia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Primary Immune thrombocytopenia (ITP) is a rare autoimmune disease characterised by immune-mediated destruction of peripheral blood platelets in patients leading to low platelet counts and bleeding. The diagnosis and effective management of ITP is challenging because there is no established test to confirm the disease and no biomarker with which one can predict the response to treatment and outcome. In this work we conduct a feasibility study to check if machine learning can be applied effectively for diagnosis of ITP using routine blood tests and demographic data in a non-acute outpatient setting. Various ML models, including Logistic Regression, Support Vector Machine, k-Nearest Neighbor, Decision Tree and Random Forest, were applied to data from the UK Adult ITP Registry and a general hematology clinic. Two different approaches were investigated: a demographic-unaware and a demographic-aware one. We conduct extensive experiments to evaluate the predictive performance of these models and approaches, as well as their bias. The results revealed that Decision Tree and Random Forest models were both superior and fair, achieving nearly perfect predictive and fairness scores, with platelet count identified as the most significant variable. Models not provided with demographic information performed better in terms of predictive accuracy but showed lower fairness score, illustrating a trade-off between predictive performance and fairness.

[LG-103] Certifying Global Robustness for Deep Neural Networks

链接: https://arxiv.org/abs/2405.20556
作者: You Li,Guannan Zhao,Shuyu Kong,Yunqi He,Hai Zhou
关键词: globally robust deep, network resists perturbations, robust deep neural, neural network resists, globally robust
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A globally robust deep neural network resists perturbations on all meaningful inputs. Current robustness certification methods emphasize local robustness, struggling to scale and generalize. This paper presents a systematic and efficient method to evaluate and verify global robustness for deep neural networks, leveraging the PAC verification framework for solid guarantees on verification results. We utilize probabilistic programs to characterize meaningful input regions, setting a realistic standard for global robustness. Additionally, we introduce the cumulative robustness curve as a criterion in evaluating global robustness. We design a statistical method that combines multi-level splitting and regression analysis for the estimation, significantly reducing the execution time. Experimental results demonstrate the efficiency and effectiveness of our verification method and its capability to find rare and diversified counterexamples for adversarial training.

[LG-104] Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2405.20555
作者: Linjiajie Fang,Ruoxue Liu,Jing Zhang,Wenjia Wang,Bing-Yi Jing
关键词: offline reinforcement learning, target policy, offline reinforcement, policy, behavior policy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In offline reinforcement learning (RL), it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. Policy-regularized methods address this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm that we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance grounds on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. For critic training, we train a Q-ensemble to stabilize the estimation of Q-gradient. Additionally, DAC employs lower confidence bound (LCB) to address the overestimation and underestimation of value targets due to function approximation error. Our approach is evaluated on the D4RL benchmarks and outperforms the state-of-the-art in almost all environments. Code is available at \hrefthis https URL\textttthis http URL.

[LG-105] EM-Assist: Safe Automated ExtractMethod Refactoring with LLMs

链接: https://arxiv.org/abs/2405.20551
作者: Dorin Pomian,Abhiram Bellur,Malinda Dilhara,Zarina Kurbatova,Egor Bogomolov,Andrey Sokolov,Timofey Bryksin,Danny Dig
关键词: Excessively long methods, Excessively long, loaded with multiple, multiple responsibilities, challenging to understand
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: This paper is accepted to the tool demonstration track of the 32nd ACM Symposium on the Foundations of Software Engineering (FSE 2024). This is an author copy

点击查看摘要

Abstract:Excessively long methods, loaded with multiple responsibilities, are challenging to understand, debug, reuse, and maintain. The solution lies in the widely recognized Extract Method refactoring. While the application of this refactoring is supported in modern IDEs, recommending which code fragments to extract has been the topic of many research tools. However, they often struggle to replicate real-world developer practices, resulting in recommendations that do not align with what a human developer would do in real life. To address this issue, we introduce EM-Assist, an IntelliJ IDEA plugin that uses LLMs to generate refactoring suggestions and subsequently validates, enhances, and ranks them. Finally, EM-Assist uses the IntelliJ IDE to apply the user-selected recommendation. In our extensive evaluation of 1,752 real-world refactorings that actually took place in open-source projects, EM-Assist’s recall rate was 53.4% among its top-5 recommendations, compared to 39.4% for the previous best-in-class tool that relies solely on static analysis. Moreover, we conducted a usability survey with 18 industrial developers and 94.4% gave a positive rating.

[LG-106] Uncertainty Quantification for Deep Learning

链接: https://arxiv.org/abs/2405.20550
作者: Peter Jan van Leeuwen,J. Christine Chiu,C. Kevin Yang
关键词: statistically consistent uncertainty, consistent uncertainty quantification, neural network, perfect predictor, complete and statistically
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages 4 figures, submitted to Environmental data Science

点击查看摘要

Abstract:A complete and statistically consistent uncertainty quantification for deep learning is provided, including the sources of uncertainty arising from (1) the new input data, (2) the training and testing data (3) the weight vectors of the neural network, and (4) the neural network because it is not a perfect predictor. Using Bayes Theorem and conditional probability densities, we demonstrate how each uncertainty source can be systematically quantified. We also introduce a fast and practical way to incorporate and combine all sources of errors for the first time. For illustration, the new method is applied to quantify errors in cloud autoconversion rates, predicted from an artificial neural network that was trained by aircraft cloud probe measurements in the Azores and the stochastic collection equation formulated as a two-moment bin model. For this specific example, the output uncertainty arising from uncertainty in the training and testing data is dominant, followed by uncertainty in the input data, in the trained neural network, and uncertainty in the weights. We discuss the usefulness of the methodology for machine learning practice, and how, through inclusion of uncertainty in the training data, the new methodology is less sensitive to input data that falls outside of the training data set.

[LG-107] owards a General GNN Framework for Combinatorial Optimization

链接: https://arxiv.org/abs/2405.20543
作者: Frederik Wenkel,Semih Cantürk,Michael Perlmutter,Guy Wolf
关键词: achieved great success, Graph neural networks, node classification, neural networks, link prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
*备注: 15 pages, 1 figure

点击查看摘要

Abstract:Graph neural networks (GNNs) have achieved great success for a variety of tasks such as node classification, graph classification, and link prediction. However, the use of GNNs (and machine learning more generally) to solve combinatorial optimization (CO) problems is much less explored. Here, we introduce a novel GNN architecture which leverages a complex filter bank and localized attention mechanisms designed to solve CO problems on graphs. We show how our method differentiates itself from prior GNN-based CO solvers and how it can be effectively applied to the maximum clique, minimum dominating set, and maximum cut problems in a self-supervised learning setting. In addition to demonstrating competitive overall performance across all tasks, we establish state-of-the-art results for the max cut problem.

[LG-108] On the Connection Between Non-negative Matrix Factorization and Latent Dirichlet Allocation

链接: https://arxiv.org/abs/2405.20542
作者: Benedikt Geiger,Peter J. Park
关键词: Non-negative matrix factorization, generalized Kullback-Leibler divergence, latent Dirichlet allocation, non-negative data, Dirichlet allocation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages

点击查看摘要

Abstract:Non-negative matrix factorization with the generalized Kullback-Leibler divergence (NMF) and latent Dirichlet allocation (LDA) are two popular approaches for dimensionality reduction of non-negative data. Here, we show that NMF with \ell_1 normalization constraints on the columns of both matrices of the decomposition and a Dirichlet prior on the columns of one matrix is equivalent to LDA. To show this, we demonstrate that explicitly accounting for the scaling ambiguity of NMF by adding \ell_1 normalization constraints to the optimization problem allows a joint update of both matrices in the widely used multiplicative updates (MU) algorithm. When both of the matrices are normalized, the joint MU algorithm leads to probabilistic latent semantic analysis (PLSA), which is LDA without a Dirichlet prior. Our approach of deriving joint updates for NMF also reveals that a Lasso penalty on one matrix together with an \ell_1 normalization constraint on the other matrix is insufficient to induce any sparsity.

[LG-109] Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

链接: https://arxiv.org/abs/2405.20541
作者: Zachary Ankner,Cody Blakeney,Kartik Sreenivasan,Max Marion,Matthew L. Leavitt,Mansheej Paul
关键词: small language models, determine high-quality subsets, large-scale text datasets, larger language models, language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emphsignificantly improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a 1.45\times reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.

[LG-110] Fully Unconstrained Online Learning

链接: https://arxiv.org/abs/2405.20540
作者: Ashok Cutkosky,Zakaria Mhammedi
关键词: Lipschitz convex losses, online learning algorithm, star, Lipschitz convex, sqrt
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We provide an online learning algorithm that obtains regret G|w_\star|\sqrtT\log(|w_\star|G\sqrtT) + |w_\star|^2 + G^2 on G -Lipschitz convex losses for any comparison point w_\star without knowing either G or |w_\star| . Importantly, this matches the optimal bound G|w_\star|\sqrtT available with such knowledge (up to logarithmic factors), unless either |w_\star| or G is so large that even G|w_\star|\sqrtT is roughly linear in T . Thus, it matches the optimal bound in all cases in which one can achieve sublinear regret, which arguably most “interesting” scenarios.

[LG-111] SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents

链接: https://arxiv.org/abs/2405.20539
作者: Ethan Rathbun,Christopher Amato,Alina Oprea
关键词: actively growing field, Reinforcement learning, safety-critical applications, usage in real-world, making it paramount
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 23 pages, 14 figures, NeurIPS

点击查看摘要

Abstract:Reinforcement learning (RL) is an actively growing field that is seeing increased usage in real-world, safety-critical applications – making it paramount to ensure the robustness of RL algorithms against adversarial attacks. In this work we explore a particularly stealthy form of training-time attacks against RL – backdoor poisoning. Here the adversary intercepts the training of an RL agent with the goal of reliably inducing a particular action when the agent observes a pre-determined trigger at inference time. We uncover theoretical limitations of prior work by proving their inability to generalize across domains and MDPs. Motivated by this, we formulate a novel poisoning attack framework which interlinks the adversary’s objectives with those of finding an optimal policy – guaranteeing attack success in the limit. Using insights from our theoretical analysis we develop ``SleeperNets’’ as a universal backdoor attack which exploits a newly proposed threat model and leverages dynamic reward poisoning techniques. We evaluate our attack in 6 environments spanning multiple domains and demonstrate significant improvements in attack success over existing methods, while preserving benign episodic return.

[LG-112] Q-learning as a monotone scheme

链接: https://arxiv.org/abs/2405.20538
作者: Lingyi Yang
关键词: learning methods persist, reinforcement learning methods, reinforcement learning, learning methods, deep reinforcement learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stability issues with reinforcement learning methods persist. To better understand some of these stability and convergence issues involving deep reinforcement learning methods, we examine a simple linear quadratic example. We interpret the convergence criterion of exact Q-learning in the sense of a monotone scheme and discuss consequences of function approximation on monotonicity properties.

[LG-113] Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning

链接: https://arxiv.org/abs/2405.20534
作者: Davide Corsi,Davide Camponogara,Alessandro Farinelli
关键词: Deep Reinforcement Learning, Deep Reinforcement, real-world robotic systems, frontier for Deep, Reinforcement Learning
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:An exciting and promising frontier for Deep Reinforcement Learning (DRL) is its application to real-world robotic systems. While modern DRL approaches achieved remarkable successes in many robotic scenarios (including mobile robotics, surgical assistance, and autonomous driving) unpredictable and non-stationary environments can pose critical challenges to such methods. These features can significantly undermine fundamental requirements for a successful training process, such as the Markovian properties of the transition model. To address this challenge, we propose a new benchmarking environment for aquatic navigation using recent advances in the integration between game engines and DRL. In more detail, we show that our benchmarking environment is problematic even for state-of-the-art DRL approaches that may struggle to generate reliable policies in terms of generalization power and safety. Specifically, we focus on PPO, one of the most widely accepted algorithms, and we propose advanced training techniques (such as curriculum learning and learnable hyperparameters). Our extensive empirical evaluation shows that a well-designed combination of these ingredients can achieve promising results. Our simulation environment and training baselines are freely available to facilitate further research on this open problem and encourage collaboration in the field.

[LG-114] Mitigating the Impact of Labeling Errors on Training via Rockafellian Relaxation

链接: https://arxiv.org/abs/2405.20531
作者: Louis L. Chen,Bobbie Chern,Eric Eckstrand,Amogh Mahapatra,Johannes O. Royset
关键词: Labeling, neural network, noisy labeling, Abstract, contexts-human labeling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Labeling errors in datasets are common, if not systematic, in practice. They naturally arise in a variety of contexts-human labeling, noisy labeling, and weak labeling (i.e., image classification), for example. This presents a persistent and pervasive stress on machine learning practice. In particular, neural network (NN) architectures can withstand minor amounts of dataset imperfection with traditional countermeasures such as regularization, data augmentation, and batch normalization. However, major dataset imperfections often prove insurmountable. We propose and study the implementation of Rockafellian Relaxation (RR), a new loss reweighting, architecture-independent methodology, for neural network training. Experiments indicate RR can enhance standard neural network methods to achieve robust performance across classification tasks in computer vision and natural language processing (sentiment analysis). We find that RR can mitigate the effects of dataset corruption due to both (heavy) labeling error and/or adversarial perturbation, demonstrating effectiveness across a variety of data domains and machine learning tasks.

[LG-115] WaveCastNet: An AI-enabled Wavefield Forecasting Framework for Earthquake Early Warning

链接: https://arxiv.org/abs/2405.20516
作者: Dongwei Lyu,Rie Nakata,Pu Ren,Michael W. Mahoney,Arben Pitarka,Nori Nakata,N. Benjamin Erichson
关键词: quickly wreak havoc, quickly wreak, wreak havoc, Large earthquakes, Long Expressive Memory
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Large earthquakes can be destructive and quickly wreak havoc on a landscape. To mitigate immediate threats, early warning systems have been developed to alert residents, emergency responders, and critical infrastructure operators seconds to a minute before seismic waves arrive. These warnings provide time to take precautions and prevent damage. The success of these systems relies on fast, accurate predictions of ground motion intensities, which is challenging due to the complex physics of earthquakes, wave propagation, and their intricate spatial and temporal interactions. To improve early warning, we propose a novel AI-enabled framework, WaveCastNet, for forecasting ground motions from large earthquakes. WaveCastNet integrates a novel convolutional Long Expressive Memory (ConvLEM) model into a sequence to sequence (seq2seq) forecasting framework to model long-term dependencies and multi-scale patterns in both space and time. WaveCastNet, which shares weights across spatial and temporal dimensions, requires fewer parameters compared to more resource-intensive models like transformers and thus, in turn, reduces inference times. Importantly, WaveCastNet also generalizes better than transformer-based models to different seismic scenarios, including to more rare and critical situations with higher magnitude earthquakes. Our results using simulated data from the San Francisco Bay Area demonstrate the capability to rapidly predict the intensity and timing of destructive ground motions. Importantly, our proposed approach does not require estimating earthquake magnitudes and epicenters, which are prone to errors using conventional approaches; nor does it require empirical ground motion models, which fail to capture strongly heterogeneous wave propagation effects.

[LG-116] Deep Modeling of Non-Gaussian Aleatoric Uncertainty

链接: https://arxiv.org/abs/2405.20513
作者: Aastha Acharya,Caleb Lee,Marissa D’Alonzo,Jared Shamwell,Nisar R. Ahmed,Rebecca Russell
关键词: fixed and Gaussian, learning offers promising, Deep learning offers, model aleatoric uncertainty, robotic estimation systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Deep learning offers promising new ways to accurately model aleatoric uncertainty in robotic estimation systems, particularly when the uncertainty distributions do not conform to traditional assumptions of being fixed and Gaussian. In this study, we formulate and evaluate three fundamental deep learning approaches for conditional probability density modeling to quantify non-Gaussian aleatoric uncertainty: parametric, discretized, and generative modeling. We systematically compare the respective strengths and weaknesses of these three methods on simulated non-Gaussian densities as well as on real-world terrain-relative navigation data. Our results show that these deep learning methods can accurately capture complex uncertainty patterns, highlighting their potential for improving the reliability and robustness of estimation systems.

[LG-117] How Multilingual Are Large Language Models Fine-Tuned for Translation?

链接: https://arxiv.org/abs/2405.20512
作者: Aquia Richburg,Marine Carpuat
关键词: translation systems trained, large language models, fine-tuning large language, outperform dedicated translation, dedicated translation systems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A new paradigm for machine translation has recently emerged: fine-tuning large language models (LLM) on parallel text has been shown to outperform dedicated translation systems trained in a supervised fashion on much larger amounts of parallel data (Xu et al., 2024a; Alves et al., 2024). However, it remains unclear whether this paradigm can enable massively multilingual machine translation or whether it requires fine-tuning dedicated models for a small number of language pairs. How does translation fine-tuning impact the MT capabilities of LLMs for zero-shot languages, zero-shot language pairs, and translation tasks that do not involve English? To address these questions, we conduct an extensive empirical evaluation of the translation quality of the TOWER family of language models (Alves et al., 2024) on 132 translation tasks from the multi-parallel FLORES-200 data. We find that translation fine-tuning improves translation quality even for zero-shot languages on average, but that the impact is uneven depending on the language pairs involved. These results call for further research to effectively enable massively multilingual translation with LLMs.

[LG-118] SPOT: Text Source Prediction from Originality Score Thresholding

链接: https://arxiv.org/abs/2405.20505
作者: Edouard Yvinec,Gabriel Kasser
关键词: large language models, social risks, wide acceptance, acceptance of large, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The wide acceptance of large language models (LLMs) has unlocked new applications and social risks. Popular countermeasures aim at detecting misinformation, usually involve domain specific models trained to recognize the relevance of any information. Instead of evaluating the validity of the information, we propose to investigate LLM generated text from the perspective of trust. In this study, we define trust as the ability to know if an input text was generated by a LLM or a human. To do so, we design SPOT, an efficient method, that classifies the source of any, standalone, text input based on originality score. This score is derived from the prediction of a given LLM to detect other LLMs. We empirically demonstrate the robustness of the method to the architecture, training data, evaluation data, task and compression of modern LLMs.

[LG-119] FCOM: A Federated Collaborative Online Monitoring Framework via Representation Learning

链接: https://arxiv.org/abs/2405.20504
作者: Tanapol Kosolwattana,Huazheng Wang,Raed Al Kontar,Ying Lin
关键词: yielding high rewards, demonstrated notable potential, dynamically allocate limited, allocate limited resources, processes yielding high
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online learning has demonstrated notable potential to dynamically allocate limited resources to monitor a large population of processes, effectively balancing the exploitation of processes yielding high rewards, and the exploration of uncertain processes. However, most online learning algorithms were designed under 1) a centralized setting that requires data sharing across processes to obtain an accurate prediction or 2) a homogeneity assumption that estimates a single global model from the decentralized data. To facilitate the online learning of heterogeneous processes from the decentralized data, we propose a federated collaborative online monitoring method, which captures the latent representative models inherent in the population through representation learning and designs a novel federated collaborative UCB algorithm to estimate the representative models from sequentially observed decentralized data. The efficiency of our method is illustrated through theoretical analysis, simulation studies, and decentralized cognitive degradation monitoring in Alzheimer’s disease.

[LG-120] Optimizing cnn-Bigru performance: Mish activation and comparative analysis with Relu

链接: https://arxiv.org/abs/2405.20503
作者: Asmaa Benchama,Khalid Zebbara
关键词: deep learning techniques, Deep learning, research domains, extensively employed, range of research
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Deep learning is currently extensively employed across a range of research domains. The continuous advancements in deep learning techniques contribute to solving intricate challenges. Activation functions (AF) are fundamental components within neural networks, enabling them to capture complex patterns and relationships in the data. By introducing non-linearities, AF empowers neural networks to model and adapt to the diverse and nuanced nature of real-world data, enhancing their ability to make accurate predictions across various tasks. In the context of intrusion detection, the Mish, a recent AF, was implemented in the CNN-BiGRU model, using three datasets: ASNM-TUN, ASNM-CDX, and HOGZILLA. The comparison with Rectified Linear Unit (ReLU), a widely used AF, revealed that Mish outperforms ReLU, showcasing superior performance across the evaluated datasets. This study illuminates the effectiveness of AF in elevating the performance of intrusion detection systems.

[LG-121] ShelfHelp: Empowering Humans to Perform Vision-Independent Manipulation Tasks with a Socially Assistive Robotic Cane

链接: https://arxiv.org/abs/2405.20501
作者: Shivendra Agrawal,Suresh Nayak,Ashutosh Naik,Bradley Hayes
关键词: shop independently, quality of life, ability to shop, important for maintaining, maintaining a high
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 14 figures and charts

点击查看摘要

Abstract:The ability to shop independently, especially in grocery stores, is important for maintaining a high quality of life. This can be particularly challenging for people with visual impairments (PVI). Stores carry thousands of products, with approximately 30,000 new products introduced each year in the US market alone, presenting a challenge even for modern computer vision solutions. Through this work, we present a proof-of-concept socially assistive robotic system we call ShelfHelp, and propose novel technical solutions for enhancing instrumented canes traditionally meant for navigation tasks with additional capability within the domain of shopping. ShelfHelp includes a novel visual product locator algorithm designed for use in grocery stores and a novel planner that autonomously issues verbal manipulation guidance commands to guide the user during product retrieval. Through a human subjects study, we show the system’s success in locating and providing effective manipulation guidance to retrieve desired products with novice users. We compare two autonomous verbal guidance modes achieving comparable performance to a human assistance baseline and present encouraging findings that validate our system’s efficiency and effectiveness and through positive subjective metrics including competence, intelligence, and ease of use.

[LG-122] ransfer Q Star: Principled Decoding for LLM Alignment

链接: https://arxiv.org/abs/2405.20495
作者: Souradip Chakraborty,Soumya Suvra Ghosal,Ming Yin,Dinesh Manocha,Mengdi Wang,Amrit Singh Bedi,Furong Huang
关键词: Aligning foundation models, Aligning foundation, trustworthy deployment, texttt, safe and trustworthy
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward r , thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ( Q^* ), which is often unavailable in practice. Hence, prior SoTA methods either approximate this Q^* using Q^\pi_\textttsft (derived from the reference \textttSFT model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose Transfer Q^* , which implicitly estimates the optimal value function for a target reward r through a baseline model \rho_\textttBL aligned with a baseline reward \rho_\textttBL (which can be different from the target reward r ). Theoretical analyses of Transfer Q^* provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference \textttSFT model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.

[LG-123] Slight Corruption in Pre-training Data Makes Better Diffusion Models

链接: https://arxiv.org/abs/2405.20494
作者: Hao Chen,Yujin Han,Diganta Misra,Xiang Li,Kai Hu,Difan Zou,Masashi Sugiyama,Jindong Wang,Bhiksha Raj
关键词: shown remarkable capabilities, generating realistic high-quality, realistic high-quality images, Diffusion models, shown remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 50 pages, 33 figures, 4 tables

点击查看摘要

Abstract:Diffusion models (DMs) have shown remarkable capabilities in generating realistic high-quality images, audios, and videos. They benefit significantly from extensive pre-training on large-scale datasets, including web-crawled data with paired data and conditions, such as image-text and image-class pairs. Despite rigorous filtering, these pre-training datasets often inevitably contain corrupted pairs where conditions do not accurately describe the data. This paper presents the first comprehensive study on the impact of such corruption in pre-training data of DMs. We synthetically corrupt ImageNet-1K and CC3M to pre-train and evaluate over 50 conditional DMs. Our empirical findings reveal that various types of slight corruption in pre-training can significantly enhance the quality, diversity, and fidelity of the generated images across different DMs, both during pre-training and downstream adaptation stages. Theoretically, we consider a Gaussian mixture model and prove that slight corruption in the condition leads to higher entropy and a reduced 2-Wasserstein distance to the ground truth of the data distribution generated by the corruptly trained DMs. Inspired by our analysis, we propose a simple method to improve the training of DMs on practical datasets by adding condition embedding perturbations (CEP). CEP significantly improves the performance of various DMs in both pre-training and downstream tasks. We hope that our study provides new insights into understanding the data and pre-training processes of DMs.

[LG-124] Policy Trees for Prediction: Interpretable and Adaptive Model Selection for Machine Learning

链接: https://arxiv.org/abs/2405.20486
作者: Dimitris Bertsimas,Matthew Peroni
关键词: central questions remain, capable machine learning, public APIs, high-stakes decision-making, multitude of capable
类目: Machine Learning (cs.LG)
*备注: Submitted to JMLR on 5/30/2024

点击查看摘要

Abstract:As a multitude of capable machine learning (ML) models become widely available in forms such as open-source software and public APIs, central questions remain regarding their use in real-world applications, especially in high-stakes decision-making. Is there always one best model that should be used? When are the models likely to be error-prone? Should a black-box or interpretable model be used? In this work, we develop a prescriptive methodology to address these key questions, introducing a tree-based approach, Optimal Predictive-Policy Trees (OP2T), that yields interpretable policies for adaptively selecting a predictive model or ensemble, along with a parameterized option to reject making a prediction. We base our methods on learning globally optimized prescriptive trees. Our approach enables interpretable and adaptive model selection and rejection while only assuming access to model outputs. By learning policies over different feature spaces, including the model outputs, our approach works with both structured and unstructured datasets. We evaluate our approach on real-world datasets, including regression and classification tasks with both structured and unstructured data. We demonstrate that our approach provides both strong performance against baseline methods while yielding insights that help answer critical questions about which models to use, and when.

[LG-125] Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

链接: https://arxiv.org/abs/2405.20485
作者: Harsh Chaudhari,Giorgio Severi,John Abascal,Matthew Jagielski,Christopher A. Choquette-Choo,Milad Nasr,Cristina Nita-Rotaru,Alina Oprea
关键词: Retrieval Augmented Generation, modern large language, large language models, RAG augmented LLMs, Retrieval Augmented
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs) in chatbot applications, enabling developers to adapt and personalize the LLM output without expensive training or fine-tuning. RAG systems use an external knowledge database to retrieve the most relevant documents for a given query, providing this context to the LLM generator. While RAG achieves impressive utility in many applications, its adoption to enable personalized generative models introduces new security risks. In this work, we propose new attack surfaces for an adversary to compromise a victim’s RAG system, by injecting a single malicious document in its knowledge database. We design Phantom, general two-step attack framework against RAG augmented LLMs. The first step involves crafting a poisoned document designed to be retrieved by the RAG system within the top-k results only when an adversarial trigger, a specific sequence of words acting as backdoor, is present in the victim’s queries. In the second step, a specially crafted adversarial string within the poisoned document triggers various adversarial attacks in the LLM generator, including denial of service, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama.

[LG-126] Leveraging Structure Between Environments: Phylogenetic Regularization Incentivizes Disentangled Representations

链接: https://arxiv.org/abs/2405.20482
作者: Elliot Layne,Jason Hartford,Sébastien Lachapelle,Mathieu Blanchette,Dhanya Sridhar
关键词: latent causal variables, indirectly via measurements, biological processes, observed indirectly, gene expression
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many causal systems such as biological processes in cells can only be observed indirectly via measurements, such as gene expression. Causal representation learning – the task of correctly mapping low-level observations to latent causal variables – could advance scientific understanding by enabling inference of latent variables such as pathway activation. In this paper, we develop methods for inferring latent variables from multiple related datasets (environments) and tasks. As a running example, we consider the task of predicting a phenotype from gene expression, where we often collect data from multiple cell types or organisms that are related in known ways. The key insight is that the mapping from latent variables driven by gene expression to the phenotype of interest changes sparsely across closely related environments. To model sparse changes, we introduce Tree-Based Regularization (TBR), an objective that minimizes both prediction error and regularizes closely related environments to learn similar predictors. We prove that under assumptions about the degree of sparse changes, TBR identifies the true latent variables up to some simple transformations. We evaluate the theory empirically with both simulations and ground-truth gene expression data. We find that TBR recovers the latent causal variables better than related methods across these settings, even under settings that violate some assumptions of the theory.

[LG-127] Extending the Massive Text Embedding Benchmark to French

链接: https://arxiv.org/abs/2405.20468
作者: Mathieu Ciancone,Imene Kerboua,Marion Schaeffer,Wissam Siblini
关键词: Massive Text Embedding, Text Embedding Benchmark, recent years, NLP tasks, numerous embedding models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, numerous embedding models have been made available and widely used for various NLP tasks. Choosing a model that performs well for several tasks in English has been largely simplified by the Massive Text Embedding Benchmark (MTEB), but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. Not only we gather 22 existing datasets in an easy-to-use interface, but we also create three new French datasets for a global evaluation over 8 different tasks. We perform a large scale comparison with 46 carefully selected embedding models, conduct comprehensive statistical tests, and analyze the correlation between model performance and many of their characteristics. We find out that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform particularly well. Our work comes with open-source code, new datasets and a public leaderboard.

[LG-128] Performance of NPG in Countable State-Space Average-Cost RL

链接: https://arxiv.org/abs/2405.20467
作者: Yashaswini Murthy,Isaac Grosof,Siva Theja Maguluri,R. Srikant
关键词: policy optimization methods, reinforcement learning settings, arbitrarily large, countably infinite, Natural Policy Gradient
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 23 pages

点击查看摘要

Abstract:We consider policy optimization methods in reinforcement learning settings where the state space is arbitrarily large, or even countably infinite. The motivation arises from control problems in communication networks, matching markets, and other queueing systems. We consider Natural Policy Gradient (NPG), which is a popular algorithm for finite state spaces. Under reasonable assumptions, we derive a performance bound for NPG that is independent of the size of the state space, provided the error in policy evaluation is within a factor of the true value function. We obtain this result by establishing new policy-independent bounds on the solution to Poisson’s equation, i.e., the relative value function, and by combining these bounds with previously known connections between MDPs and learning from experts.

[LG-129] ENTIRe-ID: An Extensive and Diverse Dataset for Person Re-Identification

链接: https://arxiv.org/abs/2405.20465
作者: Serdar Yildiz,Ahmet Nezih Kasim
关键词: growing importance, reidentification in computer, computer vision, vision has highlighted, ENTIRe-ID dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 2024 18th International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:The growing importance of person reidentification in computer vision has highlighted the need for more extensive and diverse datasets. In response, we introduce the ENTIRe-ID dataset, an extensive collection comprising over 4.45 million images from 37 different cameras in varied environments. This dataset is uniquely designed to tackle the challenges of domain variability and model generalization, areas where existing datasets for person re-identification have fallen short. The ENTIRe-ID dataset stands out for its coverage of a wide array of real-world scenarios, encompassing various lighting conditions, angles of view, and diverse human activities. This design ensures a realistic and robust training platform for ReID models. The ENTIRe-ID dataset is publicly available at this https URL

[LG-130] Scaling Laws for the Value of Individual Data Points in Machine Learning

链接: https://arxiv.org/abs/2405.20456
作者: Ian Covert,Wenlong Ji,Tatsunori Hashimoto,James Zou
关键词: data, Recent works, scaling, data points, shown that machine
类目: Machine Learning (cs.LG)
*备注: ICML 2024 camera-ready

点击查看摘要

Abstract:Recent works have shown that machine learning models improve at a predictable rate with the total amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help design a model’s training dataset, but they typically take an aggregate view of the data by only considering the dataset’s size. We introduce a new perspective by investigating scaling behavior for the value of individual data points: we find that a data point’s contribution to model’s performance shrinks predictably with the size of the dataset in a log-linear manner. Interestingly, there is significant variability in the scaling exponent among different data points, indicating that certain points are more valuable in small datasets while others are relatively more useful as a part of large datasets. We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes. We further propose a maximum likelihood estimator and an amortized estimator to efficiently learn the individualized scaling behaviors from a small number of noisy observations per data point. Using our estimators, we provide insights into factors that influence the scaling behavior of different data points. Finally, we demonstrate applications of the individualized scaling laws to data valuation and data subset selection. Overall, our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.

[LG-131] Understanding Encoder-Decoder Structures in Machine Learning Using Information Measures

链接: https://arxiv.org/abs/2405.20452
作者: Jorge F. Silva,Victor Faraggi,Camilo Ramirez,Alvaro Egana,Eduardo Pavez
关键词: information-theoretic angle, understand the role, machine learning, encoder-decoder, learning
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present new results to model and understand the role of encoder-decoder design in machine learning (ML) from an information-theoretic angle. We use two main information concepts, information sufficiency (IS) and mutual information loss (MIL), to represent predictive structures in machine learning. Our first main result provides a functional expression that characterizes the class of probabilistic models consistent with an IS encoder-decoder latent predictive structure. This result formally justifies the encoder-decoder forward stages many modern ML architectures adopt to learn latent (compressed) representations for classification. To illustrate IS as a realistic and relevant model assumption, we revisit some known ML concepts and present some interesting new examples: invariant, robust, sparse, and digital models. Furthermore, our IS characterization allows us to tackle the fundamental question of how much performance (predictive expressiveness) could be lost, using the cross entropy risk, when a given encoder-decoder architecture is adopted in a learning setting. Here, our second main result shows that a mutual information loss quantifies the lack of expressiveness attributed to the choice of a (biased) encoder-decoder ML design. Finally, we address the problem of universal cross-entropy learning with an encoder-decoder design where necessary and sufficiency conditions are established to meet this requirement. In all these results, Shannon’s information measures offer new interpretations and explanations for representation learning.

[LG-132] Knockout: A simple way to handle missing inputs

链接: https://arxiv.org/abs/2405.20448
作者: Minh Nguyen,Batuhan K. Karaman,Heejong Kim,Alan Q. Wang,Fengbei Liu,Mert R. Sabuncu
关键词: Deep learning models, Deep learning, extract predictive, predictive and actionable, actionable information
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models can extract predictive and actionable information from complex inputs. The richer the inputs, the better these models usually perform. However, models that leverage rich inputs (e.g., multi-modality) can be difficult to deploy widely, because some inputs may be missing at inference. Current popular solutions to this problem include marginalization, imputation, and training multiple models. Marginalization can obtain calibrated predictions but it is computationally costly and therefore only feasible for low dimensional inputs. Imputation may result in inaccurate predictions because it employs point estimates for missing variables and does not work well for high dimensional inputs (e.g., images). Training multiple models whereby each model takes different subsets of inputs can work well but requires knowing missing input patterns in advance. Furthermore, training and retaining multiple models can be costly. We propose an efficient way to learn both the conditional distribution using full inputs and the marginal distributions. Our method, Knockout, randomly replaces input features with appropriate placeholder values during training. We provide a theoretical justification of Knockout and show that it can be viewed as an implicit marginalization strategy. We evaluate Knockout in a wide range of simulations and real-world datasets and show that it can offer strong empirical performance.

[LG-133] Is My Data in Your Retrieval Database? Membership Inference Attacks Against Retrieval Augmented Generation

链接: https://arxiv.org/abs/2405.20446
作者: Maya Anderson,Guy Amit,Abigail Goldsteen
关键词: Retrieval Augmented Generation, Augmented Generation, natural language processing, shown great promise, Retrieval Augmented
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems have shown great promise in natural language processing. However, their reliance on data stored in a retrieval database, which may contain proprietary or sensitive information, introduces new privacy concerns. Specifically, an attacker may be able to infer whether a certain text passage appears in the retrieval database by observing the outputs of the RAG system, an attack known as a Membership Inference Attack (MIA). Despite the significance of this threat, MIAs against RAG systems have yet remained under-explored. This study addresses this gap by introducing an efficient and easy-to-use method for conducting MIA against RAG systems. We demonstrate the effectiveness of our attack using two benchmark datasets and multiple generative models, showing that the membership of a document in the retrieval database can be efficiently determined through the creation of an appropriate prompt in both black-box and gray-box settings. Our findings highlight the importance of implementing security countermeasures in deployed RAG systems to protect the privacy and security of retrieval databases.

[LG-134] GraphAny: A Foundation Model for Node Classification on Any Graph

链接: https://arxiv.org/abs/2405.20445
作者: Jianan Zhao,Hesham Mostafa,Mikhail Galkin,Michael Bronstein,Zhaocheng Zhu,Jian Tang
关键词: revolutionized machine learning, Foundation models, task without requiring, revolutionized machine, machine learning
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Preprint. Work in progress

点击查看摘要

Abstract:Foundation models that can perform inference on any new task without requiring specific training have revolutionized machine learning in vision and language applications. However, applications involving graph-structured data remain a tough nut for foundation models, due to challenges in the unique feature- and label spaces associated with each graph. Traditional graph ML models such as graph neural networks (GNNs) trained on graphs cannot perform inference on a new graph with feature and label spaces different from the training ones. Furthermore, existing models learn functions specific to the training graph and cannot generalize to new graphs. In this work, we tackle these two challenges with a new foundational architecture for inductive node classification named GraphAny. GraphAny models inference on a new graph as an analytical solution to a LinearGNN, thereby solving the first challenge. To solve the second challenge, we learn attention scores for each node to fuse the predictions of multiple LinearGNNs. Specifically, the attention module is carefully parameterized as a function of the entropy-normalized distance-features between multiple LinearGNNs predictions to ensure generalization to new graphs. Empirically, GraphAny trained on the Wisconsin dataset with only 120 labeled nodes can effectively generalize to 30 new graphs with an average accuracy of 67.26% in an inductive manner, surpassing GCN and GAT trained in the supervised regime, as well as other inductive baselines.

[LG-135] Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning

链接: https://arxiv.org/abs/2405.20439
作者: Jacob Mitchell Springer,Vaishnavh Nagarajan,Aditi Raghunathan
关键词: stochastic gradient descent, promising alternative optimizer, Sharpness-Aware Minimization, gradient descent, SAM
类目: Machine Learning (cs.LG)
*备注: 25 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has emerged as a promising alternative optimizer to stochastic gradient descent (SGD). The originally-proposed motivation behind SAM was to bias neural networks towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, suggesting that flatness does fully explain SAM’s success. Sidestepping this debate, we identify an orthogonal effect of SAM that is beneficial out-of-distribution: we argue that SAM implicitly balances the quality of diverse features. SAM achieves this effect by adaptively suppressing well-learned features which gives remaining features opportunity to be learned. We show that this mechanism is beneficial in datasets that contain redundant or spurious features where SGD falls for the simplicity bias and would not otherwise learn all available features. Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including CelebA, Waterbirds, CIFAR-MNIST, and DomainBed.

[LG-136] Deep Learning for Computing Convergence Rates of Markov Chains

链接: https://arxiv.org/abs/2405.20435
作者: Yanlin Qu,Jose Blanchet,Peter Glynn
关键词: chain Monte Carlo, Markov chain Monte, Monte Carlo, general state-space Markov, state-space Markov chains
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Convergence rate analysis for general state-space Markov chains is fundamentally important in areas such as Markov chain Monte Carlo and algorithmic analysis (for computing explicit convergence bounds). This problem, however, is notoriously difficult because traditional analytical methods often do not generate practically useful convergence bounds for realistic Markov chains. We propose the Deep Contractive Drift Calculator (DCDC), the first general-purpose sample-based algorithm for bounding the convergence of Markov chains to stationarity in Wasserstein distance. The DCDC has two components. First, inspired by the new convergence analysis framework in (Qu this http URL, 2023), we introduce the Contractive Drift Equation (CDE), the solution of which leads to an explicit convergence bound. Second, we develop an efficient neural-network-based CDE solver. Equipped with these two components, DCDC solves the CDE and converts the solution into a convergence bound. We analyze the sample complexity of the algorithm and further demonstrate the effectiveness of the DCDC by generating convergence bounds for realistic Markov chains arising from stochastic processing networks as well as constant step-size stochastic optimization.

[LG-137] Exploring the Practicality of Federated Learning: A Survey Towards the Communication Perspective

链接: https://arxiv.org/abs/2405.20431
作者: Khiem Le,Nhan Luong-Ha,Manh Nguyen-Duc,Danh Le-Phuoc,Cuong Do,Kok-Seng Wong
关键词: enabling collaborative training, Federated Learning, offers significant advancements, centralizing data, paradigm that offers
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a promising paradigm that offers significant advancements in privacy-preserving, decentralized machine learning by enabling collaborative training of models across distributed devices without centralizing data. However, the practical deployment of FL systems faces a significant bottleneck: the communication overhead caused by frequently exchanging large model updates between numerous devices and a central server. This communication inefficiency can hinder training speed, model performance, and the overall feasibility of real-world FL applications. In this survey, we investigate various strategies and advancements made in communication-efficient FL, highlighting their impact and potential to overcome the communication challenges inherent in FL systems. Specifically, we define measures for communication efficiency, analyze sources of communication inefficiency in FL systems, and provide a taxonomy and comprehensive review of state-of-the-art communication-efficient FL methods. Additionally, we discuss promising future research directions for enhancing the communication efficiency of FL systems. By addressing the communication bottleneck, FL can be effectively applied and enable scalable and practical deployment across diverse applications that require privacy-preserving, decentralized machine learning, such as IoT, healthcare, or finance.

[LG-138] Enhancing Performance for Highly Imbalanced Medical Data via Data Regularization in a Federated Learning Setting

链接: https://arxiv.org/abs/2405.20430
作者: Georgios Tsoumplekas,Ilias Siniosoglou,Vasileios Argyriou,Ioannis D. Moscholios,Panagiotis Sarigiannidis
关键词: significantly impacted healthcare, deep learning approaches, increased availability, significantly impacted, impacted healthcare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increased availability of medical data has significantly impacted healthcare by enabling the application of machine / deep learning approaches in various instances. However, medical datasets are usually small and scattered across multiple providers, suffer from high class-imbalance, and are subject to stringent data privacy constraints. In this paper, the application of a data regularization algorithm, suitable for learning under high class-imbalance, in a federated learning setting is proposed. Specifically, the goal of the proposed method is to enhance model performance for cardiovascular disease prediction by tackling the class-imbalance that typically characterizes datasets used for this purpose, as well as by leveraging patient data available in different nodes of a federated ecosystem without compromising their privacy and enabling more resource sensitive allocation. The method is evaluated across four datasets for cardiovascular disease prediction, which are scattered across different clients, achieving improved performance. Meanwhile, its robustness under various hyperparameter settings, as well as its ability to adapt to different resource allocation scenarios, is verified.

[LG-139] Back to the Basics on Predicting Transfer Performance

链接: https://arxiv.org/abs/2405.20420
作者: Levy Chaves,Eduardo Valle,Alceu Bissoto,Sandra Avila
关键词: deep learning, evolving landscape, landscape of deep, growing number, number of choices
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 3 figures, 2 tables

点击查看摘要

Abstract:In the evolving landscape of deep learning, selecting the best pre-trained models from a growing number of choices is a challenge. Transferability scorers propose alleviating this scenario, but their recent proliferation, ironically, poses the challenge of their own assessment. In this work, we propose both robust benchmark guidelines for transferability scorers, and a well-founded technique to combine multiple scorers, which we show consistently improves their results. We extensively evaluate 13 scorers from literature across 11 datasets, comprising generalist, fine-grained, and medical imaging datasets. We show that few scorers match the predictive performance of the simple raw metric of models on ImageNet, and that all predictors suffer on medical datasets. Our results highlight the potential of combining different information sources for reliably predicting transferability across varied domains.

[LG-140] Enhancing Antibiotic Stewardship using a Natural Language Approach for Better Feature Representation

链接: https://arxiv.org/abs/2405.20419
作者: Simon A. Lee,Trevor Brokowski,Jeffrey N. Chiang
关键词: global healthcare crisis, undermining the efficacy, rapid emergence, emergence of antibiotic-resistant, antibiotic-resistant bacteria
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The rapid emergence of antibiotic-resistant bacteria is recognized as a global healthcare crisis, undermining the efficacy of life-saving antibiotics. This crisis is driven by the improper and overuse of antibiotics, which escalates bacterial resistance. In response, this study explores the use of clinical decision support systems, enhanced through the integration of electronic health records (EHRs), to improve antibiotic stewardship. However, EHR systems present numerous data-level challenges, complicating the effective synthesis and utilization of data. In this work, we transform EHR data into a serialized textual representation and employ pretrained foundation models to demonstrate how this enhanced feature representation can aid in antibiotic susceptibility predictions. Our results suggest that this text representation, combined with foundation models, provides a valuable tool to increase interpretability and support antibiotic stewardship efforts.

[LG-141] he Impact of Ontology on the Prediction of Cardiovascular Disease Compared to Machine Learning Algorithms

链接: https://arxiv.org/abs/2405.20414
作者: Hakim El Massari,Noreddine Gherabi,Sajida Mhammedi,Hamza Ghandi,Mohamed Bahaj,Muhammad Raza Naqvi
关键词: machine learning, ontology-based Machine Learning, machine learning algorithms, Cardiovascular disease, Machine Learning classification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardiovascular disease is one of the chronic diseases that is on the rise. The complications occur when cardiovascular disease is not discovered early and correctly diagnosed at the right time. Various machine learning approaches, including ontology-based Machine Learning techniques, have lately played an essential role in medical science by building an automated system that can identify heart illness. This paper compares and reviews the most prominent machine learning algorithms, as well as ontology-based Machine Learning classification. Random Forest, Logistic regression, Decision Tree, Naive Bayes, k-Nearest Neighbours, Artificial Neural Network, and Support Vector Machine were among the classification methods explored. The dataset used consists of 70000 instances and can be downloaded from the Kaggle website. The findings are assessed using performance measures generated from the confusion matrix, such as F-Measure, Accuracy, Recall, and Precision. The results showed that the ontology outperformed all the machine learning algorithms.

[LG-142] Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

链接: https://arxiv.org/abs/2405.20413
作者: Haibo Jin,Andy Zhou,Joe D. Menke,Haohan Wang
关键词: Large Language Models, Large Language, bypass protective measures, carefully crafted prompts, Language Models
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as ``jailbreaks’', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( \sim \times 19.88) and lower filtered-out rates ( \sim \times 1/6) than baselines.

[LG-143] Audio2Rig: Artist-oriented deep learning tool for facial animation

链接: https://arxiv.org/abs/2405.20412
作者: Bastien Arcelin,Nicolas Chaverou
关键词: Creating realistic, lip sync animation, tedious task, lip sync rig, lip sync
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Video examples and description: this https URL

点击查看摘要

Abstract:Creating realistic or stylized facial and lip sync animation is a tedious task. It requires lot of time and skills to sync the lips with audio and convey the right emotion to the character’s face. To allow animators to spend more time on the artistic and creative part of the animation, we present Audio2Rig: a new deep learning based tool leveraging previously animated sequences of a show, to generate facial and lip sync rig animation from an audio file. Based in Maya, it learns from any production rig without any adjustment and generates high quality and stylized animations which mimic the style of the show. Audio2Rig fits in the animator workflow: since it generates keys on the rig controllers, the animation can be easily retaken. The method is based on 3 neural network modules which can learn an arbitrary number of controllers. Hence, different configurations can be created for specific parts of the face (such as the tongue, lips or eyes). With Audio2Rig, animators can also pick different emotions and adjust their intensities to experiment or customize the output, and have high level controls on the keyframes setting. Our method shows excellent results, generating fine animation details while respecting the show style. Finally, as the training relies on the studio data and is done internally, it ensures data privacy and prevents from copyright infringement.

[LG-144] Private Mean Estimation with Person-Level Differential Privacy

链接: https://arxiv.org/abs/2405.20405
作者: Sushant Agarwal,Gautam Kamath,Mahbod Majid,Argyris Mouzakis,Rose Silver,Jonathan Ullman
关键词: study differentially private, person holds multiple, differentially private, study differentially, holds multiple samples
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 67 pages, 3 figures

点击查看摘要

Abstract:We study differentially private (DP) mean estimation in the case where each person holds multiple samples. Commonly referred to as the “user-level” setting, DP here requires the usual notion of distributional stability when all of a person’s datapoints can be modified. Informally, if n people each have m samples from an unknown d -dimensional distribution with bounded k -th moments, we show that [n = \tilde \Theta\left(\fracd\alpha^2 m + \fracd \alpha m^1/2 \varepsilon + \fracd\alpha^k/(k-1) m \varepsilon + \fracd\varepsilon\right)] people are necessary and sufficient to estimate the mean up to distance \alpha in \ell_2 -norm under \varepsilon -differential privacy (and its common relaxations). In the multivariate setting, we give computationally efficient algorithms under approximate DP (with slightly degraded sample complexity) and computationally inefficient algorithms under pure DP, and our nearly matching lower bounds hold for the most permissive case of approximate DP. Our computationally efficient estimators are based on the well known noisy-clipped-mean approach, but the analysis for our setting requires new bounds on the tails of sums of independent, vector-valued, bounded-moments random variables, and a new argument for bounding the bias introduced by clipping. Comments: 67 pages, 3 figures Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2405.20405 [cs.DS] (or arXiv:2405.20405v1 [cs.DS] for this version)

[LG-145] XPrompt:Explaining Large Language Models Generation via Joint Prompt Attribution

链接: https://arxiv.org/abs/2405.20404
作者: Yurui Chang,Bochuan Cao,Yujia Wang,Jinghui Chen,Lu Lin
关键词: Large Language Models, demonstrated impressive performances, Large Language, demonstrated impressive, impressive performances
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of elucidating and explaining the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to be classification or next-word prediction. Few initial attempts aiming to explain the entire language generation often treat input prompt texts independently, ignoring their combinatorial effects on the follow-up generation. In this study, we introduce a counterfactual explanation framework based on joint prompt attribution, XPrompt, which aims to explain how a few prompt texts collaboratively influences the LLM’s complete generation. Particularly, we formulate the task of prompt attribution for generation interpretation as a combinatorial optimization problem, and introduce a probabilistic algorithm to search for the casual input combination in the discrete space. We define and utilize multiple metrics to evaluate the produced explanations, demonstrating both faithfulness and efficiency of our framework.

[LG-146] Explainable Data-driven Modeling of Adsorption Energy in Heterogeneous Catalysis

链接: https://arxiv.org/abs/2405.20397
作者: Tirtha Vinchurkar,Janghoon Ock,Amir Barati Farimani
关键词: adsorption energy, techniques, XAI, increasing popularity, catalysis has spurred
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:The increasing popularity of machine learning (ML) in catalysis has spurred interest in leveraging these techniques to enhance catalyst design. Our study aims to bridge the gap between physics-based studies and data-driven methodologies by integrating ML techniques with eXplainable AI (XAI). Specifically, we employ two XAI techniques: Post-hoc XAI analysis and Symbolic Regression. These techniques help us unravel the correlation between adsorption energy and the properties of the adsorbate-catalyst system. Leveraging a large dataset such as the Open Catalyst Dataset (OC20), we employ a combination of shallow ML techniques and XAI methodologies. Our investigation involves utilizing multiple shallow machine learning techniques to predict adsorption energy, followed by post-hoc analysis for feature importance, inter-feature correlations, and the influence of various feature values on the prediction of adsorption energy. The post-hoc analysis reveals that adsorbate properties exert a greater influence than catalyst properties in our dataset. The top five features based on higher Shapley values are adsorbate electronegativity, the number of adsorbate atoms, catalyst electronegativity, effective coordination number, and the sum of atomic numbers of the adsorbate molecule. There is a positive correlation between catalyst and adsorbate electronegativity with the prediction of adsorption energy. Additionally, symbolic regression yields results consistent with SHAP analysis. It deduces a mathematical relationship indicating that the square of the catalyst electronegativity is directly proportional to the adsorption energy. These consistent correlations resemble those derived from physics-based equations in previous research. Our work establishes a robust framework that integrates ML techniques with XAI, leveraging large datasets like OC20 to enhance catalyst design through model explainability.

[LG-147] Quantitative Convergences of Lie Group Momentum Optimizers

链接: https://arxiv.org/abs/2405.20390
作者: Lingkai Kong,Molei Tao
关键词: optimize functions defined, momentum trivialization, Lie, optimize functions, functions defined
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Explicit, momentum-based dynamics that optimize functions defined on Lie groups can be constructed via variational optimization and momentum trivialization. Structure preserving time discretizations can then turn this dynamics into optimization algorithms. This article investigates two types of discretization, Lie Heavy-Ball, which is a known splitting scheme, and Lie NAG-SC, which is newly proposed. Their convergence rates are explicitly quantified under L -smoothness and local strong convexity assumptions. Lie NAG-SC provides acceleration over the momentumless case, i.e. Riemannian gradient descent, but Lie Heavy-Ball does not. When compared to existing accelerated optimizers for general manifolds, both Lie Heavy-Ball and Lie NAG-SC are computationally cheaper and easier to implement, thanks to their utilization of group structure. Only gradient oracle and exponential map are required, but not logarithm map or parallel transport which are computational costly.

[LG-148] Medication Recommendation via Dual Molecular Modalities and Multi-Substructure Distillation

链接: https://arxiv.org/abs/2405.20358
作者: Shi Mu,Shunpan Liang,Xiang Li
关键词: combines patient medical, Medication recommendation combines, determining medication combinations, recommendation combines patient, accurately and safely
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Medication recommendation combines patient medical history with biomedical knowledge to assist doctors in determining medication combinations more accurately and safely. Existing approaches based on molecular knowledge overlook the atomic geometric structure of molecules, failing to capture the high-dimensional characteristics and intrinsic physical properties of medications, leading to structural confusion and the inability to extract useful substructures from individual patient visits. To address these limitations, we propose BiMoRec, which overcomes the inherent lack of molecular essential information in 2D molecular structures by incorporating 3D molecular structures and atomic properties. To retain the fast response required of recommendation systems, BiMoRec maximizes the mutual information between the two molecular modalities through bimodal graph contrastive learning, achieving the integration of 2D and 3D molecular graphs, and finally distills substructures through interaction with single patient visits. Specifically, we use deep learning networks to construct a pre-training method to obtain representations of 2D and 3D molecular structures and substructures, and we use contrastive learning to derive mutual information. Subsequently, we generate fused molecular representations through a trained GNN module, re-determining the relevance of substructure representations in conjunction with the patient’s clinical history information. Finally, we generate the final medication combination based on the extracted substructure sequences. Our implementation on the MIMIC-III and MIMIC-IV datasets demonstrates that our method achieves state-of-the-art performance. Compared to the next best baseline, our model improves accuracy by 1.8% while maintaining the same level of DDI as the baseline.

[LG-149] Enhancing Adversarial Robustness in SNNs with Sparse Gradients

链接: https://arxiv.org/abs/2405.20355
作者: Yujia Liu,Tong Bu,Jianhao Ding,Zecheng Hao,Tiejun Huang,Zhaofei Yu
关键词: Spiking Neural Networks, Artificial Neural Networks, Neural Networks, Spiking Neural, Artificial Neural
类目: Neural and Evolutionary Computing (cs.NE); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by ICML 2024

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have attracted great attention for their energy-efficient operations and biologically inspired structures, offering potential advantages over Artificial Neural Networks (ANNs) in terms of energy efficiency and interpretability. Nonetheless, similar to ANNs, the robustness of SNNs remains a challenge, especially when facing adversarial attacks. Existing techniques, whether adapted from ANNs or specifically designed for SNNs, exhibit limitations in training SNNs or defending against strong attacks. In this paper, we propose a novel approach to enhance the robustness of SNNs through gradient sparsity regularization. We observe that SNNs exhibit greater resilience to random perturbations compared to adversarial perturbations, even at larger scales. Motivated by this, we aim to narrow the gap between SNNs under adversarial and random perturbations, thereby improving their overall robustness. To achieve this, we theoretically prove that this performance gap is upper bounded by the gradient sparsity of the probability associated with the true label concerning the input image, laying the groundwork for a practical strategy to train robust SNNs by regularizing the gradient sparsity. We validate the effectiveness of our approach through extensive experiments on both image-based and event-based datasets. The results demonstrate notable improvements in the robustness of SNNs. Our work highlights the importance of gradient sparsity in SNNs and its role in enhancing robustness.

[LG-150] Literature Filtering for Systematic Reviews with Transformers

链接: https://arxiv.org/abs/2405.20354
作者: John Hawkins,David Tivey
关键词: Identifying critical research, Identifying critical, growing body, body of academic, essential element
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying critical research within the growing body of academic work is an essential element of quality research. Systematic review processes, used in evidence-based medicine, formalise this as a procedure that must be followed in a research program. However, it comes with an increasing burden in terms of the time required to identify the important articles of research for a given topic. In this work, we develop a method for building a general-purpose filtering system that matches a research question, posed as a natural language description of the required content, against a candidate set of articles obtained via the application of broad search terms. Our results demonstrate that transformer models, pre-trained on biomedical literature then fine tuned for the specific task, offer a promising solution to this problem. The model can remove large volumes of irrelevant articles for most research questions.

[LG-151] ADR-BC: Adversarial Density Weighted Regression Behavior Cloning

链接: https://arxiv.org/abs/2405.20351
作者: Ziqi Zhang,Zifeng Zhuang,Donglin Wang,Jingzehua Xu,Miao Liu,Shuai Zhang
关键词: traditional Imitation Learning, Imitation Learning, traditional Imitation, impacting policy learning, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Typically, traditional Imitation Learning (IL) methods first shape a reward or Q function and then use this shaped function within a reinforcement learning (RL) framework to optimize the empirical policy. However, if the shaped reward/Q function does not adequately represent the ground truth reward/Q function, updating the policy within a multi-step RL framework may result in cumulative bias, further impacting policy learning. Although utilizing behavior cloning (BC) to learn a policy by directly mimicking a few demonstrations in a single-step updating manner can avoid cumulative bias, BC tends to greedily imitate demonstrated actions, limiting its capacity to generalize to unseen state action pairs. To address these challenges, we propose ADR-BC, which aims to enhance behavior cloning through augmented density-based action support, optimizing the policy with this augmented support. Specifically, the objective of ADR-BC shares the similar physical meanings that matching expert distribution while diverging the sub-optimal distribution. Therefore, ADR-BC can achieve more robust expert distribution matching. Meanwhile, as a one-step behavior cloning framework, ADR-BC avoids the cumulative bias associated with multi-step RL frameworks. To validate the performance of ADR-BC, we conduct extensive experiments. Specifically, ADR-BC showcases a 10.5% improvement over the previous state-of-the-art (SOTA) generalized IL baseline, CEIL, across all tasks in the Gym-Mujoco domain. Additionally, it achieves an 89.5% improvement over Implicit Q Learning (IQL) using real rewards across all tasks in the Adroit and Kitchen domains. On the other hand, we conduct extensive ablations to further demonstrate the effectiveness of ADR-BC.

[LG-152] Linear Function Approximation as a Computationally Efficient Method to Solve Classical Reinforcement Learning Challenges

链接: https://arxiv.org/abs/2405.20350
作者: Hari Srikanth
关键词: Regional Policy Optimization, Proximal Policy Optimization, Trust Regional Policy, Policy Optimization, Trust Regional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Network based approximations of the Value function make up the core of leading Policy Based methods such as Trust Regional Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). While this adds significant value when dealing with very complex environments, we note that in sufficiently low State and action space environments, a computationally expensive Neural Network architecture offers marginal improvement over simpler Value approximation methods. We present an implementation of Natural Actor Critic algorithms with actor updates through Natural Policy Gradient methods. This paper proposes that Natural Policy Gradient (NPG) methods with Linear Function Approximation as a paradigm for value approximation may surpass the performance and speed of Neural Network based models such as TRPO and PPO within these environments. Over Reinforcement Learning benchmarks Cart Pole and Acrobot, we observe that our algorithm trains much faster than complex neural network architectures, and obtains an equivalent or greater result. This allows us to recommend the use of NPG methods with Linear Function Approximation over TRPO and PPO for both traditional and sparse reward low dimensional problems.

[LG-153] Small Language Models for Application Interactions: A Case Study

链接: https://arxiv.org/abs/2405.20347
作者: Beibin Li,Yi Zhang,Sébastien Bubeck,Jeevan Pathuri,Ishai Menache
关键词: natural language interactions, language interactions, facilitating application usage, Small Language Models, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the efficacy of Small Language Models (SLMs) in facilitating application usage through natural language interactions. Our focus here is on a particular internal application used in Microsoft for cloud supply chain fulfilment. Our experiments show that small models can outperform much larger ones in terms of both accuracy and running time, even when fine-tuned on small datasets. Alongside these results, we also highlight SLM-based system design considerations.

[LG-154] PUAL: A Classifier on Trifurcate Positive-Unlabeled Data

链接: https://arxiv.org/abs/2405.20970
作者: Xiaoke Wang,Xiaochen Yang,Rui Zhu,Jing-Hao Xue
关键词: aims to train, Positive-unlabeled, instances, trifurcate data, learning aims
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 6 figures

点击查看摘要

Abstract:Positive-unlabeled (PU) learning aims to train a classifier using the data containing only labeled-positive instances and unlabeled instances. However, existing PU learning methods are generally hard to achieve satisfactory performance on trifurcate data, where the positive instances distribute on both sides of the negative instances. To address this issue, firstly we propose a PU classifier with asymmetric loss (PUAL), by introducing a structure of asymmetric loss on positive instances into the objective function of the global and local learning classifier. Then we develop a kernel-based algorithm to enable PUAL to obtain non-linear decision boundary. We show that, through experiments on both simulated and real-world datasets, PUAL can achieve satisfactory classification on trifurcate data.

[LG-155] Analysis of clinical dosimetric and radiomic features for predicting local failure after stereotactic radiotherapy of brain metastases in malignant melanoma

链接: https://arxiv.org/abs/2405.20825
作者: Nanna E. Hartong,Ilias Sachpazidis,Oliver Blanck,Lucas Etzel,Jan C. Peeken,Stephanie E. Combs,Horst Urbach,Maxim Zaitsev,Dimos Baltas,Ilinca Popp,Anca-Ligia Grosu,Tobias Fechter
关键词: pretherapeutic magnetic resonance, magnetic resonance imaging, malignant melanoma, investigate the role, magnetic resonance
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: The aim of this study was to investigate the role of clinical, dosimetric and pretherapeutic magnetic resonance imaging (MRI) features for lesion-specific outcome prediction of stereotactic radiotherapy (SRT) in patients with brain metastases from malignant melanoma (MBM). Methods: In this multicenter, retrospective analysis, we reviewed 517 MBM from 130 patients treated with SRT (single fraction or hypofractionated). For each gross tumor volume (GTV) 1576 radiomic features (RF) were calculated (788 each for the GTV and for a 3 mm margin around the GTV). Clinical parameters, radiation dose and RF from pretherapeutic contrast-enhanced T1-weighted MRI from different institutions were evaluated with a feature processing and elimination pipeline in a nested cross-validation scheme. Results: Seventy-two (72) of 517 lesions (13.9%) showed a local failure (LF) after SRT. The processing pipeline showed clinical, dosimetric and radiomic features providing information for LF prediction. The most prominent ones were the correlation of the gray level co-occurrence matrix of the margin (hazard ratio (HR): 0.37, confidence interval (CI): 0.23-0.58) and systemic therapy before SRT (HR: 0.55, CI: 0.42-0.70). The majority of RF associated with LF was calculated in the margin around the GTV. Conclusions: Pretherapeutic MRI based RF connected with lesion-specific outcome after SRT could be identified, despite multicentric data and minor differences in imaging protocols. Image data analysis of the surrounding metastatic environment may provide therapy-relevant information with the potential to further individualize radiotherapy strategies. Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG) Cite as: arXiv:2405.20825 [physics.med-ph] (or arXiv:2405.20825v1 [physics.med-ph] for this version) Submission history From: Tobias Fechter [view email] [v1] Fri, 31 May 2024 14:18:37 UTC (632 KB)

[LG-156] Rough Transformers: Lightweight Continuous-Time Sequence Modelling with Path Signatures

链接: https://arxiv.org/abs/2405.20799
作者: Fernando Moreno-Pino,Álvaro Arroyo,Harrison Waldon,Xiaowen Dong,Álvaro Cartea
关键词: real-world settings typically, settings typically exhibit, typically exhibit long-range, Time-series data, exhibit long-range dependencies
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. Under review. arXiv admin note: text overlap with arXiv:2403.10288

点击查看摘要

Abstract:Time-series data in real-world settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In these settings, traditional sequence-based recurrent models struggle. To overcome this, researchers often replace recurrent architectures with Neural ODE-based models to account for irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of even moderate length. To address this challenge, we introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences and incurs significantly lower computational costs. In particular, we propose \textitmulti-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global (multi-scale) dependencies in the input data, while remaining robust to changes in the sequence length and sampling frequency and yielding improved spatial processing. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the representational benefits of Neural ODE-based models, all at a fraction of the computational time and memory resources.

[LG-157] Improving Paratope and Epitope Prediction by Multi-Modal Contrastive Learning and Interaction Informativeness Estimation

链接: https://arxiv.org/abs/2405.20668
作者: Zhiwei Wang,Yongkang Wang,Wen Zhang
关键词: Accurately predicting antibody-antigen, Accurately predicting, Multi-modal contrastive learning, predicting antibody-antigen binding, paratopes and epitopes
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: This paper is accepted by IJCAI 2024

点击查看摘要

Abstract:Accurately predicting antibody-antigen binding residues, i.e., paratopes and epitopes, is crucial in antibody design. However, existing methods solely focus on uni-modal data (either sequence or structure), disregarding the complementary information present in multi-modal data, and most methods predict paratopes and epitopes separately, overlooking their specific spatial interactions. In this paper, we propose a novel Multi-modal contrastive learning and Interaction informativeness estimation-based method for Paratope and Epitope prediction, named MIPE, by using both sequence and structure data of antibodies and antigens. MIPE implements a multi-modal contrastive learning strategy, which maximizes representations of binding and non-binding residues within each modality and meanwhile aligns uni-modal representations towards effective modal representations. To exploit the spatial interaction information, MIPE also incorporates an interaction informativeness estimation that computes the estimated interaction matrices between antibodies and antigens, thereby approximating them to the actual ones. Extensive experiments demonstrate the superiority of our method compared to baselines. Additionally, the ablation studies and visualizations demonstrate the superiority of MIPE owing to the better representations acquired through multi-modal contrastive learning and the interaction patterns comprehended by the interaction informativeness estimation.

[LG-158] Weak-Form Inference for Hybrid Dynamical Systems in Ecology

链接: https://arxiv.org/abs/2405.20591
作者: Daniel Messenger,Greg Dwyer,Vanja Dukic
关键词: environmental threats commonly, threats commonly exhibit, Species subject, commonly exhibit variable, exhibit variable periods
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Species subject to predation and environmental threats commonly exhibit variable periods of population boom and bust over long timescales. Understanding and predicting such behavior, especially given the inherent heterogeneity and stochasticity of exogenous driving factors over short timescales, is an ongoing challenge. A modeling paradigm gaining popularity in the ecological sciences for such multi-scale effects is to couple short-term continuous dynamics to long-term discrete updates. We develop a data-driven method utilizing weak-form equation learning to extract such hybrid governing equations for population dynamics and to estimate the requisite parameters using sparse intermittent measurements of the discrete and continuous variables. The method produces a set of short-term continuous dynamical system equations parametrized by long-term variables, and long-term discrete equations parametrized by short-term variables, allowing direct assessment of interdependencies between the two time scales. We demonstrate the utility of the method on a variety of ecological scenarios and provide extensive tests using models previously derived for epizootics experienced by the North American spongy moth (Lymantria dispar dispar).

[LG-159] Hybrid Reinforcement Learning Framework for Mixed-Variable Problems

链接: https://arxiv.org/abs/2405.20500
作者: Haoyan Zhai,Qianli Hu,Jiangning Chen
关键词: complex solution landscapes, presenting unique challenges, unique challenges due, Optimization problems characterized, Bayesian Optimization
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimization problems characterized by both discrete and continuous variables are common across various disciplines, presenting unique challenges due to their complex solution landscapes and the difficulty of navigating mixed-variable spaces effectively. To Address these challenges, we introduce a hybrid Reinforcement Learning (RL) framework that synergizes RL for discrete variable selection with Bayesian Optimization for continuous variable adjustment. This framework stands out by its strategic integration of RL and continuous optimization techniques, enabling it to dynamically adapt to the problem’s mixed-variable nature. By employing RL for exploring discrete decision spaces and Bayesian Optimization to refine continuous parameters, our approach not only demonstrates flexibility but also enhances optimization performance. Our experiments on synthetic functions and real-world machine learning hyperparameter tuning tasks reveal that our method consistently outperforms traditional RL, random search, and standalone Bayesian optimization in terms of effectiveness and efficiency.

[LG-160] Statistical Properties of Robust Satisficing

链接: https://arxiv.org/abs/2405.20451
作者: Zhiyi Li,Yunbei Xu,Ruohan Zhan
关键词: offering streamlined procedures, Robust Satisficing, Distributionally Robust Optimization, offering streamlined, emerging approach
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The Robust Satisficing (RS) model is an emerging approach to robust optimization, offering streamlined procedures and robust generalization across various applications. However, the statistical theory of RS remains unexplored in the literature. This paper fills in the gap by comprehensively analyzing the theoretical properties of the RS model. Notably, the RS structure offers a more straightforward path to deriving statistical guarantees compared to the seminal Distributionally Robust Optimization (DRO), resulting in a richer set of results. In particular, we establish two-sided confidence intervals for the optimal loss without the need to solve a minimax optimization problem explicitly. We further provide finite-sample generalization error bounds for the RS optimizer. Importantly, our results extend to scenarios involving distribution shifts, where discrepancies exist between the sampling and target distributions. Our numerical experiments show that the RS model consistently outperforms the baseline empirical risk minimization in small-sample regimes and under distribution shifts. Furthermore, compared to the DRO model, the RS model exhibits lower sensitivity to hyperparameter tuning, highlighting its practicability for robustness considerations.

[LG-161] Algorithmic Fairness in Performative Policy Learning: Escaping the Impossibility of Group Fairness

链接: https://arxiv.org/abs/2405.20447
作者: Seamus Somerstep,Ya’acov Ritov,Yuekai Sun
关键词: predictive model affects, prediction target, predictive model, prediction problems, model affects
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many prediction problems, the predictive model affects the distribution of the prediction target. This phenomenon is known as performativity and is often caused by the behavior of individuals with vested interests in the outcome of the predictive model. Although performativity is generally problematic because it manifests as distribution shifts, we develop algorithmic fairness practices that leverage performativity to achieve stronger group fairness guarantees in social classification problems (compared to what is achievable in non-performative settings). In particular, we leverage the policymaker’s ability to steer the population to remedy inequities in the long term. A crucial benefit of this approach is that it is possible to resolve the incompatibilities between conflicting group fairness definitions.

[LG-162] Convolutional L2LFlows: Generating Accurate Showers in Highly Granular Calorimeters Using Convolutional Normalizing Flows

链接: https://arxiv.org/abs/2405.20407
作者: Thorsten Buss,Frank Gaede,Gregor Kasieczka,Claudius Krause,David Shih
关键词: build generative surrogate, computationally efficient alternatives, generated samples remains, generative surrogate models, rule-based simulations
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:In the quest to build generative surrogate models as computationally efficient alternatives to rule-based simulations, the quality of the generated samples remains a crucial frontier. So far, normalizing flows have been among the models with the best fidelity. However, as the latent space in such models is required to have the same dimensionality as the data space, scaling up normalizing flows to high dimensional datasets is not straightforward. The prior L2LFlows approach successfully used a series of separate normalizing flows and sequence of conditioning steps to circumvent this problem. In this work, we extend L2LFlows to simulate showers with a 9-times larger profile in the lateral direction. To achieve this, we introduce convolutional layers and U-Net-type connections, move from masked autoregressive flows to coupling layers, and demonstrate the successful modelling of showers in the ILD Electromagnetic Calorimeter as well as Dataset 3 from the public CaloChallenge dataset.

[LG-163] Fast leave-one-cluster-out cross-validation by clustered Network Information Criteria (NICc)

链接: https://arxiv.org/abs/2405.20400
作者: Jiaxing Qiu,Douglas E. Lake,Teague R. Henry
关键词: Network Information Criterion, Akaike Information Criterion, Information Criterion, Bayesian Information Criterion, Network Information
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper introduced a clustered estimator of the Network Information Criterion (NICc) to approximate leave-one-cluster-out cross-validated deviance, which can be used as an alternative to cluster-based cross-validation when modeling clustered data. Stone proved that Akaike Information Criterion (AIC) is an asymptotic equivalence to leave-one-observation-out cross-validation if the parametric model is true. Ripley pointed out that the Network Information Criterion (NIC) derived in Stone’s proof, is a better approximation to leave-one-observation-out cross-validation when the model is not true. For clustered data, we derived a clustered estimator of NIC, referred to as NICc, by substituting the Fisher information matrix in NIC with its estimator that adjusts for clustering. This adjustment imposes a larger penalty in NICc than the unclustered estimator of NIC when modeling clustered data, thereby preventing overfitting more effectively. In a simulation study and an empirical example, we used linear and logistic regression to model clustered data with Gaussian or binomial response, respectively. We showed that NICc is a better approximation to leave-one-cluster-out deviance and prevents overfitting more effectively than AIC and Bayesian Information Criterion (BIC). NICc leads to more accurate model selection, as determined by cluster-based cross-validation, compared to AIC and BIC.

[LG-164] Recurrent neural network wave functions for Rydberg atom arrays on kagome lattice

链接: https://arxiv.org/abs/2405.20384
作者: Mohamed Hibat-Allah,Ejaaz Merali,Giacomo Torlai,Roger G Melko,Juan Carrasquilla
关键词: Rydberg atom array, preparing strongly-correlated phases, Rydberg atom, conventional computer simulations, powerful quantum simulators
类目: Quantum Gases (cond-mat.quant-gas); Disordered Systems and Neural Networks (cond-mat.dis-nn); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 13 pages, 5 figures, 3 tables. Link to GitHub repository: this https URL

点击查看摘要

Abstract:Rydberg atom array experiments have demonstrated the ability to act as powerful quantum simulators, preparing strongly-correlated phases of matter which are challenging to study for conventional computer simulations. A key direction has been the implementation of interactions on frustrated geometries, in an effort to prepare exotic many-body states such as spin liquids and glasses. In this paper, we apply two-dimensional recurrent neural network (RNN) wave functions to study the ground states of Rydberg atom arrays on the kagome lattice. We implement an annealing scheme to find the RNN variational parameters in regions of the phase diagram where exotic phases may occur, corresponding to rough optimization landscapes. For Rydberg atom array Hamiltonians studied previously on the kagome lattice, our RNN ground states show no evidence of exotic spin liquid or emergent glassy behavior. In the latter case, we argue that the presence of a non-zero Edwards-Anderson order parameter is an artifact of the long autocorrelations times experienced with quantum Monte Carlo simulations. This result emphasizes the utility of autoregressive models, such as RNNs, to explore Rydberg atom array physics on frustrated lattices and beyond.

[LG-165] Personalized Adapter for Large Meteorology Model on Devices: Towards Weather Foundation Models

链接: https://arxiv.org/abs/2405.20348
作者: Shengchao Chen,Guodong Long,Jing Jiang,Chengqi Zhang
关键词: meteorological variables modeling, strong foundation models, on-device meteorological variables, pre-trained language models, variables modeling
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 42 pages, under review

点击查看摘要

Abstract:This paper demonstrates that pre-trained language models (PLMs) are strong foundation models for on-device meteorological variables modeling. We present LM-Weather, a generic approach to taming PLMs, that have learned massive sequential knowledge from the universe of natural language databases, to acquire an immediate capability to obtain highly customized models for heterogeneous meteorological data on devices while keeping high efficiency. Concretely, we introduce a lightweight personalized adapter into PLMs and endows it with weather pattern awareness. During communication between clients and the server, low-rank-based transmission is performed to effectively fuse the global knowledge among devices while maintaining high communication efficiency and ensuring privacy. Experiments on real-wold dataset show that LM-Weather outperforms the state-of-the-art results by a large margin across various tasks (e.g., forecasting and imputation at different scales). We provide extensive and in-depth analyses experiments, which verify that LM-Weather can (1) indeed leverage sequential knowledge from natural language to accurately handle meteorological sequence, (2) allows each devices obtain highly customized models under significant heterogeneity, and (3) generalize under data-limited and out-of-distribution (OOD) scenarios.

[LG-166] SamBaTen: Sampling-based Batch Incremental Tensor Decomposition

链接: https://arxiv.org/abs/1709.00668
作者: Ekta Gujral,Ravdeep Pasricha,Evangelos E. Papalexakis
关键词: Incremental Tensor Decomposition, tensor decomposition, invaluable tools, tools in analyzing, analyzing multimodal datasets
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tensor decompositions are invaluable tools in analyzing multimodal datasets. In many real-world scenarios, such datasets are far from being static, to the contrary they tend to grow over time. For instance, in an online social network setting, as we observe new interactions over time, our dataset gets updated in its “time” mode. How can we maintain a valid and accurate tensor decomposition of such a dynamically evolving multimodal dataset, without having to re-compute the entire decomposition after every single update? In this paper we introduce SaMbaTen, a Sampling-based Batch Incremental Tensor Decomposition algorithm, which incrementally maintains the decomposition given new updates to the tensor dataset. SaMbaTen is able to scale to datasets that the state-of-the-art in incremental tensor decomposition is unable to operate on, due to its ability to effectively summarize the existing tensor and the incoming updates, and perform all computations in the reduced summary space. We extensively evaluate SaMbaTen using synthetic and real datasets. Indicatively, SaMbaTen achieves comparable accuracy to state-of-the-art incremental and non-incremental techniques, while being 25-30 times faster. Furthermore, SaMbaTen scales to very large sparse and dense dynamically evolving tensors of dimensions up to 100K x 100K x 100K where state-of-the-art incremental approaches were not able to operate.

信息检索

[IR-0] CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking

链接: https://arxiv.org/abs/2405.20994
作者: Josef Vonášek,Milan Straka,Rostislav Krč,Lenka Lasoňová,Ekaterina Egorova,Jana Straková,Jakub Náplava
关键词: Click Web Ranking, Web Ranking dataset, Czech click dataset, Click Web, search engine logs
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: Accepted to SIGIR 2024

点击查看摘要

Abstract:We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click dataset for relevance ranking with user behavior data collected from search engine logs of this http URL. To the best of our knowledge, CWRCzech is the largest click dataset with raw text published so far. It provides document positions in the search results as well as information about user behavior: 27.6M clicked documents and 10.8M dwell times. In addition, we also publish a manually annotated Czech test for the relevance task, containing nearly 50k query-document pairs, each annotated by at least 2 annotators. Finally, we analyze how the user behavior data improve relevance ranking and show that models trained on data automatically harnessed at sufficient scale can surpass the performance of models trained on human annotated data. CWRCzech is published under an academic non-commercial license and is available to the research community at this https URL.

[IR-1] SelfGNN: Self-Supervised Graph Neural Networks for Sequential Recommendation

链接: https://arxiv.org/abs/2405.20878
作者: Yuxi Liu,Lianghao Xia,Chao Huang
关键词: effectively addresses information, addresses information overload, recommendation effectively addresses, effectively addresses, addresses information
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by SIGIR’24

点击查看摘要

Abstract:Sequential recommendation effectively addresses information overload by modeling users’ temporal and sequential interaction patterns. To overcome the limitations of supervision signals, recent approaches have adopted self-supervised learning techniques in recommender systems. However, there are still two critical challenges that remain unsolved. Firstly, existing sequential models primarily focus on long-term modeling of individual interaction sequences, overlooking the valuable short-term collaborative relationships among the behaviors of different users. Secondly, real-world data often contain noise, particularly in users’ short-term behaviors, which can arise from temporary intents or misclicks. Such noise negatively impacts the accuracy of both graph and sequence models, further complicating the modeling process. To address these challenges, we propose a novel framework called Self-Supervised Graph Neural Network (SelfGNN) for sequential recommendation. The SelfGNN framework encodes short-term graphs based on time intervals and utilizes Graph Neural Networks (GNNs) to learn short-term collaborative relationships. It captures long-term user and item representations at multiple granularity levels through interval fusion and dynamic behavior modeling. Importantly, our personalized self-augmented learning structure enhances model robustness by mitigating noise in short-term graphs based on long-term user interests and personal stability. Extensive experiments conducted on four real-world datasets demonstrate that SelfGNN outperforms various state-of-the-art baselines. Our model implementation codes are available at this https URL.

[IR-2] Popularity-Aware Alignment and Contrast for Mitigating Popularity Bias

链接: https://arxiv.org/abs/2405.20718
作者: Miaomiao Cai,Lei Chen,Yifan Wang,Haoyue Bai,Peijie Sun,Le Wu,Min Zhang,Meng Wang
关键词: Collaborative Filtering, popularity bias, item representations, unpopular item representations, typically suffers
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by KDD 2024

点击查看摘要

Abstract:Collaborative Filtering (CF) typically suffers from the significant challenge of popularity bias due to the uneven distribution of items in real-world datasets. This bias leads to a significant accuracy gap between popular and unpopular items. It not only hinders accurate user preference understanding but also exacerbates the Matthew effect in recommendation systems. To alleviate popularity bias, existing efforts focus on emphasizing unpopular items or separating the correlation between item representations and their popularity. Despite the effectiveness, existing works still face two persistent challenges: (1) how to extract common supervision signals from popular items to improve the unpopular item representations, and (2) how to alleviate the representation separation caused by popularity bias. In this work, we conduct an empirical analysis of popularity bias and propose Popularity-Aware Alignment and Contrast (PAAC) to address two challenges. Specifically, we use the common supervisory signals modeled in popular item representations and propose a novel popularity-aware supervised alignment module to learn unpopular item representations. Additionally, we suggest re-weighting the contrastive learning loss to mitigate the representation separation from a popularity-centric perspective. Finally, we validate the effectiveness and rationale of PAAC in mitigating popularity bias through extensive experiments on three real-world datasets. Our code is available at this https URL.

[IR-3] Information Maximization via Variational Autoencoders for Cross-Domain Recommendation

链接: https://arxiv.org/abs/2405.20710
作者: Xuying Ning,Wujiang Xu,Xiaolei Liu,Mingming Ha,Qiongxu Ma,Youru Li,Linxun Chen,Yongfeng Zhang
关键词: Single-Domain Sequential Recommendation, Cross-Domain Sequential Recommendation, Sequential Recommendation, Single-Domain Sequential, Cross-Domain Sequential
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Cross-Domain Sequential Recommendation (CDSR) methods aim to address the data sparsity and cold-start problems present in Single-Domain Sequential Recommendation (SDSR). Existing CDSR methods typically rely on overlapping users, designing complex cross-domain modules to capture users’ latent interests that can propagate across different domains. However, their propagated informative information is limited to the overlapping users and the users who have rich historical behavior records. As a result, these methods often underperform in real-world scenarios, where most users are non-overlapping (cold-start) and long-tailed. In this research, we introduce a new CDSR framework named Information Maximization Variational Autoencoder (\textbf\textttIM-VAE). Here, we suggest using a Pseudo-Sequence Generator to enhance the user’s interaction history input for downstream fine-grained CDSR models to alleviate the cold-start issues. We also propose a Generative Recommendation Framework combined with three regularizers inspired by the mutual information maximization (MIM) theory \citemcgill1954multivariate to capture the semantic differences between a user’s interests shared across domains and those specific to certain domains, as well as address the informational gap between a user’s actual interaction sequences and the pseudo-sequences generated. To the best of our knowledge, this paper is the first CDSR work that considers the information disentanglement and denoising of pseudo-sequences in the open-world recommendation scenario. Empirical experiments illustrate that \textttIM-VAE outperforms the state-of-the-art approaches on two real-world cross-domain datasets on all sorts of users, including cold-start and tailed users, demonstrating the effectiveness of \textttIM-VAE in open-world recommendation.

[IR-4] Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models

链接: https://arxiv.org/abs/2405.20654
作者: Xuyang Wu,Zhiyuan Peng,Sravanthi Rajanala,Hsin-Tai Wu,Yi Fang
关键词: Effective passage retrieval, identify suitable candidates, open-domain question answering, question answering tasks, Effective passage
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Accepted at Gen-IR@SIGIR24

点击查看摘要

Abstract:Effective passage retrieval and reranking methods have been widely utilized to identify suitable candidates in open-domain question answering tasks, recent studies have resorted to LLMs for reranking the retrieved passages by the log-likelihood of the question conditioned on each passage. Although these methods have demonstrated promising results, the performance is notably sensitive to the human-written prompt (or hard prompt), and fine-tuning LLMs can be computationally intensive and time-consuming. Furthermore, this approach limits the leverage of question-passage relevance pairs and passage-specific knowledge to enhance the ranking capabilities of LLMs. In this paper, we propose passage-specific prompt tuning for reranking in open-domain question answering (PSPT): a parameter-efficient method that fine-tunes learnable passage-specific soft prompts, incorporating passage-specific knowledge from a limited set of question-passage relevance pairs. The method involves ranking retrieved passages based on the log-likelihood of the model generating the question conditioned on each passage and the learned soft prompt. We conducted extensive experiments utilizing the Llama-2-chat-7B model across three publicly available open-domain question answering datasets and the results demonstrate the effectiveness of the proposed approach.

[IR-5] Large Language Models Enhanced Sequential Recommendation for Long-tail User and Item

链接: https://arxiv.org/abs/2405.20646
作者: Qidong Liu,Xian Wu,Xiangyu Zhao,Yejing Wang,Zijian Zhang,Feng Tian,Yefeng Zheng
关键词: social networking platforms, predicting users’ subsequent, Sequential recommendation systems, users’ subsequent preferences, subsequent preferences based
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Sequential recommendation systems (SRS) serve the purpose of predicting users’ subsequent preferences based on their past interactions and have been applied across various domains such as e-commerce and social networking platforms. However, practical SRS encounters challenges due to the fact that most users engage with only a limited number of items, while the majority of items are seldom consumed. These challenges, termed as the long-tail user and long-tail item dilemmas, often create obstacles for traditional SRS methods. Mitigating these challenges is crucial as they can significantly impact user satisfaction and business profitability. While some research endeavors have alleviated these issues, they still grapple with issues such as seesaw or noise stemming from the scarcity of interactions. The emergence of large language models (LLMs) presents a promising avenue to address these challenges from a semantic standpoint. In this study, we introduce the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR), which leverages semantic embeddings from LLMs to enhance SRS performance without increasing computational overhead. To combat the long-tail item challenge, we propose a dual-view modeling approach that fuses semantic information from LLMs with collaborative signals from traditional SRS. To address the long-tail user challenge, we introduce a retrieval augmented self-distillation technique to refine user preference representations by incorporating richer interaction data from similar users. Through comprehensive experiments conducted on three authentic datasets using three widely used SRS models, our proposed enhancement framework demonstrates superior performance compared to existing methodologies.

[IR-6] Causal Distillation for Alleviating Performance Heterogeneity in Recommender Systems

链接: https://arxiv.org/abs/2405.20626
作者: Shengyu Zhang,Ziqi Jiang,Jiangchao Yao,Fuli Feng,Kun Kuang,Zhou Zhao,Shuo Li,Hongxia Yang,Tat-Seng Chua,Fei Wu
关键词: accurate recommendation services, head users enjoy, exhibits a long-tail, small portion, portion of head
类目: Information Retrieval (cs.IR); Information Theory (cs.IT)
*备注: TKDE 2023

点击查看摘要

Abstract:Recommendation performance usually exhibits a long-tail distribution over users – a small portion of head users enjoy much more accurate recommendation services than the others. We reveal two sources of this performance heterogeneity problem: the uneven distribution of historical interactions (a natural source); and the biased training of recommender models (a model source). As addressing this problem cannot sacrifice the overall performance, a wise choice is to eliminate the model bias while maintaining the natural heterogeneity. The key to debiased training lies in eliminating the effect of confounders that influence both the user’s historical behaviors and the next behavior. The emerging causal recommendation methods achieve this by modeling the causal effect between user behaviors, however potentially neglect unobserved confounders (\eg, friend suggestions) that are hard to measure in practice. To address unobserved confounders, we resort to the front-door adjustment (FDA) in causal theory and propose a causal multi-teacher distillation framework (CausalD). FDA requires proper mediators in order to estimate the causal effects of historical behaviors on the next behavior. To achieve this, we equip CausalD with multiple heterogeneous recommendation models to model the mediator distribution. Then, the causal effect estimated by FDA is the expectation of recommendation prediction over the mediator distribution and the prior distribution of historical behaviors, which is technically achieved by multi-teacher ensemble. To pursue efficient inference, CausalD further distills multiple teachers into one student model to directly infer the causal effect for making recommendations.

[IR-7] Knowledge Enhanced Multi-intent Transformer Network for Recommendation

链接: https://arxiv.org/abs/2405.20565
作者: Ding Zou,Wei Wei,Feida Zhu,Chuanyu Xu,Tao Zhang,Chengfu Huo
关键词: attracted growing attention, providing abundant supplementary, abundant supplementary information, Knowledge Contrastive Denoising, Graph Transformer
类目: Information Retrieval (cs.IR)
*备注: Accept By The Web Conf 2024 (WWW 2024) Industry Track. arXiv admin note: text overlap with arXiv:2204.08807

点击查看摘要

Abstract:Incorporating Knowledge Graphs into Recommendation has attracted growing attention in industry, due to the great potential of KG in providing abundant supplementary information and interpretability for the underlying models. However, simply integrating KG into recommendation usually brings in negative feedback in industry, due to the ignorance of the following two factors: i) users’ multiple intents, which involve diverse nodes in KG. For example, in e-commerce scenarios, users may exhibit preferences for specific styles, brands, or colors. ii) knowledge noise, which is a prevalent issue in Knowledge Enhanced Recommendation (KGR) and even more severe in industry scenarios. The irrelevant knowledge properties of items may result in inferior model performance compared to approaches that do not incorporate knowledge. To tackle these challenges, we propose a novel approach named Knowledge Enhanced Multi-intent Transformer Network for Recommendation (KGTN), comprising two primary modules: Global Intents Modeling with Graph Transformer, and Knowledge Contrastive Denoising under Intents. Specifically, Global Intents with Graph Transformer focuses on capturing learnable user intents, by incorporating global signals from user-item-relation-entity interactions with a graph transformer, meanwhile learning intent-aware user/item representations. Knowledge Contrastive Denoising under Intents is dedicated to learning precise and robust representations. It leverages intent-aware representations to sample relevant knowledge, and proposes a local-global contrastive mechanism to enhance noise-irrelevant representation learning. Extensive experiments conducted on benchmark datasets show the superior performance of our proposed method over the state-of-the-arts. And online A/B testing results on Alibaba large-scale industrial recommendation platform also indicate the real-scenario effectiveness of KGTN.

[IR-8] Extending the Massive Text Embedding Benchmark to French

链接: https://arxiv.org/abs/2405.20468
作者: Mathieu Ciancone,Imene Kerboua,Marion Schaeffer,Wissam Siblini
关键词: Massive Text Embedding, Text Embedding Benchmark, recent years, NLP tasks, numerous embedding models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, numerous embedding models have been made available and widely used for various NLP tasks. Choosing a model that performs well for several tasks in English has been largely simplified by the Massive Text Embedding Benchmark (MTEB), but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. Not only we gather 22 existing datasets in an easy-to-use interface, but we also create three new French datasets for a global evaluation over 8 different tasks. We perform a large scale comparison with 46 carefully selected embedding models, conduct comprehensive statistical tests, and analyze the correlation between model performance and many of their characteristics. We find out that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform particularly well. Our work comes with open-source code, new datasets and a public leaderboard.

[IR-9] Designing an Evaluation Framework for Large Language Models in Astronomy Research

链接: https://arxiv.org/abs/2405.20389
作者: John F. Wu,Alina Hyk,Kiera McCormick,Christine Ye,Simone Astarita,Elina Baral,Jo Ciuca,Jesse Cranney,Anjalie Field,Kartheik Iyer,Philipp Koehn,Jenn Kotler,Sandor Kruk,Michelle Ntampaka,Charles O’Neill,Joshua E.G. Peek,Sanjib Sharma,Mikaeel Yunus
关键词: Large Language Models, Large Language, Language Models, Large, shifting how scientific
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: 7 pages, 3 figures. Code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are shifting how scientific research is done. It is imperative to understand how researchers interact with these models and how scientific sub-communities like astronomy might benefit from them. However, there is currently no standard for evaluating the use of LLMs in astronomy. Therefore, we present the experimental design for an evaluation study on how astronomy researchers interact with LLMs. We deploy a Slack chatbot that can answer queries from users via Retrieval-Augmented Generation (RAG); these responses are grounded in astronomy papers from arXiv. We record and anonymize user questions and chatbot answers, user upvotes and downvotes to LLM responses, user feedback to the LLM, and retrieved documents and similarity scores with the query. Our data collection method will enable future dynamic evaluations of LLM tools for astronomy.

[IR-10] Analysis of Hopfield Model as Associative Memory

链接: https://arxiv.org/abs/2402.04264
作者: Matteo Silvestri
关键词: biological neural systems, Hopfield neural network, neural systems, drawing inspiration, neural network model
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Retrieval (cs.IR)
*备注: 35 pages, 23 figures, 3 codes

点击查看摘要

Abstract:This article delves into the Hopfield neural network model, drawing inspiration from biological neural systems. The exploration begins with an overview of the model’s foundations, incorporating insights from mechanical statistics to deepen our understanding. Focusing on audio retrieval, the study demonstrates the Hopfield model’s associative memory capabilities. Through practical implementation, the network is trained to retrieve different patterns.

人工智能

[AI-0] Code Pretraining Improves Entity Tracking Abilities of Language Models

链接: https://arxiv.org/abs/2405.21068
作者: Najoung Kim,Sebastian Schuster,Shubham Toshniwal
关键词: discourse entities expressed, Recent work, provided indirect evidence, pretraining language models, provided indirect
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim by comparing pairs of language models on their entity tracking performance. Critically, the pairs consist of base models and models trained on top of these base models with additional code data. We extend this analysis to additionally examine the effect of math training, another highly structured data type, and alignment tuning, an important step for enhancing the usability of models. We find clear evidence that models additionally trained on large amounts of code outperform the base models. On the other hand, we find no consistent benefit of additional math training or alignment tuning across various model families.

[AI-1] Recurrent neural networks: vanishing and exploding gradients are not the end of the story

链接: https://arxiv.org/abs/2405.21064
作者: Nicolas Zucchet,Antonio Orvieto
关键词: Recurrent neural networks, learn long-term memories, Recurrent neural, notoriously struggle, long-term memories
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) notoriously struggle to learn long-term memories, primarily due to vanishing and exploding gradients. The recent success of state-space models (SSMs), a subclass of RNNs, to overcome such difficulties challenges our theoretical understanding. In this paper, we delve into the optimization challenges of RNNs and discover that, as the memory of a network increases, changes in its parameters result in increasingly large output variations, making gradient-based learning highly sensitive, even without exploding gradients. Our analysis further reveals the importance of the element-wise recurrence design pattern combined with careful parametrizations in mitigating this effect. This feature is present in SSMs, as well as in other architectures, such as LSTMs. Overall, our insights provide a new explanation for some of the difficulties in gradient-based learning of RNNs and why some architectures perform better than others.

[AI-2] Neural Network Verification with Branch-and-Bound for General Nonlinearities

链接: https://arxiv.org/abs/2405.21063
作者: Zhouxing Shi,Qirui Jin,Zico Kolter,Suman Jana,Cho-Jui Hsieh,Huan Zhang
关键词: effective methods, Optimal Power Flow, verification, networks, general
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Branch-and-bound (BaB) is among the most effective methods for neural network (NN) verification. However, existing works on BaB have mostly focused on NNs with piecewise linear activations, especially ReLU networks. In this paper, we develop a general framework, named GenBaB, to conduct BaB for general nonlinearities in general computational graphs based on linear bound propagation. To decide which neuron to branch, we design a new branching heuristic which leverages linear bounds as shortcuts to efficiently estimate the potential improvement after branching. To decide nontrivial branching points for general nonlinear functions, we propose to optimize branching points offline, which can be efficiently leveraged during verification with a lookup table. We demonstrate the effectiveness of our GenBaB on verifying a wide range of NNs, including networks with activation functions such as Sigmoid, Tanh, Sine and GeLU, as well as networks involving multi-dimensional nonlinear operations such as multiplications in LSTMs and Vision Transformers. Our framework also allows the verification of general nonlinear computation graphs and enables verification applications beyond simple neural networks, particularly for AC Optimal Power Flow (ACOPF). GenBaB is part of the latest \alpha,!\beta -CROWN, the winner of the 4th International Verification of Neural Networks Competition (VNN-COMP 2023).

[AI-3] An Organic Weed Control Prototype using Directed Energy and Deep Learning

链接: https://arxiv.org/abs/2405.21056
作者: Deng Cao,Hongbo Zhang,Rajveer Dhillon
关键词: improve crop yield, sustainable approach, vital to improve, Organic weed control, weed control
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Organic weed control is a vital to improve crop yield with a sustainable approach. In this work, a directed energy weed control robot prototype specifically designed for organic farms is proposed. The robot uses a novel distributed array robot (DAR) unit for weed treatment. Soybean and corn databases are built to train deep learning neural nets to perform weed recognition. The initial deep learning neural nets show a high performance in classifying crops. The robot uses a patented directed energy plant eradication recipe that is completely organic and UV-C free, with no chemical damage or physical disturbance to the soil. The deep learning can classify 8 common weed species in a soybean field under natural environment with up to 98% accuracy.

[AI-4] Grammar-Aligned Decoding

链接: https://arxiv.org/abs/2405.21047
作者: Kanghee Park,Jiayu Wang,Taylor Berg-Kirkpatrick,Nadia Polikarpova,Loris D’Antoni
关键词: Large Language Models, Large Language, Language Models, reliably generating highly, LLM distribution
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup. Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint. Specifically, in grammar-constrained decoding (GCD), the LLM’s output must follow a given grammar. In this paper we demonstrate that GCD techniques (and in general constrained decoding techniques) can distort the LLM’s distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM, and so ultimately are low-quality. We call the problem of aligning sampling with a grammar constraint, grammar-aligned decoding (GAD), and propose adaptive sampling with approximate expected futures (ASAp), a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM’s distribution conditioned on the given grammar constraint. Our algorithm uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes. Our evaluation on code generation and structured NLP tasks shows how ASAp often produces outputs with higher likelihood (according to the LLM’s distribution) than existing GCD techniques, while still enforcing the desired grammatical constraints.

[AI-5] Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

链接: https://arxiv.org/abs/2405.21046
作者: Tengyang Xie,Dylan J. Foster,Akshay Krishnamurthy,Corby Rosset,Ahmed Awadallah,Alexander Rakhlin
关键词: Exploratory Preference Optimization, Direct Preference Optimization, language model alignment, Preference Optimization, Reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the model to produce diverse, maximally informative responses. By allowing RLHF to confidently stray from the pre-trained model, online exploration offers the possibility of novel, potentially super-human capabilities, but its full potential as a paradigm for language model training has yet to be realized, owing to computational and statistical bottlenecks in directly adapting existing reinforcement learning techniques. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO), which is simple and practical – a one-line change to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023) – yet enjoys the strongest known provable guarantees and promising empirical performance. XPO augments the DPO objective with a novel and principled exploration bonus, empowering the algorithm to explore outside the support of the initial model and human feedback data. In theory, we show that XPO is provably sample-efficient and converges to a near-optimal language model policy under natural exploration conditions, irrespective of whether the initial model has good coverage. Our analysis, which builds on the observation that DPO implicitly performs a form of Q^\star -approximation (or, Bellman error minimization), combines previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the perspective of KL-regularized Markov decision processes. Empirically, we find that XPO is more sample-efficient than non-exploratory DPO variants in a preliminary evaluation.

[AI-6] arget Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

链接: https://arxiv.org/abs/2405.21043
作者: Fengdi Che,Chenjun Xiao,Jincheng Mei,Bo Dai,Ramki Gummadi,Oscar A Ramirez,Christopher K Harris,A. Rupam Mahmood,Dale Schuurmans
关键词: linear function approximation, function approximation establishes, over-parameterized linear function, off-policy data, linear function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird’s counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.

[AI-7] Direct Alignment of Language Models via Quality-Aware Self-Refinement

链接: https://arxiv.org/abs/2405.21040
作者: Runsheng Yu,Yong Wang,Xiaoqi Jiao,Youzhi Zhang,James T. Kwok
关键词: Large Language Models, Reinforcement Learning, Large Language, Human Feedback, behaviors of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an LLM-based reward model with the policy itself, thus obviating the need for extra memory and training time to learn the reward model. However, DPO does not consider the relative qualities of the positive and negative responses, and can lead to sub-optimal training outcomes. To alleviate this problem, we investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function. Specifically, we leverage the knowledge of the LLM to design a refinement function to estimate the quality of both the positive and negative responses. We show that the constructed refinement function can help self-refine the loss function under mild assumptions. The refinement function is integrated into DPO and its variant Identity Policy Optimization (IPO). Experiments across various evaluators indicate that they can improve the performance of the fine-tuned models over DPO and IPO.

[AI-8] Standards for Belief Representations in LLMs

链接: https://arxiv.org/abs/2405.21030
作者: Daniel A. Herrmann,Benjamin A. Levinstein
关键词: large language models, demonstrate remarkable abilities, LLMs internally represent, language models, continue to demonstrate
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to demonstrate remarkable abilities across various domains, computer scientists are developing methods to understand their cognitive processes, particularly concerning how (and if) LLMs internally represent their beliefs about the world. However, this field currently lacks a unified theoretical foundation to underpin the study of belief in LLMs. This article begins filling this gap by proposing adequacy conditions for a representation in an LLM to count as belief-like. We argue that, while the project of belief measurement in LLMs shares striking features with belief measurement as carried out in decision theory and formal epistemology, it also differs in ways that should change how we measure belief. Thus, drawing from insights in philosophy and contemporary practices of machine learning, we establish four criteria that balance theoretical considerations with practical constraints. Our proposed criteria include accuracy, coherence, uniformity, and use, which together help lay the groundwork for a comprehensive understanding of belief representation in LLMs. We draw on empirical work showing the limitations of using various criteria in isolation to identify belief representations.

[AI-9] LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

链接: https://arxiv.org/abs/2405.21028
作者: Elias Stengel-Eskin,Peter Hase,Mohit Bansal
关键词: explicit confidence markers, answering questions, LACIE, confidence markers, confidence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 17 pages. Code: this https URL

点击查看摘要

Abstract:When answering questions, LLMs can convey not only an answer, but a level of confidence about the answer being correct. This includes explicit confidence markers (e.g. giving a numeric score) as well as implicit markers, like an authoritative tone or elaborating with additional knowledge. For LLMs to be trustworthy knowledge sources, the confidence they convey should match their actual expertise; however, most current models tend towards overconfidence. To calibrate both implicit and explicit confidence markers, we introduce a pragmatic, listener-aware finetuning method (LACIE) that models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We cast calibration as preference optimization, creating data via a two-agent game, where a speaker model’s outputs are judged by a simulated listener. We then finetune three LLMs (Mistral-7B, Llama3-8B, Llama3-70B) with LACIE, and show that the resulting models are better calibrated w.r.t. a simulated listener. Crucially, these trends transfer to human listeners, helping them correctly predict model correctness: we conduct a human evaluation where annotators accept or reject an LLM’s answers, finding that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers. Furthermore, LACIE generalizes to another dataset, resulting in a large increase in truthfulness on TruthfulQA when trained on TriviaQA. Our analysis indicates that LACIE leads to a better confidence separation between correct and incorrect examples. Qualitatively, we find that a LACIE-trained model hedges more and implicitly signals certainty when it is correct by using an authoritative tone or including details. Finally, LACIE finetuning leads to an emergent increase in model abstention (e.g. saying “I don’t know”) for answers that are likely wrong.

[AI-10] Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles

链接: https://arxiv.org/abs/2405.21027
作者: Jiesong Lian,Yucong Huang,Mingzhi Wang,Chengdong Ma,Yixue Hao,Ying Wen,Yaodong Yang
关键词: Nash Equilibrium, Space Response Oracle, Policy Space Response, solving zero-sum games, policies
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents’ mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.

[AI-11] Explaining Predictions by Characteristic Rules

链接: https://arxiv.org/abs/2405.21003
作者: Amr Alkhatib,Henrik Boström,Michalis Vazirgiannis
关键词: Characteristic Explanatory General, CEGA, rules, Explanatory General Association, Anchors
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022

点击查看摘要

Abstract:Characteristic rules have been advocated for their ability to improve interpretability over discriminative rules within the area of rule learning. However, the former type of rule has not yet been used by techniques for explaining predictions. A novel explanation technique, called CEGA (Characteristic Explanatory General Association rules), is proposed, which employs association rule mining to aggregate multiple explanations generated by any standard local explanation technique into a set of characteristic rules. An empirical investigation is presented, in which CEGA is compared to two state-of-the-art methods, Anchors and GLocalX, for producing local and aggregated explanations in the form of discriminative rules. The results suggest that the proposed approach provides a better trade-off between fidelity and complexity compared to the two state-of-the-art approaches; CEGA and Anchors significantly outperform GLocalX with respect to fidelity, while CEGA and GLocalX significantly outperform Anchors with respect to the number of generated rules. The effect of changing the format of the explanations of CEGA to discriminative rules and using LIME and SHAP as local explanation techniques instead of Anchors are also investigated. The results show that the characteristic explanatory rules still compete favorably with rules in the standard discriminative format. The results also indicate that using CEGA in combination with either SHAP or Anchors consistently leads to a higher fidelity compared to using LIME as the local explanation technique.

[AI-12] Locking Machine Learning Models into Hardware

链接: https://arxiv.org/abs/2405.20990
作者: Eleanor Clifford,Adhithya Saravanan,Harry Langford,Cheng Zhang,Yiren Zhao,Robert Mullins,Ilia Shumailov,Jamie Hayes
关键词: Modern Machine Learning, Modern Machine, Machine Learning models, business competitiveness, competitiveness often depends
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 2 figures of main text; 14 pages, 16 figures of appendices

点击查看摘要

Abstract:Modern Machine Learning models are expensive IP and business competitiveness often depends on keeping this IP confidential. This in turn restricts how these models are deployed – for example it is unclear how to deploy a model on-device without inevitably leaking the underlying model. At the same time, confidential computing technologies such as Multi-Party Computation or Homomorphic encryption remain impractical for wide adoption. In this paper we take a different approach and investigate feasibility of ML-specific mechanisms that deter unauthorized model use by restricting the model to only be usable on specific hardware, making adoption on unauthorized hardware inconvenient. That way, even if IP is compromised, it cannot be trivially used without specialised hardware or major model adjustment. In a sense, we seek to enable cheap locking of machine learning models into specific hardware. We demonstrate that locking mechanisms are feasible by either targeting efficiency of model representations, such making models incompatible with quantisation, or tie the model’s operation on specific characteristics of hardware, such as number of cycles for arithmetic operations. We demonstrate that locking comes with negligible work and latency overheads, while significantly restricting usability of the resultant model on unauthorized hardware.

[AI-13] Generative Adversarial Networks in Ultrasound Imaging: Extending Field of View Beyond Conventional Limits

链接: https://arxiv.org/abs/2405.20981
作者: Matej Gazda,Samuel Kadoury,Jakub Gazda,Peter Drotar
关键词: Transthoracic Echocardiography, TTE ultrasound imaging, enabling detailed visualization, Generative Adversarial Networks, cardiovascular medicine
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transthoracic Echocardiography (TTE) is a fundamental, non-invasive diagnostic tool in cardiovascular medicine, enabling detailed visualization of cardiac structures crucial for diagnosing various heart conditions. Despite its widespread use, TTE ultrasound imaging faces inherent limitations, notably the trade-off between field of view (FoV) and resolution. This paper introduces a novel application of conditional Generative Adversarial Networks (cGANs), specifically designed to extend the FoV in TTE ultrasound imaging while maintaining high resolution. Our proposed cGAN architecture, termed echoGAN, demonstrates the capability to generate realistic anatomical structures through outpainting, effectively broadening the viewable area in medical imaging. This advancement has the potential to enhance both automatic and manual ultrasound navigation, offering a more comprehensive view that could significantly reduce the learning curve associated with ultrasound imaging and aid in more accurate diagnoses. The results confirm that echoGAN reliably reproduce detailed cardiac features, thereby promising a significant step forward in the field of non-invasive cardiac naviagation and diagnostics.

[AI-14] Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training

链接: https://arxiv.org/abs/2405.20978
作者: Feiteng Fang,Yuelin Bai,Shiwen Ni,Min Yang,Xiaojun Chen,Ruifeng Xu
关键词: Large Language Models, Large Language, untraceable reasoning processes, including hallucination, exhibit substantial capabilities
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit substantial capabilities yet encounter challenges, including hallucination, outdated knowledge, and untraceable reasoning processes. Retrieval-augmented generation (RAG) has emerged as a promising solution, integrating knowledge from external databases to mitigate these challenges. However, inappropriate retrieved passages can potentially hinder the LLMs’ capacity to generate comprehensive and high-quality responses. Prior RAG studies on the robustness of retrieval noises often confine themselves to a limited set of noise types, deviating from real-world retrieval environments and limiting practical applicability. In this study, we initially investigate retrieval noises and categorize them into three distinct types, reflecting real-world environments. We analyze the impact of these various retrieval noises on the robustness of LLMs. Subsequently, we propose a novel RAG approach known as Retrieval-augmented Adaptive Adversarial Training (RAAT). RAAT leverages adaptive adversarial training to dynamically adjust the model’s training process in response to retrieval noises. Concurrently, it employs multi-task learning to ensure the model’s capacity to internally recognize noisy contexts. Extensive experiments demonstrate that the LLaMA-2 7B model trained using RAAT exhibits significant improvements in F1 and EM scores under diverse noise conditions. For reproducibility, we release our code and data at: this https URL.

[AI-15] ACE: A Model Poisoning Attack on Contribution Evaluation Methods in Federated Learning

链接: https://arxiv.org/abs/2405.20975
作者: Zhangchen Xu,Fengqing Jiang,Luyao Niu,Jinyuan Jia,Bo Li,Radha Poovendran
关键词: Federated Learning, local training data, machine learning model, contribution evaluation methods, machine learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in the 33rd USENIX Security Symposium, 2024

点击查看摘要

Abstract:In Federated Learning (FL), a set of clients collaboratively train a machine learning model (called global model) without sharing their local training data. The local training data of clients is typically non-i.i.d. and heterogeneous, resulting in varying contributions from individual clients to the final performance of the global model. In response, many contribution evaluation methods were proposed, where the server could evaluate the contribution made by each client and incentivize the high-contributing clients to sustain their long-term participation in FL. Existing studies mainly focus on developing new metrics or algorithms to better measure the contribution of each client. However, the security of contribution evaluation methods of FL operating in adversarial environments is largely unexplored. In this paper, we propose the first model poisoning attack on contribution evaluation methods in FL, termed ACE. Specifically, we show that any malicious client utilizing ACE could manipulate the parameters of its local model such that it is evaluated to have a high contribution by the server, even when its local training data is indeed of low quality. We perform both theoretical analysis and empirical evaluations of ACE. Theoretically, we show our design of ACE can effectively boost the malicious client’s perceived contribution when the server employs the widely-used cosine distance metric to measure contribution. Empirically, our results show ACE effectively and efficiently deceive five state-of-the-art contribution evaluation methods. In addition, ACE preserves the accuracy of the final global models on testing inputs. We also explore six countermeasures to defend ACE. Our results show they are inadequate to thwart ACE, highlighting the urgent need for new defenses to safeguard the contribution evaluation methods in FL.

[AI-16] SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

链接: https://arxiv.org/abs/2405.20974
作者: Tianyang Xu,Shujin Wu,Shizhe Diao,Xiaoze Liu,Xingyao Wang,Yangyi Chen,Jing Gao
关键词: Large language models, Large language, confidence estimates, broader applications, fabricated information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The code is available at \url{ this https URL }

点击查看摘要

Abstract:Large language models (LLMs) often generate inaccurate or fabricated information and generally fail to indicate their confidence, which limits their broader applications. Previous work elicits confidence from LLMs by direct or self-consistency prompting, or constructing specific datasets for supervised finetuning. The prompting-based approaches have inferior performance, and the training-based approaches are limited to binary or inaccurate group-level confidence estimates. In this work, we present the advanced SaySelf, a training framework that teaches LLMs to express more accurate fine-grained confidence estimates. In addition, beyond the confidence scores, SaySelf initiates the process of directing LLMs to produce self-reflective rationales that clearly identify gaps in their parametric knowledge and explain their uncertainty. This is achieved by using an LLM to automatically summarize the uncertainties in specific knowledge via natural language. The summarization is based on the analysis of the inconsistency in multiple sampled reasoning chains, and the resulting data is utilized for supervised fine-tuning. Moreover, we utilize reinforcement learning with a meticulously crafted reward function to calibrate the confidence estimates, motivating LLMs to deliver accurate, high-confidence predictions and to penalize overconfidence in erroneous outputs. Experimental results in both in-distribution and out-of-distribution datasets demonstrate the effectiveness of SaySelf in reducing the confidence calibration error and maintaining the task performance. We show that the generated self-reflective rationales are reasonable and can further contribute to the calibration. The code is made public at \urlthis https URL.

[AI-17] Large Language Models are Zero-Shot Next Location Predictors

链接: https://arxiv.org/abs/2405.20962
作者: Ciro Beneduce,Bruno Lepri,Massimiliano Luca
关键词: Predicting the locations, locations an individual, individual will visit, future is crucial, crucial for solving
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Predicting the locations an individual will visit in the future is crucial for solving many societal issues like disease diffusion and reduction of pollution among many others. The models designed to tackle next-location prediction, however, require a significant amount of individual-level information to be trained effectively. Such data may be scarce or even unavailable in some geographic regions or peculiar scenarios (e.g., cold-start in recommendation systems). Moreover, the design of a next-location predictor able to generalize or geographically transfer knowledge is still an open research challenge. Recent advances in natural language processing have led to a rapid diffusion of Large Language Models (LLMs) which have shown good generalization and reasoning capabilities. These insights, coupled with the recent findings that LLMs are rich in geographical knowledge, allowed us to believe that these models can act as zero-shot next-location predictors. This paper evaluates the capabilities of many popular LLMs in this role, specifically Llama, GPT-3.5 and Mistral 7B. After designing a proper prompt, we tested the models on three real-world mobility datasets. The results show that LLMs can obtain accuracies up to 32.4%, a significant relative improvement of over 600% when compared to sophisticated DL models specifically designed for human mobility. Moreover, we show that other LLMs are unable to perform the task properly. To prevent positively biased results, we also propose a framework inspired by other studies to test data contamination. Finally, we explored the possibility of using LLMs as text-based explainers for next-location prediction showing that can effectively provide an explanation for their decision. Notably, 7B models provide more generic, but still reliable, explanations compared to larger counterparts. Code: this http URL

[AI-18] Navigating Tabular Data Synthesis Research: Understanding User Needs and Tool Capabilities

链接: https://arxiv.org/abs/2405.20959
作者: Maria F. Davila R.,Sven Groen,Fabian Panse,Wolfram Wingerath
关键词: rapidly advancing data-driven, advancing data-driven applications, era of rapidly, rapidly advancing, advancing data-driven
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: 14 pages, 3 figures

点击查看摘要

Abstract:In an era of rapidly advancing data-driven applications, there is a growing demand for data in both research and practice. Synthetic data have emerged as an alternative when no real data is available (e.g., due to privacy regulations). Synthesizing tabular data presents unique and complex challenges, especially handling (i) missing values, (ii) dataset imbalance, (iii) diverse column types, and (iv) complex data distributions, as well as preserving (i) column correlations, (ii) temporal dependencies, and (iii) integrity constraints (e.g., functional dependencies) present in the original dataset. While substantial progress has been made recently in the context of generational models, there is no one-size-fits-all solution for tabular data today, and choosing the right tool for a given task is therefore no trivial task. In this paper, we survey the state of the art in Tabular Data Synthesis (TDS), examine the needs of users by defining a set of functional and non-functional requirements, and compile the challenges associated with meeting those needs. In addition, we evaluate the reported performance of 36 popular research TDS tools about these requirements and develop a decision guide to help users find suitable TDS tools for their applications. The resulting decision guide also identifies significant research gaps.

[AI-19] A Robot Walks into a Bar: Can Language Models Serve asCreativity Support Tools for Comedy? An Evaluation of LLMs Humour Alignment with Comedians

链接: https://arxiv.org/abs/2405.20956
作者: Piotr Wojciech Mirowski,Juliette Love,Kory W. Mathewson,Shakir Mohamed
关键词: Edinburgh Festival Fringe, interviewed twenty professional, twenty professional comedians, perform live shows, Fringe in August
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 15 pages, 1 figure, published at ACM FAccT 2024

点击查看摘要

Abstract:We interviewed twenty professional comedians who perform live shows in front of audiences and who use artificial intelligence in their artistic process as part of 3-hour workshops on AI x Comedy'' conducted at the Edinburgh Festival Fringe in August 2023 and online. The workshop consisted of a comedy writing session with large language models (LLMs), a human-computer interaction questionnaire to assess the Creativity Support Index of AI as a writing tool, and a focus group interrogating the comedians' motivations for and processes of using AI, as well as their ethical concerns about bias, censorship and copyright. Participants noted that existing moderation strategies used in safety filtering and instruction-tuned LLMs reinforced hegemonic viewpoints by erasing minority groups and their perspectives, and qualified this as a form of censorship. At the same time, most participants felt the LLMs did not succeed as a creativity support tool, by producing bland and biased comedy tropes, akin to cruise ship comedy material from the 1950s, but a bit less racist’‘. Our work extends scholarship about the subtle difference between, one the one hand, harmful speech, and on the other hand, offensive'' language as a practice of resistance, satire and punching up’‘. We also interrogate the global value alignment behind such language models, and discuss the importance of community-based value alignment and data ownership to build AI tools that better suit artists’ needs.

[AI-20] Monte Carlo Tree Search Satellite Scheduling Under Cloud Cover Uncertainty

链接: https://arxiv.org/abs/2405.20951
作者: Justin Norman,Francois Rivest
关键词: Efficient utilization, Monte Carlo Tree, remains a challenging, Carlo Tree Search, Efficient
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Efficient utilization of satellite resources in dynamic environments remains a challenging problem in satellite scheduling. This paper addresses the multi-satellite collection scheduling problem (m-SatCSP), aiming to optimize task scheduling over a constellation of satellites under uncertain conditions such as cloud cover. Leveraging Monte Carlo Tree Search (MCTS), a stochastic search algorithm, two versions of MCTS are explored to schedule satellites effectively. Hyperparameter tuning is conducted to optimize the algorithm’s performance. Experimental results demonstrate the effectiveness of the MCTS approach, outperforming existing methods in both solution quality and efficiency. Comparative analysis against other scheduling algorithms showcases competitive performance, positioning MCTS as a promising solution for satellite task scheduling in dynamic environments.

[AI-21] OR-Bench: An Over-Refusal Benchmark for Large Language Models

链接: https://arxiv.org/abs/2405.20947
作者: Justin Cui,Wei-Lin Chiang,Ion Stoica,Cho-Jui Hsieh
关键词: Large Language Models, Large Language, require careful safety, careful safety alignment, prevent malicious outputs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: version 1

点击查看摘要

Abstract:Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where the LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of ``seemingly toxic prompts’’ (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at this https URL and the corresponding demo can be found at this https URL. We hope this benchmark can help the community develop better safety aligned models.

[AI-22] Effective Interplay between Sparsity and Quantization: From Theory to Practice

链接: https://arxiv.org/abs/2405.20935
作者: Simla Burcu Harma,Ayan Chakraborty,Elizaveta Kostenok,Danila Mishin,Dongho Ha,Babak Falsafi,Martin Jaggi,Ming Liu,Yunho Oh,Suvinay Subramanian,Amir Yazdanbakhsh
关键词: deep neural networks, neural networks necessitates, improve computational efficiency, networks necessitates effective, increasing size
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy. While effective, the interplay between these two methods remains an open question. In this paper, we investigate the interaction between these two methods and assess whether their combination impacts final model accuracy. We mathematically prove that applying sparsity before quantization is the optimal sequence for these operations, minimizing error in computation. Our empirical studies across a wide range of models, including OPT and Llama model families (125M-8B) and ViT corroborate these theoretical findings. In addition, through rigorous analysis, we demonstrate that sparsity and quantization are not orthogonal; their interaction can significantly harm model accuracy, with quantization error playing a dominant role in this degradation. Our findings extend to the efficient deployment of large models in resource-limited compute platforms and reduce serving cost, offering insights into best practices for applying these compression methods to maximize efficacy without compromising accuracy.

[AI-23] Fast yet Safe: Early-Exiting with Risk Control

链接: https://arxiv.org/abs/2405.20915
作者: Metod Jazbec,Alexander Timans,Tin Hadži Veljković,Kaspar Sakmann,Dan Zhang,Christian A. Naesseth,Eric Nalisnick
关键词: Scaling machine learning, machine learning models, learning models significantly, models significantly improves, Scaling machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 25 pages, 11 figures, 4 tables (incl. appendix)

点击查看摘要

Abstract:Scaling machine learning models significantly improves their performance. However, such gains come at the cost of inference being slow and resource-intensive. Early-exit neural networks (EENNs) offer a promising solution: they accelerate inference by allowing intermediate layers to exit and produce a prediction early. Yet a fundamental issue with EENNs is how to determine when to exit without severely degrading performance. In other words, when is it ‘safe’ for an EENN to go ‘fast’? To address this issue, we investigate how to adapt frameworks of risk control to EENNs. Risk control offers a distribution-free, post-hoc solution that tunes the EENN’s exiting mechanism so that exits only occur when the output is of sufficient quality. We empirically validate our insights on a range of vision and language tasks, demonstrating that risk control can produce substantial computational savings, all the while preserving user-specified performance goals.

[AI-24] Enhancing Vision Models for Text-Heavy Content Understanding and Interaction

链接: https://arxiv.org/abs/2405.20906
作者: Adithya TG,Adithya SK,Abhinav R Bharadwaj,Abhiram HA,Dr. Surabhi Narayan
关键词: heavy visual content, traditional vision models, text heavy visual, major challenge, challenge for traditional
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 5 pages, 4 figures (including 1 graph)

点击查看摘要

Abstract:Interacting and understanding with text heavy visual content with multiple images is a major challenge for traditional vision models. This paper is on enhancing vision models’ capability to comprehend or understand and learn from images containing a huge amount of textual information from the likes of textbooks and research papers which contain multiple images like graphs, etc and tables in them with different types of axes and scales. The approach involves dataset preprocessing, fine tuning which is by using instructional oriented data and evaluation. We also built a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark which is developed to consider both textual and visual inputs. An accuracy of 96.71% was obtained. The aim of the project is to increase and also enhance the advance vision models’ capabilities in understanding complex visual textual data interconnected data, contributing to multimodal AI.

[AI-25] Preemptive Answer “Attacks” on Chain-of-Thought Reasoning

链接: https://arxiv.org/abs/2405.20902
作者: Rongwu Xu,Zehan Qi,Wei Xu
关键词: Large language models, Large language, showcase impressive reasoning, showcase impressive, impressive reasoning capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted to ACL’24 (Findings). Camera-ready version

点击查看摘要

Abstract:Large language models (LLMs) showcase impressive reasoning capabilities when coupled with Chain-of-Thought (CoT) prompting. However, the robustness of this approach warrants further investigation. In this paper, we introduce a novel scenario termed preemptive answers, where the LLM obtains an answer before engaging in reasoning. This situation can arise inadvertently or induced by malicious users by prompt injection attacks. Experiments reveal that preemptive answers significantly impair the model’s reasoning capability across various CoT methods and a broad spectrum of datasets. To bolster the robustness of reasoning, we propose two measures aimed at mitigating this issue to some extent.

[AI-26] MALT: Multi-scale Action Learning Transformer for Online Action Detection

链接: https://arxiv.org/abs/2405.20892
作者: Zhipeng Yang,Ruoyu Wang,Yang Tan,Liping Xie
关键词: Online action detection, identify ongoing actions, Online action, aims to identify, video in real-time
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames. Since these actions manifest at varying scales of granularity, ranging from coarse to fine, projecting an entire set of action frames to a single latent encoding may result in a lack of local information, necessitating the acquisition of action features across multiple scales. In this paper, we propose a multi-scale action learning transformer (MALT), which includes a novel recurrent decoder (used for feature fusion) that includes fewer parameters and can be trained more efficiently. A hierarchical encoder with multiple encoding branches is further proposed to capture multi-scale action features. The output from the preceding branch is then incrementally input to the subsequent branch as part of a cross-attention calculation. In this way, output features transition from coarse to fine as the branches deepen. We also introduce an explicit frame scoring mechanism employing sparse attention, which filters irrelevant frames more efficiently, without requiring an additional network. The proposed method achieved state-of-the-art performance on two benchmark datasets (THUMOS’14 and TVSeries), outperforming all existing models used for comparison, with an mAP of 0.2% for THUMOS’14 and an mcAP of 0.1% for TVseries.

[AI-27] Effects of Dataset Sampling Rate for Noise Cancellation through Deep Learning

链接: https://arxiv.org/abs/2405.20884
作者: Brandon Colelough,Andrew Zheng
关键词: Fast Fourier Transform, Active noise cancellation, Active noise, Fourier Transform, Fast Fourier
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 16 pages, 8 pictures, 3 tables

点击查看摘要

Abstract:Background: Active noise cancellation has been a subject of research for decades. Traditional techniques, like the Fast Fourier Transform, have limitations in certain scenarios. This research explores the use of deep neural networks (DNNs) as a superior alternative. Objective: The study aims to determine the effect sampling rate within training data has on lightweight, efficient DNNs that operate within the processing constraints of mobile devices. Methods: We chose the ConvTasNET network for its proven efficiency in speech separation and enhancement. ConvTasNET was trained on datasets such as WHAM!, LibriMix, and the MS-2023 DNS Challenge. The datasets were sampled at rates of 8kHz, 16kHz, and 48kHz to analyze the effect of sampling rate on noise cancellation efficiency and effectiveness. The model was tested on a core-i7 Intel processor from 2023, assessing the network’s ability to produce clear audio while filtering out background noise. Results: Models trained at higher sampling rates (48kHz) provided much better evaluation metrics against Total Harmonic Distortion (THD) and Quality Prediction For Generative Neural Speech Codecs (WARP-Q) values, indicating improved audio quality. However, a trade-off was noted with the processing time being longer for higher sampling rates. Conclusions: The Conv-TasNET network, trained on datasets sampled at higher rates like 48kHz, offers a robust solution for mobile devices in achieving noise cancellation through speech separation and enhancement. Future work involves optimizing the model’s efficiency further and testing on mobile devices.

[AI-28] Paying to Do Better: Games with Payments between Learning Agents

链接: https://arxiv.org/abs/2405.20880
作者: Yoav Kolumbus,Joe Halpern,Éva Tardos
关键词: choose their actions, players, learning agents, learning, players typically
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:In repeated games, such as auctions, players typically use learning algorithms to choose their actions. The use of such autonomous learning agents has become widespread on online platforms. In this paper, we explore the impact of players incorporating monetary transfers into their agents’ algorithms, aiming to incentivize behavior in their favor. Our focus is on understanding when players have incentives to make use of monetary transfers, how these payments affect learning dynamics, and what the implications are for welfare and its distribution among the players. We propose a simple game-theoretic model to capture such scenarios. Our results on general games show that in a broad class of games, players benefit from letting their learning agents make payments to other learners during the game dynamics, and that in many cases, this kind of behavior improves welfare for all players. Our results on first- and second-price auctions show that in equilibria of the ``payment policy game,‘’ the agents’ dynamics can reach strong collusive outcomes with low revenue for the auctioneer. These results highlight a challenge for mechanism design in systems where automated learning agents can benefit from interacting with their peers outside the boundaries of the mechanism.

[AI-29] SelfGNN: Self-Supervised Graph Neural Networks for Sequential Recommendation

链接: https://arxiv.org/abs/2405.20878
作者: Yuxi Liu,Lianghao Xia,Chao Huang
关键词: effectively addresses information, addresses information overload, recommendation effectively addresses, effectively addresses, addresses information
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by SIGIR’24

点击查看摘要

Abstract:Sequential recommendation effectively addresses information overload by modeling users’ temporal and sequential interaction patterns. To overcome the limitations of supervision signals, recent approaches have adopted self-supervised learning techniques in recommender systems. However, there are still two critical challenges that remain unsolved. Firstly, existing sequential models primarily focus on long-term modeling of individual interaction sequences, overlooking the valuable short-term collaborative relationships among the behaviors of different users. Secondly, real-world data often contain noise, particularly in users’ short-term behaviors, which can arise from temporary intents or misclicks. Such noise negatively impacts the accuracy of both graph and sequence models, further complicating the modeling process. To address these challenges, we propose a novel framework called Self-Supervised Graph Neural Network (SelfGNN) for sequential recommendation. The SelfGNN framework encodes short-term graphs based on time intervals and utilizes Graph Neural Networks (GNNs) to learn short-term collaborative relationships. It captures long-term user and item representations at multiple granularity levels through interval fusion and dynamic behavior modeling. Importantly, our personalized self-augmented learning structure enhances model robustness by mitigating noise in short-term graphs based on long-term user interests and personal stability. Extensive experiments conducted on four real-world datasets demonstrate that SelfGNN outperforms various state-of-the-art baselines. Our model implementation codes are available at this https URL.

[AI-30] Investigating Calibration and Corruption Robustness of Post-hoc Pruned Perception CNNs: An Image Classification Benchmark Study

链接: https://arxiv.org/abs/2405.20876
作者: Pallavi Mitra,Gesina Schwalbe,Nadja Klein
关键词: Convolutional Neural Networks, Convolutional Neural, Neural Networks, computer vision tasks, natural corruption robustness
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in many computer vision tasks. However, high computational and storage demands hinder their deployment into resource-constrained environments, such as embedded devices. Model pruning helps to meet these restrictions by reducing the model size, while maintaining superior performance. Meanwhile, safety-critical applications pose more than just resource and performance constraints. In particular, predictions must not be overly confident, i.e., provide properly calibrated uncertainty estimations (proper uncertainty calibration), and CNNs must be robust against corruptions like naturally occurring input perturbations (natural corruption robustness). This work investigates the important trade-off between uncertainty calibration, natural corruption robustness, and performance for current state-of-research post-hoc CNN pruning techniques in the context of image classification tasks. Our study reveals that post-hoc pruning substantially improves the model’s uncertainty calibration, performance, and natural corruption robustness, sparking hope for safe and robust embedded CNNs.Furthermore, uncertainty calibration and natural corruption robustness are not mutually exclusive targets under pruning, as evidenced by the improved safety aspects obtained by post-hoc unstructured pruning with increasing compression.

[AI-31] Automatic Channel Pruning for Multi-Head Attention

链接: https://arxiv.org/abs/2405.20867
作者: Eunho Lee,Youngbae Hwang
关键词: performance of Transformers, complexity presents challenges, quadratic computation complexity, computation complexity presents, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:Despite the strong performance of Transformers, their quadratic computation complexity presents challenges in applying them to vision tasks. Automatic pruning is one of effective methods for reducing computation complexity without heuristic approaches. However, directly applying it to multi-head attention is not straightforward due to channel misalignment. In this paper, we propose an automatic channel pruning method to take into account the multi-head attention mechanism. First, we incorporate channel similarity-based weights into the pruning indicator to preserve more informative channels in each head. Then, we adjust pruning indicator to enforce removal of channels in equal proportions across all heads, preventing the channel misalignment. We also add a reweight module to compensate for information loss resulting from channel removal, and an effective initialization step for pruning indicator based on difference of attention between original structure and each channel. Our proposed method can be used to not only original attention, but also linear attention, which is more efficient as linear complexity with respect to the number of tokens. On ImageNet-1K, applying our pruning method to the FLattenTransformer, which includes both attention mechanisms, shows outperformed accuracy for several MACs compared with previous state-of-the-art efficient models and pruned methods. Code will be available soon.

[AI-32] clembench-2024: A Challenging Dynamic Complementary Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

链接: https://arxiv.org/abs/2405.20859
作者: Anne Beyer,Kranti Chalamalasetti,Sherzod Hakimov,Brielen Madureira,Philipp Sadler,David Schlangen
关键词: strategic goal orientation, Large Language Models, language understanding abilities, interactive game play, resulting interactive game
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: under review

点击查看摘要

Abstract:It has been established in recent work that Large Language Models (LLMs) can be prompted to “self-play” conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding abilities), where the resulting interactive game play can be automatically scored. In this paper, we take one of the proposed frameworks for setting up such game-play environments, and further test its usefulness as an evaluation instrument, along a number of dimensions: We show that it can easily keep up with new developments while avoiding data contamination, we show that the tests implemented within it are not yet saturated (human performance is substantially higher than that of even the best models), and we show that it lends itself to investigating additional questions, such as the impact of the prompting language on performance. We believe that the approach forms a good basis for making decisions on model choice for building applied interactive systems, and perhaps ultimately setting up a closed-loop development environment of system and simulated evaluator.

[AI-33] SLIM: a Scalable Light-weight Root Cause Analysis for Imbalanced Data in Microservice

链接: https://arxiv.org/abs/2405.20848
作者: Rui Ren,Jingbang Yang,Linxiao Yang,Xinyue Gu,Liang Sun
关键词: change service, newly deployed service, type of minority, service, fault
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The newly deployed service – one kind of change service, could lead to a new type of minority fault. Existing state-of-the-art methods for fault localization rarely consider the imbalanced fault classification in change service. This paper proposes a novel method that utilizes decision rule sets to deal with highly imbalanced data by optimizing the F1 score subject to cardinality constraints. The proposed method greedily generates the rule with maximal marginal gain and uses an efficient minorize-maximization (MM) approach to select rules iteratively, maximizing a non-monotone submodular lower bound. Compared with existing fault localization algorithms, our algorithm can adapt to the imbalanced fault scenario of change service, and provide interpretable fault causes which are easy to understand and verify. Our method can also be deployed in the online training setting, with only about 15% training overhead compared to the current SOTA methods. Empirical studies showcase that our algorithm outperforms existing fault localization algorithms in both accuracy and model interpretability.

[AI-34] Dont Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models

链接: https://arxiv.org/abs/2405.20846
作者: A. Bavaresco,A. Testoni,R. Fernández
关键词: unusual visual elements, Image-based advertisements, figurative language, complex multimodal stimuli, advertisements are complex
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to the main conference ACL 2024

点击查看摘要

Abstract:Image-based advertisements are complex multimodal stimuli that often contain unusual visual elements and figurative language. Previous research on automatic ad understanding has reported impressive zero-shot accuracy of contrastive vision-and-language models (VLMs) on an ad-explanation retrieval task. Here, we examine the original task setup and show that contrastive VLMs can solve it by exploiting grounding heuristics. To control for this confound, we introduce TRADE, a new evaluation test set with adversarial grounded explanations. While these explanations look implausible to humans, we show that they “fool” four different contrastive VLMs. Our findings highlight the need for an improved operationalisation of automatic ad understanding that truly evaluates VLMs’ multimodal reasoning abilities. We make our code and TRADE available at this https URL .

[AI-35] nspace: Searching for Neural Architectures from Fundamental Operations

链接: https://arxiv.org/abs/2405.20838
作者: Linus Ericsson,Miguel Espinosa,Chenhongyi Yang,Antreas Antoniou,Amos Storkey,Shay B. Cohen,Steven McDonagh,Elliot J. Crowley
关键词: Neural architecture search, high performing networks, Neural architecture, NAS, finds high performing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Project page at this https URL

点击查看摘要

Abstract:Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren’t diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shifts, we need a novel expressive search space design which is built from more fundamental operations. To this end, we introduce einspace, a search space based on a parameterised probabilistic context-free grammar. Our space is versatile, supporting architectures of various sizes and complexities, while also containing diverse network operations which allow it to model convolutions, attention components and more. It contains many existing competitive architectures, and provides flexibility for discovering new ones. Using this search space, we perform experiments to find novel architectures as well as improvements on existing ones on the diverse Unseen NAS datasets. We show that competitive architectures can be obtained by searching from scratch, and we consistently find large improvements when initialising the search with strong baselines. We believe that this work is an important advancement towards a transformative NAS paradigm where search space expressivity and strategic search initialisation play key roles.

[AI-36] Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs

链接: https://arxiv.org/abs/2405.20835
作者: Davide Paglieri,Saurabh Dash,Tim Rocktäschel,Jack Parker-Holder
关键词: Large Language Models, Large Language, reduced memory usage, enabling faster operation, efficiency of Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Post-Training Quantization (PTQ) enhances the efficiency of Large Language Models (LLMs) by enabling faster operation and compatibility with more accessible hardware through reduced memory usage, at the cost of small performance drops. We explore the role of calibration sets in PTQ, specifically their effect on hidden activations in various notable open-source LLMs. Calibration sets are crucial for evaluating activation magnitudes and identifying outliers, which can distort the quantization range and negatively impact performance. Our analysis reveals a marked contrast in quantization effectiveness across models. The older OPT model, which much of the quantization literature is based on, shows significant performance deterioration and high susceptibility to outliers with varying calibration sets. In contrast, newer models like Llama-2 7B, Llama-3 8B, Command-R 35B, and Mistral 7B demonstrate strong robustness, with Mistral 7B showing near-immunity to outliers and stable activations. These findings suggest a shift in PTQ strategies might be needed. As advancements in pre-training methods reduce the relevance of outliers, there is an emerging need to reassess the fundamentals of current quantization literature. The emphasis should pivot towards optimizing inference speed, rather than primarily focusing on outlier preservation, to align with the evolving characteristics of state-of-the-art LLMs.

[AI-37] here and Back Again: The AI Alignment Paradox

链接: https://arxiv.org/abs/2405.20806
作者: Robert West,Roland Aydin
关键词: human goals, ethical principles, aims to steer, steer AI systems, systems toward human
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The field of AI alignment aims to steer AI systems toward human goals, preferences, and ethical principles. Its contributions have been instrumental for improving the output quality, safety, and trustworthiness of today’s AI models. This perspective article draws attention to a fundamental challenge inherent in all AI alignment endeavors, which we term the “AI alignment paradox”: The better we align AI models with our values, the easier we make it for adversaries to misalign the models. We illustrate the paradox by sketching three concrete example incarnations for the case of language models, each corresponding to a distinct way in which adversaries can exploit the paradox. With AI’s increasing real-world impact, it is imperative that a broad community of researchers be aware of the AI alignment paradox and work to find ways to break out of it, in order to ensure the beneficial use of AI for the good of humanity.

[AI-38] Ovis: Structural Embedding Alignment for Multimodal Large Language Model

链接: https://arxiv.org/abs/2405.20797
作者: Shiyin Lu,Yang Li,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Han-Jia Ye
关键词: Large Language Models, Current Multimodal Large, Multimodal Large Language, Large Language, pre-trained LLM
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs – the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder – makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder’s process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks demonstrate that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis’ structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Both the source code and the training dataset of Ovis will be made publicly available.

[AI-39] InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

链接: https://arxiv.org/abs/2405.20795
作者: Huaxiang Zhang,Yaojia Mu,Guo-Niu Zhu,Zhongxue Gan
关键词: advancing autonomous systems, Accurate visual understanding, Accurate visual, intelligent robots, imperative for advancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate visual understanding is imperative for advancing autonomous systems and intelligent robots. Despite the powerful capabilities of vision-language models (VLMs) in processing complex visual scenes, precisely recognizing obscured or ambiguously presented visual elements remains challenging. To tackle such issues, this paper proposes InsightSee, a multi-agent framework to enhance VLMs’ interpretative capabilities in handling complex visual understanding scenarios. The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation. The design of these agents and the mechanisms by which they can be enhanced in visual information processing are presented. Experimental results demonstrate that the InsightSee framework not only boosts performance on specific visual tasks but also retains the original models’ strength. The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.

[AI-40] Federated Learning with Blockchain-Enhanced Machine Unlearning: A Trustworthy Approach

链接: https://arxiv.org/abs/2405.20776
作者: Xuhan Zuo,Minghao Wang,Tianqing Zhu,Lefeng Zhang,Shui Yu,Wanlei Zhou
关键词: integrating machine unlearning, user data deletion, integrating machine, regulations and respond, respond to user
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages, 25 figures

点击查看摘要

Abstract:With the growing need to comply with privacy regulations and respond to user data deletion requests, integrating machine unlearning into IoT-based federated learning has become imperative. Traditional unlearning methods, however, often lack verifiable mechanisms, leading to challenges in establishing trust. This paper delves into the innovative integration of blockchain technology with federated learning to surmount these obstacles. Blockchain fortifies the unlearning process through its inherent qualities of immutability, transparency, and robust security. It facilitates verifiable certification, harmonizes security with privacy, and sustains system efficiency. We introduce a framework that melds blockchain with federated learning, thereby ensuring an immutable record of unlearning requests and actions. This strategy not only bolsters the trustworthiness and integrity of the federated learning model but also adeptly addresses efficiency and security challenges typical in IoT environments. Our key contributions encompass a certification mechanism for the unlearning process, the enhancement of data security and privacy, and the optimization of data management to ensure system responsiveness in IoT scenarios.

[AI-41] Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models

链接: https://arxiv.org/abs/2405.20775
作者: Xijie Huang,Xinyuan Wang,Hantao Zhang,Jiawen Xi,Jingkun An,Hao Wang,Chengwei Pan
关键词: Multimodal Large Language, Large Language Models, remain insufficiently studied, Large Language, Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Security concerns related to Large Language Models (LLMs) have been extensively explored, yet the safety implications for Multimodal Large Language Models (MLLMs), particularly in medical contexts (MedMLLMs), remain insufficiently studied. This paper delves into the underexplored security vulnerabilities of MedMLLMs, especially when deployed in clinical environments where the accuracy and relevance of question-and-answer interactions are critically tested against complex medical challenges. By combining existing clinical medical data with atypical natural phenomena, we redefine two types of attacks: mismatched malicious attack (2M-attack) and optimized mismatched malicious attack (O2M-attack). Using our own constructed voluminous 3MAD dataset, which covers a wide range of medical image modalities and harmful medical scenarios, we conduct a comprehensive analysis and propose the MCM optimization method, which significantly enhances the attack success rate on MedMLLMs. Evaluations with this dataset and novel attack methods, including white-box attacks on LLaVA-Med and transfer attacks on four other state-of-the-art models, indicate that even MedMLLMs designed with enhanced security features are vulnerable to security breaches. Our work underscores the urgent need for a concerted effort to implement robust security measures and enhance the safety and efficacy of open-source MedMLLMs, particularly given the potential severity of jailbreak attacks and other malicious or clinically significant exploits in medical settings. For further research and replication, anonymous access to our code is available at this https URL. Warning: Medical large model jailbreaking may generate content that includes unverified diagnoses and treatment recommendations. Always consult professional medical advice.

[AI-42] Exploring Backdoor Attacks against Large Language Model-based Decision Making

链接: https://arxiv.org/abs/2405.20774
作者: Ruochen Jiao,Shaoyuan Xie,Justin Yue,Takami Sato,Lixu Wang,Yixuan Wang,Qi Alfred Chen,Qi Zhu
关键词: Large Language Models, Large Language, Language Models, shown significant promise, reasoning abilities learned
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 27 pages, including main paper, references, and appendix

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant promise in decision-making tasks when fine-tuned on specific applications, leveraging their inherent common sense and reasoning abilities learned from vast amounts of data. However, these systems are exposed to substantial safety and security risks during the fine-tuning phase. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-enabled Decision-making systems (BALD), systematically exploring how such attacks can be introduced during the fine-tuning phase across various channels. Specifically, we propose three attack mechanisms and corresponding backdoor optimization methods to attack different components in the LLM-based decision-making pipeline: word injection, scenario manipulation, and knowledge injection. Word injection embeds trigger words directly into the query prompt. Scenario manipulation occurs in the physical environment, where a high-level backdoor semantic scenario triggers the attack. Knowledge injection conducts backdoor attacks on retrieval augmented generation (RAG)-based LLM systems, strategically injecting word triggers into poisoned knowledge while ensuring the information remains factually accurate for stealthiness. We conduct extensive experiments with three popular LLMs (GPT-3.5, LLaMA2, PaLM2), using two datasets (HighwayEnv, nuScenes), and demonstrate the effectiveness and stealthiness of our backdoor triggers and mechanisms. Finally, we critically assess the strengths and weaknesses of our proposed approaches, highlight the inherent vulnerabilities of LLMs in decision-making tasks, and evaluate potential defenses to safeguard LLM-based decision making systems.

[AI-43] Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte

链接: https://arxiv.org/abs/2405.20773
作者: Siyuan Ma,Weidi Luo,Yu Wang,Xiaogeng Liu,Muhao Chen,Bo Li,Chaowei Xiao
关键词: Multimodal Large Language, Large Language Models, deployment of Multimodal, Multimodal Large, Large Language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), ensuring their safety has become increasingly critical. To achieve this objective, it requires us to proactively discover the vulnerability of MLLMs by exploring the attack methods. Thus, structure-based jailbreak attacks, where harmful semantic content is embedded within images, have been proposed to mislead the models. However, previous structure-based jailbreak methods mainly focus on transforming the format of malicious queries, such as converting harmful content into images through typography, which lacks sufficient jailbreak effectiveness and generalizability. To address these limitations, we first introduce the concept of “Role-play” into MLLM jailbreak attacks and propose a novel and effective method called Visual Role-play (VRP). Specifically, VRP leverages Large Language Models to generate detailed descriptions of high-risk characters and create corresponding images based on the descriptions. When paired with benign role-play instruction texts, these high-risk character images effectively mislead MLLMs into generating malicious responses by enacting characters with negative attributes. We further extend our VRP method into a universal setup to demonstrate its generalizability. Extensive experiments on popular benchmarks show that VRP outperforms the strongest baseline, Query relevant and FigStep, by an average Attack Success Rate (ASR) margin of 14.3% across all models.

[AI-44] owards Black-Box Membership Inference Attack for Diffusion Models

链接: https://arxiv.org/abs/2405.20771
作者: Jingwei Li,Jing Dong,Tianxing He,Jingzhao Zhang
关键词: important research topic, research topic, train a diffusion, important research, rising popularity
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying whether an artwork was used to train a diffusion model is an important research topic, given the rising popularity of AI-generated art and the associated copyright concerns. The work approaches this problem from the membership inference attack (MIA) perspective. We first identify the limitations of applying existing MIA methods for copyright protection: the required access of internal U-nets and the choice of non-member datasets for evaluation. To address the above problems, we introduce a novel black-box membership inference attack method that operates without needing access to the model’s internal U-net. We then construct a DALL-E generated dataset for a more comprehensive evaluation. We validate our method across various setups, and our experimental results outperform previous works.

[AI-45] Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

链接: https://arxiv.org/abs/2405.20770
作者: Guang Lin,Qibin Zhao
关键词: large language models, LAnguage MOdel Sentinel, large language, past two years, advanced rapidly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Over the past two years, the use of large language models (LLMs) has advanced rapidly. While these LLMs offer considerable convenience, they also raise security concerns, as LLMs are vulnerable to adversarial attacks by some well-designed textual perturbations. In this paper, we introduce a novel defense technique named Large LAnguage MOdel Sentinel (LLAMOS), which is designed to enhance the adversarial robustness of LLMs by purifying the adversarial textual examples before feeding them into the target LLM. Our method comprises two main components: a) Agent instruction, which can simulate a new agent for adversarial defense, altering minimal characters to maintain the original meaning of the sentence while defending against attacks; b) Defense guidance, which provides strategies for modifying clean or adversarial examples to ensure effective defense and accurate outputs from the target LLMs. Remarkably, the defense agent demonstrates robust defensive capabilities even without learning from adversarial examples. Additionally, we conduct an intriguing adversarial experiment where we develop two agents, one for defense and one for defense, and engage them in mutual confrontation. During the adversarial interactions, neither agent completely beat the other. Extensive experiments on both open-source and closed-source LLMs demonstrate that our method effectively defends against adversarial attacks, thereby enhancing adversarial robustness.

[AI-46] OpenTensor: Reproducing Faster Matrix Multiplication Discovering Algorithms

链接: https://arxiv.org/abs/2405.20748
作者: Yiwen Sun,Wenye Li
关键词: Deep Reinforcement Learning, Reinforcement Learning, Deep Reinforcement, DRL, multiplication by Deep
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:OpenTensor is a reproduction of AlphaTensor, which discovered a new algorithm that outperforms the state-of-the-art methods for matrix multiplication by Deep Reinforcement Learning (DRL). While AlphaTensor provides a promising framework for solving scientific problems, it is really hard to reproduce due to the massive tricks and lack of source codes. In this paper, we clean up the algorithm pipeline, clarify the technical details, and make some improvements to the training process. Computational results show that OpenTensor can successfully find efficient matrix multiplication algorithms.

[AI-47] rajectory Forecasting through Low-Rank Adaptation of Discrete Latent Codes

链接: https://arxiv.org/abs/2405.20743
作者: Riccardo Benaglia,Angelo Porrello,Pietro Buzzega,Simone Calderara,Rita Cucchiara
关键词: video surveillance analytics, basketball players engaged, Trajectory forecasting, Quantized Variational Autoencoders, surveillance analytics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 15 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g. basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity and diversity. We address this challenge by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a discrete latent space to tackle the issue of posterior collapse. Specifically, we introduce an instance-based codebook that allows tailored latent representations for each example. In a nutshell, the rows of the codebook are dynamically adjusted to reflect contextual information (i.e., past motion patterns extracted from the observed trajectories). In this way, the discretization process gains flexibility, leading to improved reconstructions. Notably, instance-level dynamics are injected into the codebook through low-rank updates, which restrict the customization of the codebook to a lower dimension space. The resulting discrete space serves as the basis of the subsequent step, which regards the training of a diffusion-based predictive model. We show that such a two-fold framework, augmented with instance-level discretization, leads to accurate and diverse forecasts, yielding state-of-the-art performance on three established benchmarks.

[AI-48] Maximum Temperature Prediction Using Remote Sensing Data Via Convolutional Neural Network

链接: https://arxiv.org/abs/2405.20731
作者: Lorenzo Innocenti,Giacomo Blanco,Luca Barco,Claudio Rossi
关键词: pose significant threats, specific zones exhibiting, zones exhibiting substantially, exhibiting substantially higher, Urban heat islands
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, submitted to IEEE MetroLivEnv 2024 conference

点击查看摘要

Abstract:Urban heat islands, defined as specific zones exhibiting substantially higher temperatures than their immediate environs, pose significant threats to environmental sustainability and public health. This study introduces a novel machine-learning model that amalgamates data from the Sentinel-3 satellite, meteorological predictions, and additional remote sensing inputs. The primary aim is to generate detailed spatiotemporal maps that forecast the peak temperatures within a 24-hour period in Turin. Experimental results validate the model’s proficiency in predicting temperature patterns, achieving a Mean Absolute Error (MAE) of 2.09 degrees Celsius for the year 2023 at a resolution of 20 meters per pixel, thereby enriching our knowledge of urban climatic behavior. This investigation enhances the understanding of urban microclimates, emphasizing the importance of cross-disciplinary data integration, and laying the groundwork for informed policy-making aimed at alleviating the negative impacts of extreme urban temperatures.

[AI-49] GANcrop: A Contrastive Defense Against Backdoor Attacks in Federated Learning

链接: https://arxiv.org/abs/2405.20727
作者: Xiaoyun Gan,Shanyu Gan,Taizhi Su,Peng Liu
关键词: data privacy protection, attracted widespread attention, machine learning method, privacy-preserving distributed machine, privacy protection
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:With heightened awareness of data privacy protection, Federated Learning (FL) has attracted widespread attention as a privacy-preserving distributed machine learning method. However, the distributed nature of federated learning also provides opportunities for backdoor attacks, where attackers can guide the model to produce incorrect predictions without affecting the global model training process. This paper introduces a novel defense mechanism against backdoor attacks in federated learning, named GANcrop. This approach leverages contrastive learning to deeply explore the disparities between malicious and benign models for attack identification, followed by the utilization of Generative Adversarial Networks (GAN) to recover backdoor triggers and implement targeted mitigation strategies. Experimental findings demonstrate that GANcrop effectively safeguards against backdoor attacks, particularly in non-IID scenarios, while maintaining satisfactory model accuracy, showcasing its remarkable defensive efficacy and practical utility. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2405.20727 [cs.CR] (or arXiv:2405.20727v1 [cs.CR] for this version)

[AI-50] GI-NAS: Boosting Gradient Inversion Attacks through Adaptive Neural Architecture Search

链接: https://arxiv.org/abs/2405.20725
作者: Wenbo Yu,Hao Fang,Bin Chen,Xiaohang Sui,Chuan Chen,Hao Wu,Shu-Tao Xia,Ke Xu
关键词: Federated Learning, considerable privacy concerns, raised considerable privacy, Gradient Inversion, Inversion Attacks invert
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gradient Inversion Attacks invert the transmitted gradients in Federated Learning (FL) systems to reconstruct the sensitive data of local clients and have raised considerable privacy concerns. A majority of gradient inversion methods rely heavily on explicit prior knowledge (e.g., a well pre-trained generative model), which is often unavailable in realistic scenarios. To alleviate this issue, researchers have proposed to leverage the implicit prior knowledge of an over-parameterized network. However, they only utilize a fixed neural architecture for all the attack settings. This would hinder the adaptive use of implicit architectural priors and consequently limit the generalizability. In this paper, we further exploit such implicit prior knowledge by proposing Gradient Inversion via Neural Architecture Search (GI-NAS), which adaptively searches the network and captures the implicit priors behind neural architectures. Extensive experiments verify that our proposed GI-NAS can achieve superior attack performance compared to state-of-the-art gradient inversion methods, even under more practical settings with high-resolution images, large-sized batches, and advanced defense strategies.

[AI-51] ContextGS: Compact 3D Gaussian Splatting with Anchor Level Context Model

链接: https://arxiv.org/abs/2405.20721
作者: Yufei Wang,Zhihao Li,Lanqing Guo,Wenhan Yang,Alex C. Kot,Bihan Wen
关键词: Gaussian Splatting, offering fast rendering, fast rendering speeds, neural Gaussians, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has become a promising framework for novel view synthesis, offering fast rendering speeds and high fidelity. However, the large number of Gaussians and their associated attributes require effective compression techniques. Existing methods primarily compress neural Gaussians individually and independently, i.e., coding all the neural Gaussians at the same time, with little design for their interactions and spatial dependence. Inspired by the effectiveness of the context model in image compression, we propose the first autoregressive model at the anchor level for 3DGS compression in this work. We divide anchors into different levels and the anchors that are not coded yet can be predicted based on the already coded ones in all the coarser levels, leading to more accurate modeling and higher coding efficiency. To further improve the efficiency of entropy coding, e.g., to code the coarsest level with no already coded anchors, we propose to introduce a low-dimensional quantized feature as the hyperprior for each anchor, which can be effectively compressed. Our work pioneers the context model in the anchor level for 3DGS representation, yielding an impressive size reduction of over 100 times compared to vanilla 3DGS and 15 times compared to the most recent state-of-the-art work Scaffold-GS, while achieving comparable or even higher rendering quality.

[AI-52] Climate Variable Downscaling with Conditional Normalizing Flows

链接: https://arxiv.org/abs/2405.20719
作者: Christina Winkler,Paula Harder,David Rolnick
关键词: models typically operate, coarse spatial scales, spatial scales due, large computational costs, Predictions of global
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Predictions of global climate models typically operate on coarse spatial scales due to the large computational costs of climate simulations. This has led to a considerable interest in methods for statistical downscaling, a similar process to super-resolution in the computer vision context, to provide more local and regional climate information. In this work, we apply conditional normalizing flows to the task of climate variable downscaling. We showcase its successful performance on an ERA5 water content dataset for different upsampling factors. Additionally, we show that the method allows us to assess the predictive uncertainty in terms of standard deviation from the fitted conditional distribution mean.

[AI-53] Popularity-Aware Alignment and Contrast for Mitigating Popularity Bias

链接: https://arxiv.org/abs/2405.20718
作者: Miaomiao Cai,Lei Chen,Yifan Wang,Haoyue Bai,Peijie Sun,Le Wu,Min Zhang,Meng Wang
关键词: Collaborative Filtering, popularity bias, item representations, unpopular item representations, typically suffers
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by KDD 2024

点击查看摘要

Abstract:Collaborative Filtering (CF) typically suffers from the significant challenge of popularity bias due to the uneven distribution of items in real-world datasets. This bias leads to a significant accuracy gap between popular and unpopular items. It not only hinders accurate user preference understanding but also exacerbates the Matthew effect in recommendation systems. To alleviate popularity bias, existing efforts focus on emphasizing unpopular items or separating the correlation between item representations and their popularity. Despite the effectiveness, existing works still face two persistent challenges: (1) how to extract common supervision signals from popular items to improve the unpopular item representations, and (2) how to alleviate the representation separation caused by popularity bias. In this work, we conduct an empirical analysis of popularity bias and propose Popularity-Aware Alignment and Contrast (PAAC) to address two challenges. Specifically, we use the common supervisory signals modeled in popular item representations and propose a novel popularity-aware supervised alignment module to learn unpopular item representations. Additionally, we suggest re-weighting the contrastive learning loss to mitigate the representation separation from a popularity-centric perspective. Finally, we validate the effectiveness and rationale of PAAC in mitigating popularity bias through extensive experiments on three real-world datasets. Our code is available at this https URL.

[AI-54] FinGen: A Dataset for Argument Generation in Finance

链接: https://arxiv.org/abs/2405.20708
作者: Chung-Chi Chen,Hiroya Takamura,Ichiro Kobayashi,Yusuke Miyao
关键词: daily life, important activities, activities that people, Thinking, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Thinking about the future is one of the important activities that people do in daily life. Futurists also pay a lot of effort into figuring out possible scenarios for the future. We argue that the exploration of this direction is still in an early stage in the NLP research. To this end, we propose three argument generation tasks in the financial application scenario. Our experimental results show these tasks are still big challenges for representative generation models. Based on our empirical results, we further point out several unresolved issues and challenges in this research direction.

[AI-55] ADESSE: Advice Explanations in Complex Repeated Decision-Making Environments

链接: https://arxiv.org/abs/2405.20705
作者: Sören Schleibaum,Lu Feng,Sarit Kraus,Jörg P. Müller
关键词: decision-making processes stands, fostering a synergistic, paramount challenge, evolving landscape, synergistic relationship
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the evolving landscape of human-centered AI, fostering a synergistic relationship between humans and AI agents in decision-making processes stands as a paramount challenge. This work considers a problem setup where an intelligent agent comprising a neural network-based prediction component and a deep reinforcement learning component provides advice to a human decision-maker in complex repeated decision-making environments. Whether the human decision-maker would follow the agent’s advice depends on their beliefs and trust in the agent and on their understanding of the advice itself. To this end, we developed an approach named ADESSE to generate explanations about the adviser agent to improve human trust and decision-making. Computational experiments on a range of environments with varying model sizes demonstrate the applicability and scalability of ADESSE. Furthermore, an interactive game-based user study shows that participants were significantly more satisfied, achieved a higher reward in the game, and took less time to select an action when presented with explanations generated by ADESSE. These findings illuminate the critical role of tailored, human-centered explanations in AI-assisted decision-making.

[AI-56] Unveiling the Lexical Sensitivity of LLMs: Combinatorial Optimization for Prompt Enhancement

链接: https://arxiv.org/abs/2405.20701
作者: Pengwei Zhan,Zhen Xu,Qian Tan,Jie Song,Ru Xie
关键词: Large language models, Large language, demonstrate exceptional instruct-following, demonstrate exceptional, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate exceptional instruct-following ability to complete various downstream tasks. Although this impressive ability makes LLMs flexible task solvers, their performance in solving tasks also heavily relies on instructions. In this paper, we reveal that LLMs are over-sensitive to lexical variations in task instructions, even when the variations are imperceptible to humans. By providing models with neighborhood instructions, which are closely situated in the latent representation space and differ by only one semantically similar word, the performance on downstream tasks can be vastly different. Following this property, we propose a black-box Combinatorial Optimization framework for Prompt Lexical Enhancement (COPLE). COPLE performs iterative lexical optimization according to the feedback from a batch of proxy tasks, using a search strategy related to word influence. Experiments show that even widely-used human-crafted prompts for current benchmarks suffer from the lexical sensitivity of models, and COPLE recovers the declined model ability in both instruct-following and solving downstream tasks.

[AI-57] Self-degraded contrastive domain adaptation for industrial fault diagnosis with bi-imbalanced data

链接: https://arxiv.org/abs/2405.20700
作者: Gecheng Chen,Zeyu Yang,Chengwen Luo,Jianqiang Li
关键词: Modern industrial fault, Modern industrial, industrial fault diagnosis, fault diagnosis tasks, domain adaptation
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modern industrial fault diagnosis tasks often face the combined challenge of distribution discrepancy and bi-imbalance. Existing domain adaptation approaches pay little attention to the prevailing bi-imbalance, leading to poor domain adaptation performance or even negative transfer. In this work, we propose a self-degraded contrastive domain adaptation (Sd-CDA) diagnosis framework to handle the domain discrepancy under the bi-imbalanced data. It first pre-trains the feature extractor via imbalance-aware contrastive learning based on model pruning to learn the feature representation efficiently in a self-supervised manner. Then it forces the samples away from the domain boundary based on supervised contrastive domain adversarial learning (SupCon-DA) and ensures the features generated by the feature extractor are discriminative enough. Furthermore, we propose the pruned contrastive domain adversarial learning (PSupCon-DA) to pay automatically re-weighted attention to the minorities to enhance the performance towards bi-imbalanced data. We show the superiority of the proposed method via two experiments.

[AI-58] In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought

链接: https://arxiv.org/abs/2405.20692
作者: Sili Huang,Jifeng Hu,Hechang Chen,Lichao Sun,Bo Yang
关键词: offline reinforcement learning, providing task prompts, reinforcement learning, promising approach, approach for offline
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In-context learning is a promising approach for offline reinforcement learning (RL) to handle online tasks, which can be achieved by providing task prompts. Recent works demonstrated that in-context RL could emerge with self-improvement in a trial-and-error manner when treating RL tasks as an across-episodic sequential prediction problem. Despite the self-improvement not requiring gradient updates, current works still suffer from high computational costs when the across-episodic sequence increases with task horizons. To this end, we propose an In-context Decision Transformer (IDT) to achieve self-improvement in a high-level trial-and-error manner. Specifically, IDT is inspired by the efficient hierarchical structure of human decision-making and thus reconstructs the sequence to consist of high-level decisions instead of low-level actions that interact with environments. As one high-level decision can guide multi-step low-level actions, IDT naturally avoids excessively long sequences and solves online tasks more efficiently. Experimental results show that IDT achieves state-of-the-art in long-horizon tasks over current in-context RL methods. In particular, the online evaluation time of our IDT is \textbf36 \times times faster than baselines in the D4RL benchmark and \textbf27 \times times faster in the Grid World benchmark.

[AI-59] Conditioning GAN Without Training Dataset

链接: https://arxiv.org/abs/2405.20687
作者: Kidist Amde Mekonnen
关键词: Toggle, Deep learning algorithms, training dataset, Deep learning, Training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 5 pages, 2 figures, Part of my MSc project course, School Project Course 2022

点击查看摘要

Abstract:Deep learning algorithms have a large number of trainable parameters often with sizes of hundreds of thousands or more. Training this algorithm requires a large amount of training data and generating a sufficiently large dataset for these algorithms is costly\citenoguchi2019image. GANs are generative neural networks that use two deep learning networks that are competing with each other. The networks are generator and discriminator networks. The generator tries to generate realistic images which resemble the actual training dataset by approximating the training data distribution and the discriminator is trained to classify images as real or fake(generated)\citegoodfellow2016nips. Training these GAN algorithms also requires a large amount of training dataset\citenoguchi2019image. In this study, the aim is to address the question, “Given an unconditioned pretrained generator network and a pretrained classifier, is it feasible to develop a conditioned generator without relying on any training dataset?” The paper begins with a general introduction to the problem. The subsequent sections are structured as follows: Section 2 provides background information on the problem. Section 3 reviews relevant literature on the topic. Section 4 outlines the methodology employed in this study. Section 5 presents the experimental results. Section 6 discusses the findings and proposes potential future research directions. Finally, Section 7 offers concluding remarks. The implementation can be accessed \hrefthis https URLhere. Comments: 5 pages, 2 figures, Part of my MSc project course, School Project Course 2022 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2405.20687 [cs.CV] (or arXiv:2405.20687v1 [cs.CV] for this version) Submission history From: Kidist Amde Mekonnen Miss [view email] [v1] Fri, 31 May 2024 08:31:26 UTC (883 KB) Full-text links: Access Paper: View a PDF of the paper titled Conditioning GAN Without Training Dataset, by Kidist Amde MekonnenView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2405 Change to browse by: cs cs.AI cs.LG cs.MM References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-60] No Free Lunch Theorem for Privacy-Preserving LLM Inference

链接: https://arxiv.org/abs/2405.20681
作者: Xiaojin Zhang,Yulin Fei,Yan Kang,Wei Chen,Lixin Fan,Hai Jin,Qiang Yang
关键词: Gemini and ChatGPT, Large Language Models, including PaLM, significantly benefited, Language Models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Individuals and businesses have been significantly benefited by Large Language Models (LLMs) including PaLM, Gemini and ChatGPT in various ways. For example, LLMs enhance productivity, reduce costs, and enable us to focus on more valuable tasks. Furthermore, LLMs possess the capacity to sift through extensive datasets, uncover underlying patterns, and furnish critical insights that propel the frontiers of technology and science. However, LLMs also pose privacy concerns. Users’ interactions with LLMs may expose their sensitive personal or company information. A lack of robust privacy safeguards and legal frameworks could permit the unwarranted intrusion or improper handling of individual data, thereby risking infringements of privacy and the theft of personal identities. To ensure privacy, it is essential to minimize the dependency between shared prompts and private information. Various randomization approaches have been proposed to protect prompts’ privacy, but they may incur utility loss compared to unprotected LLMs prompting. Therefore, it is essential to evaluate the balance between the risk of privacy leakage and loss of utility when conducting effective protection mechanisms. The current study develops a framework for inferring privacy-protected Large Language Models (LLMs) and lays down a solid theoretical basis for examining the interplay between privacy preservation and utility. The core insight is encapsulated within a theorem that is called as the NFL (abbreviation of the word No-Free-Lunch) Theorem.

[AI-61] Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

链接: https://arxiv.org/abs/2405.20680
作者: Mingda Li,Xinyu Li,Yifan Chen,Wenfeng Xuan,Weinan Zhang
关键词: Large Language Models, Retrieval-Augmented Large Language, original retrieval-free Language, Large Language, retrieval-free Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: ACL 2024 (findings)

点击查看摘要

Abstract:Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their superiority in terms of factuality, they do not consistently outperform the original retrieval-free Language Models (LMs). Our experiments reveal that this example-level performance inconsistency exists not only between retrieval-augmented and retrieval-free LM but also among different retrievers. To understand this phenomenon, we investigate the degeneration behavior of RALMs and theoretically decompose it into four categories. Further analysis based on our decomposition reveals that the innate difference in knowledge sources and the unpredictable degeneration of the reader model contribute most to the inconsistency. Drawing from our analysis, we introduce Ensemble of Retrievers (EoR), a trainable framework that can adaptively retrieve from different knowledge sources and effectively decrease unpredictable reader errors. Our experiments on Open Domain Question Answering show that EoR substantially improves performance over the RALM with a single retriever by considerably reducing inconsistent behaviors.

[AI-62] Adv-KD: Adversarial Knowledge Distillation for Faster Diffusion Sampling

链接: https://arxiv.org/abs/2405.20675
作者: Kidist Amde Mekonnen,Nicola Dall’Asen,Paolo Rota
关键词: image synthesis tasks, Diffusion Probabilistic Models, achieving remarkable performance, Diffusion Probabilistic, Probabilistic Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 7 pages, 11 figures, ELLIS Doctoral Symposium 2023 in Helsinki, Finland

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have emerged as a powerful class of deep generative models, achieving remarkable performance in image synthesis tasks. However, these models face challenges in terms of widespread adoption due to their reliance on sequential denoising steps during sample generation. This dependence leads to substantial computational requirements, making them unsuitable for resource-constrained or real-time processing systems. To address these challenges, we propose a novel method that integrates denoising phases directly into the model’s architecture, thereby reducing the need for resource-intensive computations. Our approach combines diffusion models with generative adversarial networks (GANs) through knowledge distillation, enabling more efficient training and evaluation. By utilizing a pre-trained diffusion model as a teacher model, we train a student model through adversarial learning, employing layerwise transformations for denoising and submodules for predicting the teacher model’s output at various points in time. This integration significantly reduces the number of parameters and denoising steps requ