本篇博文主要展示 2024-10-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-10-17)

今日共更新549篇论文,其中:

  • 自然语言处理111篇(Computation and Language (cs.CL))
  • 人工智能157篇(Artificial Intelligence (cs.AI))
  • 计算机视觉97篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习182篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media

【速读】: 该论文试图解决的问题是中华人民共和国(PRC)是否通过海外华人媒体干预欧洲选举,并探讨PRC新闻媒体如何操纵这些叙事。解决方案的关键在于提出了一种名为KeyNMF的新方法,结合了基于Transformer的上下文嵌入模型进行静态和动态主题建模,并通过基准评估证明其在多个中文数据集和指标上的竞争力。此外,论文还将KeyNMF与现有的复杂系统信息动态描述方法结合,应用于五个新闻网站的数据,特别是在2024年欧洲议会选举前的时期,展示了KeyNMF在研究中文媒体信息动态方面的有效性。

链接: https://arxiv.org/abs/2410.12791
作者: Ross Deans Kristensen-McLachlan,Rebecca M. M. Hicke,Márton Kardos,Mette Thunø
关键词-EN: Republic of China, People Republic, ethnic Chinese diaspora, Chinese diaspora media, European elections
类目: Computation and Language (cs.CL)
备注: Accepted to the 2024 Computational Humanities Research Conference (CHR)

点击查看摘要

Abstract:Does the People’s Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media? This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation. In order to study diaspora media efficiently and at scale, it is necessary to use techniques derived from quantitative text analysis, such as topic modelling. In this paper, we present a pipeline for studying information dynamics in Chinese media. Firstly, we present KeyNMF, a new approach to static and dynamic topic modelling using transformer-based contextual embedding models. We provide benchmark evaluations to demonstrate that our approach is competitive on a number of Chinese datasets and metrics. Secondly, we integrate KeyNMF with existing methods for describing information dynamics in complex systems. We apply this pipeline to data from five news sites, focusing on the period of time leading up to the 2024 European parliamentary elections. Our methods and results demonstrate the effectiveness of KeyNMF for studying information dynamics in Chinese media and lay groundwork for further work addressing the broader research questions.
摘要:中华人民共和国 (PRC) 是否通过海外华人媒体干预欧洲选举?这一问题构成了一个正在进行的研究项目的基础,该项目探讨了 PRC 关于欧洲选举的叙事在华人海外媒体中的呈现方式,以及 PRC 新闻媒体操纵的目标。为了高效且大规模地研究海外媒体,有必要使用源自定量文本分析的技术,例如主题建模。本文中,我们提出了一种研究中国媒体信息动态的流程。首先,我们介绍了 KeyNMF,这是一种利用基于 Transformer 的上下文嵌入模型进行静态和动态主题建模的新方法。我们提供了基准评估,证明我们的方法在多个中文数据集和指标上具有竞争力。其次,我们将 KeyNMF 与现有描述复杂系统信息动态的方法相结合。我们将此流程应用于五个新闻网站的数据,重点关注 2024 年欧洲议会选举前的时期。我们的方法和结果展示了 KeyNMF 在研究中国媒体信息动态方面的有效性,并为解决更广泛的研究问题奠定了基础。

[NLP-1] Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

【速读】: 该论文试图解决在Retrieval-Augmented Generation (RAG)系统中,文本分块策略对知识密集型任务质量的影响问题。解决方案的关键在于引入Meta-Chunking概念,即在句子和段落之间找到一种具有深度语言逻辑连接的句子集合,并通过两种基于大语言模型(LLMs)的策略——Margin Sampling Chunking和Perplexity Chunking来实现。前者通过二分类判断连续句子是否需要分段,后者通过分析困惑度分布精确识别文本块边界。此外,结合动态合并策略,实现细粒度和粗粒度文本分块的平衡,从而提升RAG在单跳和多跳问答任务中的性能。

链接: https://arxiv.org/abs/2410.12788
作者: Jihao Zhao,Zhiyuan Ji,Pengnian Qi,Simin Niu,Bo Tang,Feiyu Xiong,Zhiyu Li
关键词-EN: large language models, Retrieval-Augmented Generation, language models, knowledge-intensive tasks, viable complement
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline, which impacts the quality of knowledge-intensive tasks. This paper introduces the concept of Meta-Chunking, which refers to a granularity between sentences and paragraphs, consisting of a collection of sentences within a paragraph that have deep linguistic logical connections. To implement Meta-Chunking, we designed two strategies based on LLMs: Margin Sampling Chunking and Perplexity Chunking. The former employs LLMs to perform binary classification on whether consecutive sentences need to be segmented, making decisions based on the probability difference obtained from margin sampling. The latter precisely identifies text chunk boundaries by analyzing the characteristics of perplexity distribution. Additionally, considering the inherent complexity of different texts, we propose a strategy that combines Meta-Chunking with dynamic merging to achieve a balance between fine-grained and coarse-grained text chunking. Experiments conducted on eleven datasets demonstrate that Meta-Chunking can more efficiently improve the performance of single-hop and multi-hop question answering based on RAG. For instance, on the 2WikiMultihopQA dataset, it outperforms similarity chunking by 1.32 while only consuming 45.8% of the time. Our code is available at this https URL.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 作为大语言模型 (Large Language Model, LLM) 的可行补充,在其流程中往往忽视了文本分块 (text chunking) 这一关键环节,从而影响了知识密集型任务的质量。本文提出了元分块 (Meta-Chunking) 的概念,指的是介于句子和段落之间的粒度,由段落内具有深层语言逻辑联系的句子集合组成。为了实现 Meta-Chunking,我们基于 LLM 设计了两种策略:边缘采样分块 (Margin Sampling Chunking) 和困惑度分块 (Perplexity Chunking)。前者利用 LLM 进行连续句子是否需要分割的二分类,决策基于边缘采样获得的概率差异;后者通过分析困惑度分布的特征,精确识别文本分块边界。此外,考虑到不同文本的固有复杂性,我们提出了一种结合 Meta-Chunking 与动态合并的策略,以在细粒度和粗粒度文本分块之间取得平衡。在十一个数据集上的实验表明,Meta-Chunking 能更高效地提升基于 RAG 的单跳和多跳问答性能。例如,在 2WikiMultihopQA 数据集上,其性能优于相似性分块 1.32,而仅消耗 45.8% 的时间。我们的代码可在以下链接获取:https URL。

[NLP-2] JudgeBench: A Benchmark for Evaluating LLM-based Judges

【速读】: 该论文试图解决现有基准测试在评估基于大语言模型(LLM)的评判器时,未能充分考虑其在复杂任务中的事实和逻辑正确性的问题。解决方案的关键在于提出了一个名为JudgeBench的新基准测试框架,该框架通过将现有困难数据集转换为具有客观正确性标签的挑战性响应对,来评估LLM评判器在知识、推理、数学和编码等领域的性能。JudgeBench显著提高了评估难度,使得即使是强大的模型(如GPT-4)也只能略优于随机猜测,从而为评估日益先进的LLM评判器提供了一个可靠的平台。

链接: https://arxiv.org/abs/2410.12784
作者: Sijun Tan,Siyuan Zhuang,Kyle Montgomery,William Y. Tang,Alejandro Cuadron,Chenguang Wang,Raluca Ada Popa,Ion Stoica
关键词-EN: LLM-based judges, scalable alternative, judges, LLM-based, human
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge’s alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at this https URL .
摘要:基于大语言模型 (LLM) 的评判系统作为一种可扩展的替代方案,逐渐取代了人工评估,并被广泛用于评估、比较和改进模型。然而,这些基于 LLM 的评判系统的可靠性却鲜少受到严格审查。随着大语言模型的不断进步,其响应变得更加复杂,需要更强大的评判系统来对其进行评估。现有的基准测试主要关注评判系统与人类偏好的对齐,但在处理更复杂的任务时,往往无法准确反映事实和逻辑的正确性,因为众包的人类偏好在这类任务中是一个较差的指标。为了解决这一问题,我们提出了一种新的评估框架,用于客观地评估基于 LLM 的评判系统。基于此框架,我们提出了 JudgeBench,这是一个用于评估基于 LLM 的评判系统在涵盖知识、推理、数学和编码等领域的复杂响应对上的基准测试。JudgeBench 利用一种新颖的流程,将现有的困难数据集转换为具有反映客观正确性的偏好标签的复杂响应对。我们对一系列提示评判系统、微调评判系统、多智能体评判系统和奖励模型进行了全面评估,结果显示 JudgeBench 比之前的基准测试更具挑战性,许多强模型(例如 GPT-4o)的表现仅略优于随机猜测。总体而言,JudgeBench 为评估日益先进的大语言模型评判系统提供了一个可靠的平台。数据和代码可在以下链接获取:https URL。

[NLP-3] In-Context Learning Enables Robot Action Prediction in LLMs

【速读】: 该论文试图解决如何利用大型语言模型(LLMs)的上下文学习(ICL)能力直接预测机器人动作的问题。解决方案的关键在于引入RoboPrompt框架,通过启发式地识别关键帧并将其转换为文本描述,结合任务指令构建ICL演示模板,从而使现成的文本LLMs能够在无需训练的情况下直接预测机器人动作。

链接: https://arxiv.org/abs/2410.12782
作者: Yida Yin,Zekai Wang,Yuvan Sharma,Dantong Niu,Trevor Darrell,Roei Herzig
关键词-EN: Large Language Models, Large Language, Language Models, achieved remarkable success, directly predict robot
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within LLMs to directly predict robot actions remains largely unexplored. In this paper, we introduce RoboPrompt, a framework that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training. Our approach first heuristically identifies keyframes that capture important moments from an episode. Next, we extract end-effector actions from these keyframes as well as the estimated initial object poses, and both are converted into textual descriptions. Finally, we construct a structured template to form ICL demonstrations from these textual descriptions and a task instruction. This enables an LLM to directly predict robot actions at test time. Through extensive experiments and analysis, RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings.
摘要:近年来,大语言模型 (Large Language Models, LLMs) 在语言领域通过上下文学习 (In-Context Learning, ICL) 取得了显著的成功。然而,利用 LLMs 的 ICL 能力直接预测机器人动作的研究仍然较少。本文介绍了 RoboPrompt,这是一个框架,使得现成的纯文本 LLMs 能够通过 ICL 直接预测机器人动作,而无需训练。我们的方法首先启发式地识别出捕捉重要时刻的关键帧。接着,我们从这些关键帧中提取末端执行器动作以及估计的初始物体姿态,并将两者转换为文本描述。最后,我们构建一个结构化的模板,将这些文本描述和任务指令组合成 ICL 演示。这使得 LLM 能够在测试时直接预测机器人动作。通过广泛的实验和分析,RoboPrompt 在模拟和真实世界环境中均显示出优于零样本和 ICL 基线的性能。

[NLP-4] Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

【速读】: 该论文试图解决预训练扩散模型(DMs)在经过有害或版权概念的遗忘训练后,仍可能通过恶意微调重新学习这些概念的问题。解决方案的关键在于提出了一种元遗忘(meta-unlearning)框架,该框架在保留某些良性概念的同时,确保这些概念在模型遭受恶意微调时能够自我销毁,从而阻止被遗忘概念的重新学习。这一方法通过引入一个易于实现的元目标,与大多数现有的遗忘方法兼容,并通过实验验证了其在Stable Diffusion模型上的有效性。

链接: https://arxiv.org/abs/2410.12777
作者: Hongcheng Gao,Tianyu Pang,Chao Du,Taihang Hu,Zhijie Deng,Min Lin
关键词-EN: diffusion-based content generation, potential model misuse, prevent potential model, content generation, significant efforts
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rapid progress of diffusion-based content generation, significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs to relearn the unlearned concepts. This occurs partly because certain benign concepts (e.g., “skin”) retained in DMs are related to the unlearned ones (e.g., “nudity”), facilitating their relearning via finetuning. To address this, we propose meta-unlearning on DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered to self-destruct, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies. Our code is available at this https URL.
摘要:随着基于扩散的内容生成技术的快速发展,人们正在大力研究如何从预训练的扩散模型 (DM) 中移除有害或受版权保护的概念,以防止潜在的模型滥用。然而,观察发现,即使在发布前对 DM 进行了适当的移除操作,恶意微调仍可能破坏这一过程,导致 DM 重新学习已移除的概念。这一现象部分原因是 DM 中保留的某些良性概念(例如,“皮肤”)与已移除的概念(例如,“裸露”)相关联,通过微调促进了这些概念的重新学习。为解决这一问题,我们提出了在 DM 上的元移除 (meta-unlearning)。直观地说,元移除后的 DM 在使用时应表现得像未移除的 DM;此外,如果元移除后的 DM 在已移除的概念上进行恶意微调,其内部保留的相关良性概念将被触发自我销毁,从而阻碍已移除概念的重新学习。我们的元移除框架与大多数现有的移除方法兼容,仅需增加一个易于实现的元目标。我们通过在 Stable Diffusion 模型 (SD-v1-4 和 SDXL) 上进行的元移除概念的实证实验验证了我们的方法,并得到了广泛的消融研究支持。我们的代码可在以下链接获取:https URL。

[NLP-5] Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information

【速读】: 该论文试图解决多任务学习中任务组合选择的问题,即如何有效地将任务分组以避免负迁移现象,使得多任务模型性能优于单任务模型。解决方案的关键在于提出了一种基于任务难度的任务相关性度量方法,使用点态V可用信息(PVI)来评估任务的相似性。论文假设具有统计上相似PVI估计值的任务可以受益于联合学习过程,并通过实验验证了这一假设,结果表明通过将PVI相似的任务分组,联合学习模型在减少总参数的情况下,在不同领域(如通用、生物医学和临床领域)均表现出色。

链接: https://arxiv.org/abs/2410.12774
作者: Yingya Li,Timothy Miller,Steven Bethard,Guergana Savova
关键词-EN: depend heavily, PVI, tasks, task, PVI estimates
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: main paper 12 pages, Appendix 7 pages, 1 figure, 18 tables

点击查看摘要

Abstract:The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.
摘要:多任务学习的成功在很大程度上取决于哪些任务被组合在一起。简单地将所有任务或随机选择的一组任务组合在一起可能会导致负迁移,使得多任务模型的表现不如单任务模型。尽管已经做出了许多努力来识别任务组合并衡量不同任务之间的相关性,但如何定义一个指标来从众多潜在任务组合中识别出最佳任务组合仍然是一个具有挑战性的研究课题。我们提出了一种基于任务难度测量的任务相关性指标,该难度通过点态 V 可用信息 (PVI) 来衡量。PVI 是一种最近提出的指标,用于估计给定模型的情况下数据集包含多少可用信息。我们假设,具有统计上不显著差异的 PVI 估计值的任务足够相似,可以从联合学习过程中受益。我们进行了全面的实验,以评估该指标在 15 个自然语言处理 (NLP) 数据集上的任务分组可行性,这些数据集涵盖了通用、生物医学和临床领域。我们将联合学习者的结果与单任务学习者、现有基线方法以及最近的大语言模型(包括 Llama 2 和 GPT-4)进行了比较。结果显示,通过将具有相似 PVI 估计值的任务分组,联合学习者在总参数较少的情况下取得了具有竞争力的结果,并且在各个领域中表现一致。

[NLP-6] Unitary Multi-Margin BERT for Robust Natural Language Processing

【速读】: 该论文试图解决深度学习在自然语言处理(NLP)系统中面临的对抗攻击问题,特别是缺乏计算效率高的防御方法。解决方案的关键在于提出了一种新颖的通用技术,通过将单位权重与多边距损失相结合,显著提升BERT模型的鲁棒性。这种结合不仅大幅提高了模型在遭受攻击后的分类准确率(提升5.3%至73.8%),同时保持了攻击前的竞争性准确率,并且可以通过单一标量参数调整攻击前后的准确率权衡,以适应不同应用的设计需求。

链接: https://arxiv.org/abs/2410.12759
作者: Hao-Yuan Chang,Kang L. Wang
关键词-EN: natural language processing, deep learning leave, mission-critical natural language, Bidirectional Encoder Representations, Recent developments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent developments in adversarial attacks on deep learning leave many mission-critical natural language processing (NLP) systems at risk of exploitation. To address the lack of computationally efficient adversarial defense methods, this paper reports a novel, universal technique that drastically improves the robustness of Bidirectional Encoder Representations from Transformers (BERT) by combining the unitary weights with the multi-margin loss. We discover that the marriage of these two simple ideas amplifies the protection against malicious interference. Our model, the unitary multi-margin BERT (UniBERT), boosts post-attack classification accuracies significantly by 5.3% to 73.8% while maintaining competitive pre-attack accuracies. Furthermore, the pre-attack and post-attack accuracy tradeoff can be adjusted via a single scalar parameter to best fit the design requirements for the target applications.
摘要:近年来,针对深度学习的对抗攻击发展迅速,使得许多任务关键型的自然语言处理 (NLP) 系统面临被利用的风险。为了解决现有对抗防御方法在计算效率上的不足,本文提出了一种新颖且通用的技术,通过将单一权重与多边距损失相结合,显著提升了双向编码器表示 Transformer (BERT) 的鲁棒性。我们发现,这两种简单思想的结合大大增强了模型对恶意干扰的防护能力。我们的模型,即单一多边距 BERT (UniBERT),在保持竞争性预攻击准确率的同时,将攻击后的分类准确率显著提高了 5.3% 至 73.8%。此外,通过单一标量参数可以调整预攻击和攻击后准确率之间的权衡,以最佳适应目标应用的设计需求。

[NLP-7] StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples

【速读】: 该论文试图解决现有风格表示方法中存在的潜在内容泄露问题,即在训练过程中,对比三元组可能同时包含风格和内容的差异,导致风格嵌入中混入内容信息。解决方案的关键在于引入了一种名为StyleDistance的新方法,通过使用大型语言模型生成具有控制风格变化的近似同义句的合成数据集,从而在40种不同风格特征上生成精确的对比学习正负样本。这种方法增强了风格嵌入的内容独立性,并在实际基准测试和下游应用中表现优于现有的领先风格表示方法。

链接: https://arxiv.org/abs/2410.12757
作者: Ajay Patel,Jiacheng Zhu,Justin Qiu,Zachary Horvitz,Marianna Apidianaki,Kathleen McKeown,Chris Callison-Burch
关键词-EN: similar writing styles, writing styles closely, embed texts, texts with similar, Style representations aim
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content. However, the contrastive triplets often used for training these representations may vary in both style and content, leading to potential content leakage in the representations. We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings. We use a large language model to create a synthetic dataset of near-exact paraphrases with controlled style variations, and produce positive and negative examples across 40 distinct style features for precise contrastive learning. We assess the quality of our synthetic data and embeddings through human and automatic evaluations. StyleDistance enhances the content-independence of style embeddings, which generalize to real-world benchmarks and outperform leading style representations in downstream applications. Our model can be found at this https URL .
摘要: 风格表示旨在将具有相似写作风格的文本紧密嵌入,并将具有不同风格的文本远距离嵌入,而不管其内容如何。然而,用于训练这些表示的对照三元组往往在风格和内容上都存在差异,从而可能导致表示中出现内容泄露。我们引入了 StyleDistance,一种新颖的训练更强内容无关风格嵌入的方法。我们使用大语言模型创建了一个包含近似精确释义的合成数据集,并通过控制风格变化生成跨 40 种不同风格特征的正负样本,以进行精确的对照学习。我们通过人类和自动评估来评估合成数据和嵌入的质量。StyleDistance 增强了风格嵌入的内容独立性,这些嵌入在现实世界的基准测试中具有泛化能力,并在下游应用中优于领先的风格表示。我们的模型可以在以下链接中找到:https URL。

[NLP-8] Comparative Analysis of Extrinsic Factors for NER in French

【速读】: 该论文试图解决法语命名实体识别(NER)中由于数据有限而导致的性能不佳问题。解决方案的关键在于综合考虑模型结构、语料标注方案和数据增强技术等多方面因素,并通过实验验证这些方法的有效性。具体来说,论文通过优化这些因素,成功将NER模型的F1分数从62.41提升至79.39,表明在数据有限的情况下,综合利用多种技术手段是提升NER性能的有效途径。

链接: https://arxiv.org/abs/2410.12750
作者: Grace Yang,Zhiyi Li,Yandong Liu,Jungyeul Park
关键词-EN: Named entity recognition, identify structured information, Named entity, entity recognition, replete with complex
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Named entity recognition (NER) is a crucial task that aims to identify structured information, which is often replete with complex, technical terms and a high degree of variability. Accurate and reliable NER can facilitate the extraction and analysis of important information. However, NER for other than English is challenging due to limited data availability, as the high expertise, time, and expenses are required to annotate its data. In this paper, by using the limited data, we explore various factors including model structure, corpus annotation scheme and data augmentation techniques to improve the performance of a NER model for French. Our experiments demonstrate that these approaches can significantly improve the model’s F1 score from original CRF score of 62.41 to 79.39. Our findings suggest that considering different extrinsic factors and combining these techniques is a promising approach for improving NER performance where the size of data is limited.
摘要:命名实体识别 (Named Entity Recognition, NER) 是一项关键任务,旨在识别结构化信息,这些信息通常充斥着复杂的术语和高度可变性。准确且可靠的 NER 可以促进重要信息的提取和分析。然而,对于非英语语言的 NER 来说,由于数据可用性有限,这成为一个挑战,因为需要高度专业知识、时间和成本来标注其数据。在本文中,我们利用有限的数据,探索了包括模型结构、语料标注方案和数据增强技术在内的多种因素,以提升法语 NER 模型的性能。我们的实验表明,这些方法可以将模型的 F1 分数从原始的 CRF 分数 62.41 显著提升至 79.39。我们的研究结果表明,考虑不同的外在因素并结合这些技术,是在数据量有限的情况下提升 NER 性能的有前景的方法。

[NLP-9] CREAM: Consistency Regularized Self-Rewarding Language Models

【速读】: 该论文试图解决自奖励大型语言模型(LLM)在迭代改进对齐性能时,由于奖励和排序的准确性无法保证,可能导致奖励系统累积偏差,进而产生不可靠的偏好数据用于训练的问题。解决方案的关键在于引入正则化技术,通过一致性正则化的自奖励语言模型(CREAM)来缓解自奖励过程中的过度自信偏好标注问题。CREAM利用不同迭代间的奖励一致性来正则化自奖励训练,从而帮助模型从更可靠的偏好数据中学习,提升奖励一致性和对齐性能。

链接: https://arxiv.org/abs/2410.12735
作者: Zhaoyang Wang,Weilei He,Zhiyuan Liang,Xuchao Zhang,Chetan Bansal,Ying Wei,Weitong Zhang,Huaxiu Yao
关键词-EN: Recent self-rewarding large, Recent self-rewarding, large language models, preference data, successfully applied
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to iteratively improve the alignment performance without the need of human annotations for preference data. These methods commonly utilize the same LLM to act as both the policy model (which generates responses) and the reward model (which scores and ranks those responses). The ranked responses are then used as preference pairs to train the LLM via direct alignment technologies (e.g. DPO). However, it is noteworthy that throughout this process, there is no guarantee of accuracy in the rewarding and ranking, which is critical for ensuring accurate rewards and high-quality preference data. Empirical results from relatively small LLMs (e.g., 7B parameters) also indicate that improvements from self-rewarding may diminish after several iterations in certain situations, which we hypothesize is due to accumulated bias in the reward system. This bias can lead to unreliable preference data for training the LLM. To address this issue, we first formulate and analyze the generalized iterative preference fine-tuning framework for self-rewarding language model. We then introduce the regularization to this generalized framework to mitigate the overconfident preference labeling in the self-rewarding process. Based on this theoretical insight, we propose a Consistency Regularized sElf-rewarding lAnguage Model (CREAM) that leverages the rewarding consistency across different iterations to regularize the self-rewarding training, helping the model to learn from more reliable preference data. With this explicit regularization, our empirical results demonstrate the superiority of CREAM in improving both reward consistency and alignment performance. The code is publicly available at this https URL.
摘要:近期,自奖励的大语言模型 (LLM) 成功应用了 LLM-as-a-Judge 方法,通过迭代改进对齐性能,无需依赖人类标注的偏好数据。这些方法通常利用同一 LLM 同时充当策略模型(生成响应)和奖励模型(评分和排序这些响应)。排序后的响应随后被用作偏好对,通过直接对齐技术(如 DPO)训练 LLM。然而,值得注意的是,在整个过程中,奖励和排序的准确性并无保障,这对确保准确的奖励和高品质的偏好数据至关重要。来自相对较小 LLM(例如,7B 参数)的实证结果也表明,在某些情况下,自奖励带来的改进在经过几次迭代后可能减弱,我们假设这是由于奖励系统中累积的偏差所致。这种偏差可能导致用于训练 LLM 的偏好数据不可靠。为解决这一问题,我们首先制定了并分析了自奖励语言模型的广义迭代偏好微调框架。然后,我们引入正则化到这一广义框架中,以缓解自奖励过程中过度自信的偏好标注。基于这一理论洞察,我们提出了一个一致性正则化的自奖励语言模型 (CREAM),该模型利用不同迭代间的奖励一致性来正则化自奖励训练,帮助模型从更可靠的偏好数据中学习。通过这种显式正则化,我们的实证结果展示了 CREAM 在提升奖励一致性和对齐性能方面的优越性。代码已公开发布,详见此 https URL。

[NLP-10] WorldMedQA-V: a multilingual multimodal medical examination dataset for multimodal language models evaluation

【速读】: 该论文试图解决现有医疗领域多模态/视觉语言模型(VLMs)评估数据集的局限性问题,特别是数据集多为单一语言且缺乏图像信息。解决方案的关键在于引入WorldMedQA-V,这是一个多语言、多模态的基准数据集,包含568个带有医学图像的多项选择题,涵盖巴西、以色列、日本和西班牙四个国家的原始语言及经过验证的英语翻译。通过提供本地语言和英语版本的评估,以及在有无图像的情况下测试模型性能,WorldMedQA-V旨在更准确地匹配AI系统在多样化的医疗环境中的应用需求,从而促进更公平、有效和具有代表性的应用。

链接: https://arxiv.org/abs/2410.12722
作者: João Matos,Shan Chen,Siena Placino,Yingya Li,Juan Carlos Climent Pardo,Daphna Idan,Takeshi Tohyama,David Restrepo,Luis F. Nakayama,Jose M. M. Pascual-Leone,Guergana Savova,Hugo Aerts,Leo A. Celi,A. Ian Wong,Danielle S. Bitterman,Jack Gallifant
关键词-EN: healthcare settings worldwide, necessitating robust benchmarks, vision language models, settings worldwide, necessitating robust
类目: Computation and Language (cs.CL)
备注: submitted for review, total of 14 pages

点击查看摘要

Abstract:Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are provided in the local language and English translations, and with and without images provided to the model. The WorldMedQA-V benchmark aims to better match AI systems to the diverse healthcare environments in which they are deployed, fostering more equitable, effective, and representative applications.
摘要:多模态/视觉语言模型 (VLMs) 在全球医疗环境中越来越广泛地部署,这要求建立强大的基准测试以确保其安全性、有效性和公平性。从国家医学考试中提取的多项选择题和答案 (QA) 数据集长期以来一直是宝贵的评估工具,但现有数据集大多仅限于文本,并且仅在少数语言和国家中可用。为了应对这些挑战,我们推出了 WorldMedQA-V,这是一个更新后的多语言、多模态基准测试数据集,旨在评估医疗领域的 VLMs。WorldMedQA-V 包含 568 个带有标签的多项选择 QA,并与来自四个国家(巴西、以色列、日本和西班牙)的 568 张医学图像配对,分别涵盖原始语言和经本地临床医生验证的英语翻译。我们提供了常见开源和闭源模型在本地语言和英语翻译中的基线性能,以及在有无图像提供给模型的情况下的性能。WorldMedQA-V 基准测试旨在更好地匹配 AI 系统在其部署的多样化医疗环境,促进更公平、有效和具有代表性的应用。

[NLP-11] WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

【速读】: 该论文试图解决视觉语言模型(VLMs)在处理特定文化知识,尤其是非英语和代表性不足的文化背景知识时的局限性问题。解决方案的关键在于引入WorldCuisines,这是一个大规模的多语言和多文化视觉基础语言理解基准,包含跨越30种语言和方言的视觉问答(VQA)数据集,涵盖9种语言家族,拥有超过100万数据点。该基准通过识别菜肴名称及其起源的任务,评估模型在正确位置上下文下的表现,同时揭示其在对抗上下文和预测特定地区菜肴及语言方面的困难。论文还发布了包含注释食品条目和图像的知识库,以支持未来的研究。

链接: https://arxiv.org/abs/2410.12705
作者: Genta Indra Winata,Frederikus Hudi,Patrick Amadeus Irawan,David Anugraha,Rifki Afina Putri,Yutong Wang,Adam Nohejl,Ubaidillah Ariq Prathama,Nedjma Ousidhoum,Afifa Amriani,Anar Rzayev,Anirban Das,Ashmari Pramodya,Aulia Adila,Bryan Wilie,Candy Olivia Mawalim,Ching Lam Cheng,Daud Abolade,Emmanuele Chersoni,Enrico Santus,Fariz Ikhwantri,Garry Kuwanto,Hanyang Zhao,Haryo Akbarianto Wibowo,Holy Lovenia,Jan Christian Blaise Cruz,Jan Wira Gotama Putra,Junho Myung,Lucky Susanto,Maria Angelica Riera Machin,Marina Zhukova,Michael Anugraha,Muhammad Farid Adilazuarda,Natasha Santosa,Peerat Limkonchotiwat,Raj Dabre,Rio Alexander Audino,Samuel Cahyawijaya,Shi-Xiong Zhang,Stephanie Yulia Salim,Yi Zhou,Yinxuan Gui,David Ifeoluwa Adelani,En-Shiun Annie Lee,Shogo Okada,Ayu Purwarianti,Alham Fikri Aji,Taro Watanabe,Derry Tanti Wijaya,Alice Oh,Chong-Wah Ngo
关键词-EN: Vision Language Models, underrepresented cultural contexts, Vision Language, Language Models, underrepresented cultural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
摘要: 视觉语言模型 (Vision Language Models, VLMs) 在处理特定文化知识时常常遇到困难,尤其是在非英语语言和代表性不足的文化背景下。为了评估这些模型对这类知识的理解能力,我们引入了 WorldCuisines,这是一个大规模的多语言和多文化视觉基础语言理解基准。该基准包括一个视觉问答 (VQA) 数据集,涵盖了 30 种语言和方言的文本-图像对,跨越 9 种语言家族,包含超过 100 万个数据点,成为迄今为止最大的多文化 VQA 基准。它包括识别菜肴名称及其起源的任务。我们提供了两种规模的评估数据集(12k 和 60k 实例)以及一个训练数据集(100 万实例)。我们的研究发现,尽管 VLMs 在正确的地理位置上下文中表现更好,但它们在对抗性上下文和预测特定地区菜肴及语言方面仍存在困难。为了支持未来的研究,我们发布了一个包含注释食品条目和图像的知识库,以及 VQA 数据。

[NLP-12] Sarcasm Detection in a Less-Resourced Language

【速读】: 该论文试图解决在资源较少语言(如斯洛文尼亚语)中进行讽刺检测的问题。解决方案的关键在于利用机器翻译和大型生成语言模型,通过翻译数据集和训练不同规模的预训练变压器模型来探索讽刺检测的可行性。研究结果表明,较大规模的模型通常表现更好,而模型集成可以略微提升检测性能,最佳集成方法的F1得分为0.765,接近源语言标注者的一致性水平。

链接: https://arxiv.org/abs/2410.12704
作者: Lazar Đoković,Marko Robnik-Šikonja
关键词-EN: natural language processing, sarcasm, sarcasm detection, Abstract, detection
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 4 pages, published in the Slovenian Conference on Artificial Intelligence

点击查看摘要

Abstract:The sarcasm detection task in natural language processing tries to classify whether an utterance is sarcastic or not. It is related to sentiment analysis since it often inverts surface sentiment. Because sarcastic sentences are highly dependent on context, and they are often accompanied by various non-verbal cues, the task is challenging. Most of related work focuses on high-resourced languages like English. To build a sarcasm detection dataset for a less-resourced language, such as Slovenian, we leverage two modern techniques: a machine translation specific medium-size transformer model, and a very large generative language model. We explore the viability of translated datasets and how the size of a pretrained transformer affects its ability to detect sarcasm. We train ensembles of detection models and evaluate models’ performance. The results show that larger models generally outperform smaller ones and that ensembling can slightly improve sarcasm detection performance. Our best ensemble approach achieves an \textF_1 -score of 0.765 which is close to annotators’ agreement in the source language.
摘要:自然语言处理中的讽刺检测任务旨在分类一个话语是否具有讽刺意味。由于讽刺句往往反转表面情感,因此该任务与情感分析相关。讽刺句高度依赖上下文,并且通常伴随各种非语言线索,这使得任务具有挑战性。大多数相关工作集中在英语等高资源语言上。为了构建一种低资源语言(如斯洛文尼亚语)的讽刺检测数据集,我们利用了两种现代技术:一种专门用于机器翻译的中型 Transformer 模型,以及一种非常大的生成式语言模型。我们探讨了翻译数据集的可行性,以及预训练 Transformer 模型的大小如何影响其检测讽刺的能力。我们训练了检测模型的集成,并评估了模型的性能。结果表明,较大的模型通常优于较小的模型,并且集成可以略微提高讽刺检测的性能。我们最佳的集成方法达到了 0.765 的 F1 分数,接近源语言中标注者的一致性。

[NLP-13] VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

【速读】: 该论文试图解决视觉语言模型(VLMs)在医疗领域应用中的独特挑战,特别是单一视觉基础方法的局限性和2D图像处理的不足,以及医疗数据缺乏的问题。解决方案的关键在于提出了VividMed模型,该模型具备多样的视觉基础能力,支持生成语义分割掩码和实例级边界框,并能处理包括2D和3D在内的多种成像模式。通过设计三阶段训练流程和基于公开数据集与模型的自动数据合成管道,VividMed不仅在视觉基础任务上表现出色,还在视觉问答(VQA)和报告生成等下游任务中显示出优越性能。

链接: https://arxiv.org/abs/2410.12694
作者: Lingxiao Luo,Bingda Tang,Xuanzhong Chen,Rong Han,Ting Chen
关键词-EN: visually grounded responses, demonstrated remarkable promise, Recent advancements, generating visually grounded, grounded responses
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at this https URL.
摘要:近年来,视觉语言模型 (Vision Language Models, VLMs) 在生成视觉基础响应方面展现了显著的潜力。然而,其在医疗领域的应用面临独特的挑战。例如,大多数 VLMs 依赖单一的视觉基础方法,而复杂的医疗任务需要更多样化的方法。此外,尽管大多数 VLMs 仅处理二维图像,但大部分医疗图像为三维。医疗数据的缺乏进一步加剧了这些障碍。为应对这些挑战,我们提出了 VividMed,一种适用于医学领域的多功能视觉语言模型。我们的模型支持生成语义分割掩码和实例级边界框,并适应多种成像模式,包括二维和三维数据。我们设计了一个三阶段的训练程序和一个基于公开数据集和模型的自动数据合成管道。除了视觉基础任务外,VividMed 在其他常见的下游任务中也表现出色,包括视觉问答 (Visual Question Answering, VQA) 和报告生成。消融研究表明,视觉基础能力的整合在这些任务中带来了性能的提升。我们的代码已公开,可访问此 https URL。

[NLP-14] Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce

【速读】: 该论文试图解决中低资源语言在自然语言处理(NLP)领域中面临的数据稀缺和标注实践中的伦理问题。解决方案的关键在于通过收集和分析直接参与者和受NLP工件影响者的反馈,识别数据质量和标注实践中的主要问题,特别是语言和文化数据的适宜性以及在线社区服务的滥用。基于这些发现,论文提出了创建高质量语言工件的建议,以反映语言使用者的文化背景,同时尊重数据工作者的尊严和劳动。

链接: https://arxiv.org/abs/2410.12691
作者: Nedjma Ousidhoum,Meriem Beloucif,Saif M. Mohammad
关键词-EN: affects people lives, symbolic capital, capital that affects, affects people, people lives
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Language is a symbolic capital that affects people’s lives in many ways (Bourdieu, 1977, 1991). It is a powerful tool that accounts for identities, cultures, traditions, and societies in general. Hence, data in a given language should be viewed as more than a collection of tokens. Good data collection and labeling practices are key to building more human-centered and socially aware technologies. While there has been a rising interest in mid- to low-resource languages within the NLP community, work in this space has to overcome unique challenges such as data scarcity and access to suitable annotators. In this paper, we collect feedback from those directly involved in and impacted by NLP artefacts for mid- to low-resource languages. We conduct a quantitative and qualitative analysis of the responses and highlight the main issues related to (1) data quality such as linguistic and cultural data suitability; and (2) the ethics of common annotation practices such as the misuse of online community services. Based on these findings, we make several recommendations for the creation of high-quality language artefacts that reflect the cultural milieu of its speakers, while simultaneously respecting the dignity and labor of data workers.
摘要: 语言是一种象征性资本,以多种方式影响人们的生活 (Bourdieu, 1977, 1991)。它是一个强大的工具,解释了身份、文化、传统和社会的总体情况。因此,给定语言的数据不应仅仅被视为 Token 的集合。良好的数据收集和标注实践是构建更加以人为本和社会意识技术的关键。尽管 NLP 社区对中低资源语言的兴趣日益增加,但这一领域的工作必须克服数据稀缺和获取合适标注者等独特挑战。在本文中,我们收集了直接参与和受中低资源语言 NLP 制品影响的反馈。我们对这些反馈进行了定量和定性分析,并强调了与以下方面相关的主要问题:(1) 数据质量,如语言和文化数据的适用性;以及 (2) 常见标注实践的伦理问题,如在线社区服务的滥用。基于这些发现,我们提出了若干建议,以创建高质量的语言制品,这些制品反映了其使用者的文化背景,同时尊重数据工作者的尊严和劳动。

[NLP-15] Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

【速读】: 该论文试图解决大视觉语言模型(LVLMs)中视觉输入的安全机制缺失问题。现有方法未能将文本安全机制有效地转移到视觉输入上,导致模型对有毒图像的识别存在漏洞。解决方案的关键在于提出了一种新的文本引导视觉语言对齐方法(TGA),通过检索与视觉输入相关的文本,并利用这些文本来指导视觉信息在隐藏状态空间的投影,从而确保视觉输入在隐藏状态层面与文本输入具有一致的语义表示,进而激活安全机制。实验结果表明,TGA不仅成功地将文本安全机制转移到视觉输入上,还保持了模型在各种视觉任务上的通用性能。

链接: https://arxiv.org/abs/2410.12662
作者: Shicheng Xu,Liang Pang,Yunchang Zhu,Huawei Shen,Xueqi Cheng
关键词-EN: Large Vision-Language Models, Vision-language alignment, safety mechanism, Vision-Language Models, Large Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language alignment in Large Vision-Language Models (LVLMs) successfully enables LLMs to understand visual input. However, we find that existing vision-language alignment methods fail to transfer the existing safety mechanism for text in LLMs to vision, which leads to vulnerabilities in toxic image. To explore the cause of this problem, we give the insightful explanation of where and how the safety mechanism of LVLMs operates and conduct comparative analysis between text and vision. We find that the hidden states at the specific transformer layers play a crucial role in the successful activation of safety mechanism, while the vision-language alignment at hidden states level in current methods is insufficient. This results in a semantic shift for input images compared to text in hidden states, therefore misleads the safety mechanism. To address this, we propose a novel Text-Guided vision-language Alignment method (TGA) for LVLMs. TGA retrieves the texts related to input vision and uses them to guide the projection of vision into the hidden states space in LLMs. Experiments show that TGA not only successfully transfers the safety mechanism for text in basic LLMs to vision in vision-language alignment for LVLMs without any safety fine-tuning on the visual modality but also maintains the general performance on various vision tasks (Safe and Good).
摘要:在大视觉语言模型 (LVLMs) 中,视觉与语言的对齐成功地使大语言模型 (LLMs) 能够理解视觉输入。然而,我们发现现有的视觉语言对齐方法未能将 LLMs 中现有的文本安全机制转移到视觉领域,从而导致有毒图像的漏洞。为了探究这一问题的原因,我们深入解释了 LVLMs 的安全机制在何处以及如何运作,并进行了文本与视觉的比较分析。我们发现,特定 Transformer 层的隐藏状态在安全机制的成功激活中起着关键作用,而当前方法在隐藏状态级别的视觉语言对齐不足。这导致输入图像在隐藏状态中的语义与文本相比发生了偏移,从而误导了安全机制。为解决这一问题,我们提出了一种新的文本引导的视觉语言对齐方法 (TGA) 用于 LVLMs。TGA 检索与输入视觉相关的文本,并使用它们来指导视觉在 LLMs 隐藏状态空间中的投影。实验表明,TGA 不仅成功地将基本 LLMs 中的文本安全机制转移到 LVLMs 的视觉语言对齐中,而无需对视觉模态进行任何安全微调,而且还保持了在各种视觉任务上的通用性能 (安全且良好)。

[NLP-16] Evaluating Morphological Compositional Generalization in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在形态学上的组合泛化能力问题,即这些模型是否能够像人类一样在语言使用中表现出组合性和创造性。解决方案的关键在于通过定义词素为组合性原语,设计了一套新的生成和判别任务,以评估模型在形态学上的生产性和系统性。研究聚焦于黏着语如土耳其语和芬兰语,评估了包括GPT-4和Gemini在内的多个最先进的指令微调多语言模型。结果显示,LLMs在处理新词根时的形态学组合泛化能力较差,且随着形态复杂性的增加,性能显著下降。尽管模型在识别单个形态组合上表现优于随机水平,但其系统性不足,导致与人类相比存在显著的准确性差距。

链接: https://arxiv.org/abs/2410.12656
作者: Mete Ismayilzada,Defne Circi,Jonne Sälevä,Hale Sirin,Abdullatif Köksal,Bhuwan Dhingra,Antoine Bosselut,Lonneke van der Plas,Duygu Ataman
关键词-EN: Large language models, natural language generation, Large language, demonstrated significant progress, generation and understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questionable, raising doubts about whether these models learn language similarly to humans. While humans exhibit compositional generalization and linguistic creativity in language use, the extent to which LLMs replicate these abilities, particularly in morphology, is under-explored. In this work, we systematically investigate the morphological generalization abilities of LLMs through the lens of compositionality. We define morphemes as compositional primitives and design a novel suite of generative and discriminative tasks to assess morphological productivity and systematicity. Focusing on agglutinative languages such as Turkish and Finnish, we evaluate several state-of-the-art instruction-finetuned multilingual models, including GPT-4 and Gemini. Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots, with performance declining sharply as morphological complexity increases. While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.
摘要:大语言模型 (LLMs) 在各种自然语言生成和理解任务中展示了显著的进步。然而,其语言泛化能力仍然存疑,引发了这些模型是否以人类相似的方式学习语言的疑问。尽管人类在语言使用中表现出组合泛化和语言创造性,但 LLMs 在形态学方面复制这些能力的程度尚未得到充分探索。在本研究中,我们通过组合性的视角系统地研究了 LLMs 的形态学泛化能力。我们将词素定义为组合性原语,并设计了一套新颖的生成和判别任务,以评估形态学生产力和系统性。聚焦于黏着语如土耳其语和芬兰语,我们评估了多个最先进的指令微调多语言模型,包括 GPT-4 和 Gemini。我们的分析显示,LLMs 在处理新词根时尤其难以进行形态学组合泛化,随着形态复杂性的增加,性能急剧下降。尽管模型在识别单个形态组合方面优于随机水平,但其表现缺乏系统性,导致与人类相比存在显著的准确性差距。

[NLP-17] From Measurement Instruments to Training Data: Leveraging Theory-Driven Synthetic Training Data for Measuring Social Constructs

【速读】: 该论文试图解决计算文本分类中多维社会结构分类的挑战,特别是如何通过合成训练数据来提升分类效果。解决方案的关键在于利用社会科学中的测量工具(如调查量表或注释代码本)中的既有知识,生成理论驱动的合成数据。通过两个研究案例(测量性别歧视和政治话题),论文评估了合成训练数据在微调文本分类模型中的附加价值。研究发现,尽管在性别歧视研究中效果不显著,但在政治话题分类中,合成数据显著减少了标记数据的需求,且性能仅略有下降。此外,理论驱动的合成数据明显优于未考虑概念信息的生成数据。

链接: https://arxiv.org/abs/2410.12622
作者: Lukas Birkenmaier,Matthias Roth,Indira Sen
关键词-EN: Computational text classification, Computational text, synthetic training data, synthetic training, synthetic data
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Computational text classification is a challenging task, especially for multi-dimensional social constructs. Recently, there has been increasing discussion that synthetic training data could enhance classification by offering examples of how these constructs are represented in texts. In this paper, we systematically examine the potential of theory-driven synthetic training data for improving the measurement of social constructs. In particular, we explore how researchers can transfer established knowledge from measurement instruments in the social sciences, such as survey scales or annotation codebooks, into theory-driven generation of synthetic data. Using two studies on measuring sexism and political topics, we assess the added value of synthetic training data for fine-tuning text classification models. Although the results of the sexism study were less promising, our findings demonstrate that synthetic data can be highly effective in reducing the need for labeled data in political topic classification. With only a minimal drop in performance, synthetic data allows for substituting large amounts of labeled data. Furthermore, theory-driven synthetic data performed markedly better than data generated without conceptual information in mind.
摘要:计算文本分类是一项具有挑战性的任务,尤其是在处理多维度的社会结构时。近期,越来越多的讨论认为,合成训练数据可以通过提供这些结构在文本中如何呈现的示例,来增强分类效果。本文系统地探讨了理论驱动合成训练数据在提升社会结构测量方面的潜力。特别地,我们研究了如何将社会科学中测量工具(如调查量表或注释代码本)的既有知识,转化为理论驱动的合成数据生成。通过两项关于测量性别歧视和政治话题的研究,我们评估了合成训练数据在微调文本分类模型中的附加价值。尽管性别歧视研究的结果不那么乐观,但我们的发现表明,合成数据在减少政治话题分类中对标注数据的需求方面非常有效。在性能仅略有下降的情况下,合成数据能够替代大量标注数据。此外,理论驱动的合成数据在性能上显著优于未考虑概念信息的生成数据。

[NLP-18] Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety Toxicity and Legal Reasoning

【速读】: 该论文试图解决超人类语言模型(LLMs)在实际对齐任务(如安全性、毒性和法律推理)中如何有效评估和调整的问题。解决方案的关键在于将弱监督生成现象扩展到实际对齐任务中,通过实验证明弱监督在复杂对齐任务中的广泛适用性,并探索提高对齐性能的有效策略,以提升模型输出的质量。

链接: https://arxiv.org/abs/2410.12621
作者: Ruimeng Ye,Yang Xiao,Bo Hui
关键词-EN: large language models, continue to advance, increasingly critical, large language, alignment
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, ensuring their alignment with human values becomes increasingly critical. Traditional alignment methods heavily rely on human feedback to fine-tune models. With the emergence of superhuman models whose outputs may surpass human understanding, evaluating and aligning these models using human judgments poses significant challenges. To address the challenges, recent works use weak supervisors to elicit knowledge from much stronger models. However, there are important disanalogies between the empirical setup in the existing works and the genuine goal of alignment. We remark that existing works investigate the phenomenon of weak-to-strong generation in analogous setup (i.e., binary classification), rather than practical alignment-relevant tasks (e.g., safety). In this paper, we bridge this gap by extending weak-to-strong generation to the context of practical alignment. We empirically demonstrate the widespread phenomenon of weak-to-strong generation in three complicated alignment tasks: safety, toxicity, and legal reasoning. Furthermore, we explore efficient strategies for improving alignment performance to enhance the quality of model outcomes. Lastly, we summarize and analyze the challenges and potential solutions in regard to specific alignment tasks, which we hope to catalyze the research progress on the topic of weak-to-strong generalization. Our code is released at this https URL.
摘要:随着大语言模型 (LLM) 的不断进步,确保其与人类价值观的一致性变得愈发关键。传统的对齐方法主要依赖于人类反馈来微调模型。然而,随着超人类模型的出现,其输出可能超越人类的理解范围,使用人类判断来评估和调整这些模型面临着重大挑战。为了应对这些挑战,近期的工作采用了弱监督者从更强大的模型中提取知识。然而,现有工作中实验设置与实际对齐目标之间存在重要的差异。我们注意到,现有工作主要研究了在类似设置(即二元分类)中的弱到强生成现象,而非实际对齐相关的任务(如安全性)。本文通过将对齐相关的弱到强生成扩展到实际对齐的背景下,填补了这一空白。我们通过三个复杂的对齐任务(安全性、毒性和法律推理),实证展示了弱到强生成的普遍现象。此外,我们探讨了提高对齐性能的有效策略,以提升模型输出的质量。最后,我们总结并分析了特定对齐任务中的挑战和潜在解决方案,希望能推动弱到强泛化研究的发展。我们的代码已发布于以下链接:[https URL]。

[NLP-19] Parsing Akkadian Verbs with Prolog ACL-02

【速读】: 该论文旨在解决阿卡德语有限动词形式的解析与生成问题,特别是处理动词的D、N、G词干以及宾格、与格和向格后缀。解决方案的关键在于利用Prolog编程语言实现一个解析/生成系统,该系统能够解释和生成这些复杂的动词形式及其后缀组合。

链接: https://arxiv.org/abs/2410.12617
作者: Aaron Macks
关键词-EN: finite verbal forms, implemented in Prolog, forms in Akkadian, describes a parsing, generation system
类目: Computation and Language (cs.CL)
备注: 6 pages, 9 figures, presented at ACL-02 the Association of Computational Linguistics, 2002

点击查看摘要

Abstract:This paper describes a parsing/generation system for finite verbal forms in Akkadian, with the possible addition of suffixes, implemented in Prolog. The work described provides the framework and engine to interpret the D, N, and G stems along with accusative, dative and ventive endings.
摘要:本文描述了一个用于阿卡德语有限动词形式的解析/生成系统,该系统可以附加后缀,并在 Prolog 中实现。本文所述工作提供了框架和引擎,用于解释 D、N 和 G 词干以及宾格、与格和动格词尾。

[NLP-20] Exploring Model Kinship for Merging Large Language Models

【速读】: 该论文试图解决在合并大型语言模型(LLMs)时如何选择合适的候选模型以最大化性能提升的问题。解决方案的关键在于引入“模型亲缘性”(model kinship)概念,即模型之间的相似度或相关性,类似于生物进化中的亲缘关系。通过综合实证分析,论文发现模型亲缘性与合并后的性能提升之间存在一定关系,并基于此提出了一种新的模型合并策略:基于模型亲缘性的Top-k贪心合并(Top-k Greedy Merging with Model Kinship)。该策略利用模型亲缘性作为选择合并模型的标准,有助于在模型进化过程中避免局部最优陷阱,从而在基准数据集上获得更好的性能。

链接: https://arxiv.org/abs/2410.12613
作者: Yedi Hu,Yunzhi Yao,Ningyu Zhang,Shumin Deng,Huajun Chen
关键词-EN: Large Language Models, Large Language, efficiency of Large, Language Models, model kinship
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Ongoing work

点击查看摘要

Abstract:Model merging has become one of the key technologies for enhancing the capabilities and efficiency of Large Language Models (LLMs). However, our understanding of the expected performance gains and principles when merging any two models remains limited. In this work, we introduce model kinship, the degree of similarity or relatedness between LLMs, analogous to biological evolution. With comprehensive empirical analysis, we find that there is a certain relationship between model kinship and the performance gains after model merging, which can help guide our selection of candidate models. Inspired by this, we propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets. Specifically, we discover that using model kinship as a criterion can assist us in continuously performing model merging, alleviating the degradation (local optima) in model evolution, whereas model kinship can serve as a guide to escape these traps. Code is available at this https URL.
摘要:模型合并已成为提升大语言模型 (LLM) 能力和效率的关键技术之一。然而,我们对合并任意两个模型时预期性能提升及其原理的理解仍然有限。在本研究中,我们引入了模型亲缘性,即 LLM 之间的相似度或相关性,类似于生物进化中的亲缘关系。通过全面的实证分析,我们发现模型亲缘性与模型合并后的性能提升之间存在一定的关系,这有助于指导我们选择候选模型。受此启发,我们提出了一种新的模型合并策略:基于模型亲缘性的 Top-k 贪心合并,该策略在基准数据集上能够取得更好的性能。具体而言,我们发现将模型亲缘性作为标准可以帮助我们持续进行模型合并,缓解模型进化中的退化 (局部最优) 问题,而模型亲缘性可以作为指导我们摆脱这些陷阱的依据。代码可在以下链接获取:https URL。

[NLP-21] Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning

【速读】: 该论文试图解决大型语言模型(LLMs)在解决数学推理问题时,中间推理步骤中常出现的计算和语义理解错误问题。解决方案的关键在于提出了一种名为PROVE的框架,该框架通过程序化验证作为启发式方法,在聚合最终答案之前过滤掉可能错误的推理路径。具体来说,PROVE不依赖于简单的多数投票,而是拒绝那些与其生成的解决方案不一致的程序输出,只聚合那些通过Python程序验证的解决方案。实验结果表明,PROVE在多个数学基准测试中,显著提高了不同大小和家族的开源LLMs的准确性。

链接: https://arxiv.org/abs/2410.12608
作者: Vernon Y.H. Toh,Deepanway Ghosal,Soujanya Poria
关键词-EN: Large language models, shown increasing proficiency, Large language, mathematical reasoning problems, shown increasing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown increasing proficiency in solving mathematical reasoning problems. However, many current open-source LLMs often still make calculation and semantic understanding errors in their intermediate reasoning steps. In this work, we propose PROVE, a simple yet effective framework that uses program-based verification as a heuristic to filter out potentially incorrect reasoning paths before aggregating the final answers. Instead of relying on vanilla majority voting, our approach rejects solutions whose corresponding program outputs are inconsistent with the generated solution, aggregating only those validated by Python programs. We conducted extensive experiments on 13 open-source LLMs from various model families and sizes, ranging from 0.5B to 13B parameters, across seven math benchmarks. We demonstrate that PROVE consistently outperforms vanilla majority voting as a heuristic for solving mathematical reasoning tasks across all datasets and model sizes. Notably, PROVE increases accuracy on the GSM8K benchmark from 48.85% to 53.83% for Qwen2-0.5B-Instruct, from 65.66% to 73.01% for Llama-3.2-1B-Instruct, from 73.39% to 79.61% for Gemma-2-2b-it, and from 41.32% to 59.51% for Llama-2-7B-chat. Our codes are available at this https URL.
摘要:大语言模型 (LLMs) 在解决数学推理问题方面展现出日益增强的能力。然而,许多当前的开源 LLMs 在其推理过程中的中间步骤中仍经常出现计算和语义理解错误。在本研究中,我们提出了 PROVE,这是一个简单而有效的框架,它利用基于程序的验证作为启发式方法,在汇总最终答案之前过滤掉潜在的错误推理路径。与依赖传统的多数投票方法不同,我们的方法拒绝那些与其生成的解决方案不一致的程序输出,仅聚合那些通过 Python 程序验证的解决方案。我们在来自不同模型家族和规模的 13 个开源 LLMs 上进行了广泛的实验,这些模型参数范围从 0.5B 到 13B,涵盖了七个数学基准测试。实验结果表明,PROVE 在所有数据集和模型规模上均持续优于传统的多数投票方法,作为解决数学推理任务的启发式方法。特别地,PROVE 在 GSM8K 基准测试中将 Qwen2-0.5B-Instruct 的准确率从 48.85% 提升至 53.83%,将 Llama-3.2-1B-Instruct 的准确率从 65.66% 提升至 73.01%,将 Gemma-2-2b-it 的准确率从 73.39% 提升至 79.61%,以及将 Llama-2-7B-chat 的准确率从 41.32% 提升至 59.51%。我们的代码可在以下链接获取:https URL。

[NLP-22] CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

【速读】: 该论文试图解决科学文档摘要中多属性控制的问题,即如何在摘要生成过程中同时控制长度和经验焦点等多个属性。解决方案的关键在于引入了一个名为CCSBench的基准,该基准允许对显性属性(如长度)和隐性属性(如经验焦点)进行细粒度控制。通过在GPT-4、LLaMA2等大型语言模型上进行广泛实验,研究发现现有模型在平衡这些控制属性,尤其是需要深度理解和抽象推理的隐性属性方面存在显著局限。

链接: https://arxiv.org/abs/2410.12601
作者: Yixi Ding,Jiaying Wu,Tongyao Zhu,Yanxia Qin,Qian Liu,Min-Yen Kan
关键词-EN: diverse audiences, simultaneously control multiple, broaden the dissemination, knowledge to diverse, scientific document summarization
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To broaden the dissemination of scientific knowledge to diverse audiences, scientific document summarization must simultaneously control multiple attributes such as length and empirical focus. However, existing research typically focuses on controlling single attributes, leaving the compositional control of multiple attributes underexplored. To address this gap, we introduce CCSBench, a benchmark for compositional controllable summarization in the scientific domain. Our benchmark enables fine-grained control over both explicit attributes (e.g., length), which are objective and straightforward, and implicit attributes (e.g., empirical focus), which are more subjective and conceptual. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our findings reveal significant limitations in large language models’ ability to balance trade-offs between control attributes, especially implicit ones that require deeper understanding and abstract reasoning.
摘要:为了将科学知识广泛传播给多样化的受众,科学文献摘要必须同时控制多个属性,如长度和实证焦点。然而,现有研究通常侧重于控制单一属性,导致对多属性组合控制的探索不足。为填补这一空白,我们引入了 CCSBench,这是一个针对科学领域组合可控摘要的基准。我们的基准能够在显性属性(例如,长度)和隐性属性(例如,实证焦点)上实现细粒度控制。显性属性是客观且直接的,而隐性属性则更为主观和概念化。我们在各种设置下对 GPT-4、LLaMA2 及其他流行的大语言模型进行了广泛的实验。我们的研究结果揭示了大语言模型在平衡控制属性之间的权衡,特别是需要深入理解和抽象推理的隐性属性方面存在显著局限。

[NLP-23] On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs

【速读】: 该论文试图解决大语言模型(LLMs)在恶意社交文本检测中可能带来的证据污染问题。解决方案的关键在于提出三种防御策略:机器生成文本检测、混合专家模型和参数更新。这些策略旨在从数据和模型两个方面减轻证据污染的影响,尽管在实际应用中面临标注数据需求和推理成本高等局限性。

链接: https://arxiv.org/abs/2410.12600
作者: Herun Wan,Minnan Luo,Zhixiong Su,Guang Dai,Xiang Zhao
关键词-EN: Evidence-enhanced detectors present, Evidence-enhanced detectors, present remarkable abilities, remarkable abilities, abilities in identifying
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evidence-enhanced detectors present remarkable abilities in identifying malicious social text with related evidence. However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors. This paper explores how to manipulate evidence, simulating potential misuse scenarios including basic pollution, and rephrasing or generating evidence by LLMs. To mitigate its negative impact, we propose three defense strategies from both the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating. Extensive experiments on four malicious social text detection tasks with ten datasets present that evidence pollution, especially the generate strategy, significantly compromises existing detectors. On the other hand, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment, such as the need for annotated data and huge inference costs. Further analysis illustrates that polluted evidence is of high quality, would compromise the model calibration, and could ensemble to amplify the negative impact.
摘要:证据增强检测器在识别带有相关证据的恶意社交文本方面展现出显著能力。然而,大语言模型 (LLM) 的兴起带来了证据污染的潜在风险,可能混淆检测器。本文探讨了如何操纵证据,模拟包括基本污染、通过 LLM 重述或生成证据在内的潜在滥用场景。为减轻其负面影响,我们提出了三种防御策略,分别从数据和模型两方面入手,包括机器生成文本检测、专家混合模型和参数更新。在四个恶意社交文本检测任务的十个数据集上进行的广泛实验表明,证据污染,尤其是生成策略,显著损害了现有检测器。另一方面,防御策略能够缓解证据污染,但在实际应用中面临局限,如需要标注数据和巨大的推理成本。进一步分析表明,污染证据质量高,会损害模型校准,并可能通过集成放大负面影响。

[NLP-24] Can We Reverse In-Context Knowledge Edits?

【速读】: 该论文试图解决的问题是如何检测和逆转对大型语言模型(LLM)输出的上下文知识编辑(IKE),以防止恶意操纵和信息篡改。解决方案的关键在于:首先,通过仅使用下一个词的top-10输出概率,在黑盒设置下(如专有LLM)高精度(F1 80%)检测IKE编辑;其次,引入专门调谐的逆转词元(reversal tokens)来逆转IKE编辑,探索连续和离散逆转词元的使用,实现恢复原始未编辑输出的准确率超过80%。特别是连续逆转词元在不影响未编辑提示的情况下表现尤为有效,通过分析输出分布、注意力模式和词元排名,揭示了IKE对LLM的影响及其逆转机制,从而增强LLM对潜在滥用的抵抗力,提升其透明度和可信度。

链接: https://arxiv.org/abs/2410.12586
作者: Paul Youssef,Zhixue Zhao,Jörg Schlötterer,Christin Seifert
关键词-EN: enables efficient modification, large language model, enables efficient, language model, efficient modification
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context knowledge editing (IKE) enables efficient modification of large language model (LLM) outputs without parameter changes and at zero-cost. However, it can be misused to manipulate responses opaquely, e.g., insert misinformation or offensive content. Such malicious interventions could be incorporated into high-level wrapped APIs where the final input prompt is not shown to end-users. To address this issue, we investigate the detection and reversal of IKE-edits. First, we demonstrate that IKE-edits can be detected with high accuracy (F1 80%) using only the top-10 output probabilities of the next token, even in a black-box setting, e.g. proprietary LLMs with limited output information. Further, we introduce the novel task of reversing IKE-edits using specially tuned reversal tokens. We explore using both continuous and discrete reversal tokens, achieving over 80% accuracy in recovering original, unedited outputs across multiple LLMs. Our continuous reversal tokens prove particularly effective, with minimal impact on unedited prompts. Through analysis of output distributions, attention patterns, and token rankings, we provide insights into IKE’s effects on LLMs and how reversal tokens mitigate them. This work represents a significant step towards enhancing LLM resilience against potential misuse of in-context editing, improving their transparency and trustworthiness.
摘要:上下文知识编辑 (In-context Knowledge Editing, IKE) 能够在不改变大语言模型 (Large Language Model, LLM) 参数的情况下,以零成本高效地修改其输出。然而,这种技术可能被滥用,以不透明的方式操纵响应,例如插入错误信息或冒犯性内容。这些恶意干预可能被整合到高级封装 API 中,最终的输入提示不会展示给终端用户。为解决这一问题,我们研究了检测和逆转 IKE 编辑的方法。首先,我们证明,即使在黑箱设置下,例如使用输出信息有限的专有 LLM,仅利用下一个 Token 的前 10 个输出概率,也能以高准确率 (F1 80%) 检测 IKE 编辑。进一步,我们引入了使用特别调谐的逆转 Token 来逆转 IKE 编辑的新任务。我们探索了使用连续和离散逆转 Token 的方法,在多个 LLM 上恢复原始未编辑输出的准确率超过 80%。我们的连续逆转 Token 尤为有效,对未编辑提示的影响最小。通过分析输出分布、注意力模式和 Token 排名,我们深入了解了 IKE 对 LLM 的影响以及逆转 Token 如何缓解这些影响。这项工作标志着在增强 LLM 对上下文编辑潜在滥用的抵抗力、提高其透明度和可信度方面迈出了重要一步。

[NLP-25] STRUX: An LLM for Decision-Making with Structured Explanations NAACL2025

【速读】: 该论文试图解决如何增强大型语言模型(LLM)在决策过程中的透明性和解释性问题。解决方案的关键在于引入了一个名为STRUX的新框架,该框架通过提供结构化的解释来增强LLM的决策能力。STRUX的核心步骤包括:首先将大量信息提炼成关键事实的简明表格;然后通过一系列自我反思步骤确定这些事实对决策的影响,并将其分类为有利或不利;最后,对LLM进行微调,以识别和优先处理这些关键事实,从而优化决策过程。STRUX在股票投资决策预测任务中展示了优于现有方法的性能,显著提高了决策的透明度。

链接: https://arxiv.org/abs/2410.12583
作者: Yiming Lu,Yebowen Hu,Hassan Foroosh,Wei Jin,Fei Liu
关键词-EN: Countless decisions shape, daily lives, Countless decisions, shape our daily, Countless
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, submitted to NAACL 2025

点击查看摘要

Abstract:Countless decisions shape our daily lives, and it is paramount to understand the how and why behind these choices. In this paper, we introduce a new LLM decision-making framework called STRUX, which enhances LLM decision-making by providing structured explanations. These include favorable and adverse facts related to the decision, along with their respective strengths. STRUX begins by distilling lengthy information into a concise table of key facts. It then employs a series of self-reflection steps to determine which of these facts are pivotal, categorizing them as either favorable or adverse in relation to a specific decision. Lastly, we fine-tune an LLM to identify and prioritize these key facts to optimize decision-making. STRUX has been evaluated on the challenging task of forecasting stock investment decisions based on earnings call transcripts and demonstrated superior performance against strong baselines. It enhances decision transparency by allowing users to understand the impact of different factors, representing a meaningful step towards practical decision-making with LLMs.
摘要:无数决策塑造了我们的日常生活,理解这些选择背后的“如何”和“为何”至关重要。本文介绍了一种名为 STRUX 的新型大语言模型 (LLM) 决策框架,该框架通过提供结构化的解释来增强 LLM 的决策能力。这些解释包括与决策相关的有利和不利事实,以及它们各自的强度。STRUX 首先将冗长的信息提炼成一个简洁的关键事实表。然后,它通过一系列自我反思步骤来确定哪些事实是关键的,并将它们分类为对特定决策有利或不利。最后,我们微调了一个大语言模型,以识别和优先处理这些关键事实,从而优化决策过程。STRUX 在基于财报电话会议记录预测股票投资决策的挑战性任务中进行了评估,并展示了优于强基线的性能。它通过允许用户理解不同因素的影响,增强了决策的透明度,代表了在实际应用中使用大语言模型进行决策的重要一步。

[NLP-26] A Claim Decomposition Benchmark for Long-form Answer Verification

【速读】: 该论文试图解决大语言模型(LLMs)在复杂长篇问答任务中生成不实信息(hallucination)的问题。解决方案的关键在于引入一个新的声明分解基准,即中文原子声明分解数据集(CACDD),该数据集通过专家注释确保高质量,旨在帮助系统识别LLM响应中的原子且值得验证的声明。论文提出了一个人类注释的新流程,并展示了零样本、少样本和微调LLMs的实验结果,表明声明分解任务极具挑战性,需要进一步探索。

链接: https://arxiv.org/abs/2410.12558
作者: Zhihao Zhang,Yixing Fan,Ruqing Zhang,Jiafeng Guo
关键词-EN: complex long-form question, long-form question answering, question answering tasks, significantly boosted, boosted the performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by CCIR 2024

点击查看摘要

Abstract:The advancement of LLMs has significantly boosted the performance of complex long-form question answering tasks. However, one prominent issue of LLMs is the generated “hallucination” responses that are not factual. Consequently, attribution for each claim in responses becomes a common solution to improve the factuality and verifiability. Existing researches mainly focus on how to provide accurate citations for the response, which largely overlook the importance of identifying the claims or statements for each response. To bridge this gap, we introduce a new claim decomposition benchmark, which requires building system that can identify atomic and checkworthy claims for LLM responses. Specifically, we present the Chinese Atomic Claim Decomposition Dataset (CACDD), which builds on the WebCPM dataset with additional expert annotations to ensure high data quality. The CACDD encompasses a collection of 500 human-annotated question-answer pairs, including a total of 4956 atomic claims. We further propose a new pipeline for human annotation and describe the challenges of this task. In addition, we provide experiment results on zero-shot, few-shot and fine-tuned LLMs as baselines. The results show that the claim decomposition is highly challenging and requires further explorations. All code and data are publicly available at \urlthis https URL.
摘要:大语言模型 (LLM) 的进步显著提升了复杂长篇问答任务的性能。然而,LLM 的一个显著问题是生成的“幻觉”响应,这些响应并不符合事实。因此,为每个响应中的主张提供归属成为提高事实性和可验证性的常见解决方案。现有研究主要集中在如何为响应提供准确的引用,这在很大程度上忽视了识别每个响应中的主张或陈述的重要性。为了填补这一空白,我们引入了一个新的主张分解基准,该基准要求构建能够识别大语言模型响应中的原子性和可核查主张的系统。具体而言,我们提出了中文原子主张分解数据集 (CACDD),该数据集基于 WebCPM 数据集,并增加了专家注释以确保高质量数据。CACDD 包含 500 对人工注释的问答对,总计 4956 个原子主张。我们进一步提出了新的人工注释流程,并描述了该任务的挑战。此外,我们提供了零样本、少样本和微调大语言模型的实验结果作为基线。结果表明,主张分解极具挑战性,需要进一步探索。所有代码和数据均公开可用,详见 \urlthis https URL。

[NLP-27] LLM-based Translation Inference with Iterative Bilingual Understanding

【速读】: 该论文试图解决大语言模型(LLMs)在翻译过程中因对源句理解错误而导致翻译质量下降的问题。解决方案的关键在于提出了一种名为迭代双语理解翻译(IBUT)的新方法,该方法利用LLMs的双语能力和翻译任务的双重特性,通过生成源语言和目标语言的上下文理解,并进行跨语言反馈,迭代地优化上下文理解,从而减少错误并提升翻译性能。实验结果表明,IBUT在多个领域(如新闻、常识和文化翻译)的基准测试中表现优于其他对比方法。

链接: https://arxiv.org/abs/2410.12543
作者: Andong Chen,Kehai Chen,Yang Xiang,Xuefeng Bai,Muyun Yang,Tiejun Zhao,Min zhang
关键词-EN: greatly improved translation, large language models, improved translation performance, Iterative Bilingual Understanding, greatly improved
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in process

点击查看摘要

Abstract:The remarkable understanding and generation capabilities of large language models (LLMs) have greatly improved translation performance. However, incorrect understanding of the sentence to be translated can degrade translation quality. To address this issue, we proposed a novel Iterative Bilingual Understanding Translation (IBUT) method based on the cross-lingual capabilities of LLMs and the dual characteristics of translation tasks. The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately. Furthermore, the dual characteristics allow IBUT to generate effective cross-lingual feedback, iteratively refining contextual understanding, thereby reducing errors and improving translation performance. Experimental results showed that the proposed IBUT outperforms several strong comparison methods, especially being generalized to multiple domains (e.g., news, commonsense, and cultural translation benchmarks).
摘要:大语言模型 (LLM) 在理解和生成方面的显著能力极大地提升了翻译性能。然而,对被翻译句子的错误理解会降低翻译质量。为了解决这一问题,我们提出了一种基于 LLM 跨语言能力和翻译任务双重特性的新型迭代双语理解翻译 (IBUT) 方法。LLM 的跨语言能力使得能够分别生成源语言和目标语言的上下文理解。此外,翻译任务的双重特性使得 IBUT 能够生成有效的跨语言反馈,通过迭代优化上下文理解,从而减少错误并提升翻译性能。实验结果表明,所提出的 IBUT 方法优于几种强大的对比方法,特别是在多领域(例如新闻、常识和文化翻译基准)的泛化能力上表现尤为突出。

[NLP-28] MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

【速读】: 该论文试图解决大型语言模型(LLM)在医疗领域应用中缺乏个性化推荐和诊断分析的问题,导致幻觉和性能瓶颈。解决方案的关键在于提出了MedAide框架,该框架通过检索增强生成进行查询重写,以实现准确的医疗意图理解,并利用上下文编码器获取意图原型嵌入,通过相似度匹配识别细粒度意图。根据意图相关性,激活的代理协同工作,提供综合决策分析。实验结果表明,MedAide在医疗领域的复杂意图处理上优于当前的LLM,并提升了其医疗专业性和策略推理能力。

链接: https://arxiv.org/abs/2410.12532
作者: Jinjie Wei,Dingkang Yang,Yanshu Li,Qingyao Xu,Zhaoyu Chen,Mingcheng Li,Yue Jiang,Xiaolu Hou,Lihua Zhang
关键词-EN: Large Language Model, Large Language, Language Model, driven interactive systems, show potential promise
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large Language Model (LLM)-driven interactive systems currently show potential promise in healthcare domains. Despite their remarkable capabilities, LLMs typically lack personalized recommendations and diagnosis analysis in sophisticated medical applications, causing hallucinations and performance bottlenecks. To address these challenges, this paper proposes MedAide, an LLM-based omni medical multi-agent collaboration framework for specialized healthcare services. Specifically, MedAide first performs query rewriting through retrieval-augmented generation to accomplish accurate medical intent understanding. Immediately, we devise a contextual encoder to obtain intent prototype embeddings, which are used to recognize fine-grained intents by similarity matching. According to the intent relevance, the activated agents collaborate effectively to provide integrated decision analysis. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.
摘要:大语言模型 (LLM) 驱动的交互系统目前在医疗领域展现出巨大的潜力。尽管其具备显著的能力,但 LLM 在复杂的医疗应用中通常缺乏个性化的推荐和诊断分析,导致幻觉和性能瓶颈。为解决这些问题,本文提出了 MedAide,一种基于 LLM 的全方位医疗多智能体协作框架,用于专业化的医疗服务。具体而言,MedAide 首先通过检索增强生成进行查询重写,以实现准确的医疗意图理解。随后,我们设计了一个上下文编码器来获取意图原型嵌入,这些嵌入通过相似性匹配来识别细粒度的意图。根据意图的相关性,激活的智能体有效协作,提供综合的决策分析。我们在四个具有复合意图的医疗基准上进行了广泛的实验。自动化指标和专家医生评估的实验结果表明,MedAide 优于当前的 LLM,并提升了其在医疗领域的专业性和策略推理能力。

[NLP-29] FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction ICLR2025

【速读】: 该论文试图解决在资源受限环境下(如移动和边缘设备),自回归大语言模型(LLMs)推理过程中的高计算延迟问题。解决方案的关键在于提出了一种名为FIRST的算法,该算法通过使用层级特定的路由器,根据输入序列动态选择部分Transformer层进行推理,从而减少推理延迟。FIRST算法不仅保持了与KV缓存的兼容性,还通过引入LoRA适配器进行微调,提升了任务特定的准确性,同时保持了延迟优势。该方法的核心在于输入自适应性,即根据不同的任务和输入序列,动态调整中间层的参与,从而在降低延迟的同时保持模型性能。

链接: https://arxiv.org/abs/2410.12513
作者: Akriti Jain,Saransh Sharma,Koyel Mukherjee,Soumyabrata Pal
关键词-EN: Auto-regressive Large Language, Large Language Models, Auto-regressive Large, Language Models, Large Language
类目: Computation and Language (cs.CL)
备注: 17 pages, 6 figures, Submitted to ICLR 2025

点击查看摘要

Abstract:Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across domanins such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences. To address both limitations, we propose FIRST, an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence - the prompt (during prefill stage) decides which layers will be skipped during decoding. FIRST preserves compatibility with KV caching enabling faster inference while being quality-aware. FIRST is model-agnostic and can be easily enabled on any pre-trained LLM. We further improve performance by incorporating LoRA adapters for fine-tuning on external datasets, enhancing task-specific accuracy while maintaining latency benefits. Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on task. Extensive experiments show that FIRST significantly reduces latency while retaining competitive performance (as compared to baselines), making our approach an efficient solution for LLM deployment in low-resource environments.
摘要:自回归大语言模型 (LLM) 在视觉和语言处理等多个领域展示了卓越的性能。然而,由于通过一系列 Transformer 层进行顺序处理,自回归解码面临显著的计算/延迟挑战,特别是在移动和边缘设备等资源受限的环境中。现有文献中旨在通过跳过层来改善延迟的方法有两种主要类型:1) 早期退出 2) 输入无关的启发式方法,其中 Token 在预定的层退出,与输入序列无关。上述两种策略都存在局限性:前者无法应用于处理现代框架中加速所需的 KV 缓存,而后者未能捕捉到任务间或更广泛地说,输入序列间层重要性的变化。为了解决这两个局限性,我们提出了 FIRST,一种通过使用层特定路由器来为每个输入序列动态选择 Transformer 层子集以减少推理延迟的算法——提示 (在预填充阶段) 决定了在解码过程中哪些层将被跳过。FIRST 保持与 KV 缓存的兼容性,从而实现更快的推理,同时保持质量意识。FIRST 是模型无关的,可以轻松应用于任何预训练的 LLM。我们进一步通过结合 LoRA 适配器在外部数据集上进行微调来提升性能,增强了任务特定的准确性,同时保持了延迟优势。我们的方法表明,输入适应性至关重要——事实上,不同的任务特定中间层在根据任务演化隐藏表示方面起着关键作用。广泛的实验表明,FIRST 显著减少了延迟,同时保持了与基线相比的竞争性能,使得我们的方法成为在低资源环境中部署 LLM 的高效解决方案。

[NLP-30] Advancing Fairness in Natural Language Processing: From Traditional Methods to Explainability

【速读】: 该论文试图解决自然语言处理(NLP)系统中的公平性和透明性问题,特别是如何确保这些系统在处理不同人群时能够公平且无偏见。解决方案的关键在于开发和评估能够有效减少偏见的方法,并提升模型的可解释性。论文提出了创新算法来缓解多类分类器中的偏见,分析了数据集大小对偏见的影响,并引入了COCKATIEL和TaCo等模型无关的可解释性方法,以识别和消除Transformer模型中的偏见,从而推动NLP系统向更加公平和负责任的方向发展。

链接: https://arxiv.org/abs/2410.12511
作者: Fanny Jourdan
关键词-EN: Natural Language Processing, Language Processing, Natural Language, field of Natural, NLP
类目: Computation and Language (cs.CL)
备注: PhD Thesis, Toulouse University

点击查看摘要

Abstract:The burgeoning field of Natural Language Processing (NLP) stands at a critical juncture where the integration of fairness within its frameworks has become an imperative. This PhD thesis addresses the need for equity and transparency in NLP systems, recognizing that fairness in NLP is not merely a technical challenge but a moral and ethical necessity, requiring a rigorous examination of how these technologies interact with and impact diverse human populations. Through this lens, this thesis undertakes a thorough investigation into the development of equitable NLP methodologies and the evaluation of biases that prevail in current systems. First, it introduces an innovative algorithm to mitigate biases in multi-class classifiers, tailored for high-risk NLP applications, surpassing traditional methods in both bias mitigation and prediction accuracy. Then, an analysis of the Bios dataset reveals the impact of dataset size on discriminatory biases and the limitations of standard fairness metrics. This awareness has led to explorations in the field of explainable AI, aiming for a more complete understanding of biases where traditional metrics are limited. Consequently, the thesis presents COCKATIEL, a model-agnostic explainability method that identifies and ranks concepts in Transformer models, outperforming previous approaches in sentiment analysis tasks. Finally, the thesis contributes to bridging the gap between fairness and explainability by introducing TaCo, a novel method to neutralize bias in Transformer model embeddings. In conclusion, this thesis constitutes a significant interdisciplinary endeavor that intertwines explicability and fairness to challenge and reshape current NLP paradigms. The methodologies and critiques presented contribute to the ongoing discourse on fairness in machine learning, offering actionable solutions for more equitable and responsible AI systems. Comments: PhD Thesis, Toulouse University Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.12511 [cs.CL] (or arXiv:2410.12511v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.12511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:自然语言处理 (NLP) 这一蓬勃发展的领域正处于一个关键节点,其中公平性在其框架内的整合已成为迫切需求。本博士论文探讨了 NLP 系统中公平性和透明性的必要性,认识到 NLP 中的公平性不仅是一个技术挑战,更是一个道德和伦理上的必需,需要对这些技术如何与多样的人类群体互动并产生影响进行严格审视。通过这一视角,本论文深入研究了公平 NLP 方法的开发以及当前系统中普遍存在的偏见评估。首先,论文介绍了一种创新算法,用于减轻多类分类器中的偏见,特别适用于高风险 NLP 应用,在偏见缓解和预测准确性方面均优于传统方法。接着,对 Bios 数据集的分析揭示了数据集大小对歧视性偏见的影响以及标准公平性指标的局限性。这一认识促使我们在可解释 AI 领域进行探索,旨在在传统指标有限的领域中更全面地理解偏见。因此,论文提出了 COCKATIEL,一种模型无关的可解释性方法,用于识别和排序 Transformer 模型中的概念,在情感分析任务中表现优于先前的方法。最后,论文通过引入 TaCo,一种用于中和 Transformer 模型嵌入中偏见的新方法,为公平性和可解释性之间的桥梁做出了贡献。总之,本论文是一项重要的跨学科努力,将可解释性和公平性交织在一起,挑战并重塑当前的 NLP 范式。所提出的方法和批评为机器学习中公平性的持续讨论提供了可操作的解决方案,推动了更公平和负责任的 AI 系统的发展。

评论:博士论文,图卢兹大学
主题:计算与语言 (cs.CL)
引用方式:arXiv:2410.12511 [cs.CL]
(或 arXiv:2410.12511v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.12511
了解更多信息
arXiv 发布的 DOI 通过 DataCite (注册待定)

[NLP-31] With a Grain of SALT: Are LLMs Fair Across Social Dimensions?

【速读】: 该论文旨在分析开源大型语言模型(LLMs)在性别、宗教和种族方面的偏见问题。解决方案的关键在于引入了一种基于七种偏见触发器(如辩论、职业建议、故事生成等)的偏见检测数据集生成方法,并使用GPT-4o生成多样化的提示。通过GPT-4o-mini对生成的文本进行匿名化处理,并利用GPT-4o-as-a-Judge进行成对比较,以量化偏见。研究还扩展到三种语言(英语、德语和阿拉伯语),以探讨语言对偏见表现的影响。研究发现,LLMs在各分类中对某些群体表现出明显的偏见,且在不同语言间存在文化线索和语境差异导致的偏见表现变化。

链接: https://arxiv.org/abs/2410.12499
作者: Samee Arif,Zohaib Khan,Agha Ali Raza,Awais Athar
关键词-EN: open-source Large Language, open-source Large, Large Language Models, General Debate, Positioned Debate
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents an analysis of biases in open-source Large Language Models (LLMs) across various genders, religions, and races. We introduce a methodology for generating a bias detection dataset using seven bias triggers: General Debate, Positioned Debate, Career Advice, Story Generation, Problem-Solving, Cover-Letter Writing, and CV Generation. We use GPT-4o to generate a diverse set of prompts for each trigger across various genders, religious and racial groups. We evaluate models from Llama and Gemma family on the generated dataset. We anonymise the LLM-generated text associated with each group using GPT-4o-mini and do a pairwise comparison using GPT-4o-as-a-Judge. To quantify bias in the LLM-generated text we use the number of wins and losses in the pairwise comparison. Our analysis spans three languages, English, German, and Arabic to explore how language influences bias manifestation. Our findings reveal that LLMs exhibit strong polarization toward certain groups across each category, with a notable consistency observed across models. However, when switching languages, variations and anomalies emerge, often attributable to cultural cues and contextual differences.
摘要:本文对开源大语言模型 (LLM) 在不同性别、宗教和种族中的偏见进行了分析。我们提出了一种使用七种偏见触发器生成偏见检测数据集的方法:一般辩论、定位辩论、职业建议、故事生成、问题解决、求职信撰写和简历生成。我们使用 GPT-4o 为每个触发器生成多样化的提示,涵盖不同性别、宗教和种族群体。我们在生成的数据集上评估了 Llama 和 Gemma 系列模型。我们使用 GPT-4o-mini 对与每个群体相关的 LLM 生成文本进行匿名化,并使用 GPT-4o-as-a-Judge 进行成对比较。为了量化 LLM 生成文本中的偏见,我们使用了成对比较中的胜负次数。我们的分析涵盖了三种语言:英语、德语和阿拉伯语,以探讨语言如何影响偏见的显现。我们的研究发现,LLM 在每个类别中对某些群体表现出强烈的偏见,并且在不同模型中观察到显著的一致性。然而,当切换语言时,会出现变化和异常,这通常归因于文化线索和上下文差异。

[NLP-32] End-to-end Planner Training for Language Modeling

【速读】: 该论文试图解决现有语言模型(LM)在训练过程中无法与规划模块(planner)进行联合端到端微调的问题。解决方案的关键在于提出了一种可微分的方法,通过使用预测标签概率作为混合权重,将标签嵌入进行加权平均,从而使LM能够基于规划模块预测的标签分布进行条件化。这种方法不仅实现了规划模块与语言模型的联合微调,还允许LM利用规划模块预测的完整标签分布,从而保留更多信息,并在实验中显示出一致的困惑度改进。

链接: https://arxiv.org/abs/2410.12492
作者: Nathan Cornille,Florian Mai,Jingyuan Sun,Marie-Francine Moens
关键词-EN: valuable tools, predict abstract labels, language modeling, planner, enhance language modeling
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages

点击查看摘要

Abstract:Through end-to-end training to predict the next token, LLMs have become valuable tools for various tasks. Enhancing their core training in language modeling can improve numerous downstream applications. A successful approach to enhance language modeling uses a separate planning module to predict abstract labels of future sentences and conditions the LM on these predictions. However, this method is non-differentiable, preventing joint end-to-end tuning of the planner with the LM. We propose an effective method to improve this approach by enabling joint fine-tuning of the planner and the LM. We show that a naive way of approximating the gradient of selecting a label via the straight-through estimator is not effective. Instead, we propose to use the predicted label probabilities as mixing weights to condition the LM on a weighted average of label embeddings in a differentiable manner. This not only enables joint fine-tuning of the planner and the LM, but also allows the LM to draw on the full label distribution predicted by the planner, retaining more information. Our experimental results show consistent improvements in perplexity.
摘要:通过端到端的训练来预测下一个 Token,大语言模型 (LLM) 已成为处理各种任务的有价值工具。增强其核心的语言模型训练可以提升众多下游应用的性能。一种成功的增强语言模型训练的方法是使用一个独立的规划模块来预测未来句子的抽象标签,并根据这些预测来调整语言模型。然而,这种方法是非可微分的,阻碍了规划模块与语言模型的联合端到端调优。我们提出了一种有效的方法来改进这一方法,通过实现规划模块与语言模型的联合微调。我们发现,通过直通估计器来近似选择标签的梯度的朴素方法并不有效。相反,我们建议使用预测标签的概率作为混合权重,以可微分的方式将语言模型调整为标签嵌入的加权平均值。这不仅实现了规划模块与语言模型的联合微调,还允许语言模型利用规划模块预测的完整标签分布,保留更多信息。我们的实验结果显示,困惑度 (perplexity) 得到了一致的提升。

[NLP-33] Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

【速读】: 该论文试图解决大语言模型(LLMs)在基于人类反馈的强化学习(RLHF)训练过程中,其内在奖励函数和决策过程不透明的问题。解决方案的关键在于采用逆强化学习(IRL)来恢复这些模型的隐含奖励函数。通过实验,研究者成功提取了与人类偏好预测准确率高达80.40%的奖励模型,并揭示了奖励函数的非可辨识性、模型规模与可解释性之间的关系,以及RLHF过程中的潜在问题。此外,IRL导出的奖励模型可用于微调新的LLMs,从而在毒性基准测试中实现相当的或改进的性能。这一方法为理解和改进LLM的校准提供了新的视角,对这些强大系统的负责任开发和部署具有重要意义。

链接: https://arxiv.org/abs/2410.12491
作者: Jared Joselowitz,Arjun Jagota,Satyapriya Krishna,Sonali Parbhoo
关键词-EN: demonstrated remarkable capabilities, processes remain opaque, Large language models, decision-making processes remain, Large language
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.
摘要:通过人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF) 训练的大语言模型 (Large Language Models, LLMs) 展示了卓越的能力,但其背后的奖励函数和决策过程仍然不透明。本文提出了一种通过应用逆强化学习 (Inverse Reinforcement Learning, IRL) 来解释 LLMs 的新方法,以恢复其隐含的奖励函数。我们在不同规模的毒性对齐 LLMs 上进行了实验,提取的奖励模型在预测人类偏好方面达到了高达 80.40% 的准确率。我们的分析揭示了奖励函数的不可识别性、模型规模与可解释性之间的关系,以及 RLHF 过程中的潜在陷阱。我们展示了通过 IRL 获得的奖励模型可以用于微调新的 LLMs,从而在毒性基准测试中获得可比或改进的性能。这项工作为理解和改进 LLM 的对齐提供了一个新的视角,对这些强大系统的负责任开发和部署具有重要意义。

[NLP-34] KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

【速读】: 该论文试图解决大语言模型(LLMs)在模式和实体匹配任务中存在的幻觉和任务指令混淆问题。解决方案的关键在于提出了知识合规匹配框架(KcMF),该框架通过伪代码任务分解策略,将任务特定的自然语言陈述引入LLM推理过程,从而减少混淆。此外,KcMF还引入了数据集作为知识(DaK)和示例作为知识(EaK)两种机制,以在没有结构化领域知识的情况下构建领域知识集。最后,通过结果集成策略,KcMF能够利用多种知识源并抑制格式不良的输出,从而在模式和实体匹配任务中显著提升性能,平均F1得分比之前的非LLM最先进(SOTA)方法高出22.9%,并与经过微调的SOTA LLMs竞争有效。

链接: https://arxiv.org/abs/2410.12480
作者: Yongqin Xu,Huan Li,Ke Chen,Lidan Shou
关键词-EN: integration and management, crucial for data, data integration, entity matching tasks, Knowledge-Compliant Matching Framework
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Schema and entity matching tasks are crucial for data integration and management. While large language models (LLMs) have shown promising results in these tasks, they suffer from hallucinations and confusion about task instructions. In this paper, we present the Knowledge-Compliant Matching Framework (KcMF), an LLM-based approach that addresses these issues without the need for domain-specific fine-tuning. KcMF employs a pseudo-code-based task decomposition strategy to adopt task-specific natural language statements that guide LLM reasoning and reduce confusion. We also propose two mechanisms, Dataset as Knowledge (DaK) and Example as Knowledge (EaK), to build domain knowledge sets when unstructured domain knowledge is lacking. Additionally, we introduce a result-ensembling strategy to leverage multiple knowledge sources and suppress poorly formatted outputs. Comprehensive evaluations on schema and entity matching tasks demonstrate that KcMF outperforms previous non-LLM state-of-the-art (SOTA) methods by an average F1 score of 22.9% and competes effectively with SOTA fine-tuned LLMs. Moreover, KcMF generalizes well across different LLMs.
摘要:模式与实体匹配任务对于数据集成和管理至关重要。尽管大语言模型 (LLM) 在这些任务中展示了有前景的结果,但它们在任务指令的理解上存在幻觉和混淆。本文提出了知识合规匹配框架 (KcMF),这是一种基于 LLM 的方法,无需特定领域的微调即可解决这些问题。KcMF 采用基于伪代码的任务分解策略,采用特定任务的自然语言陈述,指导 LLM 推理并减少混淆。我们还提出了两种机制,即数据集作为知识 (DaK) 和示例作为知识 (EaK),在缺乏非结构化领域知识时构建领域知识集。此外,我们引入了一种结果集成策略,以利用多种知识源并抑制格式不佳的输出。在模式与实体匹配任务上的综合评估表明,KcMF 的平均 F1 分数比之前的非 LLM 最先进 (SOTA) 方法高出 22.9%,并且与 SOTA 微调的 LLM 有效竞争。此外,KcMF 在不同 LLM 之间具有良好的泛化能力。

[NLP-35] MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在生成内容时可能产生的幻觉问题,特别是在非英语语言环境中的可靠性评估不足的问题。解决方案的关键在于引入多语言置信度评估(MlingConf),通过对比语言无关(LA)和语言特定(LS)任务中的表现,揭示了英语在置信度评估中的语言主导地位,并提出了一种有效的本地化提示策略,即使用特定语言的提示来提高LLMs在语言特定任务中的可靠性和准确性。

链接: https://arxiv.org/abs/2410.12478
作者: Boyang Xue,Hongru Wang,Rui Wang,Sheng Wang,Zezhong Wang,Yiming Du,Bin Liang,Kam-Fai Wong
关键词-EN: Large Language Models, generate hallucinations raises, hallucinations raises concerns, tendency of Large, confidence estimations
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The tendency of Large Language Models (LLMs) to generate hallucinations raises concerns regarding their reliability. Therefore, confidence estimations indicating the extent of trustworthiness of the generations become essential. However, current LLM confidence estimations in languages other than English remain underexplored. This paper addresses this gap by introducing a comprehensive investigation of Multilingual Confidence estimation (MlingConf) on LLMs, focusing on both language-agnostic (LA) and language-specific (LS) tasks to explore the performance and language dominance effects of multilingual confidence estimations on different tasks. The benchmark comprises four meticulously checked and human-evaluate high-quality multilingual datasets for LA tasks and one for the LS task tailored to specific social, cultural, and geographical contexts of a language. Our experiments reveal that on LA tasks English exhibits notable linguistic dominance in confidence estimations than other languages, while on LS tasks, using question-related language to prompt LLMs demonstrates better linguistic dominance in multilingual confidence estimations. The phenomena inspire a simple yet effective native-tone prompting strategy by employing language-specific prompts for LS tasks, effectively improving LLMs’ reliability and accuracy on LS tasks.
摘要:大语言模型 (LLM) 生成幻觉的倾向引发了对其可靠性的担忧。因此,评估生成内容可信度的置信度估计变得至关重要。然而,当前对非英语语言的 LLM 置信度估计研究仍显不足。本文针对这一空白,通过引入对多语言置信度估计 (MlingConf) 的全面调查,探讨了在语言无关 (LA) 和语言特定 (LS) 任务中多语言置信度估计的表现及其语言主导效应。基准测试包括四个经过精心校验和人工评估的高质量多语言 LA 任务数据集,以及一个针对特定语言的社会、文化和地理背景定制的 LS 任务数据集。我们的实验表明,在 LA 任务中,英语在置信度估计中表现出比其他语言更显著的语言主导性;而在 LS 任务中,使用与问题相关的语言提示 LLM 显示出更好的多语言置信度估计的语言主导性。这一现象启发了一种简单而有效的本地化提示策略,即通过使用特定语言的提示来提升 LS 任务中 LLM 的可靠性和准确性。

[NLP-36] Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation

【速读】: 该论文试图解决临床领域中数据稀缺和伦理问题对机器学习应用的限制,特别是临床试验数据的生成面临隐私法规严格、成本高昂和时间周期长等挑战。解决方案的关键在于引入一种新的检索-推理少样本框架,利用大型语言模型(LLMs)生成人工但真实且多样化的临床试验数据,并带有二元成功/失败标签。通过实验验证,合成数据能够有效增强真实数据集,并在下游任务如试验结果预测中提升模型训练效果,从而加速临床研究并维护患者隐私的伦理标准。

链接: https://arxiv.org/abs/2410.12476
作者: Zerui Xu,Fang Wu,Tianfan Fu,Yue Zhao
关键词-EN: Machine learning, clinical, clinical trials, Machine, synthetic clinical trial
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning (ML) exhibits promise in the clinical domain. However, it is constrained by data scarcity and ethical considerations, as the generation of clinical trials presents significant challenges due to stringent privacy regulations, high costs, and the extended duration required for conducting studies with human participants. Despite the advancements of large language models (LLMs) in general generation tasks, their potential in facilitating the generation of synthetic clinical trials is under-explored. To address this gap, we introduce a novel Retrieval-Reasoning few-shot framework that leverages LLMs to generate artificial yet realistic and diverse clinical trials with binary success/failure labels. Experiments conducted on real clinical trials from the \urlthis http URL database demonstrate that our synthetic data can effectively augment real datasets. Furthermore, by fine-tuning a pre-trained model as a binary classifier on synthetic clinical trial datasets, we demonstrate that this augmentation enhances model training for downstream tasks such as trial outcome prediction. Our findings suggest that LLMs for synthetic clinical trial generation hold promise for accelerating clinical research and upholding ethical standards for patient privacy. The code is publicly available at this https URL.
摘要:机器学习 (Machine Learning, ML) 在临床领域展现出巨大的潜力。然而,由于严格的隐私法规、高昂的成本以及进行涉及人类参与者的研究所需的时间较长,临床试验的生成面临着重大挑战,这限制了数据稀缺性和伦理考量。尽管大语言模型 (Large Language Models, LLMs) 在一般生成任务中取得了进展,但其在促进合成临床试验生成方面的潜力尚未得到充分探索。为填补这一空白,我们提出了一种新颖的检索-推理少样本框架,该框架利用 LLMs 生成具有二元成功/失败标签的人工但真实且多样的临床试验。通过对 \urlthis http URL 数据库中的真实临床试验进行实验,我们证明了合成数据能够有效增强真实数据集。此外,通过在合成临床试验数据集上微调预训练模型作为二元分类器,我们展示了这种增强能够提升下游任务(如试验结果预测)的模型训练效果。我们的研究结果表明,用于合成临床试验生成的大语言模型有望加速临床研究并维护患者隐私的伦理标准。代码已公开发布于 \urlthis https URL。

[NLP-37] Learning to Predict Usage Options of Product Reviews with LLM-Generated Labels

【速读】: 该论文试图解决大规模数据集标注的挑战,特别是在复杂自然语言任务中,如从客户评论中预测产品使用选项时,传统众包方法成本高且质量难以保证的问题。解决方案的关键在于利用大型语言模型(LLMs)作为少样本学习者进行数据标注,并通过学习一个独立的模型来实现对标注任务的定制化控制,从而在能源效率和隐私保护方面提供更优的解决方案。此外,论文提出了一种新的评估指标HAMS4,用于比较多个参考集的字符串集合,以确保标注质量。实验结果表明,这种方法不仅显著降低了成本,而且生成的标注质量甚至达到了领域专家的水平。

链接: https://arxiv.org/abs/2410.12470
作者: Leo Kohlenberg,Leonard Horns,Frederic Sadrieh,Nils Kiele,Matthis Clausen,Konstantin Ketterer,Avetis Navasardyan,Tamara Czinczoll,Gerard de Melo,Ralf Herbrich
关键词-EN: Annotating large datasets, large datasets, Annotating large, Abstract, Annotating
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Annotating large datasets can be challenging. However, crowd-sourcing is often expensive and can lack quality, especially for non-trivial tasks. We propose a method of using LLMs as few-shot learners for annotating data in a complex natural language task where we learn a standalone model to predict usage options for products from customer reviews. We also propose a new evaluation metric for this scenario, HAMS4, that can be used to compare a set of strings with multiple reference sets. Learning a custom model offers individual control over energy efficiency and privacy measures compared to using the LLM directly for the sequence-to-sequence task. We compare this data annotation approach with other traditional methods and demonstrate how LLMs can enable considerable cost savings. We find that the quality of the resulting data exceeds the level attained by third-party vendor services and that GPT-4-generated labels even reach the level of domain experts. We make the code and generated labels publicly available.
摘要:标注大型数据集可能具有挑战性。然而,众包通常成本高昂且质量参差不齐,尤其是对于非简单的任务。我们提出了一种利用大语言模型 (LLM) 作为少样本学习者进行数据标注的方法,针对复杂的自然语言任务,我们训练了一个独立的模型来预测客户评论中的产品使用选项。我们还为此场景提出了一种新的评估指标 HAMS4,可用于比较一组字符串与多个参考集。与直接使用 LLM 进行序列到序列任务相比,学习自定义模型提供了对能效和隐私措施的个体控制。我们将此数据标注方法与其他传统方法进行了比较,并展示了 LLM 如何实现显著的成本节约。我们发现,由此产生的数据质量超过了第三方供应商服务的水平,甚至 GPT-4 生成的标签达到了领域专家的水平。我们公开了代码和生成的标签。

[NLP-38] Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention

【速读】: 该论文试图解决大型语言模型(LLMs)在不同语言间性能差异显著的问题。解决方案的关键是提出了一种名为Inference-Time Cross-Lingual Intervention (INCLINE)的新框架,通过在推理阶段将低性能语言(源语言)的内部表示与高性能语言(目标语言)对齐,从而提升低性能语言的表现。INCLINE通过最小二乘优化学习对齐矩阵,并在推理时应用这些矩阵将源语言的表示转换到目标语言的空间,显著提高了多语言任务的性能,且具有高成本效益。

链接: https://arxiv.org/abs/2410.12462
作者: Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch
关键词-EN: Large Language Models, shown remarkable capabilities, natural language processing, Large Language, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language processing but exhibit significant performance gaps among different languages. Most existing approaches to address these disparities rely on pretraining or fine-tuning, which are resource-intensive. To overcome these limitations without incurring significant costs, we propose Inference-Time Cross-Lingual Intervention (INCLINE), a novel framework that enhances LLM performance on low-performing (source) languages by aligning their internal representations with those of high-performing (target) languages during inference. INCLINE initially learns alignment matrices using parallel sentences from source and target languages through a Least-Squares optimization, and then applies these matrices during inference to transform the low-performing language representations toward the high-performing language space. Extensive experiments on nine benchmarks with five LLMs demonstrate that INCLINE significantly improves performance across diverse tasks and languages, compared to recent strong baselines. Our analysis demonstrates that INCLINE is highly cost-effective and applicable to a wide range of applications. In addition, we release the code to foster research along this line: this https URL.
摘要:大语言模型 (LLMs) 在自然语言处理方面展现了显著的能力,但在不同语言之间存在显著的性能差距。大多数现有的解决这些差异的方法依赖于预训练或微调,这些方法资源密集。为了在不产生显著成本的情况下克服这些限制,我们提出了推理时跨语言干预 (Inference-Time Cross-Lingual Intervention, INCLINE),这是一种新颖的框架,通过在推理过程中将低性能 (源) 语言的内部表示与高性能 (目标) 语言的内部表示对齐,来增强大语言模型在低性能语言上的表现。INCLINE 首先通过最小二乘优化从源语言和目标语言的平行句子中学习对齐矩阵,然后在推理过程中应用这些矩阵将低性能语言的表示转换为高性能语言的空间。在九个基准测试中对五个大语言模型的广泛实验表明,与最近的强基线相比,INCLINE 在各种任务和语言中显著提高了性能。我们的分析表明,INCLINE 具有高成本效益,并且适用于广泛的应用。此外,我们发布了代码以促进这一方向的研究:this https URL。

[NLP-39] he Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

【速读】: 该论文试图解决大语言模型(LLMs)在自然语言处理(NLP)任务中因监督微调(SFT)数据质量与多样性不平衡而导致模型性能不佳的问题。解决方案的关键在于提出了GraphFilter方法,通过将数据集表示为二分图,链接句子与其构成的n-grams,从而有效捕捉句子间及语言模式间的关系,并利用优先级函数综合考虑质量与多样性指标,迭代选择高优先级句子并更新二分图,以动态调整数据集,最终在多个基准测试中显著提升模型性能与计算效率。

链接: https://arxiv.org/abs/2410.12458
作者: Minghao Wu,Thuy-Trang Vu,Lizhen Qu,Gholamreza Haffari
关键词-EN: natural language processing, large language models, language processing, large language, natural language
类目: Computation and Language (cs.CL)
备注: 19 pages, 5 figures, 5 tables

点击查看摘要

Abstract:The performance of large language models (LLMs) in natural language processing (NLP) tasks is significantly influenced by the quality and diversity of data used for supervised fine-tuning (SFT). Current data selection methods often focus solely on quality or diversity, leading to underperforming models due to suboptimal training data. In this paper, we introduce GraphFilter, a novel method that represents the dataset as a bipartite graph, linking sentences to their constituent n-grams. This representation effectively captures the relationships between sentences and linguistic patterns, facilitating the selection of sentences that enhance n-gram diversity. To balance quality and diversity during selection, we propose a priority function that combines the quality metric with the diversity metric in a multiplicative manner. GraphFilter iteratively selects high-priority sentences, updates the bipartite graph by removing covered n-grams, and re-calculates priorities to reflect the evolving data landscape. We conduct extensive experiments using three model backbones across six widely used benchmarks. The results demonstrate that GraphFilter outperforms all nine baseline approaches, achieving superior model performance and computational efficiency. Our analyses validate the effectiveness of our design choices, examine the subsets selected by GraphFilter and other methods, highlight the importance of instruction diversity, and explore the role of quality and diversity in relation to subset sizes. GraphFilter establishes a new foundation for effective data selection strategies, encouraging further research in data selection for LLMs.
摘要:大语言模型 (LLM) 在自然语言处理 (NLP) 任务中的表现显著受到用于监督微调 (SFT) 的数据质量和多样性的影响。当前的数据选择方法往往仅关注质量或多样性,导致由于训练数据次优而模型表现不佳。本文中,我们提出了 GraphFilter,一种新颖的方法,将数据集表示为二部图,将句子与其构成的 n-gram 相连接。这种表示有效地捕捉了句子与语言模式之间的关系,促进了选择增强 n-gram 多样性的句子。为了在选择过程中平衡质量和多样性,我们提出了一种优先级函数,该函数以乘法方式结合了质量指标与多样性指标。GraphFilter 迭代选择高优先级句子,通过移除覆盖的 n-gram 更新二部图,并重新计算优先级以反映不断变化的数据环境。我们使用三种模型骨干在六个广泛使用的基准上进行了广泛的实验。结果表明,GraphFilter 优于所有九种基线方法,实现了卓越的模型性能和计算效率。我们的分析验证了我们设计选择的有效性,检查了 GraphFilter 和其他方法选择的子集,强调了指令多样性的重要性,并探讨了质量与多样性在子集大小方面的关系。GraphFilter 为有效的数据选择策略奠定了新的基础,鼓励在大语言模型数据选择领域的进一步研究。

[NLP-40] Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs

【速读】: 该论文试图解决现有Open Ko-LLM Leaderboard在评估韩国大语言模型(LLMs)时存在的两个主要问题:一是量化改进与模型实际应用效果之间的脱节,二是现有基准测试主要基于英语翻译版本,未能充分反映韩语的复杂性。解决方案的关键在于提出Open Ko-LLM Leaderboard2,该版本完全替换原有基准测试,引入更贴近实际应用的新任务,并新增四个原生韩语基准测试,以更准确地评估和推动韩国LLMs的发展。

链接: https://arxiv.org/abs/2410.12445
作者: Hyeonwoo Kim,Dahyun Kim,Jihoo Kim,Sukyung Lee,Yungi Kim,Chanjun Park
关键词-EN: benchmarking Korean Large, Large Language Models, Korean Large Language, Open Ko-LLM Leaderboard, Korean Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Open Ko-LLM Leaderboard has been instrumental in benchmarking Korean Large Language Models (LLMs), yet it has certain limitations. Notably, the disconnect between quantitative improvements on the overly academic leaderboard benchmarks and the qualitative impact of the models should be addressed. Furthermore, the benchmark suite is largely composed of translated versions of their English counterparts, which may not fully capture the intricacies of the Korean language. To address these issues, we propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard. The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities. Additionally, four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language. Through these refinements, Open Ko-LLM Leaderboard2 seeks to provide a more meaningful evaluation for advancing Korean LLMs.
摘要:Open Ko-LLM Leaderboard 在评估韩国大语言模型 (LLM) 方面发挥了重要作用,但仍存在一些局限性。特别是,过于学术化的排行榜基准上的量化改进与模型质量影响的脱节问题亟待解决。此外,基准套件主要由其英文版本的翻译组成,这可能无法充分捕捉韩语的复杂性。为了解决这些问题,我们提出了 Open Ko-LLM Leaderboard2,这是早期 Open Ko-LLM Leaderboard 的改进版本。原有的基准被完全替换为更贴近实际应用能力的新任务。此外,还引入了四个新的本土韩语基准,以更好地反映韩语的独特特征。通过这些改进,Open Ko-LLM Leaderboard2 旨在为推进韩国 LLM 提供更有意义的评估。

[NLP-41] Expanding Chatbot Knowledge in Customer Service: Context-Aware Similar Question Generation Using Large Language Models

【速读】: 该论文试图解决服务聊天机器人中相似问题生成(Similar Question Generation, SQG)的问题,即如何在不牺牲语义一致性的前提下,生成多样化且与源问题相关的相似问题。解决方案的关键在于利用大型语言模型(LLMs)进行微调,通过设计特定的提示(prompts)来增强LLMs的自然语言理解能力,从而生成大量语义一致且多样化的相似问题。实验结果表明,该方法在语义多样性方面显著优于传统方法,并且通过人工评估验证了生成的相似问题能够更好地反映客户意图,满足业务需求。

链接: https://arxiv.org/abs/2410.12444
作者: Mengze Hong,Yuanfeng Song,Di Jiang,Lu Wang,Zichang Guo,Chen Jason Zhang
关键词-EN: knowledge base comprising, base comprising predefined, comprising predefined question-answer, Reliable responses, predefined question-answer pairs
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reliable responses of service chatbots are often achieved by employing retrieval-based methods that restrict answers to a knowledge base comprising predefined question-answer pairs (QA pairs). To accommodate potential variations in how a customer’s query may be expressed, it emerges as the favored solution to augment these QA pairs with similar questions that are possibly diverse while remaining semantic consistency. This augmentation task is known as Similar Question Generation (SQG). Traditional methods that heavily rely on human efforts or rule-based techniques suffer from limited diversity or significant semantic deviation from the source question, only capable of producing a finite number of useful questions. To address these limitations, we propose an SQG approach based on Large Language Models (LLMs), capable of producing a substantial number of diverse questions while maintaining semantic consistency to the source QA pair. This is achieved by leveraging LLMs’ natural language understanding capability through fine-tuning with specially designed prompts. The experiments conducted on a real customer-service dataset demonstrate that our method surpasses baseline methods by a significant margin in terms of semantic diversity. Human evaluation further confirms that integrating the answer that reflects the customer’s intention is crucial for increasing the number of generated questions that meet business requirements. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.12444 [cs.CL] (or arXiv:2410.12444v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.12444 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:服务聊天机器人的可靠响应通常通过采用基于检索的方法来实现,这些方法将答案限制在由预定义的问答对(QA 对)组成的知识库中。为了适应客户查询可能表达方式的变化,增加这些 QA 对中相似问题的数量成为首选解决方案,这些相似问题可能多样但仍保持语义一致性。这种增加任务被称为相似问题生成(Similar Question Generation, SQG)。传统方法严重依赖人工努力或基于规则的技术,存在多样性有限或与源问题语义偏差较大的问题,只能生成有限数量的有用问题。为了解决这些限制,我们提出了一种基于大语言模型(Large Language Models, LLMs)的 SQG 方法,能够在保持与源 QA 对语义一致性的同时,生成大量多样的问题。这是通过利用 LLMs 的自然语言理解能力,通过使用专门设计的提示进行微调来实现的。在真实客户服务数据集上进行的实验表明,我们的方法在语义多样性方面显著优于基线方法。人工评估进一步证实,整合反映客户意图的答案对于增加满足业务需求的生成问题数量至关重要。

主题:计算与语言(cs.CL
引用方式:arXiv:2410.12444 [cs.CL](或 arXiv:2410.12444v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.12444
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-42] Conformity in Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在面对多数意见时表现出从众效应的问题,即LLMs在信息获取和决策支持任务中可能因从众而采纳错误答案,从而影响其有效性。解决方案的关键在于识别和减轻这种从众行为,特别是当模型对其自身预测不确定时。论文通过心理实验方法,首次揭示了LLMs在不确定情况下的从众倾向,并探讨了训练范式和输入特征对从众行为的影响。最终,论文提出了两种干预措施——“魔鬼代言人”和“问题蒸馏”,以减少从众行为,从而提升语言模型的鲁棒性。

链接: https://arxiv.org/abs/2410.12428
作者: Xiaochen Zhu,Caiqi Zhang,Tom Stafford,Nigel Collier,Andreas Vlachos
关键词-EN: conformity effect describes, effect describes, describes the tendency, tendency of individuals, individuals to align
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages (8 pages main body), 14 figures

点击查看摘要

Abstract:The conformity effect describes the tendency of individuals to align their responses with the majority. Studying this bias in large language models (LLMs) is crucial, as LLMs are increasingly used in various information-seeking and decision-making tasks as conversation partners to improve productivity. Thus, conformity to incorrect responses can compromise their effectiveness. In this paper, we adapt psychological experiments to examine the extent of conformity in state-of-the-art LLMs. Our findings reveal that all models tested exhibit varying levels of conformity toward the majority, regardless of their initial choice or correctness, across different knowledge domains. Notably, we are the first to show that LLMs are more likely to conform when they are more uncertain in their own prediction. We further explore factors that influence conformity, such as training paradigms and input characteristics, finding that instruction-tuned models are less susceptible to conformity, while increasing the naturalness of majority tones amplifies conformity. Finally, we propose two interventions–Devil’s Advocate and Question Distillation–to mitigate conformity, providing insights into building more robust language models.
摘要:从众效应描述了个体倾向于与多数人保持一致的倾向。研究这种偏差在大语言模型 (LLMs) 中的表现至关重要,因为 LLMs 越来越多地被用作各种信息获取和决策任务中的对话伙伴,以提高生产力。因此,对错误响应的从众可能会削弱其有效性。在本文中,我们将心理学实验改编以检验最先进 LLMs 中的从众程度。我们的研究结果表明,所有测试的模型在不同知识领域中,无论其初始选择或正确性如何,都表现出不同程度的从众倾向。值得注意的是,我们是第一个证明 LLMs 在其自身预测不确定性较高时更可能从众的。我们进一步探讨了影响从众的因素,如训练范式和输入特征,发现指令调优模型对从众的敏感性较低,而增加多数声音的自然性则会放大从众效应。最后,我们提出了两种干预措施——魔鬼代言人和问题蒸馏——以减轻从众效应,为构建更强大的语言模型提供了见解。

[NLP-43] heoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding

【速读】: 该论文试图解决Transformer模型在生成层次化语言时是否需要显式位置编码的问题。解决方案的关键在于证明因果掩码(causal masking)和起始标记(starting token)足以使Transformer模型在没有显式位置编码的情况下,有效地计算位置信息和层次结构深度,从而生成层次化语言。研究结果表明,显式位置编码可能对序列长度的泛化能力产生负面影响。

链接: https://arxiv.org/abs/2410.12413
作者: Daichi Hayakawa,Issei Sato
关键词-EN: provide constructive proof, model size, hierarchical language efficiently, provide constructive, constructive proof
类目: Computation and Language (cs.CL)
备注: 55 pages, 11 figures

点击查看摘要

Abstract:In this study, we provide constructive proof that Transformers can recognize and generate hierarchical language efficiently with respect to model size, even without the need for a specific positional encoding. Specifically, we show that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures. We demonstrate that Transformers without positional encoding can generate hierarchical languages. Furthermore, we suggest that explicit positional encoding might have a detrimental effect on generalization with respect to sequence length.
摘要:在本研究中,我们提供了建设性的证据,证明 Transformer 能够在模型尺寸方面高效地识别和生成层次语言,即使不需要特定的位置编码。具体而言,我们展示了因果掩码和起始 Token 使 Transformer 能够计算位置信息和层次结构中的深度。我们证明了在没有位置编码的情况下,Transformer 能够生成层次语言。此外,我们提出显式位置编码可能对序列长度的泛化产生不利影响。

[NLP-44] Revealing the Barriers of Language Agents in Planning

【速读】: 该论文试图解决语言代理在自主规划中无法达到人类水平的问题,关键在于识别并分析阻碍其规划能力的两个主要因素:约束条件的有限作用和问题影响力的减弱。通过特征归因研究,论文揭示了当前策略虽有助于缓解这些挑战,但并未完全解决,表明语言代理在实现人类级智能方面仍有很长的路要走。

链接: https://arxiv.org/abs/2410.12409
作者: Jian Xie,Kexun Zhang,Jiangjie Chen,Siyu Yuan,Kai Zhang,Yikai Zhang,Lei Li,Yanghua Xiao
关键词-EN: ongoing pursuit, inception of artificial, planning, Autonomous planning, artificial intelligence
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Autonomous planning has been an ongoing pursuit since the inception of artificial intelligence. Based on curated problem solvers, early planning agents could deliver precise solutions for specific tasks but lacked generalization. The emergence of large language models (LLMs) and their powerful reasoning capabilities has reignited interest in autonomous planning by automatically generating reasonable solutions for given tasks. However, prior research and our experiments show that current language agents still lack human-level planning abilities. Even the state-of-the-art reasoning model, OpenAI o1, achieves only 15.6% on one of the complex real-world planning benchmarks. This highlights a critical question: What hinders language agents from achieving human-level planning? Although existing studies have highlighted weak performance in agent planning, the deeper underlying issues and the mechanisms and limitations of the strategies proposed to address them remain insufficiently understood. In this work, we apply the feature attribution study and identify two key factors that hinder agent planning: the limited role of constraints and the diminishing influence of questions. We also find that although current strategies help mitigate these challenges, they do not fully resolve them, indicating that agents still have a long way to go before reaching human-level intelligence.
摘要:自主规划自人工智能诞生以来一直是持续追求的目标。基于精心设计的问题解决者,早期的规划智能体能够为特定任务提供精确的解决方案,但缺乏泛化能力。大语言模型 (LLMs) 及其强大的推理能力重新激发了人们对自主规划的兴趣,通过自动生成合理的解决方案来应对给定任务。然而,先前的研究和我们的实验表明,当前的语言智能体仍缺乏人类水平的规划能力。即使是目前最先进的推理模型,OpenAI o1,在复杂现实世界规划基准测试中的表现也仅为 15.6%。这突显了一个关键问题:是什么阻碍了语言智能体实现人类水平的规划?尽管现有研究指出了智能体规划中的表现不佳,但更深层次的根本问题以及解决这些问题的策略的机制和局限性仍未得到充分理解。在本研究中,我们应用了特征归因研究,并确定了阻碍智能体规划的两个关键因素:约束的有限作用和问题影响力的减弱。我们还发现,尽管当前的策略有助于缓解这些挑战,但并未完全解决它们,这表明智能体在达到人类水平智能之前仍有很长的路要走。

[NLP-45] Beyond Coarse-Grained Matching in Video-Text Retrieval ACCV2024

【速读】: 该论文试图解决现有视频-文本检索模型在细粒度评估方面的不足,特别是模型对字幕中细微差异的辨别能力。解决方案的关键在于引入一种新的细粒度评估方法,通过自动生成包含单个词语变化的硬负样本测试字幕,来检测模型对这些细微差异的感知能力。该方法不仅适用于现有数据集,还能通过实验揭示当前评估基准在检测模型细粒度能力方面的局限性,并提出一种新的基线方法,以增强模型对细粒度差异的理解能力。

链接: https://arxiv.org/abs/2410.12407
作者: Aozhu Chen,Hazel Doughty,Xirong Li,Cees G. M. Snoek
关键词-EN: Video-text retrieval, significant advancements, requires verification, Video-text, fine-grained
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted to ACCV 2024

点击查看摘要

Abstract:Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across nouns, verbs, adjectives, adverbs, and prepositions. We perform comprehensive experiments using four state-of-the-art models across two standard benchmarks (MSR-VTT and VATEX) and two specially curated datasets enriched with detailed descriptions (VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our analyses show that the current evaluation benchmarks fall short in detecting a model’s ability to perceive subtle single-word differences, 2) our fine-grained evaluation highlights the difficulty models face in distinguishing such subtle variations. To enhance fine-grained understanding, we propose a new baseline that can be easily combined with current methods. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model’s ability to understand fine-grained differences.
摘要:视频-文本检索技术已取得显著进展,但模型在辨别字幕中细微差异的能力仍需验证。本文介绍了一种新的细粒度评估方法。该方法通过自动生成包含名词、动词、形容词、副词和介词细微单字变化的硬负样本测试字幕,可应用于现有数据集。我们使用四种最先进的模型在两个标准基准(MSR-VTT 和 VATEX)以及两个经过精心策划、包含详细描述的专用数据集(VLN-UVO 和 VLN-OOPS)上进行了全面实验,得出了若干新见解:1)我们的分析表明,当前的评估基准在检测模型感知单字细微差异的能力方面存在不足;2)我们的细粒度评估突显了模型在区分此类细微变化时的困难。为增强细粒度理解,我们提出了一种新的基线方法,该方法可以轻松与当前方法结合。在我们细粒度评估上的实验表明,这种方法提升了模型理解细粒度差异的能力。

[NLP-46] Nominal Class Assignment in Swahili: A Computational Account

【速读】: 该论文试图解决斯瓦希里语中语义与名词类别分配之间的关系问题。解决方案的关键在于从计算角度量化这种关系的程度,并通过细致的分析揭示其本质,同时特别注意排除形态句法因素的干扰。研究结果首次提供了对每个名词类别语义一致性的定量评估,以及对其语义内容的细致分类描述。

链接: https://arxiv.org/abs/2410.12406
作者: Giada Palmieri,Konstantinos Kogkalidis
关键词-EN: assignment in Swahili, nominal class assignment, discuss the open, open question, nominal class
类目: Computation and Language (cs.CL)
备注: Tenth Italian Conference on Computational Linguistics (CliC-it-2024)

点击查看摘要

Abstract:We discuss the open question of the relation between semantics and nominal class assignment in Swahili. We approach the problem from a computational perspective, aiming first to quantify the extent of this relation, and then to explicate its nature, taking extra care to suppress morphosyntactic confounds. Our results are the first of their kind, providing a quantitative evaluation of the semantic cohesion of each nominal class, as well as a nuanced taxonomic description of its semantic content.
摘要:我们探讨了斯瓦希里语中语义与名词类别分配之间关系的开放性问题。我们从计算角度出发,首先旨在量化这种关系的程度,然后阐明其本质,特别注意抑制形态句法混淆。我们的研究结果是同类研究中的首次,提供了每个名词类别的语义凝聚力的定量评估,以及对其语义内容的细致分类描述。

[NLP-47] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs EMNLP2024

【速读】: 该论文试图解决大语言模型(LLMs)在不同提示(prompt)下的性能波动问题,特别是提示敏感性对模型评估和用户满意度的影响。解决方案的关键在于提出了ProSA框架,该框架通过引入新的敏感性指标PromptSensiScore,并结合解码信心度来评估和理解提示敏感性。研究结果表明,提示敏感性在不同数据集和模型间存在差异,且较大模型表现出更高的鲁棒性。此外,少量示例(few-shot examples)可以缓解这一敏感性问题,而主观评估在复杂推理任务中同样受提示敏感性的影响。

链接: https://arxiv.org/abs/2410.12405
作者: Jingming Zhuo,Songyang Zhang,Xinyu Fang,Haodong Duan,Dahua Lin,Kai Chen
关键词-EN: Large language models, demonstrated impressive capabilities, Large language, demonstrated impressive, impressive capabilities
类目: Computation and Language (cs.CL)
备注: EMNLP 2024, Findings

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities across various tasks, but their performance is highly sensitive to the prompts utilized. This variability poses challenges for accurate assessment and user satisfaction. Current research frequently overlooks instance-level prompt variations and their implications on subjective evaluations. To address these shortcomings, we introduce ProSA, a framework designed to evaluate and comprehend prompt sensitivity in LLMs. ProSA incorporates a novel sensitivity metric, PromptSensiScore, and leverages decoding confidence to elucidate underlying mechanisms. Our extensive study, spanning multiple tasks, uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness. We observe that few-shot examples can alleviate this sensitivity issue, and subjective evaluations are also susceptible to prompt sensitivities, particularly in complex, reasoning-oriented tasks. Furthermore, our findings indicate that higher model confidence correlates with increased prompt robustness. We believe this work will serve as a helpful tool in studying prompt sensitivity of LLMs. The project is released at: this https URL .
摘要:大语言模型 (LLMs) 在各种任务中展示了令人印象深刻的能力,但其性能对所使用的提示 (prompts) 高度敏感。这种可变性对准确评估和用户满意度构成了挑战。当前的研究往往忽视了实例级别的提示变化及其对主观评估的影响。为了解决这些不足,我们引入了 ProSA,这是一个旨在评估和理解 LLMs 中提示敏感性的框架。ProSA 包含了一种新颖的敏感度指标,PromptSensiScore,并利用解码置信度来阐明其底层机制。我们的广泛研究涵盖了多个任务,发现提示敏感性在数据集和模型之间波动,较大的模型表现出更强的鲁棒性。我们观察到,少样本示例可以缓解这一敏感性问题,而主观评估也容易受到提示敏感性的影响,尤其是在复杂、推理导向的任务中。此外,我们的研究结果表明,模型置信度越高,提示鲁棒性越强。我们相信这项工作将成为研究 LLMs 提示敏感性的有用工具。项目已在以下网址发布:this https URL。

[NLP-48] racking Universal Features Through Fine-Tuning and Model Merging

【速读】: 该论文试图解决在不同文本领域微调模型时特征的涌现、消失和持久性问题。解决方案的关键在于从基础的一层Transformer语言模型出发,该模型在BabyLM语料库和Python代码集合上训练,然后将其适应于TinyStories和Lua编程语言两个新领域。通过球面线性插值将这两个模型合并,研究在典型迁移学习场景中,小规模模型和稀疏自编码器对特征稳定性和转换的影响。

链接: https://arxiv.org/abs/2410.12391
作者: Niels Horn,Desmond Elliott
关键词-EN: domains of text, base one-layer Transformer, one-layer Transformer language, Transformer language model, features emerge
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study how features emerge, disappear, and persist across models fine-tuned on different domains of text. More specifically, we start from a base one-layer Transformer language model that is trained on a combination of the BabyLM corpus, and a collection of Python code from The Stack. This base model is adapted to two new domains of text: TinyStories, and the Lua programming language, respectively; and then these two models are merged using these two models using spherical linear interpolation. Our exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.
摘要:我们研究了在针对不同文本领域进行微调的模型中,特征如何出现、消失和持久存在。更具体地说,我们从基于一个单层 Transformer 语言模型开始,该模型在 BabyLM 语料库和来自 The Stack 的 Python 代码集合的组合上进行训练。该基础模型分别适应于两个新的文本领域:TinyStories 和 Lua 编程语言;然后使用球面线性插值将这两个模型合并。我们的探索旨在通过使用小规模模型和稀疏自编码器,深入了解在典型迁移学习场景中特征的稳定性和转变。

[NLP-49] Prompt Compression for Large Language Models : A Survey

【速读】: 该论文试图解决利用大型语言模型(LLMs)进行复杂自然语言任务时,由于长表单提示导致内存使用和推理成本增加的问题。解决方案的关键在于采用提示压缩技术,分为硬提示方法和软提示方法。论文详细比较了这些方法的技术途径,并探讨了从注意力优化、参数高效微调(PEFT)、模态融合和新合成语言等角度理解其机制的方式。此外,论文还分析了当前提示压缩方法的局限性,并提出了未来优化的方向,如优化压缩编码器、结合硬软提示方法以及利用多模态的见解。

链接: https://arxiv.org/abs/2410.12388
作者: Zongqian Li,Yinhong Liu,Yixuan Su,Nigel Collier
关键词-EN: tasks typically requires, typically requires long-form, convey detailed requirements, increased memory usage, requires long-form prompts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Leveraging large language models (LLMs) for complex natural language tasks typically requires long-form prompts to convey detailed requirements and information, which results in increased memory usage and inference costs. To mitigate these challenges, multiple efficient methods have been proposed, with prompt compression gaining significant research interest. This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. First, the technical approaches of these methods are compared, followed by an exploration of various ways to understand their mechanisms, including the perspectives of attention optimization, Parameter-Efficient Fine-Tuning (PEFT), modality fusion, and new synthetic language. We also examine the downstream adaptations of various prompt compression techniques. Finally, the limitations of current prompt compression methods are analyzed, and several future directions are outlined, such as optimizing the compression encoder, combining hard and soft prompts methods, and leveraging insights from multimodality.
摘要:利用大语言模型 (LLMs) 处理复杂的自然语言任务通常需要长篇幅的提示 (prompt) 来传达详细的要求和信息,这导致了内存使用量和推理成本的增加。为了缓解这些挑战,已经提出了多种高效的方法,其中提示压缩 (prompt compression) 引起了显著的研究兴趣。本调查报告概述了提示压缩技术,将其分为硬提示方法 (hard prompt methods) 和软提示方法 (soft prompt methods)。首先,比较了这些方法的技术途径,随后探讨了理解其机制的各种方式,包括注意力优化 (attention optimization)、参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT)、模态融合 (modality fusion) 和新合成语言 (new synthetic language) 的角度。我们还考察了各种提示压缩技术的下游适应性。最后,分析了当前提示压缩方法的局限性,并概述了几个未来方向,例如优化压缩编码器 (compression encoder)、结合硬提示和软提示方法,以及利用多模态 (multimodality) 的见解。

[NLP-50] Evaluation of Attribution Bias in Retrieval-Augmented Large Language Models

【速读】: 该论文试图解决在检索增强生成(RAG)模型中,答案归属问题可能引入的偏见问题。解决方案的关键在于定义并评估两个方面:归属敏感性和归属偏见,特别是关于作者信息的影响。通过向大型语言模型(LLM)明确提供源文档的作者信息,并指示其进行答案归属,研究分析了LLM输出对源文档作者的敏感性以及对人类撰写或AI生成文档的偏见。实验结果表明,添加作者信息可以显著改变LLM的归属质量,并揭示了LLM可能存在对明确人类作者的归属偏见,这为先前研究中LLM生成内容可能优于人类撰写内容的结论提供了竞争性假设。

链接: https://arxiv.org/abs/2410.12380
作者: Amin Abolghasemi,Leif Azzopardi,Seyyed Hadi Hashemi,Maarten de Rijke,Suzan Verberne
关键词-EN: retrieval augmented generation, source documents, Attributing answers, RAG pipelines, RAG
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attributing answers to source documents is an approach used to enhance the verifiability of a model’s output in retrieval augmented generation (RAG). Prior work has mainly focused on improving and evaluating the attribution quality of large language models (LLMs) in RAG, but this may come at the expense of inducing biases in the attribution of answers. We define and examine two aspects in the evaluation of LLMs in RAG pipelines, namely attribution sensitivity and bias with respect to authorship information. We explicitly inform an LLM about the authors of source documents, instruct it to attribute its answers, and analyze (i) how sensitive the LLM’s output is to the author of source documents, and (ii) whether the LLM exhibits a bias towards human-written or AI-generated source documents. We design an experimental setup in which we use counterfactual evaluation to study three LLMs in terms of their attribution sensitivity and bias in RAG pipelines. Our results show that adding authorship information to source documents can significantly change the attribution quality of LLMs by 3% to 18%. Moreover, we show that LLMs can have an attribution bias towards explicit human authorship, which can serve as a competing hypothesis for findings of prior work that shows that LLM-generated content may be preferred over human-written contents. Our findings indicate that metadata of source documents can influence LLMs’ trust, and how they attribute their answers. Furthermore, our research highlights attribution bias and sensitivity as a novel aspect of brittleness in LLMs.
摘要:将答案归因于源文档是增强检索增强生成 (RAG) 模型输出可验证性的一种方法。先前的工作主要集中在改进和评估大语言模型 (LLM) 在 RAG 中的归因质量,但这可能会以引入归因偏见为代价。我们定义并研究了在 RAG 管道中评估 LLM 的两个方面,即归因敏感性和与作者信息相关的偏见。我们明确告知 LLM 源文档的作者,指示其进行答案归因,并分析 (i) LLM 的输出对源文档作者的敏感性,以及 (ii) LLM 是否对人类编写或 AI 生成的源文档表现出偏见。我们设计了一个实验设置,使用反事实评估来研究三个 LLM 在 RAG 管道中的归因敏感性和偏见。我们的结果表明,将作者信息添加到源文档中可以使 LLM 的归因质量显著改变 3% 到 18%。此外,我们发现 LLM 可能对明确的人类作者身份存在归因偏见,这可以作为先前工作发现的一个竞争性假设,即 LLM 生成的内容可能比人类编写的内容更受青睐。我们的研究结果表明,源文档的元数据可以影响 LLM 的信任度及其答案的归因方式。此外,我们的研究强调了归因偏见和敏感性作为 LLM 脆弱性的一个新方面。

[NLP-51] HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims EMNLP2024

【速读】: 该论文旨在解决自动化事实核查任务(AVeriTeC shared task),其关键解决方案是引入了一个名为HerO的系统,该系统利用多个公开可用的大型语言模型(LLMs)来完成事实核查的各个步骤。HerO通过语言模型增强查询以生成假设性事实核查文档,并使用预训练和微调的LLMs通过精心设计的提示进行问题生成和真实性预测。该方法在AVeriTeC评分中获得0.57分,位列排行榜第二,展示了开放LLMs在验证现实世界声明中的潜力。

链接: https://arxiv.org/abs/2410.12377
作者: Yejun Yoon,Jaeyoon Jung,Seunghyun Yoon,Kunwoo Park
关键词-EN: shared task hosted, dubbed the Herd, AVeriTeC shared task, Herd of Open, step of automated
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: A system description paper for the AVeriTeC shared task, hosted by the seventh FEVER workshop (co-located with EMNLP 2024)

点击查看摘要

Abstract:To tackle the AVeriTeC shared task hosted by the FEVER-24, we introduce a system that only employs publicly available large language models (LLMs) for each step of automated fact-checking, dubbed the Herd of Open LLMs for verifying real-world claims (HerO). HerO employs multiple LLMs for each step of automated fact-checking. For evidence retrieval, a language model is used to enhance a query by generating hypothetical fact-checking documents. We prompt pretrained and fine-tuned LLMs for question generation and veracity prediction by crafting prompts with retrieved in-context samples. HerO achieved 2nd place on the leaderboard with the AVeriTeC score of 0.57, suggesting the potential of open LLMs for verifying real-world claims. For future research, we make our code publicly available at this https URL.
摘要:为了应对 FEVER-24 主办的 AVeriTeC 共享任务,我们引入了一个仅使用公开可用的大语言模型 (LLM) 进行自动化事实核查的系统,称为用于验证现实声明的开放 LLM 集群 (HerO)。HerO 在自动化事实核查的每一步都采用了多个 LLM。在证据检索阶段,使用语言模型通过生成假设的事实核查文档来增强查询。我们通过构建包含检索到的上下文样本的提示,引导预训练和微调的 LLM 进行问题生成和真实性预测。HerO 在排行榜上以 0.57 的 AVeriTeC 得分获得第二名,表明开放 LLM 在验证现实声明方面的潜力。未来研究中,我们将代码公开在此 https URL。

[NLP-52] PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agent ic Thinking

【速读】: 该论文试图解决的问题是如何通过自学习和递归推理优化来提升语言模型的推理能力。解决方案的关键在于提出了一种基于偏好的递归语言建模方法(PRefLexOR),结合强化学习的概念,使模型能够在训练和推理阶段通过多步骤的推理、回顾和修正中间步骤来逐步提升推理质量。该方法通过动态知识图谱的构建和偏好优化,利用拒绝采样和递归优化策略,在思考标记框架内引入迭代反馈循环,从而实现模型推理的深度一致性和适应性提升。

链接: https://arxiv.org/abs/2410.12375
作者: Markus J. Buehler
关键词-EN: Modeling for Exploratory, Preference-based Recursive Language, Reinforcement Learning, Recursive Language Modeling, concepts from Reinforcement
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:PRefLexOR (Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning) combines preference optimization with concepts from Reinforcement Learning to enable models to self-teach through iterative reasoning improvements. We propose a recursive learning approach that engages the model in multi-step reasoning, revisiting, and refining intermediate steps before producing a final output in training and inference phases. Through multiple training stages, the model first learns to align its reasoning with accurate decision paths by optimizing the log odds between preferred and non-preferred responses. During this process, PRefLexOR builds a dynamic knowledge graph by generating questions from random text chunks and retrieval-augmentation to contextualize relevant details from the entire training corpus. In the second stage, preference optimization enhances model performance by using rejection sampling to fine-tune reasoning quality by continually producing in-situ training data while masking the reasoning steps. Recursive optimization within a thinking token framework introduces iterative feedback loops, where the model refines reasoning, achieving deeper coherence, consistency, and adaptability. Implemented in small language models with only 3 billion parameters, we should that even tiny models can iteratively teach themselves to reason with greater depth and reflectivity. Our implementation is straightforward and can be incorporated into any existing pretrained LLM. We focus our examples on applications in biological materials science and demonstrate the method in a variety of case studies that range from in-domain to cross-domain applications. Using reasoning strategies that include thinking and reflection modalities we build a multi-agent recursive self-improving inference approach to successively improve responses via repeated sampling in inference time.
摘要:PRefLexOR(基于偏好的递归语言建模用于探索性推理优化)结合了偏好优化与强化学习的概念,使模型能够通过迭代推理改进进行自我教学。我们提出了一种递归学习方法,该方法在训练和推理阶段中,使模型在生成最终输出之前,进行多步骤推理、回顾和细化中间步骤。通过多个训练阶段,模型首先通过优化偏好与非偏好响应之间的对数几率,学习将其推理与准确的决策路径对齐。在此过程中,PRefLexOR通过从随机文本块生成问题并结合检索增强,构建了一个动态知识图谱,以从整个训练语料库中情境化相关细节。在第二阶段,偏好优化通过使用拒绝采样来微调推理质量,不断生成现场训练数据,同时掩盖推理步骤,从而提升模型性能。在思考 Token 框架内的递归优化引入了迭代反馈循环,模型在此过程中不断细化推理,实现更深层次的一致性、连贯性和适应性。我们在仅包含 30 亿参数的小型语言模型中实现了这一方法,表明即使是微小的模型也能通过迭代自我教学,实现更深层次和反思性的推理。我们的实现方法简单直接,可以融入任何现有的预训练大语言模型。我们重点展示了在生物材料科学领域的应用示例,并在从领域内到跨领域的多种案例研究中展示了该方法。通过包含思考和反思模式的推理策略,我们构建了一种多智能体递归自我改进的推理方法,通过在推理时间内的重复采样,逐步提升响应质量。

[NLP-53] Proactive Agent : Shifting LLM Agents from Reactive Responses to Active Assistance

【速读】: 该论文试图解决现有大型语言模型(LLM)代理系统在需要预见性和自主决策的场景中表现不足的问题。解决方案的关键在于提出了一种数据驱动的方法,通过收集和标注真实世界的人类活动数据来训练一个奖励模型,该模型能够模拟人类判断并评估LLM代理的主动性。随后,利用这一模型构建了一个名为ProactiveBench的多样化数据集,并通过微调模型显著提升了LLM代理的主动性,实验结果表明,微调后的模型在主动提供帮助方面表现优异,F1-Score达到66.47%,超越了所有开源和闭源模型。

链接: https://arxiv.org/abs/2410.12361
作者: Yaxi Lu,Shenzhi Yang,Cheng Qian,Guirong Chen,Qinyu Luo,Yesai Wu,Huadong Wang,Xin Cong,Zhong Zhang,Yankai Lin,Weiwen Liu,Yasheng Wang,Zhiyuan Liu,Fangming Liu,Maosong Sun
关键词-EN: shown remarkable abilities, solving complex tasks, large language models, powered by large, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.
摘要:由大语言模型驱动的智能体在解决复杂任务方面展现了显著的能力。然而,大多数智能体系统仍然是反应式的,限制了其在需要预见性和自主决策场景中的有效性。本文中,我们解决了开发能够预见并主动发起任务的智能体的挑战,而无需明确的人类指令。我们提出了一种新颖的数据驱动方法来解决这一问题。首先,我们收集现实世界的人类活动以生成主动任务预测。这些预测随后由人工标注者标记为接受或拒绝。标记的数据用于训练一个奖励模型,该模型模拟人类判断,并作为大语言模型智能体主动性的自动评估器。在此基础上,我们开发了一个全面的数据生成管道,创建了一个包含6,790个事件的多样化数据集,即ProactiveBench。最后,我们证明,使用所提出的ProactiveBench对模型进行微调可以显著激发大语言模型智能体的主动性。实验结果显示,我们微调后的模型在主动提供帮助方面达到了66.47%的F1分数,优于所有开源和闭源模型。这些结果突显了我们的方法在创建更具主动性和有效性的智能体系统方面的潜力,为未来人机协作的进步铺平了道路。

[NLP-54] GECTurk WEB: An Explainable Online Platform for Turkish Grammatical Error Detection and Correction

【速读】: 该论文试图解决土耳其语这种形态丰富且书写规则复杂的语言中,现有语法错误检测/纠正工具主要关注拼写错误而非语法错误,且缺乏网络界面、错误解释和反馈机制的问题。解决方案的关键是引入GECTurk WEB系统,这是一个轻量级、开源且灵活的基于网络的系统,能够检测和纠正土耳其语中最常见的书写错误,包括变音符号误用、复合词和外来词、代词、轻动词以及拼写错误。该系统不仅提供错误检测和纠正功能,还通过展示违反规则的解释来帮助用户学习和记忆语法规则。

链接: https://arxiv.org/abs/2410.12350
作者: Ali Gebeşçe,Gözde Gül Şahin
关键词-EN: English and Chinese, Sophisticated grammatical error, Sophisticated grammatical, grammatical error detection, small set
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sophisticated grammatical error detection/correction tools are available for a small set of languages such as English and Chinese. However, it is not straightforward – if not impossible – to adapt them to morphologically rich languages with complex writing rules like Turkish which has more than 80 million speakers. Even though several tools exist for Turkish, they primarily focus on spelling errors rather than grammatical errors and lack features such as web interfaces, error explanations and feedback mechanisms. To fill this gap, we introduce GECTurk WEB, a light, open-source, and flexible web-based system that can detect and correct the most common forms of Turkish writing errors, such as the misuse of diacritics, compound and foreign words, pronouns, light verbs along with spelling mistakes. Our system provides native speakers and second language learners an easily accessible tool to detect/correct such mistakes and also to learn from their mistakes by showing the explanation for the violated rule(s). The proposed system achieves 88,3 system usability score, and is shown to help learn/remember a grammatical rule (confirmed by 80% of the participants). The GECTurk WEB is available both as an offline tool at this https URL or online at this http URL.
摘要:对于英语和中文等少数语言,已经存在复杂的语法错误检测/纠正工具。然而,要将这些工具应用于像土耳其语这样拥有超过8000万使用者且具有复杂书写规则的形态丰富的语言,即使不是不可能,也是非常困难的。尽管已有一些针对土耳其语的工具,但它们主要关注拼写错误而非语法错误,并且缺乏如网页界面、错误解释和反馈机制等功能。为了填补这一空白,我们推出了 GECTurk WEB,这是一个轻量级、开源且灵活的基于网页的系统,能够检测和纠正土耳其语中最常见的书写错误,如变音符号的误用、复合词和外来词、代词、轻动词以及拼写错误。我们的系统为母语使用者和第二语言学习者提供了一个易于访问的工具,用于检测/纠正这些错误,并通过展示违反规则的解释来帮助他们从错误中学习。该系统获得了 88.3 的系统可用性评分,并被证明有助于学习和记忆语法规则(80% 的参与者确认)。GECTurk WEB 既可以作为离线工具在此 https URL 获取,也可以在线访问此 http URL。

[NLP-55] A linguistic analysis of undesirable outcomes in the era of generative AI

【速读】: 该论文试图解决生成式AI在长期使用过程中出现的“模型崩溃”问题,即随着模型在其自身生成内容上进行多代微调,其生成内容的词汇丰富度和多样性逐渐降低,导致语言模式失真。解决方案的关键在于通过精心选择和策划初始输入文本,以缓解模型崩溃问题,并强调了在生成内容中保持语言模式的重要性。此外,论文还通过定性分析微调模型的性能,揭示了模型在自我消耗循环中可能产生的错误和偏见信息。

链接: https://arxiv.org/abs/2410.12341
作者: Daniele Gambetta,Gizem Gezici,Fosca Giannotti,Dino Pedreschi,Alistair Knott,Luca Pappalardo
关键词-EN: Recent research, generated content, posing scientific, research has focused, medium and long-term
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has focused on the medium and long-term impacts of generative AI, posing scientific and societal challenges mainly due to the detection and reliability of machine-generated information, which is projected to form the major content on the Web soon. Prior studies show that LLMs exhibit a lower performance in generation tasks (model collapse) as they undergo a fine-tuning process across multiple generations on their own generated content (self-consuming loop). In this paper, we present a comprehensive simulation framework built upon the chat version of LLama2, focusing particularly on the linguistic aspects of the generated content, which has not been fully examined in existing studies. Our results show that the model produces less lexical rich content across generations, reducing diversity. The lexical richness has been measured using the linguistic measures of entropy and TTR as well as calculating the POSTags frequency. The generated content has also been examined with an n -gram analysis, which takes into account the word order, and semantic networks, which consider the relation between different words. These findings suggest that the model collapse occurs not only by decreasing the content diversity but also by distorting the underlying linguistic patterns of the generated text, which both highlight the critical importance of carefully choosing and curating the initial input text, which can alleviate the model collapse problem. Furthermore, we conduct a qualitative analysis of the fine-tuned models of the pipeline to compare their performances on generic NLP tasks to the original model. We find that autophagy transforms the initial model into a more creative, doubtful and confused one, which might provide inaccurate answers and include conspiracy theories in the model responses, spreading false and biased information on the Web.
摘要:近期研究聚焦于生成式 AI (Generative AI) 的中长期影响,主要由于机器生成信息的检测和可靠性问题,这些问题预计将很快成为网络内容的主要组成部分。先前研究表明,大语言模型 (LLM) 在多代自我生成内容(自我消耗循环)的微调过程中,生成任务的性能(模型崩溃)有所下降。本文提出了一种基于 LLama2 聊天版本的全面模拟框架,特别关注生成内容的语言学方面,这在现有研究中尚未得到充分探讨。我们的结果显示,模型在多代生成中产生的词汇丰富度较低,多样性减少。词汇丰富度通过语言学测量中的熵和 TTR 以及计算 POSTags 频率来衡量。生成内容还通过考虑词序的 n-gram 分析和考虑不同词之间关系的语义网络进行了检查。这些发现表明,模型崩溃不仅通过减少内容多样性发生,还通过扭曲生成文本的底层语言模式发生,这都突显了精心选择和策划初始输入文本的重要性,这可以缓解模型崩溃问题。此外,我们对管道的微调模型进行了定性分析,以比较它们在通用 NLP 任务上的表现与原始模型。我们发现,自噬将初始模型转变为更具创造性、怀疑性和困惑性的模型,这可能导致模型响应中提供不准确的答案,并包含阴谋论,从而在网络上传播虚假和偏见信息。

[NLP-56] Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

【速读】: 该论文试图解决多模态大语言模型(MLLMs)评估中对多模态推理能力的真实评估问题,以及大语言模型(LLM)背景知识对性能的影响。解决方案的关键在于引入了一种改进的评估协议,用于分离LLM背景知识与多模态整合的贡献,并采用自动知识识别技术诊断LLM是否具备处理多模态问题所需的知识。通过这些方法,研究揭示了当前基准测试中对多模态推理能力的评估不足,以及LLM背景知识缺乏导致的性能问题。为解决知识缺乏问题,论文提出了一种知识增强流程,显著提升了模型性能,最高达到60%的改进,从而实现了约4倍的性能提升。

链接: https://arxiv.org/abs/2410.12329
作者: Botian Jiang,Lei Li,Xiaonan Li,Zhaowei Li,Xiachong Feng,Lingpeng Kong,Qi Liu,Xipeng Qiu
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, underlying Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has been accompanied by the development of various benchmarks to evaluate their capabilities. However, the true nature of these evaluations and the extent to which they assess multimodal reasoning versus merely leveraging the underlying Large Language Model (LLM) backbone remain unclear. This paper presents a comprehensive investigation into the role of LLM backbones in MLLM evaluation, focusing on two critical aspects: the degree to which current benchmarks truly assess multimodal reasoning and the influence of LLM prior knowledge on performance. Specifically, we introduce a modified evaluation protocol to disentangle the contributions of the LLM backbone from multimodal integration, and an automatic knowledge identification technique for diagnosing whether LLMs equip the necessary knowledge for corresponding multimodal questions. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone, indicating a heavy reliance on language capabilities. To address knowledge deficiencies, we propose a knowledge augmentation pipeline that achieves significant performance gains, with improvements of up to 60% on certain datasets, resulting in a approximately 4x increase in performance. Our work provides crucial insights into the role of the LLM backbone in MLLMs, and highlights the need for more nuanced benchmarking approaches.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的快速发展伴随着各种评估基准的开发,以衡量其能力。然而,这些评估的真实性质及其在多大程度上评估多模态推理能力,而非仅仅依赖于底层大语言模型 (Large Language Model, LLM) 主干,仍不明确。本文对 LLM 主干在 MLLM 评估中的作用进行了全面研究,重点关注两个关键方面:当前基准在多大程度上真正评估了多模态推理能力,以及 LLM 先验知识对性能的影响。具体而言,我们引入了一种改进的评估协议,以分离 LLM 主干与多模态整合的贡献,并采用一种自动知识识别技术,诊断 LLM 是否具备相应多模态问题的必要知识。我们的研究涵盖了四个不同的 MLLM 基准和八个最先进的 MLLM。关键发现表明,某些基准即使在没有视觉输入的情况下也能实现高表现,高达 50% 的错误率可归因于 LLM 主干中世界知识的不足,这表明对语言能力的依赖性很强。为解决知识缺陷,我们提出了一种知识增强流程,在某些数据集上实现了高达 60% 的性能提升,从而使性能提升了约 4 倍。我们的工作为 LLM 主干在 MLLMs 中的作用提供了重要见解,并强调了需要更细致的基准测试方法。

[NLP-57] Neuron-based Personality Trait Induction in Large Language Models

【速读】: 该论文试图解决大语言模型(LLMs)在模拟个性特征方面的能力提升问题。解决方案的关键在于提出了一种基于神经元的个性特征诱导方法,具体包括三个主要技术贡献:1) 构建了PersonalityBench数据集,用于识别和评估LLMs中的个性特征;2) 通过分析个性特征的相反方面,提出了一种高效识别与个性相关的神经元的方法;3) 开发了一种简单而有效的诱导方法,通过操纵这些识别出的神经元的值来实现对LLMs个性特征的细粒度控制,而无需训练和修改模型参数。该方法在实验中验证了其有效性,并展示了与微调模型相当的性能,提供了更高效和灵活的个性特征诱导解决方案。

链接: https://arxiv.org/abs/2410.12327
作者: Jia Deng,Tianyi Tang,Yanbin Yin,Wenhao Yang,Wayne Xin Zhao,Ji-Rong Wen
关键词-EN: Large language models, supporting related applications, Large language, personality traits, related applications
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly proficient at simulating various personality traits, an important capability for supporting related applications (e.g., role-playing). To further improve this capacity, in this paper, we present a neuron-based approach for personality trait induction in LLMs, with three major technical contributions. First, we construct PersonalityBench, a large-scale dataset for identifying and evaluating personality traits in LLMs. This dataset is grounded in the Big Five personality traits from psychology and is designed to assess the generative capabilities of LLMs towards specific personality traits. Second, by leveraging PersonalityBench, we propose an efficient method for identifying personality-related neurons within LLMs by examining the opposite aspects of a given trait. Third, we develop a simple yet effective induction method that manipulates the values of these identified personality-related neurons. This method enables fine-grained control over the traits exhibited by LLMs without training and modifying model parameters. Extensive experiments validate the efficacy of our neuron identification and trait induction methods. Notably, our approach achieves comparable performance as fine-tuned models, offering a more efficient and flexible solution for personality trait induction in LLMs. We provide access to all the mentioned resources at this https URL.
摘要:大语言模型 (LLMs) 在模拟各种人格特质方面变得越来越熟练,这是支持相关应用(例如角色扮演)的重要能力。为了进一步提高这一能力,本文提出了一种基于神经元的人格特质诱导方法,并做出了三大技术贡献。首先,我们构建了 PersonalityBench,这是一个大规模数据集,用于识别和评估 LLMs 中的人格特质。该数据集基于心理学中的大五人格特质,旨在评估 LLMs 对特定人格特质的生成能力。其次,通过利用 PersonalityBench,我们提出了一种高效的方法,通过检查给定特质的对立面来识别 LLMs 中与性格相关的神经元。第三,我们开发了一种简单而有效的诱导方法,通过操纵这些识别出的与性格相关的神经元的值来实现。这种方法能够在不训练和修改模型参数的情况下,对 LLMs 表现出的特质进行精细控制。广泛的实验验证了我们神经元识别和特质诱导方法的有效性。值得注意的是,我们的方法实现了与微调模型相当的性能,为 LLMs 中的人格特质诱导提供了更高效和灵活的解决方案。我们在此 https URL 提供了所有提到的资源。

[NLP-58] Optimizing Low-Resource Language Model Training: Comprehensive Analysis of Multi-Epoch Multi-Lingual and Two-Stage Approaches

【速读】: 该论文试图解决在低资源语言的大型语言模型(LLM)训练中,如何优化训练设置以有效利用有限语料库的问题。解决方案的关键在于综合运用多轮训练、多语言训练和两阶段训练三种方法,并通过实验发现以下关键点:(1) 随着目标语言语料库的减少,最佳训练方法从单语言单阶段训练转变为多语言两阶段训练,转变点取决于计算预算;(2) 最佳模型规模保持稳定,不受目标语言语料库大小的影响,可使用单语言训练的计算最优规模;(3) 最佳训练轮数可以通过从小规模实验外推至大规模实验来确定。此外,论文还提供了单阶段训练中目标语言验证损失与目标语言比例之间遵循幂律关系的证据。

链接: https://arxiv.org/abs/2410.12325
作者: Kosuke Akimoto,Masafumi Oyamada
关键词-EN: Large Language Models, target language corpus, target language, Large Language, setups for Large
类目: Computation and Language (cs.CL)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:In this paper, we address the challenge of optimizing training setups for Large Language Models (LLMs) of low-resource language with a limited amount of corpus. Existing works adopt multi-epoch, multi-lingual, and two-stage training to utilize the limited target language corpus efficiently. However, there is still a lack of understanding about the optimal hyperparameter setups for combining these three approaches to train LLMs. We exhaustively explore training setups for low-resource language LLM, combining these three approaches, and found the following insights for efficiently reducing the cost of hyperparameter search: (1) As the amount of target language corpus decreases, the optimal training approach shifts from monolingual single-stage training to multi-lingual two-stage training at a compute budget dependent threshold. (2) The optimal model scale remains stable regardless of the amount of target language corpus, allowing the use of the compute-optimal scale of monolingual training. (3) The optimal number of epochs can be extrapolated from smaller-scale experiments to larger scale using our proposed model. Also, we provide evidence that, in single-stage training, the target language validation loss follows a power law with respect to the target language ratio, with an exponent independent of the amount of data, model scale, and language pair.
摘要:本文针对在语料库有限的低资源语言大语言模型 (LLM) 训练设置优化问题进行了探讨。现有研究采用多轮次、多语言和两阶段训练方法,以高效利用有限的目標语言语料库。然而,对于如何结合这三种方法来训练 LLM 的最佳超参数设置,仍缺乏深入理解。我们全面探索了低资源语言 LLM 的训练设置,结合了这三种方法,并发现了以下有助于高效降低超参数搜索成本的见解:(1) 随着目标语言语料库数量的减少,最佳训练方法从单语言单阶段训练转变为多语言两阶段训练,这一转变发生在计算预算依赖的阈值处。(2) 无论目标语言语料库的数量如何,最佳模型规模保持稳定,允许使用单语言训练的计算最优规模。(3) 最佳轮次数可以通过我们提出的模型从小规模实验外推到大规模实验。此外,我们提供了证据表明,在单阶段训练中,目标语言验证损失与目标语言比例之间的关系遵循幂律,且幂指数与数据量、模型规模和语言对无关。

[NLP-59] Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up

【速读】: 该论文试图解决大语言模型(LLMs)在数学和复杂逻辑推理任务中的局限性问题。解决方案的关键在于提出了一种名为“Reversal of Thought (RoT)”的新框架,通过偏好引导的反向推理预热策略,结合元认知机制和成对偏好自我评估,生成任务特定的提示,从而增强LLMs的逻辑推理能力。RoT利用反向推理机制,通过认知偏好管理器评估知识边界,并聚合已知任务的解决方案逻辑和未知任务的风格模板,以扩展LLMs的推理能力。实验结果表明,RoT在推理精度和效率上均优于现有基线方法。

链接: https://arxiv.org/abs/2410.12323
作者: Jiahao Yuan,Dehui Du,Hao Zhang,Zixiang Di,Usman Naseem
关键词-EN: Large language models, shown remarkable performance, Large language, language models, shown remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance in reasoning tasks but face limitations in mathematical and complex logical reasoning. Existing methods to improve LLMs’ logical capabilities either involve traceable or verifiable logical sequences that generate more reliable responses by constructing logical structures yet increase computational costs, or introduces rigid logic template rules, reducing flexibility. In this paper, we propose Reversal of Thought (RoT), a novel framework aimed at enhancing the logical reasoning abilities of LLMs. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning through meta-cognitive mechanisms and pairwise preference self-evaluation to generate task-specific prompts solely through demonstrations, aligning with LLMs’ cognitive preferences shaped by Reinforcement Learning with Human Feedback (RLHF). Through reverse reasoning, we ultilize a Cognitive Preference Manager to assess knowledge boundaries and further expand LLMs’ reasoning capabilities by aggregating solution logic for known tasks and stylistic templates for unknown tasks. Experiments across various tasks demonstrate that RoT surpasses existing baselines in both reasoning accuracy and efficiency.
摘要:大语言模型 (LLMs) 在推理任务中展现了卓越的性能,但在数学和复杂逻辑推理方面仍存在局限性。现有的提升 LLMs 逻辑能力的方法要么涉及可追溯或可验证的逻辑序列,通过构建逻辑结构生成更可靠的响应,但增加了计算成本;要么引入刚性的逻辑模板规则,降低了灵活性。本文提出了一种名为“思维反转 (Reversal of Thought, RoT)”的新框架,旨在增强 LLMs 的逻辑推理能力。RoT 采用了一种偏好引导的逆向推理预热策略,通过元认知机制整合逻辑符号进行伪代码规划,并通过成对偏好自我评估生成仅基于演示的任务特定提示,与通过人类反馈强化学习 (RLHF) 塑造的 LLMs 认知偏好相一致。通过逆向推理,我们利用认知偏好管理器评估知识边界,并通过聚合已知任务的解决方案逻辑和未知任务的风格模板,进一步扩展 LLMs 的推理能力。在多种任务上的实验表明,RoT 在推理准确性和效率方面均超越了现有的基线方法。

[NLP-60] Open Domain Question Answering with Conflicting Contexts

【速读】: 该论文试图解决开放领域问答系统在处理包含冲突信息的文本集合时可能产生的答案不准确问题。解决方案的关键在于通过构建一个人工标注的数据集(Question Answering with Conflicting Contexts, QACC),评估大型语言模型(LLMs)在处理冲突信息时的局限性,并提出通过微调LLMs使其能够解释其答案,从而引入更丰富的信息来指导模型在冲突情境下的推理过程。

链接: https://arxiv.org/abs/2410.12311
作者: Siyi Liu,Qiang Ning,Kishaloy Halder,Wei Xiao,Zheng Qi,Phu Mon Htut,Yi Zhang,Neha Anna John,Bonan Min,Yassine Benajiba,Dan Roth
关键词-EN: systems frequently rely, answering systems frequently, question answering systems, Open domain, Open domain question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.
摘要:开放领域问答系统通常依赖于从大量文本集合(如网络)中检索到的信息来回答问题。然而,这些文本集合往往包含相互冲突的信息,盲目依赖这些信息可能导致不真实和不准确的答案。为了理解这一问题的严重性,我们收集了一个人工标注的数据集,即“带有冲突上下文的问答”(Question Answering with Conflicting Contexts, QACC),并发现当使用 Google 搜索检索时,多达 25% 的明确、开放领域问题会导致冲突上下文的出现。我们评估并基准测试了三个强大的大语言模型 (Large Language Models, LLMs) 在我们的数据集 QACC 上的表现,并展示了它们在有效处理带有冲突信息的问题方面的局限性。为了探索人类如何通过冲突上下文进行推理,我们要求标注者为其选择的正确答案提供解释。我们证明,通过微调 LLMs 以解释其答案,我们可以在其训练过程中引入更丰富的信息,从而指导它们在处理冲突上下文时的推理过程。

[NLP-61] Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

【速读】: 该论文试图解决大语言模型(LLMs)在行为对齐方面的挑战,特别是现有干预方法缺乏对多样化输入语义的适应性问题。解决方案的关键在于提出了语义自适应动态干预(Semantics-Adaptive Dynamic Intervention, SADI)方法,该方法通过构建动态转向向量在推理时干预模型激活。SADI利用对比对的激活差异来精确识别LLM中的关键元素(如注意力头、隐藏状态和神经元),并在推理时根据输入语义的方向动态调整激活,从而在不进行额外训练的情况下显著提升任务性能。

链接: https://arxiv.org/abs/2410.12299
作者: Weixuan Wang,Jingyuan Yang,Wei Peng
关键词-EN: Large language models, Large language, behaviors remains challenging, desired behaviors remains, achieved remarkable performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an effective and economical method to modify the behavior of LLMs. Despite considerable interest in this area, current intervention methods exclusively employ a fixed steering vector to modify model activations, lacking adaptability to diverse input semantics. To address this limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. More specifically, SADI utilizes activation differences in contrastive pairs to precisely identify critical elements of an LLM (i.e., attention heads, hidden states, and neurons) for targeted intervention. During inference, SADI dynamically steers model behavior by scaling element-wise activations based on the directions of input semantics. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training. SADI’s cost-effectiveness and generalizability across various LLM backbones and tasks highlight its potential as a versatile alignment technique. In addition, we release the code to foster research along this line:this https URL.
摘要:大语言模型 (LLMs) 在众多任务中取得了显著的性能,然而将其与期望的行为对齐仍然是一个挑战。激活干预作为一种有效且经济的手段,已成为修改 LLMs 行为的方法。尽管在这一领域有相当大的兴趣,但当前的干预方法仅采用固定的转向向量来修改模型激活,缺乏对多样输入语义的适应性。为了解决这一限制,我们提出了语义自适应动态干预 (Semantics-Adaptive Dynamic Intervention, SADI),这是一种新颖的方法,它构建了一个动态转向向量,在推理时干预模型激活。更具体地说,SADI 利用对比对中的激活差异来精确识别 LLM 的关键元素(即注意力头、隐藏状态和神经元),以便进行有针对性的干预。在推理过程中,SADI 通过根据输入语义的方向缩放元素级激活来动态引导模型行为。实验结果表明,SADI 显著优于现有的基线方法,在不进行训练的情况下提高了任务性能。SADI 的成本效益和在各种 LLM 骨干网络及任务中的通用性突显了其作为多功能对齐技术的潜力。此外,我们发布了代码以促进这一领域的研究:this https URL。

[NLP-62] Pyramid-Driven Alignment: Pyramid Principle Guided Integration of Large Language Models and Knowledge Graphs

【速读】: 该论文试图解决大语言模型(LLMs)在生成信息时容易产生错误(即幻觉现象)的问题。解决方案的关键在于提出了一种名为Pyramid-Driven Alignment(PDA)的新框架,通过利用金字塔原则分析构建层次化金字塔结构,以反映输入问题并生成更验证的演绎知识,从而增强LLMs与知识图谱(KGs)之间的对齐,确保更紧密的集成。此外,PDA采用递归机制来利用KGs的内在推理能力,从而在问答任务中实现更准确的知识检索。

链接: https://arxiv.org/abs/2410.12298
作者: Lei Sun,Xinchen Wang,Youdi Li
关键词-EN: Large Language Models, Large Language, Language Models, generating incorrect information, possess impressive reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) possess impressive reasoning abilities but are prone to generating incorrect information, often referred to as hallucinations. While incorporating external Knowledge Graphs (KGs) can partially mitigate this issue, existing methods primarily treat KGs as static knowledge repositories, overlooking the critical disparity between KG and LLM knowledge, and failing to fully exploit the reasoning capabilities inherent in KGs. To address these limitations, we propose Pyramid-Driven Alignment (PDA), a novel framework for seamlessly integrating LLMs with KGs. PDA utilizes Pyramid Principle analysis to construct a hierarchical pyramid structure. This structure is designed to reflect the input question and generate more validated deductive knowledge, thereby enhancing the alignment of LLMs and KGs and ensuring more cohesive integration. Furthermore, PDA employs a recursive mechanism to harness the underlying reasoning abilities of KGs, resulting in more accurate knowledge retrieval for question-answering tasks. Our experimental results reveal a substantial performance advantage of PDA over state-of-the-art baselines, with improvements reaching 26.70% and 26.78%.
摘要:大语言模型 (LLM) 具备令人印象深刻的推理能力,但容易生成错误信息,通常被称为幻觉 (hallucinations)。尽管引入外部知识图谱 (Knowledge Graphs, KGs) 可以在一定程度上缓解这一问题,但现有方法主要将 KGs 视为静态知识库,忽视了 KG 与 LLM 知识之间的关键差异,未能充分利用 KGs 固有的推理能力。为解决这些局限性,我们提出了金字塔驱动对齐 (Pyramid-Driven Alignment, PDA),这是一种新颖的框架,用于无缝集成 LLM 与 KGs。PDA 利用金字塔原理分析构建层次金字塔结构。该结构旨在反映输入问题并生成更具验证性的演绎知识,从而增强 LLM 与 KGs 的对齐,并确保更紧密的集成。此外,PDA 采用递归机制来利用 KGs 的底层推理能力,从而在问答任务中实现更准确的知识检索。我们的实验结果显示,PDA 相较于最先进的基线方法具有显著的性能优势,改进幅度分别达到 26.70% 和 26.78%。

[NLP-63] owards LLM-based Cognitive Models of Students with Misconceptions

【速读】: 该论文试图解决如何准确建模学生认知以开发有效的AI驱动教育技术的问题。解决方案的关键在于创建能够同时满足两个基本属性的学生模型:(1)准确复制特定误解;(2)在误解不适用的情况下正确解决问题。论文通过引入MalAlgoPy库生成反映真实学生解题模式的图结构数据集,并定义和研究了经过指令微调的大型语言模型(LLMs)作为认知学生模型(CSMs)。研究发现,通过精心调整训练数据中正确与误解示例的比例(有时低至0.25),可以开发出同时满足这两个属性的CSMs,从而为有效的自适应学习系统铺平道路。

链接: https://arxiv.org/abs/2410.12294
作者: Shashank Sonkar,Xinghe Chen,Naiming Liu,Richard G. Baraniuk,Mrinmaya Sachan
关键词-EN: AI-driven educational technologies, Accurately modeling student, modeling student cognition, Accurately modeling, developing effective AI-driven
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurately modeling student cognition is crucial for developing effective AI-driven educational technologies. A key challenge is creating realistic student models that satisfy two essential properties: (1) accurately replicating specific misconceptions, and (2) correctly solving problems where these misconceptions are not applicable. This dual requirement reflects the complex nature of student understanding, where misconceptions coexist with correct knowledge. This paper investigates whether Large Language Models (LLMs) can be instruction-tuned to meet this dual requirement and effectively simulate student thinking in algebra. We introduce MalAlgoPy, a novel Python library that generates datasets reflecting authentic student solution patterns through a graph-based representation of algebraic problem-solving. Utilizing MalAlgoPy, we define and examine Cognitive Student Models (CSMs) - LLMs instruction tuned to faithfully emulate realistic student behavior. Our findings reveal that LLMs trained on misconception examples can efficiently learn to replicate errors. However, the training diminishes the model’s ability to solve problems correctly, particularly for problem types where the misconceptions are not applicable, thus failing to satisfy second property of CSMs. We demonstrate that by carefully calibrating the ratio of correct to misconception examples in the training data - sometimes as low as 0.25 - it is possible to develop CSMs that satisfy both properties. Our insights enhance our understanding of AI-based student models and pave the way for effective adaptive learning systems.
摘要:准确建模学生认知对于开发有效的 AI 驱动的教育技术至关重要。一个关键挑战是创建满足两个基本属性的现实学生模型:(1) 准确复制特定误解,以及 (2) 正确解决这些误解不适用的问题。这种双重要求反映了学生理解的复杂性,其中误解与正确知识共存。本文研究了大语言模型 (LLMs) 是否可以通过指令调优来满足这一双重要求,并有效模拟学生在代数中的思维。我们引入了 MalAlgoPy,一种新颖的 Python 库,通过基于图的代数问题解决表示生成反映真实学生解题模式的数据集。利用 MalAlgoPy,我们定义并研究了认知学生模型 (CSMs) - 经过指令调优的 LLMs,以忠实模拟现实学生行为。我们的研究结果表明,经过误解示例训练的 LLMs 可以高效学习复制错误。然而,训练削弱了模型正确解决问题的能力,特别是在误解不适用的题型中,从而未能满足 CSMs 的第二个属性。我们证明,通过仔细校准训练数据中正确示例与误解示例的比例 - 有时低至 0.25 - 可以开发出满足这两个属性的 CSMs。我们的见解增强了我们对基于 AI 的学生模型的理解,并为有效的自适应学习系统铺平了道路。

[NLP-64] How much do contextualized representations encode long-range context?

【速读】: 该论文试图解决神经自回归语言模型中长程上下文表示的问题,特别是跨越数千个token的长程上下文。解决方案的关键在于采用扰动设置和各向异性校准余弦相似度(Anisotropy-Calibrated Cosine Similarity)这一度量方法,从表示几何的角度捕捉长程模式上下文化的程度。通过对比标准解码器专用Transformer模型与其他新型架构和训练配置的模型,研究发现不同模型在处理高复杂度序列时的能力差异,以及全递归模型与混合模型在编码整个序列结构上的效率差异,从而为改进现有语言模型提供了潜在方向。

链接: https://arxiv.org/abs/2410.12292
作者: Simeng Sun,Cheng-Ping Hsieh
关键词-EN: analyze contextual representations, neural autoregressive language, Anisotropy-Calibrated Cosine Similarity, emphasizing long-range contexts, thousand tokens
类目: Computation and Language (cs.CL)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:We analyze contextual representations in neural autoregressive language models, emphasizing long-range contexts that span several thousand tokens. Our methodology employs a perturbation setup and the metric \emphAnisotropy-Calibrated Cosine Similarity, to capture the degree of contextualization of long-range patterns from the perspective of representation geometry. We begin the analysis with a case study on standard decoder-only Transformers, demonstrating that similar perplexity can exhibit markedly different downstream task performance, which can be explained by the difference in contextualization of long-range content. Next, we extend the analysis to other models, covering recent novel architectural designs and various training configurations. The representation-level results illustrate a reduced capacity for high-complexity (i.e., less compressible) sequences across architectures, and that fully recurrent models rely heavily on local context, whereas hybrid models more effectively encode the entire sequence structure. Finally, preliminary analysis of model size and training configurations on the encoding of long-range context suggest potential directions for improving existing language models.
摘要:我们分析了神经自回归语言模型中的上下文表示,特别强调跨越数千个 Token 的长程上下文。我们的方法采用了一种扰动设置和度量标准——各向异性校准余弦相似度 (Anisotropy-Calibrated Cosine Similarity),以从表示几何的角度捕捉长程模式上下文化程度。我们从标准解码器专用 Transformer 的案例研究开始分析,展示了在相似的困惑度 (perplexity) 下,下游任务性能可能表现出显著差异,这可以通过长程内容的上下文化差异来解释。接下来,我们将分析扩展到其他模型,涵盖了近期新颖的架构设计和各种训练配置。在表示层面的结果表明,不同架构在高复杂度(即较难压缩)序列上的能力有所降低,并且全递归模型严重依赖局部上下文,而混合模型则更有效地编码整个序列结构。最后,对模型大小和训练配置在长程上下文编码上的初步分析,为改进现有语言模型提供了潜在方向。

[NLP-65] A Prompt-Based Knowledge Graph Foundation Model for Universal In-Context Reasoning NEURIPS2024

【速读】: 该论文试图解决现有知识图谱(KGs)推理模型在不同KGs和推理场景中缺乏通用性和知识迁移能力的问题。解决方案的关键在于提出了一种基于提示(prompt)的KG基础模型,通过上下文学习(in-context learning)实现通用推理能力。具体来说,论文引入了一个以查询相关示例事实为中心的提示图,用于理解查询关系,并通过统一的标记器将提示图中的实体和关系映射到预定义的标记,进而利用消息传递神经网络进行提示编码和KG推理。这种方法在43个不同KGs的传导和归纳设置中进行了评估,结果表明其具有出色的泛化能力和通用推理能力。

链接: https://arxiv.org/abs/2410.12288
作者: Yuanning Cui,Zequn Sun,Wei Hu
关键词-EN: facilitate knowledge-driven tasks, Extensive knowledge graphs, Extensive knowledge, constructed to facilitate, facilitate knowledge-driven
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Extensive knowledge graphs (KGs) have been constructed to facilitate knowledge-driven tasks across various scenarios. However, existing work usually develops separate reasoning models for different KGs, lacking the ability to generalize and transfer knowledge across diverse KGs and reasoning settings. In this paper, we propose a prompt-based KG foundation model via in-context learning, namely KG-ICL, to achieve a universal reasoning ability. Specifically, we introduce a prompt graph centered with a query-related example fact as context to understand the query relation. To encode prompt graphs with the generalization ability to unseen entities and relations in queries, we first propose a unified tokenizer that maps entities and relations in prompt graphs to predefined tokens. Then, we propose two message passing neural networks to perform prompt encoding and KG reasoning, respectively. We conduct evaluation on 43 different KGs in both transductive and inductive settings. Results indicate that the proposed KG-ICL outperforms baselines on most datasets, showcasing its outstanding generalization and universal reasoning capabilities. The source code is accessible on GitHub: this https URL.
摘要:广泛的知识图谱 (Knowledge Graphs, KGs) 已被构建以促进跨多种场景的知识驱动任务。然而,现有工作通常为不同的 KGs 开发单独的推理模型,缺乏在多样化的 KGs 和推理设置之间泛化和转移知识的能力。本文中,我们提出了一种基于提示的 KG 基础模型,通过上下文学习实现通用推理能力,即 KG-ICL。具体而言,我们引入了一个以查询相关示例事实为中心的提示图,以理解查询关系。为了编码具有对查询中未见实体和关系泛化能力的提示图,我们首先提出了一种统一的 Tokenizer,将提示图中的实体和关系映射到预定义的 Token。随后,我们提出了两种消息传递神经网络,分别用于执行提示编码和 KG 推理。我们在 43 个不同的 KGs 上进行了传导和归纳设置的评估。结果表明,所提出的 KG-ICL 在大多数数据集上优于基线,展示了其卓越的泛化和通用推理能力。源代码可在 GitHub 上获取:this https URL。

[NLP-66] Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting

【速读】: 该论文试图解决当前可解释AI(XAI)方法在实际应用中缺乏对真实用户有效性评估的问题。解决方案的关键在于通过大规模用户研究,评估不同类型的解释(如视觉解释、自然语言解释及其组合)在医疗领域(特别是胸部X光分析中)对用户决策的影响。研究特别关注解释的准确性与AI建议的正确性之间的匹配度,发现文本解释容易导致过度依赖,而结合视觉解释(如显著性图)可以有效缓解这一问题。

链接: https://arxiv.org/abs/2410.12284
作者: Maxime Kayser,Bayar Menzat,Cornelius Emde,Bogdan Bercean,Alex Novak,Abdala Espinosa,Bartlomiej W. Papiez,Susanne Gaube,Thomas Lukasiewicz,Oana-Maria Camburu
关键词-EN: including in safety-critical, safety-critical domains, growing capabilities, explanations, models are leading
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing capabilities of AI models are leading to their wider use, including in safety-critical domains. Explainable AI (XAI) aims to make these models safer to use by making their inference process more transparent. However, current explainability methods are seldom evaluated in the way they are intended to be used: by real-world end users. To address this, we conducted a large-scale user study with 85 healthcare practitioners in the context of human-AI collaborative chest X-ray analysis. We evaluated three types of explanations: visual explanations (saliency maps), natural language explanations, and a combination of both modalities. We specifically examined how different explanation types influence users depending on whether the AI advice and explanations are factually correct. We find that text-based explanations lead to significant over-reliance, which is alleviated by combining them with saliency maps. We also observe that the quality of explanations, that is, how much factually correct information they entail, and how much this aligns with AI correctness, significantly impacts the usefulness of the different explanation types.
摘要:随着 AI 模型能力的不断提升,它们在包括安全关键领域在内的更广泛领域中得到了应用。可解释 AI (Explainable AI, XAI) 旨在通过使模型的推理过程更加透明,从而使其使用更加安全。然而,当前的可解释性方法很少在实际应用中由真实终端用户进行评估。为了解决这一问题,我们在人机协作的胸部 X 光分析背景下,对 85 名医疗从业者进行了大规模用户研究。我们评估了三种类型的解释:视觉解释(显著性图)、自然语言解释以及两者的结合。我们特别研究了不同解释类型如何根据 AI 建议和解释的事实正确性影响用户。我们发现,基于文本的解释会导致显著的过度依赖,而将它们与显著性图结合可以缓解这一问题。我们还观察到,解释的质量,即它们包含多少事实正确信息,以及这些信息与 AI 正确性的对齐程度,显著影响不同解释类型的实用性。

[NLP-67] Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

【速读】: 该论文试图解决自动生成用于幻觉检测的任务特定合成数据集的问题。解决方案的关键在于采用两步生成-选择管道,结合幻觉模式引导和语言风格对齐。幻觉模式引导利用最重要的任务特定幻觉模式,而语言风格对齐则确保合成数据集的风格与基准文本一致。此外,通过数据混合策略进一步提高性能的鲁棒性和泛化能力。实验结果表明,基于合成数据集训练的幻觉检测器在泛化能力上显著优于基于上下文学习的检测器。

链接: https://arxiv.org/abs/2410.12278
作者: Yong Xie,Karan Aggarwal,Aitzaz Ahmad,Stephen Lau
关键词-EN: automatically generate non-trivial, generate non-trivial task-specific, automatically generate, generate non-trivial, non-trivial task-specific synthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform in-context-learning (ICL)-based detectors by a large margin of 32%. Our extensive experiments confirm the benefits of our approach with cross-task and cross-generator generalization. Our data-mixture-based training further improves the generalization and robustness of hallucination detection.
摘要:我们提出了一种新颖的方法,用于自动生成非平凡的任务特定合成数据集,以进行幻觉检测。我们的方法采用两步生成-选择流程,在生成过程中使用幻觉模式引导和语言风格对齐。幻觉模式引导利用了最重要的任务特定幻觉模式,而语言风格对齐则使合成数据集的风格与基准文本对齐。为了从合成数据集中获得鲁棒的监督检测器,我们还采用了数据混合策略,以提高性能的鲁棒性和泛化能力。我们在三个数据集上的结果表明,我们生成的幻觉文本与非幻觉文本相比,与基准文本更为接近,从而能够训练出具有更好泛化能力的幻觉检测器。我们基于合成数据集训练的幻觉检测器在性能上大幅超越基于上下文学习 (ICL) 的检测器,优势达到 32%。我们的广泛实验证实了我们的方法在跨任务和跨生成器泛化方面的优势。基于数据混合的训练进一步提升了幻觉检测的泛化能力和鲁棒性。

[NLP-68] Kallini et al. (2024) do not compare impossible languages with constituency-based ones

【速读】: 该论文试图解决的问题是验证大型语言模型(LLMs)是否能够区分“可能的人类语言”和“不可能的人类语言”。解决方案的关键在于识别并消除实验中的混淆因素,以确保实验结果能够准确反映LLMs的归纳偏差是否与人类语言的可能性相一致。具体来说,论文指出Kallini等人的实验存在混淆因素,导致其结论不可靠,并提出改进实验设计以更准确地测试这一问题的建议。

链接: https://arxiv.org/abs/2410.12271
作者: Tim Hunter
关键词-EN: developing human child, typically developing human, Impossible Language Models, human languages, linguistic theory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A central goal of linguistic theory is to find a precise characterization of the notion “possible human language”, in the form of a computational device that is capable of describing all and only the languages that can be acquired by a typically developing human child. The success of recent large language models (LLMs) in NLP applications arguably raises the possibility that LLMs might be computational devices that meet this goal. This would only be the case if, in addition to succeeding in learning human languages, LLMs struggle to learn “impossible” human languages. Kallini et al. (2024; “Mission: Impossible Language Models”, Proc. ACL) conducted experiments aiming to test this by training GPT-2 on a variety of synthetic languages, and found that it learns some more successfully than others. They present these asymmetries as support for the idea that LLMs’ inductive biases align with what is regarded as “possible” for human languages, but the most significant comparison has a confound that makes this conclusion unwarranted. In this paper I explain the confound and suggest some ways forward towards constructing a comparison that appropriately tests the underlying issue.
摘要:语言理论的核心目标之一是找到一种精确的描述,即“可能的人类语言”的概念,通过一种能够描述所有且仅能被典型发育的人类儿童习得语言的计算设备来实现。近期大语言模型 (LLMs) 在自然语言处理 (NLP) 应用中的成功,无疑提升了 LLMs 可能成为满足这一目标的计算设备的潜力。然而,这只有在 LLMs 不仅能够成功学习人类语言,而且在学习“不可能”的人类语言时遇到困难的情况下,才可能成立。Kallini 等人 (2024; “Mission: Impossible Language Models”, Proc. ACL) 进行了一系列实验,旨在通过训练 GPT-2 在多种合成语言上,测试这一假设,并发现它在某些语言上的学习效果优于其他语言。他们将这些不对称性作为支持 LLMs 的归纳偏差与人类语言的“可能性”相一致的证据,但最显著的比较存在一个混淆因素,使得这一结论缺乏依据。本文将解释这一混淆因素,并提出一些前进的方向,以构建一个能够适当测试潜在问题的比较。

[NLP-69] An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

【速读】: 该论文试图解决大型语言模型(LLMs)评估过程中存在的高成本、测试格式有限、依赖人工标注以及系统性评估偏差等问题。解决方案的关键在于引入Auto-PRE,一个基于同行评审的自动LLM评估框架。与依赖人工标注的先前研究不同,Auto-PRE通过自动选择具有一致性、自信心和相关性等内在特质的评估LLMs,显著降低了评估成本并提升了评估效率。实验结果表明,Auto-PRE在摘要生成、非事实性问答和对话生成三项任务中均达到了最先进的性能,同时强调了提示策略和评估格式对评估性能的影响,为未来的方法优化提供了指导。

链接: https://arxiv.org/abs/2410.12265
作者: Junjie Chen,Weihang Su,Zhumin Chu,Haitao Li,Qinyao Ai,Yiqun Liu,Min Zhang,Shaoping Ma
关键词-EN: large language models, important research question, language models, research question, rapid development
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of large language models (LLMs), how to efficiently evaluate them has become an important research question. Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases. To address these limitations, our study introduces the Auto-PRE, an automatic LLM evaluation framework based on peer review. In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluator LLMs automatically based on their inherent traits including consistency, self-confidence, and pertinence. We conduct extensive experiments on three tasks: summary generation, non-factoid question-answering, and dialogue generation. Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost. Moreover, our study highlights the impact of prompt strategies and evaluation formats on evaluation performance, offering guidance for method optimization in the future.
摘要:随着大语言模型 (LLM) 的快速发展,如何高效地评估它们已成为一个重要的研究问题。现有的评估方法往往存在高成本、测试格式有限、依赖人工参考以及系统性评估偏差等问题。为了解决这些局限性,我们的研究引入了 Auto-PRE,这是一个基于同行评审的自动 LLM 评估框架。与以往依赖人工标注的研究不同,Auto-PRE 根据评估者 LLM 的内在特质(包括一致性、自信心和相关性)自动选择评估者。我们在三个任务上进行了广泛的实验:摘要生成、非事实性问答和对话生成。实验结果表明,Auto-PRE 在较低成本下达到了最先进的性能。此外,我们的研究强调了提示策略和评估格式对评估性能的影响,为未来的方法优化提供了指导。

[NLP-70] CoFE-RAG: A Comprehensive Full-chain Evaluation Framework for Retrieval-Augmented Generation with Enhanced Data Diversity

【速读】: 该论文试图解决检索增强生成(RAG)系统在评估过程中面临的三个主要问题:数据多样性不足、问题定位困难以及检索评估不稳定。解决方案的关键在于提出一个综合的全链路评估框架(CoFE-RAG),该框架通过引入多粒度关键词(包括粗粒度和细粒度关键词)来评估检索到的上下文,从而不再依赖于黄金片段的标注。此外,论文还发布了一个涵盖多种文档格式和查询类型的综合基准数据集,以支持对RAG系统各个阶段的全面评估。通过这种方法,研究者能够更深入地理解RAG系统在处理多样化数据场景中的能力和局限性。

链接: https://arxiv.org/abs/2410.12248
作者: Jintao Liu,Ruixue Ding,Linhao Zhang,Pengjun Xie,Fie Huang
关键词-EN: large language models, enhance large language, external knowledge sources, RAG systems, Unstable retrieval evaluation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) aims to enhance large language models (LLMs) to generate more accurate and reliable answers with the help of the retrieved context from external knowledge sources, thereby reducing the incidence of hallucinations. Despite the advancements, evaluating these systems remains a crucial research area due to the following issues: (1) Limited data diversity: The insufficient diversity of knowledge sources and query types constrains the applicability of RAG systems; (2) Obscure problems location: Existing evaluation methods have difficulty in locating the stage of the RAG pipeline where problems occur; (3) Unstable retrieval evaluation: These methods often fail to effectively assess retrieval performance, particularly when the chunking strategy changes. To tackle these challenges, we propose a Comprehensive Full-chain Evaluation (CoFE-RAG) framework to facilitate thorough evaluation across the entire RAG pipeline, including chunking, retrieval, reranking, and generation. To effectively evaluate the first three phases, we introduce multi-granularity keywords, including coarse-grained and fine-grained keywords, to assess the retrieved context instead of relying on the annotation of golden chunks. Moreover, we release a holistic benchmark dataset tailored for diverse data scenarios covering a wide range of document formats and query types. We demonstrate the utility of the CoFE-RAG framework by conducting experiments to evaluate each stage of RAG systems. Our evaluation method provides unique insights into the effectiveness of RAG systems in handling diverse data scenarios, offering a more nuanced understanding of their capabilities and limitations.
摘要:检索增强生成 (Retrieval-Augmented Generation, RAG) 旨在通过从外部知识源检索上下文来增强大语言模型 (Large Language Models, LLMs),从而生成更准确和可靠的答案,减少幻觉的发生。尽管取得了进展,但评估这些系统仍然是一个关键的研究领域,原因如下:(1) 数据多样性有限:知识源和查询类型的多样性不足限制了 RAG 系统的适用性;(2) 问题定位模糊:现有的评估方法难以定位 RAG 管道中问题发生的阶段;(3) 检索评估不稳定:这些方法往往无法有效评估检索性能,尤其是在分块策略发生变化时。为了应对这些挑战,我们提出了一个全面的端到端评估 (Comprehensive Full-chain Evaluation, CoFE-RAG) 框架,以促进对整个 RAG 管道(包括分块、检索、重排序和生成)的全面评估。为了有效评估前三个阶段,我们引入了多粒度关键词,包括粗粒度和细粒度关键词,以评估检索到的上下文,而不是依赖于黄金分块的标注。此外,我们发布了一个综合基准数据集,专门针对涵盖广泛文档格式和查询类型的多样化数据场景。我们通过实验展示了 CoFE-RAG 框架在评估 RAG 系统各个阶段的实用性。我们的评估方法为 RAG 系统在处理多样化数据场景中的有效性提供了独特的见解,提供了对其能力和局限性的更细致的理解。

[NLP-71] EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

【速读】: 该论文试图解决在大规模语言模型(LLM)中,混合专家(MoE)架构在推理过程中因计算效率和通信开销导致的吞吐量瓶颈问题。解决方案的关键在于提出了EPS-MoE,一种新颖的专家管道调度器,通过动态选择最佳的GroupGemm和DenseGemm内核实现,并自适应地重叠计算与all2all通信,从而显著提升推理吞吐量。实验结果表明,EPS-MoE在现有并行推理方法的基础上,平均提高了21%的预填充吞吐量,特别是在DeepSeekV2模型上,将其预填充吞吐量从100K tokens/秒提升至至少120K tokens/秒。

链接: https://arxiv.org/abs/2410.12247
作者: Yulei Qian,Fengcun Li,Xiangyang Ji,Xiaoyu Zhao,Jianchao Tan,Kefeng Zhang,Xunliang Cai
关键词-EN: Large Language Model, capabilities expanding rapidly, expanding rapidly due, increased computational resources, Large Language
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 13 pages, 14 figures

点击查看摘要

Abstract:Large Language Model (LLM) has revolutionized the field of artificial intelligence, with their capabilities expanding rapidly due to advances in deep learning and increased computational resources. The mixture-of-experts (MoE) model has emerged as a prominent architecture in the field of LLM, better balancing the model performance and computational efficiency. MoE architecture allows for effective scaling and efficient parallel processing, but the GEMM (General Matrix Multiply) of MoE and the large parameters introduce challenges in terms of computation efficiency and communication overhead, which becomes the throughput bottleneck during inference. Applying a single parallelism strategy like EP, DP, PP, etc. to MoE architecture usually achieves sub-optimal inference throughput, the straightforward combinations of existing different parallelisms on MoE can not obtain optimal inference throughput yet. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that goes beyond the existing inference parallelism schemes. Our approach focuses on optimizing the computation of MoE FFN (FeedForward Network) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with \textitall2all communication, leading to a substantial increase in throughput. Our experimental results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods. Specifically, we validated our method on DeepSeekV2, a highly optimized model claimed to achieve a prefill throughput of 100K tokens per second. By applying EPS-MoE, we further accelerated it to at least 120K tokens per second.
摘要:大语言模型 (LLM) 已经彻底改变了人工智能领域,其能力由于深度学习的进步和计算资源的增加而迅速扩展。混合专家 (MoE) 模型已成为 LLM 领域中一种突出的架构,更好地平衡了模型性能和计算效率。MoE 架构允许有效的扩展和高效的并行处理,但其 GEMM (General Matrix Multiply) 和大参数引入了计算效率和通信开销方面的挑战,这成为推理过程中的吞吐量瓶颈。将单一并行策略(如 EP、DP、PP 等)应用于 MoE 架构通常只能实现次优的推理吞吐量,现有不同并行策略的直接组合尚不能获得最佳的推理吞吐量。本文介绍了 EPS-MoE,这是一种针对 MoE 的新型专家流水线调度器,超越了现有的推理并行方案。我们的方法专注于通过动态选择 GroupGemm 和 DenseGemm 的最佳内核实现来优化 MoE FFN (FeedForward Network) 模块的计算,并自适应地重叠这些计算与 \textit{all2all} 通信,从而显著提高吞吐量。我们的实验结果表明,与现有的并行推理方法相比,预填充吞吐量平均提高了 21%。具体而言,我们在 DeepSeekV2 上验证了我们的方法,这是一个经过高度优化的模型,声称达到了每秒 100K Token 的预填充吞吐量。通过应用 EPS-MoE,我们将其进一步加速至至少每秒 120K Token。

[NLP-72] riple Modality Fusion: Aligning Visual Textual and Graph Data with Large Language Models for Multi-Behavior Recommendations

【速读】: 该论文试图解决个性化推荐系统中单一数据源模型无法全面捕捉物品特征和用户行为多样性的问题。解决方案的关键在于引入了一种名为Triple Modality Fusion (TMF)的新框架,通过融合视觉、文本和图数据三种模态,并利用大型语言模型(LLMs)进行对齐和整合。具体来说,TMF模型通过LLMs对用户行为和物品特征进行自然语言建模,并设计了基于交叉注意力和自注意力机制的模态融合模块,将不同模态的数据集成到同一嵌入空间中,从而实现对用户行为的全面表示,显著提升了推荐系统的准确性。

链接: https://arxiv.org/abs/2410.12228
作者: Luyi Ma,Xiaohan Li,Zezhong Fan,Jianpeng Xu,Jason Cho,Praveen Kanumala,Kaushiki Nag,Sushant Kumar,Kannan Achan
关键词-EN: Integrating diverse data, Integrating diverse, personalized recommendation systems, diverse data modalities, crucial for enhancing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user’s interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.
摘要:整合多样的数据模态对于提升个性化推荐系统的性能至关重要。传统模型通常依赖单一数据源,缺乏深度,无法准确捕捉项目特征和用户行为的多样性。本文介绍了一种新颖的多行为推荐框架,通过与大语言模型 (LLM) 的对齐,融合了视觉、文本和图数据三种模态。通过整合视觉信息,我们捕捉了项目的上下文和美学特征;文本数据详细揭示了用户兴趣和项目特征;图数据则阐明了项目-行为异构图中的关系。我们提出的模型名为三模态融合 (Triple Modality Fusion, TMF),利用 LLM 的力量对齐并整合这三种模态,实现对用户行为的全面表示。LLM 模型以自然语言建模用户的交互行为和项目特征。首先,LLM 仅使用基于自然语言的提示进行预热。然后,我们设计了基于交叉注意力和自注意力机制的模态融合模块,将来自其他模型的不同模态整合到同一嵌入空间,并将其融入 LLM 中。广泛的实验证明了我们的方法在提高推荐准确性方面的有效性。进一步的消融研究验证了我们模型设计的有效性和 TMF 的益处。

[NLP-73] On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation

【速读】: 该论文试图解决自然语言生成(NLG)中的幻觉问题,即生成的内容与事实不符,可能导致数据质量下降或用户信任度降低。解决方案的关键在于开发自动化的事实核查评估方法,通过设计评分模板和利用大型语言模型(LLMs)对生成内容进行量化评分。论文比较了多种LLMs和自然语言推理(NLI)模型在评分质量和敏感性方面的表现,并提出了生成合成不忠实数据的方法以及量化幻觉比例的启发式方法。研究结果表明,GPT-4在判断源文本与生成文本的事实一致性方面表现准确,且通过在合成数据上微调NLI模型可以提升性能。此外,论文还探讨了部署此类系统的延迟和成本问题。

链接: https://arxiv.org/abs/2410.12222
作者: Xiaonan Jing,Srinivas Billa,Danny Godbout
关键词-EN: NLG, popular topic, Abstract, generation, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 13 figures

点击查看摘要

Abstract:Hallucination has been a popular topic in natural language generation (NLG). In real-world applications, unfaithful content can result in bad data quality or loss of trust from end users. Thus, it is crucial to fact-check before adopting NLG for production usage, which can be expensive if done manually. In this paper, we investigate automated faithfulness evaluation in guided NLG. We developed a rubrics template and use large language models (LLMs) to score the generation into quantifiable scales. We compared popular LLMs as well as the widely adopted natural language inference (NLI) models in scoring quality and sensitivity. In addition, we developed methods to generation synthetic unfaithful data, as well as a heuristics to quantify the percentage of hallucination. Our results on 4 travel-domain industry dataset show that GPT-4 can provide accurate judgement and explanation on whether a source and a generation are factually consistent. Furthermore, we found that tuning NLI models on synthetic data can improve performance. Lastly, we present insights on latency and cost for deploying such system.
摘要: 幻觉 (Hallucination) 在自然语言生成 (NLG) 领域一直是一个热门话题。在实际应用中,不忠实的内容可能导致数据质量下降或终端用户信任的丧失。因此,在将 NLG 用于生产之前进行事实核查至关重要,但手动核查成本高昂。本文探讨了在引导式 NLG 中进行自动忠实度评估的方法。我们开发了一个评分模板,并利用大语言模型 (LLMs) 将生成内容评分量化。我们比较了流行的 LLMs 以及广泛采用的自然语言推理 (NLI) 模型在评分质量和敏感性方面的表现。此外,我们开发了生成合成不忠实数据的方法,并设计了一种启发式方法来量化幻觉的百分比。我们在四个旅游领域的行业数据集上的实验结果表明,GPT-4 能够提供关于源文本与生成文本是否事实一致的准确判断和解释。此外,我们发现对 NLI 模型进行合成数据调优可以提升其性能。最后,我们提供了关于部署此类系统的延迟和成本的见解。

[NLP-74] OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

【速读】: 该论文试图解决现有基准测试在评估多模态语言模型(OLMs)时,无法全面评估模型在跨模态理解和推理能力上的问题。解决方案的关键在于引入了OmnixR评估套件,该套件提供了两种评估变体:合成子集和现实子集。合成子集通过自动将文本转换为多种模态(如音频、图像、视频及其混合)来生成数据集,而现实子集则是由专家手动策划和注释的真实世界数据集,用于评估在自然环境中的跨模态推理。OmnixR通过提供涉及多种模态的复杂问题,为评估OLMs的跨模态推理能力提供了一个严格的测试平台。

链接: https://arxiv.org/abs/2410.12219
作者: Lichang Chen,Hexiang Hu,Mingda Zhang,Yiwen Chen,Zifeng Wang,Yandong Li,Pranav Shyam,Tianyi Zhou,Heng Huang,Ming-Hsuan Yang,Boqing Gong
关键词-EN: SoTA Omni-modality Language, Omni-modality Language Models, Omni-modality Language, benchmark SoTA Omni-modality, SoTA Omni-modality
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 19 pages, 6 figures, 12 tables

点击查看摘要

Abstract:We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Particularly, the user message might often consist of multiple modalities, such that OLMs have to establish holistic understanding and reasoning across modalities to accomplish the task. Existing benchmarks are limited to single modality or dual-modality tasks, overlooking comprehensive multi-modal assessments of model reasoning. To address this, OmnixR offers two evaluation variants: (1)synthetic subset: a synthetic dataset generated automatically by translating text into multiple modalities–audio, images, video, and hybrids (Omnify). (2)realistic subset: a real-world dataset, manually curated and annotated by experts, for evaluating cross-modal reasoning in natural settings. OmnixR presents a unique evaluation towards assessing OLMs over a diverse mix of modalities, such as a question that involves video, audio, and text, providing a rigorous cross-modal reasoning testbed unlike any existing benchmarks. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer. Further analysis highlights differences in reasoning behavior, underscoring the challenges of omni-modal AI alignment.
摘要:我们介绍了 OmnixR,这是一个用于基准测试最先进的 Omni-modality 语言模型(如 GPT-4o 和 Gemini)的评估套件。评估整合了文本、视觉和音频等多种模态的语言模型(OLMs)面临着独特的挑战。特别是,用户消息可能经常包含多种模态,因此 OLMs 必须建立跨模态的整体理解和推理能力以完成任务。现有的基准测试局限于单一模态或双模态任务,忽视了对模型跨模态推理能力的全面评估。为此,OmnixR 提供了两种评估变体:(1) 合成子集:一个由文本自动翻译成多种模态(包括音频、图像、视频和混合模态(Omnify))生成的合成数据集。(2) 现实子集:一个由专家手动策划和注释的真实世界数据集,用于在自然环境中评估跨模态推理。OmnixR 提供了一种独特的评估方法,用于评估 OLMs 在涉及多种模态(如涉及视频、音频和文本的问题)的多样化混合中的表现,提供了一个不同于现有基准的严格跨模态推理测试平台。我们的实验发现,所有最先进的 OLMs 在需要整合多种模态信息来回答的 OmnixR 问题上都表现不佳。进一步的分析突显了推理行为的差异,强调了全模态 AI 对齐的挑战。

[NLP-75] Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree

【速读】: 该论文试图解决在标注者意见不一致时,如何更准确地预测个体标注者的标签评分问题。解决方案的关键在于引入三种方法:神经协同过滤(NCF)、上下文学习(ICL)和基于中间嵌入的架构,通过整合标注者的历史记录、人口统计信息和调查数据,来捕捉传统标签聚合可能忽略的细微差别。研究结果表明,基于中间嵌入的架构在预测准确性上优于其他方法,并且通过调查信息预测的人口统计数据作为特征,其表现与使用真实人口统计数据相当,这提示人口统计信息在模型评分中的作用可能被调查响应所涵盖。

链接: https://arxiv.org/abs/2410.12217
作者: Harbani Jaggi,Kashyap Murali,Eve Fleisig,Erdem Bıyık
关键词-EN: traditional label aggregation, capture nuances overlooked, label aggregation, traditional label, capture nuances
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When annotators disagree, predicting the labels given by individual annotators can capture nuances overlooked by traditional label aggregation. We introduce three approaches to predicting individual annotator ratings on the toxicity of text by incorporating individual annotator-specific information: a neural collaborative filtering (NCF) approach, an in-context learning (ICL) approach, and an intermediate embedding-based architecture. We also study the utility of demographic information for rating prediction. NCF showed limited utility; however, integrating annotator history, demographics, and survey information permits both the embedding-based architecture and ICL to substantially improve prediction accuracy, with the embedding-based architecture outperforming the other methods. We also find that, if demographics are predicted from survey information, using these imputed demographics as features performs comparably to using true demographic data. This suggests that demographics may not provide substantial information for modeling ratings beyond what is captured in survey responses. Our findings raise considerations about the relative utility of different types of annotator information and provide new approaches for modeling annotators in subjective NLP tasks.
摘要:当标注者之间存在分歧时,预测单个标注者给出的标签可以捕捉到传统标签聚合方法所忽略的细微差别。我们提出了三种方法,通过结合个体标注者的特定信息来预测文本毒性评级的个体标注者评分:神经协同过滤 (NCF) 方法、上下文学习 (ICL) 方法以及基于中间嵌入的架构。我们还研究了人口统计信息对评级预测的效用。NCF 显示出有限的效用;然而,整合标注者历史、人口统计信息和调查信息使得基于嵌入的架构和 ICL 都能显著提高预测准确性,其中基于嵌入的架构表现优于其他方法。我们还发现,如果从调查信息中预测人口统计信息,使用这些推断的人口统计信息作为特征的表现与使用真实人口统计数据相当。这表明,人口统计信息可能不会在调查响应之外为建模评级提供显著的信息。我们的研究结果引发了关于不同类型标注者信息相对效用的考虑,并为在主观自然语言处理任务中建模标注者提供了新的方法。

[NLP-76] Negative-Prompt-driven Alignment for Generative Language Model

【速读】: 该论文试图解决大语言模型在输出与人类价值观和偏好对齐方面的挑战,特别是现有方法主要依赖正例而忽视了负例在引导模型避免不良行为中的重要性。解决方案的关键在于提出了NEAT(NEgative-prompt-driven AlignmenT)方法,通过引入负例提示在优化过程中生成不良响应,并明确惩罚模型产生有害输出的行为。NEAT通过正负例的双重反馈机制,不仅引导模型生成期望行为,还防止其产生不良或偏见响应,从而显著提升模型与人类价值观和偏好的对齐效果。

链接: https://arxiv.org/abs/2410.12194
作者: Shiqi Qiao,Ning Xv,Biao Liu,Xin Geng
关键词-EN: achieved remarkable capabilities, Large language models, Large language, remarkable capabilities, significant challenge
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of explicit negative examples that contradict human values, hindering its ability to discourage harmful or biased outputs during training. To address this limitation, we propose NEAT, i.e., NEgative-prompt-driven AlignmenT, to introduce negative prompts to generate undesirable responses alongside positive examples during the optimization process. NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses. This dual feedback mechanism enables better alignment with human preferences, crucial in contexts where avoiding harm is paramount. Starting from a pre-trained language model, NEAT performs online alignment by incorporating a ranking loss derived from an expanded preference dataset containing both positive and negative examples. Extensive experiments validate NEAT’s effectiveness in significantly enhancing language models’ alignment with human values and preferences.
摘要:大语言模型已经取得了显著的能力,但如何使其输出与人类价值观和偏好相一致仍然是一个重大挑战。现有的对齐方法主要集中在正面示例上,而忽视了负面响应在引导模型避免不良行为方面的重要性。例如,广泛使用的对齐数据集中明显缺乏与人类价值观相矛盾的负面示例,这阻碍了其在训练过程中阻止有害或偏见输出的能力。为了解决这一局限性,我们提出了 NEAT,即 NEgative-prompt-driven AlignmenT,通过引入负面提示在优化过程中生成与正面示例并列的不良响应。NEAT 明确地对模型产生有害输出的行为进行惩罚,不仅引导其向着理想行为发展,还使其远离生成不良、偏见的响应。这种双重反馈机制能够更好地与人类偏好对齐,这在避免伤害至关重要的情境中尤为关键。从预训练的语言模型开始,NEAT 通过结合从扩展的偏好数据集中提取的排序损失来进行在线对齐,该数据集包含正面和负面示例。广泛的实验验证了 NEAT 在显著增强语言模型与人类价值观和偏好对齐方面的有效性。

[NLP-77] Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish

【速读】: 该论文试图解决在Rioplatense西班牙语中检测仇恨言论的问题,解决方案的关键在于利用大型语言模型(如ChatGPT 3.5、Mixtral和Aya)进行分类实验,并通过链式思维推理来提高检测的敏感性,特别是在处理高度微妙的仇恨言论(如恐同/恐跨性别言论)时。尽管大型语言模型在精确度上可能不如经过微调的BERT分类器,但它们在识别难以捕捉的俚语或口语表达方面表现出色。

链接: https://arxiv.org/abs/2410.12174
作者: Juan Manuel Pérez,Paula Miguel,Viviana Cotik
关键词-EN: Large Language Models, Natural Language Processing, Hate speech, Large Language, speech detection deals
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech detection deals with many language variants, slang, slurs, expression modalities, and cultural nuances. This outlines the importance of working with specific corpora, when addressing hate speech within the scope of Natural Language Processing, recently revolutionized by the irruption of Large Language Models. This work presents a brief analysis of the performance of large language models in the detection of Hate Speech for Rioplatense Spanish. We performed classification experiments leveraging chain-of-thought reasoning with ChatGPT 3.5, Mixtral, and Aya, comparing their results with those of a state-of-the-art BERT classifier. These experiments outline that, even if large language models show a lower precision compared to the fine-tuned BERT classifier and, in some cases, they find hard-to-get slurs or colloquialisms, they still are sensitive to highly nuanced cases (particularly, homophobic/transphobic hate speech). We make our code and models publicly available for future research.
摘要:仇恨言论检测涉及多种语言变体、俚语、侮辱性词汇、表达方式和文化细微差别。这凸显了在处理自然语言处理(Natural Language Processing, NLP)领域内的仇恨言论时,使用特定语料库的重要性,尤其是在大语言模型(Large Language Models, LLM)的涌现下,NLP 领域经历了革命性的变革。本文对大语言模型在 Rioplatense 西班牙语仇恨言论检测中的表现进行了简要分析。我们利用 ChatGPT 3.5、Mixtral 和 Aya 的链式思维推理(chain-of-thought reasoning)进行了分类实验,并将它们的结果与最先进的 BERT 分类器进行了比较。这些实验表明,尽管大语言模型的精确度低于经过微调的 BERT 分类器,并且在某些情况下难以识别某些难以捕捉的侮辱性词汇或口语表达,但它们对高度细微的案例(特别是针对同性恋/跨性别者的仇恨言论)仍然敏感。我们公开了代码和模型,以供未来的研究使用。

[NLP-78] able-LLM-Specialist: Language Model Specialists for Tables using Iterative Generator-Validator Fine-tuning

【速读】: 该论文提出了一种名为Table-LLM-Specialist(简称Table-Specialist)的新型自训练微调范式,旨在解决表格任务中缺乏手动标注数据的问题。解决方案的关键在于利用任务的双重性(生成性和分类性),通过生成-验证范式迭代生成训练数据,从而微调出能够专门处理特定任务的模型,无需手动标注数据。该方法不仅在多种表格任务上表现优于传统的语言模型,如GPT-3.5,还能达到甚至超越GPT-4的质量水平,同时降低了部署成本和提高了模型的泛化能力。

链接: https://arxiv.org/abs/2410.12164
作者: Junjie Xing,Yeye He,Mengyu Zhou,Haoyu Dong,Shi Han,Dongmei Zhang,Surajit Chaudhuri
关键词-EN: self-trained fine-tuning paradigm, fine-tuning paradigm specifically, paradigm specifically designed, self-trained fine-tuning, specifically designed
类目: Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we propose Table-LLM-Specialist, or Table-Specialist for short, as a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger \sys models that can specialize in a given task, without requiring manually-labeled data. Our extensive evaluations suggest that our Table-Specialist has (1) \textitstrong performance on diverse table tasks over vanilla language-models – for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) \textitlower cost to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency and inference cost, with comparable quality, and (3) \textitbetter generalizability when evaluated across multiple benchmarks, since \sys is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code and data will be available at this https URL. Subjects: Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2410.12164 [cs.CL] (or arXiv:2410.12164v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.12164 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在本研究中,我们提出了 Table-LLM-Specialist,简称 Table-Specialist,作为一种专为表格任务设计的新型自训练微调范式。我们的洞察在于,对于每个表格任务,通常存在两个本质上相互对立的任务版本,一个是生成式的,另一个是分类式的。利用这种对立性,我们提出了一种生成-验证范式,通过迭代地从语言模型生成然后验证训练数据,来微调更强大的 \sys 模型,使其能够专门处理特定任务,而无需手动标注的数据。我们的广泛评估表明,Table-Specialist 具有以下特点:(1)在多种表格任务上表现优于普通的语言模型——例如,在 GPT-3.5 上微调的 Table-Specialist 不仅优于普通的 GPT-3.5,而且通常能够达到或超越 GPT-4 级别的质量;(2)部署成本更低,因为在 GPT-3.5 上微调的 Table-Specialist 达到 GPT-4 级别质量时,可以部署具有更低延迟和推理成本的小型模型,同时保持相当的质量;(3)在多个基准测试中具有更好的泛化能力,因为 \sys 是在从多种真实表格系统生成的广泛训练数据上进行微调的。我们的代码和数据将在此 https URL 上提供。

主题:计算与语言 (cs.CL); 数据库 (cs.DB); 机器学习 (cs.LG)
引用为:arXiv:2410.12164 [cs.CL]
(或 arXiv:2410.12164v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.12164
了解更多信息
arXiv 发布的 DOI 通过 DataCite (待注册)

[NLP-79] Exploiting LLMs Reasoning Capability to Infer Implicit Concepts in Legal Information Retrieval KR

【速读】: 该论文试图解决法定法律检索中,基于语义和词汇相关性的检索系统在处理涉及现实场景或非法律领域特定词汇的查询时表现不佳的问题。解决方案的关键在于利用大型语言模型(LLMs)的逻辑推理能力,识别与查询中提及的情境相关的法律术语和事实,并通过术语扩展和查询重构来集成额外信息,从而提高检索准确性。实验结果表明,LLMs提供的额外知识有助于提升词汇和语义排序模型的检索效果,最终的集成检索系统在COLIEE 2022和2023竞赛中表现优异,超越了所有参赛队伍的最高成绩。

链接: https://arxiv.org/abs/2410.12154
作者: Hai-Long Nguyen,Tan-Minh Nguyen,Duc-Minh Nguyen,Thi-Hai-Yen Vuong,Ha-Thanh Nguyen,Xuan-Hieu Phan
关键词-EN: Statutory law retrieval, Statutory law, law engineering, legal language processing, practical applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:Statutory law retrieval is a typical problem in legal language processing, that has various practical applications in law engineering. Modern deep learning-based retrieval methods have achieved significant results for this problem. However, retrieval systems relying on semantic and lexical correlations often exhibit limitations, particularly when handling queries that involve real-life scenarios, or use the vocabulary that is not specific to the legal domain. In this work, we focus on overcoming this weaknesses by utilizing the logical reasoning capabilities of large language models (LLMs) to identify relevant legal terms and facts related to the situation mentioned in the query. The proposed retrieval system integrates additional information from the term–based expansion and query reformulation to improve the retrieval accuracy. The experiments on COLIEE 2022 and COLIEE 2023 datasets show that extra knowledge from LLMs helps to improve the retrieval result of both lexical and semantic ranking models. The final ensemble retrieval system outperformed the highest results among all participating teams in the COLIEE 2022 and 2023 competitions.
摘要:法定法律检索是法律语言处理中的典型问题,在法律工程中具有多种实际应用。现代基于深度学习的检索方法在此问题上取得了显著成果。然而,依赖于语义和词汇关联的检索系统往往存在局限性,特别是在处理涉及现实生活场景或使用非法律领域特定词汇的查询时。在本研究中,我们专注于利用大语言模型 (LLM) 的逻辑推理能力来识别与查询中提及的情境相关的法律术语和事实,从而克服这一弱点。所提出的检索系统通过结合基于术语的扩展和查询重构的额外信息来提高检索准确性。在 COLIEE 2022 和 COLIEE 2023 数据集上的实验表明,来自 LLM 的额外知识有助于提升词汇和语义排序模型的检索结果。最终的集成检索系统在 COLIEE 2022 和 2023 竞赛中超越了所有参赛队伍的最高成绩。

[NLP-80] Layer-of-Thoughts Prompting (LoT): Leveraging LLM-Based Retrieval with Constraint Hierarchies KR

【速读】: 该论文试图解决现有提示技术在多轮交互中缺乏对提示间层次关系深入理解的问题。解决方案的关键在于提出了一种名为“Layer-of-Thoughts Prompting (LoT)”的新方法,通过利用约束层次结构来筛选和优化候选响应,从而实现结构化的检索过程,增强了解释性和自动化程度。该方法通过关注提示间的层次关系,显著提升了信息检索任务的准确性和可理解性。

链接: https://arxiv.org/abs/2410.12153
作者: Wachara Fungwacharakorn,Nguyen Ha Thanh,May Myo Zin,Ken Satoh
关键词-EN: refine candidate responses, utilizes constraint hierarchies, approach termed, hierarchies to filter, filter and refine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:This paper presents a novel approach termed Layer-of-Thoughts Prompting (LoT), which utilizes constraint hierarchies to filter and refine candidate responses to a given query. By integrating these constraints, our method enables a structured retrieval process that enhances explainability and automation. Existing methods have explored various prompting techniques but often present overly generalized frameworks without delving into the nuances of prompts in multi-turn interactions. Our work addresses this gap by focusing on the hierarchical relationships among prompts. We demonstrate that the efficacy of thought hierarchy plays a critical role in developing efficient and interpretable retrieval algorithms. Leveraging Large Language Models (LLMs), LoT significantly improves the accuracy and comprehensibility of information retrieval tasks.
摘要:本文提出了一种名为“思维层级提示法 (Layer-of-Thoughts Prompting, LoT)”的新方法,该方法利用约束层级来筛选和优化对给定查询的候选响应。通过整合这些约束,我们的方法实现了一个结构化的检索过程,增强了可解释性和自动化程度。现有方法虽然探索了多种提示技术,但往往提供过于泛化的框架,未能深入探讨多轮交互中提示的细微差别。我们的工作通过关注提示之间的层级关系来填补这一空白。我们证明,思维层级在开发高效且可解释的检索算法中起着关键作用。借助大语言模型 (LLMs),LoT 显著提高了信息检索任务的准确性和可理解性。

[NLP-81] Preference Optimization with Multi-Sample Comparisons

【速读】: 该论文试图解决现有生成模型后训练方法(如RLHF和DAP)在单样本比较中难以捕捉生成多样性和偏差等关键特征的问题。解决方案的关键在于引入多样本比较,通过提出Multi-sample Direct Preference Optimization (mDPO)和Multi-sample Identity Preference Optimization (mIPO)方法,改进传统的DAP方法,使其能够更有效地优化生成模型的集体特征(如多样性和偏差),并提高对标签噪声的鲁棒性。

链接: https://arxiv.org/abs/2410.12138
作者: Chaoqi Wang,Zhuokai Zhao,Chen Zhu,Karthik Abinav Sankararaman,Michal Valko,Xuefei Cao,Zhaorun Chen,Madian Khabsa,Yuxin Chen,Hao Ma,Sinong Wang
关键词-EN: Recent advancements, large language models, large language, driven by extensive, extensive pretraining
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.
摘要:近年来,生成式模型,特别是大语言模型 (LLM) 和扩散模型,在大量数据集上的预训练后,通过后训练得到了显著进展。然而,当前的后训练方法,如基于人类反馈的强化学习 (RLHF) 和直接偏好对齐方法 (DAP),主要依赖于单样本比较。这些方法往往无法捕捉生成多样性和偏差等关键特征,而这些特征通过多样本评估能更准确地反映。为了解决这些局限性,我们提出了一种新的方法,将后训练扩展到包含多样本比较。为此,我们提出了多样本直接偏好优化 (mDPO) 和多样本身份偏好优化 (mIPO)。这些方法通过关注群体特征,改进了传统的 DAP 方法。实证研究表明,多样本比较在优化生成模型的集体特征 (例如,多样性和偏差) 方面比单样本比较更为有效。此外,我们的研究结果表明,多样本比较提供了一个更稳健的优化框架,特别是在存在标签噪声的数据集上。

[NLP-82] Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning

【速读】: 该论文试图解决大语言模型(LLMs)在推理过程中产生幻觉(hallucination)的问题,这种幻觉可能导致事实错误、信息不一致和内容捏造,从而引发安全风险。解决方案的关键在于提出了一种名为迭代模型级对比学习(Iterative Model-level Contrastive Learning, Iter-AHMCL)的新方法。该方法通过对比训练包含幻觉和不包含幻觉的数据,修改预训练LLMs的表示层,利用正负模型的差异来消除幻觉,并通过迭代对比学习进一步增强性能。实验结果表明,该方法在TruthfulQA基准测试中平均提升了10.1分,有效减少了幻觉现象,同时保持了LLMs的通用能力。

链接: https://arxiv.org/abs/2410.12130
作者: Huiwen Wu,Xiaohan Li,Xiaogang Xu,Jiafei Wu,Deyi Zhang,Zhe Liu
关键词-EN: Large Language Models, scientific research fields, scientific literature summarization, Large Language, knowledge graph construction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of Large Language Models (LLMs) has significantly advanced various AI applications in commercial and scientific research fields, such as scientific literature summarization, writing assistance, and knowledge graph construction. However, a significant challenge is the high risk of hallucination during LLM inference, which can lead to security concerns like factual inaccuracies, inconsistent information, and fabricated content. To tackle this issue, it is essential to develop effective methods for reducing hallucination while maintaining the original capabilities of the LLM. This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination. This method modifies the representation layers of pre-trained LLMs by using contrastive positive' and negative’ models, trained on data with and without hallucinations. By leveraging the differences between these two models, we create a more straightforward pathway to eliminate hallucinations, and the iterative nature of contrastive learning further enhances performance. Experimental validation on four pre-trained foundation LLMs (LLaMA2, Alpaca, LLaMA3, and Qwen) finetuning with a specially designed dataset shows that our approach achieves an average improvement of 10.1 points on the TruthfulQA benchmark. Comprehensive experiments demonstrate the effectiveness of Iter-AHMCL in reducing hallucination while maintaining the general capabilities of LLMs.
摘要:大语言模型 (LLM) 的发展显著推动了商业和科学研究领域中各种 AI 应用的进步,如科学文献摘要、写作辅助和知识图谱构建。然而,LLM 推理过程中存在的高幻觉风险是一个重大挑战,可能导致事实不准确、信息不一致和内容捏造等安全问题。为解决这一问题,开发有效的方法以减少幻觉同时保持 LLM 的原始能力至关重要。本文介绍了一种名为迭代模型级对比学习 (Iterative Model-level Contrastive Learning, Iter-AHMCL) 的新方法来应对幻觉问题。该方法通过使用对比的“正”模型和“负”模型,对包含和不包含幻觉的数据进行训练,修改预训练 LLM 的表示层。通过利用这两种模型之间的差异,我们创建了一条更直接的消除幻觉的路径,而对比学习的迭代特性进一步提升了性能。在四个预训练基础 LLM (LLaMA2, Alpaca, LLaMA3, 和 Qwen) 上进行微调,并使用专门设计的数据集进行实验验证,结果显示我们的方法在 TruthfulQA 基准测试中平均提升了 10.1 分。综合实验表明,Iter-AHMCL 在减少幻觉的同时,保持了 LLM 的通用能力。

[NLP-83] Scaling laws for post-training quantized large language models

【速读】: 该论文试图解决大型语言模型(LLMs)在训练后压缩过程中的性能预测问题。解决方案的关键在于通过系统性实验研究,识别出与局部损失景观特征相关的关键缩放因子,并基于这些因子构建统计模型,从而能够合理预测量化后LLMs的性能。

链接: https://arxiv.org/abs/2410.12119
作者: Zifei Xu,Alexander Lan,Wanzin Yazar,Tristan Webb,Sayeh Sharify,Xin Wang
关键词-EN: well-trained large language, Generalization abilities, large language models, abilities of well-trained, well-trained large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.
摘要:众所周知,经过良好训练的大语言模型 (LLM) 的泛化能力会随着模型规模的增加而可预测地扩展。与预训练过程中存在的实际扩展规律相比,大语言模型在训练后压缩后的质量仍然高度不可预测,实践中通常需要逐例验证。在本研究中,我们尝试通过系统地对多个大语言模型家族进行量化,使用流行的权重量化技术将其量化为多种低精度张量数据类型,来缩小这一差距。我们识别了与局部损失景观特征相关的关键扩展因子,基于这些因子,量化后的大语言模型的性能可以通过统计模型得到合理预测。

[NLP-84] Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

【速读】: 该论文试图解决大型语言模型(LLMs)在处理复杂规划问题时面临的灵活性与复杂性之间的权衡问题。解决方案的关键在于提出了一种通用框架LLMFP,该框架利用LLMs的常识、推理和编程能力,将规划问题转化为优化问题,并从零开始进行形式化建模和求解,无需任务特定的上下文示例或预定义的批评/验证器。通过这种方式,LLMFP显著提高了跨任务的泛化能力,并在多个规划任务中实现了显著的性能提升。

链接: https://arxiv.org/abs/2410.12112
作者: Yilun Hao,Yang Zhang,Chuchu Fan
关键词-EN: large language models, recently demonstrated strong, demonstrated strong potential, planning problems, solving planning problems
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 50 pages, 25 figures, 7 tables

点击查看摘要

Abstract:While large language models (LLMs) have recently demonstrated strong potential in solving planning problems, there is a trade-off between flexibility and complexity. LLMs, as zero-shot planners themselves, are still not capable of directly generating valid plans for complex planning problems such as multi-constraint or long-horizon tasks. On the other hand, many frameworks aiming to solve complex planning problems often rely on task-specific preparatory efforts, such as task-specific in-context examples and pre-defined critics/verifiers, which limits their cross-task generalization capability. In this paper, we tackle these challenges by observing that the core of many planning problems lies in optimization problems: searching for the optimal solution (best plan) with goals subject to constraints (preconditions and effects of decisions). With LLMs’ commonsense, reasoning, and programming capabilities, this opens up the possibilities of a universal LLM-based approach to planning problems. Inspired by this observation, we propose LLMFP, a general-purpose framework that leverages LLMs to capture key information from planning problems and formally formulate and solve them as optimization problems from scratch, with no task-specific examples needed. We apply LLMFP to 9 planning problems, ranging from multi-constraint decision making to multi-step planning problems, and demonstrate that LLMFP achieves on average 83.7% and 86.8% optimal rate across 9 tasks for GPT-4o and Claude 3.5 Sonnet, significantly outperforming the best baseline (direct planning with OpenAI o1-preview) with 37.6% and 40.7% improvements. We also validate components of LLMFP with ablation experiments and analyzed the underlying success and failure reasons.
摘要:尽管大语言模型 (LLMs) 在解决规划问题方面最近展示了强大的潜力,但在灵活性与复杂性之间存在权衡。作为零样本规划器,LLMs 仍然无法直接为多约束或长时程任务等复杂规划问题生成有效的规划。另一方面,许多旨在解决复杂规划问题的框架通常依赖于任务特定的准备工作,例如任务特定的上下文示例和预定义的批评者/验证器,这限制了它们的跨任务泛化能力。本文通过观察到许多规划问题的核心在于优化问题:在约束条件下(决策的前提和效果)寻找最优解(最佳规划)。利用 LLMs 的常识、推理和编程能力,这为通用 LLM 方法解决规划问题开辟了可能性。受此启发,我们提出了 LLMFP,这是一个通用框架,利用 LLMs 从规划问题中捕捉关键信息,并从头开始将它们正式制定和解决为优化问题,无需任务特定的示例。我们将 LLMFP 应用于 9 个规划问题,从多约束决策到多步骤规划问题,并展示了 LLMFP 在 GPT-4o 和 Claude 3.5 Sonnet 上平均在 9 个任务中分别达到 83.7% 和 86.8% 的最优率,显著优于最佳基线(直接使用 OpenAI o1-preview 进行规划),分别提高了 37.6% 和 40.7%。我们还通过消融实验验证了 LLMFP 的组件,并分析了其成功和失败的根本原因。

[NLP-85] OMCAT: Omni Context Aware Transformer

【速读】: 该论文试图解决多模态大语言模型在细粒度跨模态时间理解上的挑战,特别是音频和视频流中事件关联的问题。解决方案的关键在于提出了两个创新:一是新的数据集OCTAV(Omni Context and Temporal Audio Video),用于捕捉音频和视频中的事件过渡;二是模型OMCAT(Omni Context Aware Transformer),利用RoTE(Rotary Time Embeddings)增强时间定位和计算效率,通过三阶段训练流程(特征对齐、指令调优和OCTAV特定训练)实现卓越的跨模态时间理解能力。

链接: https://arxiv.org/abs/2410.12109
作者: Arushi Goel,Karan Sapra,Matthieu Le,Rafael Valle,Andrew Tao,Bryan Catanzaro
关键词-EN: Large Language Models, Large Language, recent advancements extending, Language Models, Temporal Audio Video
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Demo page: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment, instruction tuning, and OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is this https URL.
摘要:大语言模型 (LLMs) 在文本生成和理解方面取得了显著进展,最近的发展更是扩展到了多模态 LLMs,这些模型整合了视觉和音频输入。然而,这些模型在细粒度的跨模态时间理解方面仍然面临挑战,特别是在音频和视频流之间关联事件时。我们通过两项关键贡献来解决这些挑战:一个新数据集和一个新模型,分别称为 OCTAV 和 OMCAT。OCTAV (Omni Context and Temporal Audio Video) 是一个新颖的数据集,旨在捕捉音频和视频之间的事件过渡。其次,OMCAT (Omni Context Aware Transformer) 是一个强大的模型,它利用 RoTE (Rotary Time Embeddings),这是 RoPE 的一个创新扩展,以增强时间锚定任务中的时间定位和计算效率。通过一个稳健的三阶段训练流程——特征对齐、指令调优和 OCTAV 特定训练——OMCAT 在跨模态时间理解方面表现出色。我们的模型在音频-视觉问答 (AVQA) 任务和 OCTAV 基准测试中展示了最先进的性能,通过全面的实验和消融研究验证了其在时间推理和跨模态对齐方面的显著提升。我们的数据集和代码将公开发布。我们的演示页面链接是这个 https URL。

[NLP-86] De-jargonizing Science for Journalists with GPT-4: A Pilot Study

【速读】: 该论文试图解决科学文献中术语(jargon)的识别与定义问题,特别是针对不同读者背景的个性化需求。解决方案的关键在于利用GPT-4大型语言模型(LLM)结合检索增强生成(RAG)技术,通过读者自我报告的知识水平来识别和定义术语。研究结果表明,仅基于摘要的上下文生成定义比使用全文上下文的RAG方法更为准确和高质量,这突显了生成式AI在简化复杂文档和辅助科学报道方面的潜力。

链接: https://arxiv.org/abs/2410.12069
作者: Sachita Nishal,Eric Lee,Nicholas Diakopoulos
关键词-EN: large language model, readers’ self-reported knowledge, define jargon terms, Retrieval-Augmented Generation, self-reported knowledge
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted to Computation+Journalism Symposium 2024

点击查看摘要

Abstract:This study offers an initial evaluation of a human-in-the-loop system leveraging GPT-4 (a large language model or LLM), and Retrieval-Augmented Generation (RAG) to identify and define jargon terms in scientific abstracts, based on readers’ self-reported knowledge. The system achieves fairly high recall in identifying jargon and preserves relative differences in readers’ jargon identification, suggesting personalization as a feasible use-case for LLMs to support sense-making of complex information. Surprisingly, using only abstracts for context to generate definitions yields slightly more accurate and higher quality definitions than using RAG-based context from the fulltext of an article. The findings highlight the potential of generative AI for assisting science reporters, and can inform future work on developing tools to simplify dense documents.
摘要:本研究首次评估了一个利用 GPT-4(一个大语言模型或 LLM)和检索增强生成 (RAG) 的人机协作系统,该系统基于读者自我报告的知识,识别和定义科学摘要中的术语。该系统在识别术语方面具有相当高的召回率,并保留了读者在术语识别中的相对差异,表明个性化是大语言模型支持复杂信息理解的一个可行用例。令人惊讶的是,仅使用摘要作为上下文生成定义,其准确性和质量略高于使用基于全文 RAG 上下文生成的定义。这些发现突显了生成式 AI 在协助科学记者方面的潜力,并为未来开发简化密集文档的工具提供了参考。

[NLP-87] LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text

【速读】: 该论文试图解决在自然文本中检测法律违规行为的问题,具体分为两个子任务:识别法律违规实体(LegalLens-NER)和将这些违规行为与相关法律背景及受影响个体关联(LegalLens-NLI)。解决方案的关键在于利用增强的LegalLens数据集,涵盖劳动、隐私和消费者保护领域,并通过微调预训练语言模型来提升识别和关联的准确性。研究结果显示,微调预训练语言模型在两个子任务中均优于专门的法律模型和少样本方法,尤其是在NER任务中,最高性能团队比基线提高了7.11%。

链接: https://arxiv.org/abs/2410.12064
作者: Ben Hagag,Liav Harpaz,Gil Semo,Dor Bernsohn,Rohit Saha,Pashootan Vaezipoor,Kyryl Truskovskyi,Gerasimos Spanakis
关键词-EN: LegalLens Shared Task, detecting legal violations, identifying legal violation, legal violation entities, relevant legal contexts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents the results of the LegalLens Shared Task, focusing on detecting legal violations within text in the wild across two sub-tasks: LegalLens-NER for identifying legal violation entities and LegalLens-NLI for associating these violations with relevant legal contexts and affected individuals. Using an enhanced LegalLens dataset covering labor, privacy, and consumer protection domains, 38 teams participated in the task. Our analysis reveals that while a mix of approaches was used, the top-performing teams in both tasks consistently relied on fine-tuning pre-trained language models, outperforming legal-specific models and few-shot methods. The top-performing team achieved a 7.11% improvement in NER over the baseline, while NLI saw a more marginal improvement of 5.7%. Despite these gains, the complexity of legal texts leaves room for further advancements.
摘要:本文介绍了 LegalLens 共享任务的结果,重点在于检测自然文本中的法律违规行为,涵盖两个子任务:LegalLens-NER 用于识别法律违规实体,LegalLens-NLI 用于将这些违规行为与相关法律背景和受影响个体关联起来。使用涵盖劳动、隐私和消费者保护领域的增强版 LegalLens 数据集,共有 38 支团队参与了该任务。我们的分析显示,尽管采用了多种方法,但在两个子任务中表现最佳的团队始终依赖于微调预训练语言模型,其表现优于专门的法律模型和少样本方法。表现最佳的团队在 NER 任务中比基线提高了 7.11%,而 NLI 任务的提升则较为有限,为 5.7%。尽管取得了这些进展,法律文本的复杂性仍为未来的进一步发展留下了空间。

[NLP-88] Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned

【速读】: 该论文旨在探讨语言模型在完形填空任务中生成下一个词的预测行为与人类实际表现之间的差异。研究发现,尽管训练时间更长的大型模型通常能更好地估计人类的表现,但它们在预测人类响应的概率时存在系统性偏差,如低估人类响应的概率、高估罕见响应的排名、低估常见响应的排名,并生成高度独特的语义空间。关键在于,这些发现表明语言模型的生成结果不能直接替代或模拟完形填空任务中的人类表现。

链接: https://arxiv.org/abs/2410.12057
作者: Cassandra L. Jacobs,Loïc Grobol,Alvin Tsang
关键词-EN: token prediction level, cloze task, compare the generative, generative behavior, token prediction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work we compare the generative behavior at the next token prediction level in several language models by comparing them to human productions in the cloze task. We find that while large models trained for longer are typically better estimators of human productions, but they reliably under-estimate the probabilities of human responses, over-rank rare responses, under-rank top responses, and produce highly distinct semantic spaces. Altogether, this work demonstrates in a tractable, interpretable domain that LM generations can not be used as replacements of or models of the cloze task.
摘要:在本研究中,我们通过将多个语言模型在下一个 Token 预测层面的生成行为与人类在填空任务中的表现进行比较,探讨了这些模型在生成方面的表现。我们发现,尽管经过长时间训练的大型模型通常能更好地估计人类的表现,但它们在预测人类响应的概率时存在系统性偏差:它们可靠地低估了人类响应的概率,过度排序了罕见响应,低估了常见响应,并生成了高度独特的语义空间。总体而言,本研究在一个可处理且可解释的领域内展示了语言模型的生成结果不能作为填空任务的替代品或模型。

[NLP-89] A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek

【速读】: 该论文旨在通过比较六种模型,确定适用于古希腊语的先进形态句法解析器和词形还原器,以便根据古希腊语依存树库的标注方案进行标注。解决方案的关键在于使用经过标准化处理的主要标注文本集合进行训练和微调,特别是通过随机初始化的字符嵌入训练基线模型Dithrax,并对Trankit及四个预训练于古希腊语文本的模型(GreBERTa、PhilBERTa、GreTA和PhilTa)进行微调。贝叶斯分析结果表明,Dithrax和Trankit在形态标注上几乎等效,而Trankit在句法标注上表现最佳,GreTa在词形还原上表现最佳。实验结果还指出,单纯依赖词嵌入不足以获得高UAS和LAS评分,除非结合专门设计的模型策略来捕捉句法关系。

链接: https://arxiv.org/abs/2410.12055
作者: Giuseppe G. A. Celano
关键词-EN: Greek Dependency Treebank, Ancient Greek Dependency, Dependency Treebank annotation, Ancient Greek capable, Treebank annotation scheme
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents an experiment consisting in the comparison of six models to identify a state-of-the-art morphosyntactic parser and lemmatizer for Ancient Greek capable of annotating according to the Ancient Greek Dependency Treebank annotation scheme. A normalized version of the major collections of annotated texts was used to (i) train the baseline model Dithrax with randomly initialized character embeddings and (ii) fine-tune Trankit and four recent models pretrained on Ancient Greek texts, i.e., GreBERTa and PhilBERTa for morphosyntactic annotation and GreTA and PhilTa for lemmatization. A Bayesian analysis shows that Dithrax and Trankit annotate morphology practically equivalently, while syntax is best annotated by Trankit and lemmata by GreTa. The results of the experiment suggest that token embeddings are not sufficient to achieve high UAS and LAS scores unless they are coupled with a modeling strategy specifically designed to capture syntactic relationships. The dataset and best-performing models are made available online for reuse.
摘要:本文通过比较六种模型,旨在识别出一种适用于古希腊语的、能够按照古希腊语依存树库注释方案进行注释的先进形态句法解析器和词形还原器。我们使用了经过标准化处理的主要注释文本集合,用于(i)训练基线模型 Dithrax,该模型采用随机初始化的字符嵌入;(ii)微调 Trankit 以及四种近期在古希腊语文本上预训练的模型,即用于形态句法注释的 GreBERTa 和 PhilBERTa,以及用于词形还原的 GreTA 和 PhilTa。贝叶斯分析显示,Dithrax 和 Trankit 在形态注释上表现几乎相同,而句法注释最佳的是 Trankit,词形还原最佳的是 GreTa。实验结果表明,Token 嵌入本身不足以获得高 UAS 和 LAS 分数,除非它们与一种专门设计用于捕捉句法关系的建模策略相结合。数据集和表现最佳的模型已在线提供,供重复使用。

[NLP-90] Skill-LLM: Repurposing General-Purpose LLMs for Skill Extraction

【速读】: 该论文试图解决从职位描述中准确提取技能的难题,解决方案的关键在于利用大型语言模型(LLMs)进行微调,构建专门的Skill-LLM和轻量级模型。通过在基准数据集上评估,并与现有最先进(SOTA)方法进行比较,研究结果表明这种方法在技能提取的精度和质量上优于现有的SOTA技术。

链接: https://arxiv.org/abs/2410.12052
作者: Amirhossein Herandi,Yitao Li,Zhanlin Liu,Ximin Hu,Xiao Cai
关键词-EN: Accurate skill extraction, Named Entity Recognition, Accurate skill, remains challenging, job descriptions
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate skill extraction from job descriptions is crucial in the hiring process but remains challenging. Named Entity Recognition (NER) is a common approach used to address this issue. With the demonstrated success of large language models (LLMs) in various NLP tasks, including NER, we propose fine-tuning a specialized Skill-LLM and a light weight model to improve the precision and quality of skill extraction. In our study, we evaluated the fine-tuned Skill-LLM and the light weight model using a benchmark dataset and compared its performance against state-of-the-art (SOTA) methods. Our results show that this approach outperforms existing SOTA techniques.
摘要:从职位描述中准确提取技能在招聘过程中至关重要,但仍具有挑战性。命名实体识别 (Named Entity Recognition, NER) 是解决这一问题的常用方法。随着大语言模型 (Large Language Models, LLMs) 在包括 NER 在内的多种自然语言处理 (Natural Language Processing, NLP) 任务中展示出的成功,我们提出微调一个专门的 Skill-LLM 和一个轻量级模型,以提高技能提取的精度和质量。在我们的研究中,我们使用基准数据集评估了微调后的 Skill-LLM 和轻量级模型,并将其性能与最先进 (State-of-the-Art, SOTA) 方法进行了比较。结果表明,这种方法优于现有的 SOTA 技术。

[NLP-91] Sabia-3 Technical Report

【速读】: 该论文旨在解决在葡萄牙语和巴西相关任务中,现有语言模型性能不足的问题。解决方案的关键在于开发了Sabiá-3,这是一个基于大规模巴西语料库训练的新旗舰语言模型。Sabiá-3通过领域专业化,显著提升了在推理密集型任务中的表现,并且在成本上比前沿的大型语言模型低三到四倍,从而实现了性能与成本效益的平衡。

链接: https://arxiv.org/abs/2410.12049
作者: Hugo Abonizio,Thales Sales Almeida,Thiago Laitz,Roseval Malaquias Junior,Giovana Kerche Bonás,Rodrigo Nogueira,Ramon Pires
关键词-EN: large brazilian-centric corpus, language model trained, flagship language model, report presents, brazilian-centric corpus
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This report presents Sabiá-3, our new flagship language model trained on a large brazilian-centric corpus. Evaluations across diverse professional and academic benchmarks show a strong performance on Portuguese and Brazil-related tasks. Sabiá-3 shows large improvements in comparison to our previous best of model, Sabiá-2 Medium, especially in reasoning-intensive tasks. Notably, Sabiá-3’s average performance matches frontier LLMs, while it is offered at a three to four times lower cost per token, reinforcing the benefits of domain specialization.
摘要:本报告介绍了 Sabiá-3,这是我们基于大规模以巴西为中心的语料库训练的新一代旗舰语言模型。在多样化的专业和学术基准测试中,Sabiá-3 在葡萄牙语和与巴西相关的任务上表现出色。与之前最佳模型 Sabiá-2 Medium 相比,Sabiá-3 在推理密集型任务中显示出显著的改进。值得注意的是,Sabiá-3 的平均表现与前沿大语言模型 (LLM) 相当,而其每 Token 的成本仅为后者的三分之一到四分之一,这进一步强化了领域专业化的优势。

[NLP-92] Boosting Logical Fallacy Reasoning in LLMs via Logical Structure Tree EMNLP2024

【速读】: 该论文试图解决逻辑谬误的检测与分类问题,其关键解决方案是构建一个逻辑结构树(logical structure tree),以显式地表示和追踪陈述中关系连接词及其论据之间的层次逻辑流。该树通过无监督方式构建,基于成分树(constituency tree)和连接词的分类体系,将关系连接词作为非终端节点,文本论据作为终端节点。论文进一步提出了两种策略将逻辑结构树融入大型语言模型(LLMs)进行谬误推理:一是将树转换为自然语言描述并作为硬文本提示输入LLMs;二是生成关系感知的树嵌入并作为软提示插入LLMs。实验结果表明,基于逻辑结构树的方法显著提高了谬误检测和分类的精确度和召回率。

链接: https://arxiv.org/abs/2410.12048
作者: Yuanyuan Lei,Ruihong Huang
关键词-EN: logical structure tree, Logical, logical structure, structure tree, tree
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024

点击查看摘要

Abstract:Logical fallacy uses invalid or faulty reasoning in the construction of a statement. Despite the prevalence and harmfulness of logical fallacies, detecting and classifying logical fallacies still remains a challenging task. We observe that logical fallacies often use connective words to indicate an intended logical relation between two arguments, while the argument semantics does not actually support the logical relation. Inspired by this observation, we propose to build a logical structure tree to explicitly represent and track the hierarchical logic flow among relation connectives and their arguments in a statement. Specifically, this logical structure tree is constructed in an unsupervised manner guided by the constituency tree and a taxonomy of connectives for ten common logical relations, with relation connectives as non-terminal nodes and textual arguments as terminal nodes, and the latter are mostly elementary discourse units. We further develop two strategies to incorporate the logical structure tree into LLMs for fallacy reasoning. Firstly, we transform the tree into natural language descriptions and feed the textualized tree into LLMs as a part of the hard text prompt. Secondly, we derive a relation-aware tree embedding and insert the tree embedding into LLMs as a soft prompt. Experiments on benchmark datasets demonstrate that our approach based on logical structure tree significantly improves precision and recall for both fallacy detection and fallacy classification.
摘要:逻辑谬误在陈述构建中使用无效或错误的推理。尽管逻辑谬误普遍存在且具有危害性,但其检测和分类仍然是一个具有挑战性的任务。我们观察到,逻辑谬误通常使用连接词来指示两个论点之间的预期逻辑关系,而论点语义实际上并不支持这种逻辑关系。受此启发,我们提出构建一个逻辑结构树,以显式表示和跟踪陈述中关系连接词及其论点之间的层次逻辑流。具体而言,该逻辑结构树在无监督方式下,由成分树和十种常见逻辑关系的关系连接词分类引导构建,其中关系连接词作为非终端节点,文本论点作为终端节点,后者多为基本话语单元。我们进一步开发了两种策略,将逻辑结构树融入大语言模型 (LLM) 中进行谬误推理。首先,我们将树转换为自然语言描述,并将文本化的树作为硬文本提示的一部分输入 LLM。其次,我们推导出一个关系感知的树嵌入,并将树嵌入作为软提示插入 LLM。在基准数据集上的实验表明,基于逻辑结构树的方法显著提高了谬误检测和谬误分类的精确率和召回率。

[NLP-93] Concept-Reversed Winograd Schema Challenge: Evaluating and Improving Robust Reasoning in Large Language Models via Abstraction

【速读】: 该论文试图解决大语言模型(LLMs)在推理过程中依赖表面逻辑链而非稳健推理的问题。解决方案的关键在于提出了一种新的评估数据集——概念反转的Winograd模式挑战(CR-WSC),通过反转与错误答案更相关的概念,显著降低了LLMs的性能,同时保持推理逻辑不变。此外,论文还提出了“思维抽象”(AoT)这一新颖的提示方法,通过概念抽象将对抗性案例恢复为正常案例,从而提高LLMs在推理中的鲁棒性和一致性。

链接: https://arxiv.org/abs/2410.12040
作者: Kaiqiao Han,Tianqing Fang,Zhaowei Wang,Yangqiu Song,Mark Steedman
关键词-EN: Large Language Models, Winograd Schema Challenge, superficial logical chains, Language Models, showcased remarkable proficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have showcased remarkable proficiency in reasoning, there is still a concern about hallucinations and unreliable reasoning issues due to semantic associations and superficial logical chains. To evaluate the extent to which LLMs perform robust reasoning instead of relying on superficial logical chains, we propose a new evaluation dataset, the Concept-Reversed Winograd Schema Challenge (CR-WSC), based on the famous Winograd Schema Challenge (WSC) dataset. By simply reversing the concepts to those that are more associated with the wrong answer, we find that the performance of LLMs drops significantly despite the rationale of reasoning remaining the same. Furthermore, we propose Abstraction-of-Thought (AoT), a novel prompt method for recovering adversarial cases to normal cases using conceptual abstraction to improve LLMs’ robustness and consistency in reasoning, as demonstrated by experiments on CR-WSC.
摘要:尽管大语言模型 (LLM) 在推理方面展示了显著的能力,但由于语义关联和表面逻辑链的问题,仍存在幻觉和不可靠推理的担忧。为了评估 LLM 在推理时是否依赖于表面逻辑链,而不是进行稳健的推理,我们提出了一种新的评估数据集,即概念反转的 Winograd 模式挑战 (CR-WSC),该数据集基于著名的 Winograd 模式挑战 (WSC) 数据集。通过简单地将概念反转为与错误答案更相关的概念,我们发现尽管推理的基本原理保持不变,但 LLM 的性能显著下降。此外,我们提出了思维抽象 (AoT),这是一种新颖的提示方法,通过概念抽象将对抗性案例恢复为正常案例,以提高 LLM 在推理中的鲁棒性和一致性,这一点在 CR-WSC 上的实验中得到了证明。

[NLP-94] On Classification with Large Language Models in Cultural Analytics

【速读】: 该论文旨在探讨大型语言模型(LLMs)在文化分析中的分类任务中的应用,并评估其在感性认知实践中的潜力。解决方案的关键在于通过公开数据集上的实证研究,比较LLMs与传统监督学习方法在已建立任务和新任务中的表现,发现LLMs在已建立任务中具有竞争力,但在新任务中表现较差。此外,LLMs通过作为形式理论测试的中间输入,能够辅助感性认知过程。

链接: https://arxiv.org/abs/2410.12029
作者: David Bamman,Kent K. Chang,Li Lucy,Naitian Zhou
关键词-EN: large language models, cultural analytics, practice in cultural, large language, sensemaking practice
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In this work, we survey the way in which classification is used as a sensemaking practice in cultural analytics, and assess where large language models can fit into this landscape. We identify ten tasks supported by publicly available datasets on which we empirically assess the performance of LLMs compared to traditional supervised methods, and explore the ways in which LLMs can be employed for sensemaking goals beyond mere accuracy. We find that prompt-based LLMs are competitive with traditional supervised models for established tasks, but perform less well on de novo tasks. In addition, LLMs can assist sensemaking by acting as an intermediary input to formal theory testing.
摘要:在本研究中,我们探讨了分类在文化分析中作为意义构建实践的方式,并评估了大语言模型在这一领域中的适用性。我们识别了十个由公开数据集支持的任务,在这些任务上我们实证评估了大语言模型与传统监督方法的性能,并探讨了大语言模型如何在超越单纯准确性的意义构建目标中被应用。我们发现,基于提示的大语言模型在已建立的任务上与传统监督模型具有竞争力,但在全新任务上表现较差。此外,大语言模型可以通过作为正式理论测试的中间输入来辅助意义构建。

[NLP-95] LocoMotion: Learning Motion-Focused Video-Language Representations ACCV2024

【速读】: 该论文试图解决视频-语言表示学习中对运动信息的关注不足问题。现有方法主要依赖于空间信息,通过识别物体和场景来区分相关描述,而忽略了运动和时间进展的描述。解决方案的关键在于提出LocoMotion方法,通过向视频添加合成运动并利用这些运动的参数生成相应的描述,从而学习运动聚焦的描述。此外,通过动词变体改写来增加描述的多样性,并建立基本运动与高级动词之间的联系,最终实现对运动信息的有效表示学习。实验结果表明,该方法在下游任务中表现出色,特别是在微调数据有限的情况下。

链接: https://arxiv.org/abs/2410.12018
作者: Hazel Doughty,Fida Mohammad Thoker,Cees G. M. Snoek
关键词-EN: paper strives, video-language representations, motion-focused video-language representations, learn video-language representations, motion-focused video-language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: ACCV 2024

点击查看摘要

Abstract:This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is often enough to distinguish the relevant caption. We instead propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions. We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions. Furthermore, we propose verb-variation paraphrasing to increase the caption variety and learn the link between primitive motions and high-level verbs. With this, we are able to learn a motion-focused video-language representation. Experiments demonstrate our approach is effective for a variety of downstream tasks, particularly when limited data is available for fine-tuning. Code is available: this https URL
摘要:本文致力于构建以运动为核心的视听语言表征。现有学习视听语言表征的方法多聚焦于空间数据,其中识别物体和场景通常足以区分相关描述。我们提出 LocoMotion,通过学习描述局部物体运动及其时间演变的运动聚焦描述,来实现这一目标。我们通过向视频添加合成运动,并利用这些运动的参数生成相应的描述,从而实现这一目标。此外,我们提出动词变体释义,以增加描述的多样性,并学习基本运动与高级动词之间的联系。通过这些方法,我们能够学习到以运动为核心的视听语言表征。实验表明,我们的方法在多种下游任务中表现有效,特别是在微调数据有限的情况下。代码已公开:this https URL

[NLP-96] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

【速读】: 该论文试图解决Mixture-of-Experts (MoE)架构中存在的内存消耗高和专家冗余问题。解决方案的关键在于提出了一种名为MoE-Pruner的方法,该方法通过一次性修剪权重最小且与输入激活和路由权重相乘的输出神经元权重,从而在不重新训练或更新权重的情况下减少网络权重并保持模型性能。实验结果表明,该方法在多个语言基准测试中显著优于现有的LLM修剪方法,并且通过专家级知识蒸馏,修剪后的MoE模型性能可以进一步提升。

链接: https://arxiv.org/abs/2410.12013
作者: Yanyue Xie,Zhi Zhang,Ding Zhou,Cong Xie,Ziang Song,Xin Liu,Yanzhi Wang,Xue Lin,An Xu
关键词-EN: architectures face challenges, high memory consumption, architectures face, redundancy in experts, face challenges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. Pruning MoE can reduce network weights while maintaining model performance. Motivated by the recent observation of emergent large magnitude features in Large Language Models (LLM) and MoE routing policy, we propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights, on each output neuron. Our pruning method is one-shot, requiring no retraining or weight updates. We evaluate our method on Mixtral-8x7B and Mixtral-8x22B across multiple language benchmarks. Experimental results show that our pruning method significantly outperforms state-of-the-art LLM pruning methods. Furthermore, our pruned MoE models can benefit from a pretrained teacher model through expert-wise knowledge distillation, improving performance post-pruning. Experimental results demonstrate that the Mixtral-8x7B model with 50% sparsity maintains 99% of the performance of the original model after the expert-wise knowledge distillation.
摘要:混合专家 (Mixture-of-Experts, MoE) 架构面临高内存消耗和专家冗余等挑战。修剪 MoE 可以在保持模型性能的同时减少网络权重。受近期在大语言模型 (Large Language Model, LLM) 和 MoE 路由策略中观察到的涌现大特征值现象的启发,我们提出了 MoE-Pruner,一种在每个输出神经元上修剪权重的方法,该权重为最小特征值与相应输入激活和路由权重的乘积。我们的修剪方法是单次操作,无需重新训练或权重更新。我们在 Mixtral-8x7B 和 Mixtral-8x22B 上通过多个语言基准测试评估了我们的方法。实验结果表明,我们的修剪方法显著优于最先进的 LLM 修剪方法。此外,我们的修剪后 MoE 模型可以通过专家级知识蒸馏从预训练的教师模型中受益,从而在修剪后提升性能。实验结果显示,经过专家级知识蒸馏后,Mixtral-8x7B 模型在 50% 稀疏度下仍能保持原始模型 99% 的性能。

[NLP-97] Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models EMNLP2025

【速读】: 该论文试图解决像素级语言模型在语言理解和视觉能力之间存在的差距问题。解决方案的关键在于通过一系列语言和视觉任务来探究PIXEL模型的视觉和语言理解能力,发现其较低层主要捕捉表面视觉特征,而较高层逐渐学习语法和语义抽象。此外,通过研究不同文本渲染策略对模型训练的影响,发现引入特定的正字法约束可以促进表面特征的早期学习。这些发现为像素级语言模型的进一步发展提供了重要见解。

链接: https://arxiv.org/abs/2410.12011
作者: Kushal Tatariya,Vladimir Araujo,Thomas Bauwens,Miryam de Lhoneux
关键词-EN: subword-based language modelling, virtually any script, compelling alternative, alternative to subword-based, represent virtually
类目: Computation and Language (cs.CL)
备注: 9 pages, Accepted to EMNLP 2025 Main

点击查看摘要

Abstract:Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text. While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowledge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum. Our findings reveal a substantial gap between the model’s visual and linguistic understanding. The lower layers of PIXEL predominantly capture superficial visual features, whereas the higher layers gradually learn more syntactic and semantic abstractions. Additionally, we examine variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input level can facilitate earlier learning of surface-level features. With this study, we hope to provide insights that aid the further development of pixel-based language models.
摘要:基于像素的语言模型作为一种引人注目的替代方案,逐渐取代了基于子词的语言建模,特别是由于它们能够表示几乎任何文字。PIXEL 是这类模型的一个典型例子,它是一个视觉 Transformer,已经在渲染的文本上进行了预训练。尽管 PIXEL 展示了有前景的跨文字迁移能力和对正字法扰动的鲁棒性,但在大多数其他情况下,它未能超越像 BERT 这样的单语子词模型。这种差异引发了关于这些模型学习到的语言知识量的疑问,以及它们在语言任务中的表现是否更多地源于其视觉能力而非语言能力。为了探讨这一点,我们通过一系列语言和视觉任务来探究 PIXEL,以评估其在视觉到语言连续谱中的位置。我们的研究结果揭示了该模型在视觉和语言理解之间存在显著差距。PIXEL 的较低层主要捕获表面视觉特征,而较高层则逐渐学习更多的句法和语义抽象。此外,我们考察了使用不同文本渲染策略训练的 PIXEL 变体,发现引入某些正字法约束在输入层面可以促进表面特征的早期学习。通过这项研究,我们希望提供有助于进一步开发基于像素的语言模型的见解。

[NLP-98] Bias Similarity Across Large Language Models

【速读】: 该论文试图解决的问题是不同大型语言模型(LLMs)之间偏见相似性的比较,特别是在生成式AI领域中,这些模型产生的偏见对社会决策的影响。解决方案的关键在于通过输出分布来评估和量化不同LLMs之间的偏见相似性,使用两个数据集进行测量,并发现微调对输出分布的影响有限,同一模型家族内的LLMs输出分布不相似,以及存在训练数据信息泄露的风险。这些发现为理解LLM的行为和实际部署中的潜在风险提供了重要见解。

链接: https://arxiv.org/abs/2410.12010
作者: Hyejun Jeong,Shiqing Ma,Amir Houmansadr
关键词-EN: machine learning models, models influence decision-making, Large Language Models, chronic problem, human society
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Bias in machine learning models has been a chronic problem, especially as these models influence decision-making in human society. In generative AI, such as Large Language Models, the impact of bias is even more profound compared to the classification models. LLMs produce realistic and human-like content that users may unconsciously trust, which could perpetuate harmful stereotypes to the uncontrolled public. It becomes particularly concerning when utilized in journalism or education. While prior studies have explored and quantified bias in individual AI models, no work has yet compared bias similarity across different LLMs. To fill this gap, we take a comprehensive look at ten open- and closed-source LLMs from four model families, assessing the extent of biases through output distribution. Using two datasets-one containing 4k questions and another with one million questions for each of the four bias dimensions – we measure functional similarity to understand how biases manifest across models. Our findings reveal that 1) fine-tuning does not significantly alter output distributions, which would limit its ability to mitigate bias, 2) LLMs within the same family tree do not produce similar output distributions, implying that addressing bias in one model could have limited implications for others in the same family, and 3) there is a possible risk of training data information leakage, raising concerns about privacy and data security. Our analysis provides insight into LLM behavior and highlights potential risks in real-world deployment.
摘要:机器学习模型中的偏差问题一直是一个长期存在的难题,尤其是在这些模型影响人类社会决策时。在生成式 AI 中,如大语言模型 (LLM),偏差的影响比分类模型更为深远。LLM 生成逼真且类似人类的内容,用户可能会无意识地信任这些内容,从而在不受控制的公众中延续有害的刻板印象。当应用于新闻或教育领域时,这一问题尤为令人担忧。尽管先前的研究已经探索并量化了个别 AI 模型中的偏差,但尚未有研究比较不同 LLM 之间的偏差相似性。为了填补这一空白,我们全面考察了来自四个模型家族的十个开源和闭源 LLM,通过输出分布评估偏差的程度。我们使用两个数据集——一个包含 4k 个问题,另一个包含每个偏差维度的 100 万个问题——来测量功能相似性,以理解偏差如何在不同模型中表现。我们的研究发现:1) 微调不会显著改变输出分布,这将限制其缓解偏差的能力;2) 同一模型家族中的 LLM 不会产生相似的输出分布,这意味着解决一个模型中的偏差对同一家族中的其他模型影响有限;3) 存在训练数据信息泄露的可能风险,这引发了关于隐私和数据安全的担忧。我们的分析为 LLM 的行为提供了见解,并突显了在实际部署中的潜在风险。

[NLP-99] oolken: Improving LLM Tool Usage with Reranking and a Reject Option EMNLP2024

【速读】: 该论文试图解决ToolkenGPT在工具学习范式中无法利用工具文档和在是否使用工具时经常出错的两个主要问题。解决方案的关键在于引入Toolken+,通过重新排序由ToolkenGPT选出的前k个工具来缓解第一个问题,并通过引入特殊的“Reject”选项来解决第二个问题,使得模型在“Reject”被优先排序时生成一个词汇标记,从而避免错误使用工具。

链接: https://arxiv.org/abs/2410.12004
作者: Konstantin Yakovlev,Sergey Nikolenko,Andrey Bout
关键词-EN: recently proposed ToolkenGPT, learning paradigm demonstrates, paradigm demonstrates promising, demonstrates promising performance, tool learning paradigm
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:The recently proposed ToolkenGPT tool learning paradigm demonstrates promising performance but suffers from two major issues: first, it cannot benefit from tool documentation, and second, it often makes mistakes in whether to use a tool at all. We introduce Toolken+ that mitigates the first problem by reranking top k tools selected by ToolkenGPT and the second problem with a special “Reject” option such that the model will generate a vocabulary token if “Reject” is ranked first. We demonstrate the effectiveness of Toolken+ on multistep numerical reasoning and tool selection tasks.
摘要:最近提出的 ToolkenGPT 工具学习范式展示了良好的性能,但存在两个主要问题:首先,它无法从工具文档中受益;其次,它在是否使用工具的决策上经常出错。我们引入了 Toolken+,通过重新排序 ToolkenGPT 选择的 top k 工具来缓解第一个问题,并通过引入特殊的“拒绝”选项来解决第二个问题,使得模型在“拒绝”选项排名第一时生成一个词汇 Token。我们在多步骤数值推理和工具选择任务中展示了 Toolken+ 的有效性。

[NLP-100] Impacts of Continued Legal Pre-Training and IFT on LLMs Latent Representations of Human-Defined Legal Concepts

【速读】: 该论文旨在探讨在法律语料库上继续预训练和指令微调(IFT)大型语言模型(LLMs)是否以及如何增强其对人类定义的法律概念的利用,特别是在生成输入序列的全局上下文表示时。解决方案的关键在于比较Mistral 7B、SaulLM-7B-Base(在法律语料库上继续预训练的Mistral 7B)和SaulLM-7B-Instruct(进一步进行IFT)三个模型,通过分析它们对代表法律概念的子集标记的注意力分配比例和注意力分数变化模式,评估法律训练是否引入了与人类法律知识结构相对应的新注意力模式。研究结果表明,法律训练的影响在不同的人类定义的法律概念之间分布不均,且法律训练中学习的法律知识上下文表示与人类定义的法律概念结构不一致。

链接: https://arxiv.org/abs/2410.12001
作者: Shaun Ho
关键词-EN: human-defined legal concepts, large language models, legal corpora increases, legal, human-defined legal
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper aims to offer AI Law researchers and practitioners a more detailed understanding of whether and how continued pre-training and instruction fine-tuning (IFT) of large language models (LLMs) on legal corpora increases their utilization of human-defined legal concepts when developing global contextual representations of input sequences. We compared three models: Mistral 7B, SaulLM-7B-Base (Mistral 7B with continued pre-training on legal corpora), and SaulLM-7B-Instruct (with further IFT). This preliminary assessment examined 7 distinct text sequences from recent AI Law literature, each containing a human-defined legal concept. We first compared the proportions of total attention the models allocated to subsets of tokens representing the legal concepts. We then visualized patterns of raw attention score alterations, evaluating whether legal training introduced novel attention patterns corresponding to structures of human legal knowledge. This inquiry revealed that (1) the impact of legal training was unevenly distributed across the various human-defined legal concepts, and (2) the contextual representations of legal knowledge learned during legal training did not coincide with structures of human-defined legal concepts. We conclude with suggestions for further investigation into the dynamics of legal LLM training.
摘要:本文旨在为 AI 法律研究者和从业者提供更深入的理解,探讨在法律语料库上对大语言模型 (LLM) 进行持续预训练和指令微调 (IFT) 是否以及如何增加其在构建输入序列的全局上下文表示时对人类定义的法律概念的利用。我们比较了三个模型:Mistral 7B、SaulLM-7B-Base(在法律语料库上进行持续预训练的 Mistral 7B)和 SaulLM-7B-Instruct(进一步进行 IFT)。本次初步评估研究了来自近期 AI 法律文献的 7 个不同文本序列,每个序列均包含一个人类定义的法律概念。我们首先比较了模型分配给代表法律概念的 Token 子集的总注意力比例。然后,我们可视化了原始注意力分数的变化模式,评估法律训练是否引入了与人类法律知识结构相对应的新注意力模式。这项研究揭示了以下两点:(1) 法律训练的影响在不同的人类定义的法律概念之间分布不均;(2) 法律训练中学习的法律知识上下文表示与人类定义的法律概念结构不一致。最后,我们提出了进一步研究法律 LLM 训练动态的建议。

[NLP-101] Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data

【速读】: 该论文试图解决长上下文语言模型(LCLMs)在处理跨多文档的整体推理(holistic reasoning)任务时的性能评估问题。解决方案的关键在于引入HoloBench框架,通过将数据库推理操作引入文本上下文中,系统性地调整上下文长度、信息密度、信息分布和查询复杂度等关键因素,以全面评估LCLMs在处理复杂查询任务时的表现。研究发现,上下文中的信息量对LCLM性能的影响大于上下文长度本身,而查询的复杂性对性能的影响尤为显著,特别是对于不同类型的查询。此外,论文还探讨了信息分组对性能的影响,发现虽然分组通常能提升性能,但最佳分组策略因模型而异。

链接: https://arxiv.org/abs/2410.11996
作者: Seiji Maekawa,Hayate Iso,Nikita Bhutani
关键词-EN: efficient methods, methods to sift, information, holistic reasoning, context length
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid increase in textual information means we need more efficient methods to sift through, organize, and understand it all. While retrieval-augmented generation (RAG) models excel in accessing information from large document collections, they struggle with complex tasks that require aggregation and reasoning over information spanning across multiple documents–what we call holistic reasoning. Long-context language models (LCLMs) have great potential for managing large-scale documents, but their holistic reasoning capabilities remain unclear. In this work, we introduce HoloBench, a novel framework that brings database reasoning operations into text-based contexts, making it easier to systematically evaluate how LCLMs handle holistic reasoning across large documents. Our approach adjusts key factors such as context length, information density, distribution of information, and query complexity to evaluate LCLMs comprehensively. Our experiments show that the amount of information in the context has a bigger influence on LCLM performance than the actual context length. Furthermore, the complexity of queries affects performance more than the amount of information, particularly for different types of queries. Interestingly, queries that involve finding maximum or minimum values are easier for LCLMs and are less affected by context length, even though they pose challenges for RAG systems. However, tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases. Additionally, we find that while grouping relevant information generally improves performance, the optimal positioning varies across models. Our findings surface both the advancements and the ongoing challenges in achieving a holistic understanding of long contexts.
摘要:文本信息的快速增长意味着我们需要更高效的方法来筛选、组织和理解这些信息。尽管检索增强生成 (RAG) 模型在从大型文档集合中访问信息方面表现出色,但它们在处理需要跨多个文档进行信息聚合和推理的复杂任务时显得力不从心——我们称之为整体推理。长上下文语言模型 (LCLM) 在管理大规模文档方面具有巨大潜力,但其整体推理能力尚不明确。在本研究中,我们引入了 HoloBench,这是一个将数据库推理操作引入基于文本的上下文的新框架,使得系统性地评估 LCLM 如何处理跨大文档的整体推理变得更加容易。我们的方法调整了关键因素,如上下文长度、信息密度、信息分布和查询复杂度,以全面评估 LCLM。我们的实验表明,上下文中的信息量对 LCLM 性能的影响大于实际的上下文长度。此外,查询的复杂性对性能的影响大于信息量,尤其是不同类型的查询。有趣的是,涉及查找最大或最小值的查询对 LCLM 来说更容易,且受上下文长度的影响较小,尽管它们对 RAG 系统构成了挑战。然而,需要聚合多条信息的任务在上下文长度增加时显示出明显的准确性下降。此外,我们发现虽然分组相关信息通常会提高性能,但最佳定位因模型而异。我们的研究揭示了在实现长上下文整体理解方面的进展和持续挑战。

[NLP-102] DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models NEURIPS2024

【速读】: 该论文试图解决大型语言模型(LLMs)在资源有限设备上的部署问题,特别是由于模型内存和计算成本增加所带来的挑战。解决方案的关键在于提出了一种新颖的维度无关结构化剪枝方法,该方法放松了传统结构化剪枝方法的约束,消除了嵌入维度上的结构依赖性。这种方法允许不同模块使用特征图的不同子集,并使每个模块在其输入和输出维度上具有不同的宽度,从而显著增强了结构化剪枝的灵活性。实验结果表明,该方法在多个LLMs上表现优异,首次实现了与半结构化剪枝相当的精度。

链接: https://arxiv.org/abs/2410.11988
作者: Shangqian Gao,Chi-Heng Lin,Ting Hua,Tang Zheng,Yilin Shen,Hongxia Jin,Yen-Chang Hsu
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, achieved remarkable success
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, including language modeling, understanding, and generation. However, the increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices. Structural pruning has emerged as a promising solution to reduce the costs of LLMs without requiring post-processing steps. Prior structural pruning methods either follow the dependence of structures at the cost of limiting flexibility, or introduce non-trivial additional parameters by incorporating different projection matrices. In this work, we propose a novel approach that relaxes the constraint imposed by regular structural pruning methods and eliminates the structural dependence along the embedding dimension. Our dimension-independent structural pruning method offers several benefits. Firstly, our method enables different blocks to utilize different subsets of the feature maps. Secondly, by removing structural dependence, we facilitate each block to possess varying widths along its input and output dimensions, thereby significantly enhancing the flexibility of structural pruning. We evaluate our method on various LLMs, including OPT, LLaMA, LLaMA-2, Phi-1.5, and Phi-2. Experimental results demonstrate that our approach outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.
摘要:大语言模型 (LLMs) 在多种自然语言处理任务中取得了显著的成功,包括语言建模、理解和生成。然而,这些模型所增加的内存和计算成本对在资源受限设备上的部署构成了重大挑战。结构化剪枝作为一种有前景的解决方案,能够在不需后处理步骤的情况下降低 LLMs 的成本。先前的结构化剪枝方法要么遵循结构的依赖性以牺牲灵活性为代价,要么通过引入不同的投影矩阵而引入非平凡的额外参数。在本研究中,我们提出了一种新颖的方法,该方法放松了常规结构化剪枝方法所施加的约束,并消除了沿嵌入维度的结构依赖性。我们的维度无关结构化剪枝方法具有多个优点。首先,我们的方法使得不同的块能够利用特征映射的不同子集。其次,通过消除结构依赖性,我们使得每个块在其输入和输出维度上具有不同的宽度,从而显著增强了结构化剪枝的灵活性。我们在多种 LLMs 上评估了我们的方法,包括 OPT、LLaMA、LLaMA-2、Phi-1.5 和 Phi-2。实验结果表明,我们的方法优于其他最先进的方法,首次展示了结构化剪枝能够达到与半结构化剪枝相似的准确性。

[NLP-103] he Fair Language Model Paradox

【速读】: 该论文试图解决大语言模型(LLMs)在训练过程中由于权重衰减(weight decay)引入的性能偏差问题,特别是在低频词(low-frequency tokens)上的不公平对待。论文揭示了随着权重衰减的增加,低频词的性能会不成比例地下降,而低频词在大多数语言中占据了词频分布的绝大部分。解决方案的关键在于提出新的正则化技术,以确保所有可用词(tokens)在训练过程中都能得到公平对待。

链接: https://arxiv.org/abs/2410.11985
作者: Andrea Pinto,Tomer Galanti,Randall Balestriero
关键词-EN: Large Language Models, Large Language, real-world applications, widely deployed, deployed in real-world
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.
摘要:大语言模型 (LLMs) 在实际应用中被广泛部署,然而关于其在 Token 级别的训练动态知之甚少。评估通常依赖于在批次级别测量的聚合训练损失,这忽略了由于 (i) 不同 Token 级别的动态变化和 (ii) 由超参数引入的结构偏差所导致的细微的每个 Token 偏差。尽管权重衰减常用于稳定训练,我们揭示了它在 Token 级别无声地引入了可检测的性能偏差。事实上,我们在不同数据集大小、模型架构和参数规模(从 270M 到 3B 参数)的实验中表明,随着权重衰减的增加,低频 Token 被不成比例地贬值。这一点尤其令人担忧,因为这些被忽视的低频 Token 在大多数语言的 Token 分布中占据了绝大多数,这呼吁采用新的正则化技术,以确保所有可用 Token 的公平性。

[NLP-104] FLARE: Faithful Logic-Aided Reasoning and Exploration

【速读】: 该论文试图解决基于大语言模型(LLM)的现代问答(QA)和推理方法在生成忠实于中间推理链的输出时遇到的困难。解决方案的关键在于提出了一种名为**Faithful Logic-Aided Reasoning and Exploration (Faithful \textbfLogic-\textbfAided \textbfReasoning and \textbfExploration)**的新颖可解释方法,通过任务分解来遍历问题空间。该方法利用LLM规划解决方案,将查询软形式化为逻辑编程代码中的事实和谓词,并使用定义空间上的穷举多跳搜索来模拟代码执行。这种方法不仅允许计算推理过程对生成代码的忠实度,还能在不依赖外部求解器的情况下分析多跳搜索的步骤,从而在多个推理基准测试中实现了最先进的结果。

链接: https://arxiv.org/abs/2410.11900
作者: Erik Arakelyan,Pasquale Minervini,Pat Verga,Patrick Lewis,Isabelle Augenstein
关键词-EN: Modern Question Answering, Large Language Models, Large Language, Modern Question, Question Answering
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Modern Question Answering (QA) and Reasoning approaches based on Large Language Models (LLMs) commonly use prompting techniques, such as Chain-of-Thought (CoT), assuming the resulting generation will have a more granular exploration and reasoning over the question space and scope. However, such methods struggle with generating outputs that are faithful to the intermediate chain of reasoning produced by the model. On the other end of the spectrum, neuro-symbolic methods such as Faithful CoT (F-CoT) propose to combine LLMs with external symbolic solvers. While such approaches boast a high degree of faithfulness, they usually require a model trained for code generation and struggle with tasks that are ambiguous or hard to formalise strictly. We introduce \textbfFaithful \textbfLogic-\textbfAided \textbfReasoning and \textbfExploration (\textbf\ours), a novel interpretable approach for traversing the problem space using task decompositions. We use the LLM to plan a solution, soft-formalise the query into facts and predicates using a logic programming code and simulate that code execution using an exhaustive multi-hop search over the defined space. Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers. Our methods achieve SOTA results on \mathbf7 out of \mathbf9 diverse reasoning benchmarks. We also show that model faithfulness positively correlates with overall performance and further demonstrate that \textbf\ours allows pinpointing the decisive factors sufficient for and leading to the correct answer with optimal reasoning during the multi-hop search.
摘要:基于大语言模型 (LLM) 的现代问答 (QA) 和推理方法通常使用提示技术,如思维链 (Chain-of-Thought, CoT),假设生成的结果将在问题空间和范围内进行更细粒度的探索和推理。然而,这些方法在生成忠实于模型产生的中间推理链的输出方面存在困难。另一方面,神经符号方法如忠实思维链 (Faithful CoT, F-CoT) 提出将 LLM 与外部符号求解器结合。尽管这些方法具有高度的忠实性,但它们通常需要一个经过代码生成训练的模型,并且在处理模糊或难以严格形式化的任务时表现不佳。我们引入了 忠实逻辑辅助推理与探索 (Faithful Logic-Aided Reasoning and Exploration, \ours),这是一种使用任务分解遍历问题空间的新颖可解释方法。我们使用 LLM 规划解决方案,使用逻辑编程代码将查询软形式化为事实和谓词,并通过在定义的空间上进行穷举多跳搜索来模拟代码执行。我们的方法使我们能够计算推理过程相对于生成代码的忠实度,并分析多跳搜索的步骤,而无需依赖外部求解器。我们的方法在 9 个多样化推理基准中的 7 个上达到了最先进 (SOTA) 的结果。我们还展示了模型忠实性与整体性能正相关,并进一步证明 \ours 能够在多跳搜索期间准确定位导致正确答案的关键因素,并进行最佳推理。

[NLP-105] ChatVis: Automating Scientific Visualization with a Large Language Model

【速读】: 该论文试图解决通过自然语言描述生成正确数据分析和可视化Python脚本的问题。解决方案的关键在于开发了一个名为ChatVis的迭代助手,该助手利用大型语言模型(LLM)生成Python脚本,并通过错误检测和修正机制不断迭代,直至脚本正确执行。ChatVis在五个典型可视化场景中展示了其有效性,并与其他未辅助的LLM生成的脚本进行了对比,结果表明ChatVis在所有情况下均成功生成了正确的脚本,而未辅助的LLM则未能做到这一点。

链接: https://arxiv.org/abs/2410.11863
作者: Tanwi Mallick,Orcun Yildiz,David Lenz,Tom Peterka
关键词-EN: synthetically generate Python, large language model, generate Python scripts, generate Python, develop an iterative
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We develop an iterative assistant we call ChatVis that can synthetically generate Python scripts for data analysis and visualization using a large language model (LLM). The assistant allows a user to specify the operations in natural language, attempting to generate a Python script for the desired operations, prompting the LLM to revise the script as needed until it executes correctly. The iterations include an error detection and correction mechanism that extracts error messages from the execution of the script and subsequently prompts LLM to correct the error. Our method demonstrates correct execution on five canonical visualization scenarios, comparing results with ground truth. We also compared our results with scripts generated by several other LLMs without any assistance. In every instance, ChatVis successfully generated the correct script, whereas the unassisted LLMs failed to do so. The code is available on GitHub: this https URL.
摘要:我们开发了一种迭代助手,名为 ChatVis,它能够利用大语言模型 (LLM) 合成生成用于数据分析和可视化的 Python 脚本。该助手允许用户以自然语言指定操作,尝试生成所需的 Python 脚本,并根据需要提示 LLM 修订脚本,直至其正确执行。迭代过程包括一个错误检测与纠正机制,该机制从脚本执行中提取错误信息,并随后提示 LLM 纠正错误。我们的方法在五个典型的可视化场景中展示了正确的执行结果,并与基准真值进行了比较。我们还将其结果与几个其他无辅助的 LLM 生成的脚本进行了对比。在每一次实例中,ChatVis 都成功生成了正确的脚本,而未受辅助的 LLM 则未能做到这一点。代码可在 GitHub 上获取:此 https URL。

[NLP-106] Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations

【速读】: 该论文试图解决随着大型语言模型(LLMs)规模不断扩大带来的存储和计算资源需求增加的问题。解决方案的关键在于分析和优化LLMs的架构设计,包括层配置、操作机制和模型大小,以及在不同超参数设置下的性能表现。论文通过回顾LLMs的发展历程,总结了在RTX 6000(基于Ada Lovelace架构)上的性能趋势,强调了模型在不同超参数设置和服务器或边缘环境部署下的行为差异,从而为优化LLMs的资源利用和性能提供了指导。

链接: https://arxiv.org/abs/2410.11381
作者: Seongho Kim,Jihyun Moon,Juntaek Oh,Insu Choi,Joon-Sung Yang
关键词-EN: Transformer architecture enables, enables contextually natural, contextually natural text, natural text generation, processing entire source
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages and 16 figures

点击查看摘要

Abstract:The advent of the Attention mechanism and Transformer architecture enables contextually natural text generation and compresses the burden of processing entire source information into singular vectors. Based on these two main ideas, model sizes gradually increases to accommodate more precise and comprehensive information, leading to the current state-of-the-art LLMs being very large, with parameters around 70 billion. As the model sizes are growing, the demand for substantial storage and computational capacity increases. This leads to the development of high-bandwidth memory and accelerators, as well as a variety of model architectures designed to meet these requirements. We note that LLM architectures have increasingly converged. This paper analyzes how these converged architectures perform in terms of layer configurations, operational mechanisms, and model sizes, considering various hyperparameter settings. In this paper, we conduct a concise survey of the history of LLMs by tracing the evolution of their operational improvements. Furthermore, we summarize the performance trends of LLMs under various hyperparameter settings using the RTX 6000, which features the state-of-the-art Ada Lovelace architecture. We conclude that even the same model can exhibit different behaviors depending on the hyperparameters or whether it is deployed in server or edge environments.
摘要:注意力机制 (Attention mechanism) 和 Transformer 架构 (Transformer architecture) 的出现使得上下文相关的自然文本生成成为可能,并将处理整个源信息的负担压缩为单一向量。基于这两个主要思想,模型规模逐渐增大以容纳更精确和全面的信息,导致当前最先进的大语言模型 (LLM) 的参数数量达到约 700 亿。随着模型规模的不断增长,对大量存储和计算能力的需求也随之增加。这促使了高带宽内存和加速器的发展,以及多种旨在满足这些需求的模型架构设计。我们注意到,大语言模型的架构正日益趋同。本文分析了这些趋同架构在层配置、操作机制和模型规模方面的表现,考虑了各种超参数设置。本文通过追踪大语言模型操作改进的演变,简要回顾了其历史。此外,我们使用具有最先进 Ada Lovelace 架构的 RTX 6000,总结了大语言模型在不同超参数设置下的性能趋势。我们得出结论,即使是相同的模型,在不同的超参数设置或部署在服务器或边缘环境中时,也可能表现出不同的行为。

[NLP-107] OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions

【速读】: 该论文试图解决在无载体隐写术中,如何利用大型语言模型(LLM)驱动算术编码解码器生成自然流畅的隐写文本的问题。解决方案的关键在于通过优化下一个词生成的替换概率分布的熵,使其在保持自然流畅性的同时,尽可能少地使用语言标记来嵌入秘密消息位。论文提出了一个封闭形式的优化问题解决方案,并通过解决实际问题如标记化不匹配、词汇截断技术结合以及与其他序列级选择启发式的结合,进一步提高了效率和可靠性。

链接: https://arxiv.org/abs/2410.04328
作者: Yu-Shin Huang,Peter Just,Krishna Narayanan,Chao Tian
关键词-EN: Large Language Model, arithmetic coding decoder, Language Model, Large Language, drives an arithmetic
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 9 figures

点击查看摘要

Abstract:We consider coverless steganography where a Large Language Model (LLM) drives an arithmetic coding decoder to generate stego-texts. An efficient method should embed secret message bits in as few language tokens as possible, while still keeping the stego-text natural and fluent. We show that on the individual token level, this problem is mathematically equivalent to maximizing the entropy of a replacement probability distribution of the next token generation, subject to a constraint on the KL divergence between the chosen probability distribution and the original distribution given by the LLM. A closed-form solution is provided for the optimization problem, which can be computed efficiently. Several important practical issues are also tackled: 1) An often-overlooked tokenization mismatch issue is resolved with a simple prompt selection approach, 2) The combination of the optimized distribution and the vocabulary truncation technique is considered, and 3) The combination of the optimized distribution with other sequence-level selection heuristics to further enhance the efficiency and reliability is studied.
摘要:我们考虑了一种无覆盖隐写术,其中大语言模型 (LLM) 驱动算术编码解码器生成隐写文本。一个高效的方法应当在尽可能少的语言 Token 中嵌入秘密消息位,同时保持隐写文本的自然和流畅。我们证明,在单个 Token 层面上,这个问题在数学上等价于最大化下一个 Token 生成的替换概率分布的熵,同时受限于所选概率分布与 LLM 给出的原始分布之间的 KL 散度约束。我们为该优化问题提供了一个封闭形式的解,该解可以高效计算。此外,我们还解决了几个重要的实际问题:1) 通过简单的提示选择方法解决了常被忽视的 Token 化不匹配问题,2) 考虑了优化分布与词汇截断技术的结合,3) 研究了优化分布与其他序列级选择启发式的结合,以进一步提高效率和可靠性。

[NLP-108] Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR ICASSP2025

【速读】: 该论文试图解决合成语音与真实语音在自动语音识别(ASR)系统中的性能差异问题。研究者认为这种现象可能源于文本到语音(TTS)模型中的过度平滑行为,特别是在使用TTS生成的数据训练ASR模型时。解决方案的关键在于比较基于去噪扩散概率模型(DDPM)和均方误差(MSE)的TTS模型在ASR训练中的表现,特别是在数据量和说话人多样性增加时的扩展性。研究发现,DDPM模型在利用更多数据和多样说话人方面优于MSE模型,从而在合成语音与真实语音的词错误率(WER)比率上取得了迄今为止最佳的1.46,但仍存在显著差距。

链接: https://arxiv.org/abs/2410.12279
作者: Christoph Minixhofer,Ondrej Klejch,Peter Bell
关键词-EN: Synthetically generated speech, rapidly approached human, approached human levels, Synthetically generated, levels of naturalness
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review at ICASSP 2025

点击查看摘要

Abstract:Synthetically generated speech has rapidly approached human levels of naturalness. However, the paradox remains that ASR systems, when trained on TTS output that is judged as natural by humans, continue to perform badly on real speech. In this work, we explore whether this phenomenon is due to the oversmoothing behaviour of models commonly used in TTS, with a particular focus on the behaviour of TTS-for-ASR as the amount of TTS training data is scaled up. We systematically compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We test the scalability of the two approaches, varying both the number hours, and the number of different speakers. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio between real and synthetic speech WER to date (1.46), but also find that a large gap remains.
摘要:合成语音的自然度已迅速接近人类水平。然而,一个矛盾的现象依然存在:当 ASR 系统使用被人类判断为自然的 TTS 输出进行训练时,其在真实语音上的表现仍然不佳。在本研究中,我们探讨了这种现象是否是由于 TTS 中常用模型的过度平滑行为所致,特别是随着 TTS 训练数据量的增加,TTS-for-ASR 的行为变化。我们系统地比较了用于 ASR 模型训练时,基于去噪扩散概率模型 (Denoising Diffusion Probabilistic Models, DDPM) 和基于均方误差 (Mean Squared Error, MSE) 的 TTS 模型。我们测试了这两种方法的可扩展性,变化了数据时长和不同说话人的数量。我们发现,对于给定的模型规模,DDPM 能更好地利用更多数据和更多样化的说话人集合,优于 MSE 模型。我们实现了迄今为止最佳的真实语音与合成语音词错误率 (Word Error Rate, WER) 比率 (1.46),但也发现仍存在较大差距。

[NLP-109] Automatic Screening for Children with Speech Disorder using Automatic Speech Recognition: Opportunities and Challenges AAAI

【速读】: 该论文试图解决儿童言语障碍(SD)评估的效率和可扩展性问题。解决方案的关键在于利用人工智能技术,特别是自动语音识别(ASR)模型,来实现言语和语言评估(SLA)的自动化。论文强调了将ASR模型适应于儿童语音的重要性,并探讨了AI增强的SLA管道的可行性,同时考虑了实际部署中的可访问性和隐私问题。

链接: https://arxiv.org/abs/2410.11865
作者: Dancheng Liu,Jason Yang,Ishan Albrecht-Buehler,Helen Qin,Sophie Li,Yuting Hu,Amir Nassereldine,Jinjun Xiong
关键词-EN: human life, academic development, fundamental aspect, aspect of human, SLA pipelines
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: AAAI-FSS 24

点击查看摘要

Abstract:Speech is a fundamental aspect of human life, crucial not only for communication but also for cognitive, social, and academic development. Children with speech disorders (SD) face significant challenges that, if unaddressed, can result in lasting negative impacts. Traditionally, speech and language assessments (SLA) have been conducted by skilled speech-language pathologists (SLPs), but there is a growing need for efficient and scalable SLA methods powered by artificial intelligence. This position paper presents a survey of existing techniques suitable for automating SLA pipelines, with an emphasis on adapting automatic speech recognition (ASR) models for children’s speech, an overview of current SLAs and their automated counterparts to demonstrate the feasibility of AI-enhanced SLA pipelines, and a discussion of practical considerations, including accessibility and privacy concerns, associated with the deployment of AI-powered SLAs.
摘要:语音是人类生活中的一个基本方面,不仅对沟通至关重要,还对认知、社会和学术发展起着关键作用。患有语音障碍 (Speech Disorders, SD) 的儿童面临着重大挑战,如果这些问题得不到解决,可能会产生持久的负面影响。传统上,语音和语言评估 (Speech and Language Assessments, SLA) 由熟练的言语语言病理学家 (Speech-Language Pathologists, SLPs) 进行,但随着人工智能技术的进步,对高效且可扩展的 SLA 方法的需求日益增长。本文综述了现有的适用于自动化 SLA 流程的技术,重点介绍了如何调整自动语音识别 (Automatic Speech Recognition, ASR) 模型以适应儿童语音,概述了当前的 SLA 及其自动化版本,以展示 AI 增强的 SLA 流程的可行性,并讨论了与部署 AI 驱动的 SLA 相关的实际考虑因素,包括可访问性和隐私问题。

[NLP-110] he rotating normal form of braids is regular

【速读】: 该论文试图解决Birman-Ko-Lee辫子群中旋转正规形式的正则性问题,并证明其在n条带上的旋转词识别的有限状态自动机的构造。解决方案的关键在于构建一个有限状态自动机,该自动机能够识别n条带上的旋转词,从而证明了旋转正规形式的正则性,并进一步推导出整个辫子群上的σ-确定性正规形式的正则性。

链接: https://arxiv.org/abs/1606.08970
作者: Jean Fromentin(LMPA)
关键词-EN: Dehornoy braid ordering, rotating normal form, Dehornoy braid, strong connections, normal form
类目: Group Theory (math.GR); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: Erratum. The Lemma 4.1 of the previous version is incorrect, as pointed out by June Roupin. This lemma, is not used in the rest of the paper. We have replaced it with Definition 4.1

点击查看摘要

Abstract:Defined on Birman-Ko-Lee monoids, the rotating normal form has strong connections with the Dehornoy’s braid ordering. It can be seen as a process for selecting between all the representative words of a Birman-Ko-Lee braid a particular one, called rotating word. In this paper we construct, for all n 2, a finite-state automaton which recognizes rotating words on n strands, proving that the rotating normal form is regular. As a consequence we obtain the regularity of a \sigma -definite normal form defined on the whole braid group.
摘要:在 Birman-Ko-Lee 幺半群上定义的旋转正规形式与 Dehornoy 的辫子排序有着紧密的联系。它可以被视为在 Birman-Ko-Lee 辫子的所有代表词中选择一个特定的词,称为旋转词的过程。本文中,我们为所有 n ≥ 2 构建了一个有限状态自动机,该自动机识别 n 股上的旋转词,从而证明了旋转正规形式是正则的。由此,我们得到了在整个辫子群上定义的 σ-确定性正规形式的正则性。

人工智能

[AI-0] JudgeBench: A Benchmark for Evaluating LLM-based Judges

链接: https://arxiv.org/abs/2410.12784
作者: Sijun Tan,Siyuan Zhuang,Kyle Montgomery,William Y. Tang,Alejandro Cuadron,Chenguang Wang,Raluca Ada Popa,Ion Stoica
关键词-EN: LLM-based judges, scalable alternative, judges, LLM-based, human
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge’s alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at this https URL .

[AI-1] Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information

链接: https://arxiv.org/abs/2410.12774
作者: Yingya Li,Timothy Miller,Steven Bethard,Guergana Savova
关键词-EN: depend heavily, PVI, tasks, task, PVI estimates
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: main paper 12 pages, Appendix 7 pages, 1 figure, 18 tables

点击查看摘要

Abstract:The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.

[AI-2] Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions

链接: https://arxiv.org/abs/2410.12773
作者: Zhenyu Jiang,Yuqi Xie,Jinhan Li,Ye Yuan,Yifeng Zhu,Yuke Zhu
关键词-EN: potential to integrate, integrate seamlessly, Vision Language Models, human-like embodiment, human environments
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted for oral presentation at 8th Annual Conference on Robot Learning. Project website: this https URL

点击查看摘要

Abstract:Humanoid robots, with their human-like embodiment, have the potential to integrate seamlessly into human environments. Critical to their coexistence and cooperation with humans is the ability to understand natural language communications and exhibit human-like behaviors. This work focuses on generating diverse whole-body motions for humanoid robots from language descriptions. We leverage human motion priors from extensive human motion datasets to initialize humanoid motions and employ the commonsense reasoning capabilities of Vision Language Models (VLMs) to edit and refine these motions. Our approach demonstrates the capability to produce natural, expressive, and text-aligned humanoid motions, validated through both simulated and real-world experiments. More videos can be found at this https URL.

[AI-3] Vaccinating Federated Learning for Robust Modulation Classification in Distributed Wireless Networks

链接: https://arxiv.org/abs/2410.12772
作者: Hunmin Lee,Hongju Seong,Wonbin Kim,Hyeokchan Kwon,Daehee Seo
关键词-EN: Automatic modulation classification, reliable communication services, Automatic modulation, existing FL-based AMC, modulation classification
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic modulation classification (AMC) serves a vital role in ensuring efficient and reliable communication services within distributed wireless networks. Recent developments have seen a surge in interest in deep neural network (DNN)-based AMC models, with Federated Learning (FL) emerging as a promising framework. Despite these advancements, the presence of various noises within the signal exerts significant challenges while optimizing models to capture salient features. Furthermore, existing FL-based AMC models commonly rely on linear aggregation strategies, which face notable difficulties in integrating locally fine-tuned parameters within practical non-IID (Independent and Identically Distributed) environments, thereby hindering optimal learning convergence. To address these challenges, we propose FedVaccine, a novel FL model aimed at improving generalizability across signals with varying noise levels by deliberately introducing a balanced level of noise. This is accomplished through our proposed harmonic noise resilience approach, which identifies an optimal noise tolerance for DNN models, thereby regulating the training process and mitigating overfitting. Additionally, FedVaccine overcomes the limitations of existing FL-based AMC models’ linear aggregation by employing a split-learning strategy using structural clustering topology and local queue data structure, enabling adaptive and cumulative updates to local models. Our experimental results, including IID and non-IID datasets as well as ablation studies, confirm FedVaccine’s robust performance and superiority over existing FL-based AMC approaches across different noise levels. These findings highlight FedVaccine’s potential to enhance the reliability and performance of AMC systems in practical wireless network environments.

[AI-4] SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

链接: https://arxiv.org/abs/2410.12761
作者: Jaehong Yoon,Shoubin Yu,Vaidehi Patil,Huaxiu Yao,Mohit Bansal
关键词-EN: Recent advances, significantly enhanced, enhanced their ability, ability to generate, increased the risk
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The first two authors contributed equally; Project page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model’s weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.

[AI-5] Unitary Multi-Margin BERT for Robust Natural Language Processing

链接: https://arxiv.org/abs/2410.12759
作者: Hao-Yuan Chang,Kang L. Wang
关键词-EN: natural language processing, deep learning leave, mission-critical natural language, Bidirectional Encoder Representations, Recent developments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent developments in adversarial attacks on deep learning leave many mission-critical natural language processing (NLP) systems at risk of exploitation. To address the lack of computationally efficient adversarial defense methods, this paper reports a novel, universal technique that drastically improves the robustness of Bidirectional Encoder Representations from Transformers (BERT) by combining the unitary weights with the multi-margin loss. We discover that the marriage of these two simple ideas amplifies the protection against malicious interference. Our model, the unitary multi-margin BERT (UniBERT), boosts post-attack classification accuracies significantly by 5.3% to 73.8% while maintaining competitive pre-attack accuracies. Furthermore, the pre-attack and post-attack accuracy tradeoff can be adjusted via a single scalar parameter to best fit the design requirements for the target applications.

[AI-6] Counterfactual Generative Modeling with Variational Causal Inference

链接: https://arxiv.org/abs/2410.12730
作者: Yulun Wu,Louie McConnell,Claudia Iriondo
关键词-EN: supervised learning approaches, individual potential outcomes, counterfactual generative modeling, gene expressions, facial images
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating an individual’s potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, facial images) and covariates are relatively limited. In this case, to predict one’s outcomes under counterfactual treatments, it is crucial to leverage individual information contained in its high-dimensional observed outcome in addition to the covariates. Prior works using variational inference in counterfactual generative modeling have been focusing on neural adaptations and model variants within the conditional variational autoencoder formulation, which we argue is fundamentally ill-suited to the notion of counterfactual in causal inference. In this work, we present a novel variational Bayesian causal inference framework and its theoretical backings to properly handle counterfactual generative modeling tasks, through which we are able to conduct counterfactual supervision end-to-end during training without any counterfactual samples, and encourage latent disentanglement that aids the correct identification of causal effect in counterfactual generations. In experiments, we demonstrate the advantage of our framework compared to state-of-the-art models in counterfactual generative modeling on multiple benchmarks.

[AI-7] ransformer based super-resolution downscaling for regional reanalysis: Full domain vs tiling approaches

链接: https://arxiv.org/abs/2410.12728
作者: Antonio Pérez,Mario Santa Cruz,Daniel San Martín,José Manuel Gutiérrez
关键词-EN: producing high-resolution climate, promising cost-effective downscaling, cost-effective downscaling methodology, high-resolution climate information, promising cost-effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Super-resolution (SR) is a promising cost-effective downscaling methodology for producing high-resolution climate information from coarser counterparts. A particular application is downscaling regional reanalysis outputs (predictand) from the driving global counterparts (predictor). This study conducts an intercomparison of various SR downscaling methods focusing on temperature and using the CERRA reanalysis (5.5 km resolution, produced with a regional atmospheric model driven by ERA5) as example. The method proposed in this work is the Swin transformer and two alternative methods are used as benchmark (fully convolutional U-Net and convolutional and dense DeepESD) as well as the simple bicubic interpolation. We compare two approaches, the standard one using the full domain as input and a more scalable tiling approach, dividing the full domain into tiles that are used as input. The methods are trained to downscale CERRA surface temperature, based on temperature information from the driving ERA5; in addition, the tiling approach includes static orographic information. We show that the tiling approach, which requires spatial transferability, comes at the cost of a lower performance (although it outperforms some full-domain benchmarks), but provides an efficient scalable solution that allows SR reduction on a pan-European scale and is valuable for real-time applications.

[AI-8] HEnRY: A Multi-Agent System Framework for Multi-Domain Contexts

链接: https://arxiv.org/abs/2410.12720
作者: Emmanuele Lacavalla,Shuyi Yang,Riccardo Crupi,Joseph E. Gonzalez
关键词-EN: Intesa Sanpaolo, Intesa Sanpaolo context, named HEnRY, efficient resource management, Multi-Agent System
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:This project, named HEnRY, aims to introduce a Multi-Agent System (MAS) into Intesa Sanpaolo. The name HEnRY summarizes the project’s core principles: the Hierarchical organization of agents in a layered structure for efficient resource management; Efficient optimization of resources and operations to enhance overall performance; Reactive ability of agents to quickly respond to environmental stimuli; and Yielding adaptability and flexibility of agents to handle unexpected situations. The discussion covers two distinct research paths: the first focuses on the system architecture, and the second on the collaboration between agents. This work is not limited to the specific structure of the Intesa Sanpaolo context; instead, it leverages existing research in MAS to introduce a new solution. Since Intesa Sanpaolo is organized according to a model that aligns with international corporate governance best practices, this approach could also be relevant to similar scenarios.

[AI-9] FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

链接: https://arxiv.org/abs/2410.12707
作者: Zhenheng Tang,Xueze Kang,Yiming Yin,Xinglin Pan,Yuxin Wang,Xin He,Qiang Wang,Rongfei Zeng,Kaiyong Zhao,Shaohuai Shi,Amelie Chi Zhou,Bo Li,Bingsheng He,Xiaowen Chu
关键词-EN: large deep neural, training large deep, large language models, alleviate hardware scarcity, deep neural networks
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.12707 [cs.DC] (or arXiv:2410.12707v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2410.12707 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

链接: https://arxiv.org/abs/2410.12705
作者: Genta Indra Winata,Frederikus Hudi,Patrick Amadeus Irawan,David Anugraha,Rifki Afina Putri,Yutong Wang,Adam Nohejl,Ubaidillah Ariq Prathama,Nedjma Ousidhoum,Afifa Amriani,Anar Rzayev,Anirban Das,Ashmari Pramodya,Aulia Adila,Bryan Wilie,Candy Olivia Mawalim,Ching Lam Cheng,Daud Abolade,Emmanuele Chersoni,Enrico Santus,Fariz Ikhwantri,Garry Kuwanto,Hanyang Zhao,Haryo Akbarianto Wibowo,Holy Lovenia,Jan Christian Blaise Cruz,Jan Wira Gotama Putra,Junho Myung,Lucky Susanto,Maria Angelica Riera Machin,Marina Zhukova,Michael Anugraha,Muhammad Farid Adilazuarda,Natasha Santosa,Peerat Limkonchotiwat,Raj Dabre,Rio Alexander Audino,Samuel Cahyawijaya,Shi-Xiong Zhang,Stephanie Yulia Salim,Yi Zhou,Yinxuan Gui,David Ifeoluwa Adelani,En-Shiun Annie Lee,Shogo Okada,Ayu Purwarianti,Alham Fikri Aji,Taro Watanabe,Derry Tanti Wijaya,Alice Oh,Chong-Wah Ngo
关键词-EN: Vision Language Models, underrepresented cultural contexts, Vision Language, Language Models, underrepresented cultural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

[AI-11] Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization

链接: https://arxiv.org/abs/2410.12700
作者: Xingqi Wang,Xiaoyuan Yi,Xing Xie,Jia Jia
关键词-EN: Recent advancements, produce harmful content, harmful content misaligned, Large Language Models, indistinguishable human-level images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by ACM Multimedia 2024. The dataset and code can be found at this https URL

点击查看摘要

Abstract:Recent advancements in diffusion models trained on large-scale data have enabled the generation of indistinguishable human-level images, yet they often produce harmful content misaligned with human values, e.g., social bias, and offensive content. Despite extensive research on Large Language Models (LLMs), the challenge of Text-to-Image (T2I) model alignment remains largely unexplored. Addressing this problem, we propose LiVO (Lightweight Value Optimization), a novel lightweight method for aligning T2I models with human values. LiVO only optimizes a plug-and-play value encoder to integrate a specified value principle with the input prompt, allowing the control of generated images over both semantics and values. Specifically, we design a diffusion model-tailored preference optimization loss, which theoretically approximates the Bradley-Terry model used in LLM alignment but provides a more flexible trade-off between image quality and value conformity. To optimize the value encoder, we also develop a framework to automatically construct a text-image preference dataset of 86k (prompt, aligned image, violating image, value principle) samples. Without updating most model parameters and through adaptive value selection from the input prompt, LiVO significantly reduces harmful outputs and achieves faster convergence, surpassing several strong baselines and taking an initial step towards ethically aligned T2I models.

[AI-12] Automatic Mapping of Anatomical Landmarks from Free-Text Using Large Language Models : Insights from Llama-2

链接: https://arxiv.org/abs/2410.12686
作者: Mohamad Abdi,Gerardo Hemosillo Valadez,Halid Ziya Yerebakan
关键词-EN: anomaly detection, navigation and anomaly, Anatomical landmarks, landmarks, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:Anatomical landmarks are vital in medical imaging for navigation and anomaly detection. Modern large language models (LLMs), like Llama-2, offer promise for automating the mapping of these landmarks in free-text radiology reports to corresponding positions in image data. Recent studies propose LLMs may develop coherent representations of generative processes. Motivated by these insights, we investigated whether LLMs accurately represent the spatial positions of anatomical landmarks. Through experiments with Llama-2 models, we found that they can linearly represent anatomical landmarks in space with considerable robustness to different prompts. These results underscore the potential of LLMs to enhance the efficiency and accuracy of medical imaging workflows.

[AI-13] Context Matters: Leveraging Contextual Features for Time Series Forecasting

链接: https://arxiv.org/abs/2410.12672
作者: Sameep Chattopadhyay,Pulkit Paliwal,Sai Shankar Narasimhan,Shubhankar Agarwal,Sandeep P. Chinchali
关键词-EN: Time series forecasts, Time series, exogenous contextual features, series forecasts, influenced by exogenous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasts are often influenced by exogenous contextual features in addition to their corresponding history. For example, in financial settings, it is hard to accurately predict a stock price without considering public sentiments and policy decisions in the form of news articles, tweets, etc. Though this is common knowledge, the current state-of-the-art (SOTA) forecasting models fail to incorporate such contextual information, owing to its heterogeneity and multimodal nature. To address this, we introduce ContextFormer, a novel plug-and-play method to surgically integrate multimodal contextual information into existing pre-trained forecasting models. ContextFormer effectively distills forecast-specific information from rich multimodal contexts, including categorical, continuous, time-varying, and even textual information, to significantly enhance the performance of existing base forecasters. ContextFormer outperforms SOTA forecasting models by up to 30% on a range of real-world datasets spanning energy, traffic, environmental, and financial domains.

[AI-14] Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

链接: https://arxiv.org/abs/2410.12662
作者: Shicheng Xu,Liang Pang,Yunchang Zhu,Huawei Shen,Xueqi Cheng
关键词-EN: Large Vision-Language Models, Vision-language alignment, safety mechanism, Vision-Language Models, Large Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Vision-language alignment in Large Vision-Language Models (LVLMs) successfully enables LLMs to understand visual input. However, we find that existing vision-language alignment methods fail to transfer the existing safety mechanism for text in LLMs to vision, which leads to vulnerabilities in toxic image. To explore the cause of this problem, we give the insightful explanation of where and how the safety mechanism of LVLMs operates and conduct comparative analysis between text and vision. We find that the hidden states at the specific transformer layers play a crucial role in the successful activation of safety mechanism, while the vision-language alignment at hidden states level in current methods is insufficient. This results in a semantic shift for input images compared to text in hidden states, therefore misleads the safety mechanism. To address this, we propose a novel Text-Guided vision-language Alignment method (TGA) for LVLMs. TGA retrieves the texts related to input vision and uses them to guide the projection of vision into the hidden states space in LLMs. Experiments show that TGA not only successfully transfers the safety mechanism for text in basic LLMs to vision in vision-language alignment for LVLMs without any safety fine-tuning on the visual modality but also maintains the general performance on various vision tasks (Safe and Good).

[AI-15] Evaluating Morphological Compositional Generalization in Large Language Models

链接: https://arxiv.org/abs/2410.12656
作者: Mete Ismayilzada,Defne Circi,Jonne Sälevä,Hale Sirin,Abdullatif Köksal,Bhuwan Dhingra,Antoine Bosselut,Lonneke van der Plas,Duygu Ataman
关键词-EN: Large language models, natural language generation, Large language, demonstrated significant progress, generation and understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 33 pages

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questionable, raising doubts about whether these models learn language similarly to humans. While humans exhibit compositional generalization and linguistic creativity in language use, the extent to which LLMs replicate these abilities, particularly in morphology, is under-explored. In this work, we systematically investigate the morphological generalization abilities of LLMs through the lens of compositionality. We define morphemes as compositional primitives and design a novel suite of generative and discriminative tasks to assess morphological productivity and systematicity. Focusing on agglutinative languages such as Turkish and Finnish, we evaluate several state-of-the-art instruction-finetuned multilingual models, including GPT-4 and Gemini. Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots, with performance declining sharply as morphological complexity increases. While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.

[AI-16] Constrained Posterior Sampling: Time Series Generation with Hard Constraints

链接: https://arxiv.org/abs/2410.12652
作者: Sai Shankar Narasimhan,Shubhankar Agarwal,Litu Rout,Sanjay Shakkottai,Sandeep P. Chinchali
关键词-EN: protecting user privacy, synthetic data, crucial for stress-testing, stress-testing models, models and protecting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Generating realistic time series samples is crucial for stress-testing models and protecting user privacy by using synthetic data. In engineering and safety-critical applications, these samples must meet certain hard constraints that are domain-specific or naturally imposed by physics or nature. Consider, for example, generating electricity demand patterns with constraints on peak demand times. This can be used to stress-test the functioning of power grids during adverse weather conditions. Existing approaches for generating constrained time series are either not scalable or degrade sample quality. To address these challenges, we introduce Constrained Posterior Sampling (CPS), a diffusion-based sampling algorithm that aims to project the posterior mean estimate into the constraint set after each denoising update. Notably, CPS scales to a large number of constraints (~100) without requiring additional training. We provide theoretical justifications highlighting the impact of our projection step on sampling. Empirically, CPS outperforms state-of-the-art methods in sample quality and similarity to real time series by around 10% and 42%, respectively, on real-world stocks, traffic, and air quality datasets.

[AI-17] Explainable Moral Values: a neuro-symbolic approach to value classification ESWC24

链接: https://arxiv.org/abs/2410.12631
作者: Nicolas Lazzari,Stefano De Giorgis,Aldo Gangemi,Valentina Presutti
关键词-EN: Machine Learning techniques, Machine Learning, Ontology Design Pattern, Moral Foundations Theory, reasoning and Machine
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at ESWC24 Satellite Event

点击查看摘要

Abstract:This work explores the integration of ontology-based reasoning and Machine Learning techniques for explainable value classification. By relying on an ontological formalization of moral values as in the Moral Foundations Theory, relying on the DnS Ontology Design Pattern, the \textitsandra neuro-symbolic reasoner is used to infer values (fomalized as descriptions) that are \emphsatisfied by a certain sentence. Sentences, alongside their structured representation, are automatically generated using an open-source Large Language Model. The inferred descriptions are used to automatically detect the value associated with a sentence. We show that only relying on the reasoner’s inference results in explainable classification comparable to other more complex approaches. We show that combining the reasoner’s inferences with distributional semantics methods largely outperforms all the baselines, including complex models based on neural network architectures. Finally, we build a visualization tool to explore the potential of theory-based values classification, which is publicly available at this http URL.

[AI-18] Exploring Model Kinship for Merging Large Language Models

链接: https://arxiv.org/abs/2410.12613
作者: Yedi Hu,Yunzhi Yao,Ningyu Zhang,Shumin Deng,Huajun Chen
关键词-EN: Large Language Models, Large Language, efficiency of Large, Language Models, model kinship
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Ongoing work

点击查看摘要

Abstract:Model merging has become one of the key technologies for enhancing the capabilities and efficiency of Large Language Models (LLMs). However, our understanding of the expected performance gains and principles when merging any two models remains limited. In this work, we introduce model kinship, the degree of similarity or relatedness between LLMs, analogous to biological evolution. With comprehensive empirical analysis, we find that there is a certain relationship between model kinship and the performance gains after model merging, which can help guide our selection of candidate models. Inspired by this, we propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets. Specifically, we discover that using model kinship as a criterion can assist us in continuously performing model merging, alleviating the degradation (local optima) in model evolution, whereas model kinship can serve as a guide to escape these traps. Code is available at this https URL.

[AI-19] owards Graph Foundation Models: The Perspective of Zero-shot Reasoning on Knowledge Graphs

链接: https://arxiv.org/abs/2410.12609
作者: Kai Wang,Siqiang Luo
关键词-EN: artificial general intelligence, Foundation Models, Graph Foundation Models, developing Graph Foundation, general intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 Pages, 5 figures

点击查看摘要

Abstract:Inspired by the success of artificial general intelligence, there is a trend towards developing Graph Foundation Models that excel in generalization across various graph tasks and domains. However, current models often require extensive training or fine-tuning to capture structural and semantic insights on new graphs, which limits their versatility. In this work, we explore graph foundation models from the perspective of zero-shot reasoning on Knowledge Graphs (KGs). Our focus is on utilizing KGs as a unified topological structure to tackle diverse tasks, while addressing semantic isolation challenges in KG reasoning to effectively integrate diverse semantic and structural features. This brings us new methodological insights into KG reasoning, as well as high generalizability towards foundation models in practice. Methodologically, we introduce SCORE, a unified graph reasoning framework that effectively generalizes diverse graph tasks using zero-shot learning. At the core of SCORE is semantic conditional message passing, a technique designed to capture both structural and semantic invariances in graphs, with theoretical backing for its expressive power. Practically, we evaluate the zero-shot reasoning capability of SCORE using 38 diverse graph datasets, covering node-level, link-level, and graph-level tasks across multiple domains. Our experiments reveal a substantial performance improvement over prior foundation models and supervised baselines, highlighting the efficacy and adaptability of our approach.

[AI-20] Low-Rank Adversarial PGD Attack

链接: https://arxiv.org/abs/2410.12607
作者: Dayana Savostianova,Emanuele Zangrando,Francesco Tudisco
关键词-EN: Projected Gradient Descent, deep neural network, neural network models, neural network, Projected Gradient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Adversarial attacks on deep neural network models have seen rapid development and are extensively used to study the stability of these networks. Among various adversarial strategies, Projected Gradient Descent (PGD) is a widely adopted method in computer vision due to its effectiveness and quick implementation, making it suitable for adversarial training. In this work, we observe that in many cases, the perturbations computed using PGD predominantly affect only a portion of the singular value spectrum of the original image, suggesting that these perturbations are approximately low-rank. Motivated by this observation, we propose a variation of PGD that efficiently computes a low-rank attack. We extensively validate our method on a range of standard models as well as robust models that have undergone adversarial training. Our analysis indicates that the proposed low-rank PGD can be effectively used in adversarial training due to its straightforward and fast implementation coupled with competitive performance. Notably, we find that low-rank PGD often performs comparably to, and sometimes even outperforms, the traditional full-rank PGD attack, while using significantly less memory.

[AI-21] Self-Supervised Learning of Disentangled Representations for Multivariate Time-Series NEURIPS2024

链接: https://arxiv.org/abs/2410.12606
作者: Ching Chang,Chiao-Tung Chan,Wei-Yao Wang,Wen-Chih Peng,Tien-Fu Chen
关键词-EN: fields like healthcare, healthcare and industry, industry are informative, informative but challenging, challenging due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice

点击查看摘要

Abstract:Multivariate time-series data in fields like healthcare and industry are informative but challenging due to high dimensionality and lack of labels. Recent self-supervised learning methods excel in learning rich representations without labels but struggle with disentangled embeddings and inductive bias issues like transformation-invariance. To address these challenges, we introduce TimeDRL, a framework for multivariate time-series representation learning with dual-level disentangled embeddings. TimeDRL features: (i) disentangled timestamp-level and instance-level embeddings using a [CLS] token strategy; (ii) timestamp-predictive and instance-contrastive tasks for representation learning; and (iii) avoidance of augmentation methods to eliminate inductive biases. Experiments on forecasting and classification datasets show TimeDRL outperforms existing methods, with further validation in semi-supervised settings with limited labeled data.

[AI-22] Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting

链接: https://arxiv.org/abs/2410.12593
作者: Wei Chen,Yuxuan Liang
关键词-EN: sensing devices leads, spatio-temporal graph neural, spatio-temporal forecasting applications, graph neural network, air quality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method, following two fundamental tuning principles guided by empirical and theoretical analysis: expand and compress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base spatio-temporal graph neural network with a continuous prompt pool, utilizing stored prompts (i.e., few learnable parameters) in memory, and jointly optimize them with the base spatio-temporal graph neural network. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority of our method over the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.

[AI-23] Rethinking Visual Counterfactual Explanations Through Region Constraint

链接: https://arxiv.org/abs/2410.12591
作者: Bartlomiej Sobieski,Jakub Grzywaczewski,Bartlomiej Sadlej,Matthew Tivnan,Przemyslaw Biecek
关键词-EN: recently gained immense, gained immense popularity, Visual counterfactual explanations, Visual counterfactual, recently gained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Visual counterfactual explanations (VCEs) have recently gained immense popularity as a tool for clarifying the decision-making process of image classifiers. This trend is largely motivated by what these explanations promise to deliver – indicate semantically meaningful factors that change the classifier’s decision. However, we argue that current state-of-the-art approaches lack a crucial component – the region constraint – whose absence prevents from drawing explicit conclusions, and may even lead to faulty reasoning due to phenomenons like confirmation bias. To address the issue of previous methods, which modify images in a very entangled and widely dispersed manner, we propose region-constrained VCEs (RVCEs), which assume that only a predefined image region can be modified to influence the model’s prediction. To effectively sample from this subclass of VCEs, we propose Region-Constrained Counterfactual Schrödinger Bridges (RCSB), an adaptation of a tractable subclass of Schrödinger Bridges to the problem of conditional inpainting, where the conditioning signal originates from the classifier of interest. In addition to setting a new state-of-the-art by a large margin, we extend RCSB to allow for exact counterfactual reasoning, where the predefined region contains only the factor of interest, and incorporating the user to actively interact with the RVCE by predefining the regions manually.

[AI-24] STRUX: An LLM for Decision-Making with Structured Explanations NAACL2025

链接: https://arxiv.org/abs/2410.12583
作者: Yiming Lu,Yebowen Hu,Hassan Foroosh,Wei Jin,Fei Liu
关键词-EN: Countless decisions shape, daily lives, Countless decisions, shape our daily, Countless
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures, submitted to NAACL 2025

点击查看摘要

Abstract:Countless decisions shape our daily lives, and it is paramount to understand the how and why behind these choices. In this paper, we introduce a new LLM decision-making framework called STRUX, which enhances LLM decision-making by providing structured explanations. These include favorable and adverse facts related to the decision, along with their respective strengths. STRUX begins by distilling lengthy information into a concise table of key facts. It then employs a series of self-reflection steps to determine which of these facts are pivotal, categorizing them as either favorable or adverse in relation to a specific decision. Lastly, we fine-tune an LLM to identify and prioritize these key facts to optimize decision-making. STRUX has been evaluated on the challenging task of forecasting stock investment decisions based on earnings call transcripts and demonstrated superior performance against strong baselines. It enhances decision transparency by allowing users to understand the impact of different factors, representing a meaningful step towards practical decision-making with LLMs.

[AI-25] On the Utility of Domain Modeling Assistance with Large Language Models

链接: https://arxiv.org/abs/2410.12577
作者: Meriem Ben Chaaben,Lola Burgueño,Istvan David,Houari Sahraoui
关键词-EN: syntactic constraints hinder, incomplete domain understanding, simplifies software development, Model-driven engineering, development through abstraction
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Model-driven engineering (MDE) simplifies software development through abstraction, yet challenges such as time constraints, incomplete domain understanding, and adherence to syntactic constraints hinder the design process. This paper presents a study to evaluate the usefulness of a novel approach utilizing large language models (LLMs) and few-shot prompt learning to assist in domain modeling. The aim of this approach is to overcome the need for extensive training of AI-based completion models on scarce domain-specific datasets and to offer versatile support for various modeling activities, providing valuable recommendations to software modelers. To support this approach, we developed MAGDA, a user-friendly tool, through which we conduct a user study and assess the real-world applicability of our approach in the context of domain modeling, offering valuable insights into its usability and effectiveness.

[AI-26] Robust RL with LLM-Driven Data Synthesis and Policy Adaptation for Autonomous Driving

链接: https://arxiv.org/abs/2410.12568
作者: Sihao Wu,Jiaxu Liu,Xiangyu Yin,Guangliang Cheng,Meng Fang,Xingyu Zhao,Xinping Yi,Xiaowei Huang
关键词-EN: Large Language Models, Language Models, Large Language, purely data-driven methods, strong common sense
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into autonomous driving systems demonstrates strong common sense and reasoning abilities, effectively addressing the pitfalls of purely data-driven methods. Current LLM-based agents require lengthy inference times and face challenges in interacting with real-time autonomous driving environments. A key open question is whether we can effectively leverage the knowledge from LLMs to train an efficient and robust Reinforcement Learning (RL) agent. This paper introduces RAPID, a novel \underline\textbfRobust \underline\textbfAdaptive \underline\textbfPolicy \underline\textbfInfusion and \underline\textbfDistillation framework, which trains specialized mix-of-policy RL agents using data synthesized by an LLM-based driving agent and online adaptation. RAPID features three key designs: 1) utilization of offline data collected from an LLM agent to distil expert knowledge into RL policies for faster real-time inference; 2) introduction of robust distillation in RL to inherit both performance and robustness from LLM-based teacher; and 3) employment of a mix-of-policy approach for joint decision decoding with a policy adapter. Through fine-tuning via online environment interaction, RAPID reduces the forgetting of LLM knowledge while maintaining adaptability to different tasks. Extensive experiments demonstrate RAPID’s capability to effectively integrate LLM knowledge into scaled-down RL policies in an efficient, adaptable, and robust way. Code and checkpoints will be made publicly available upon acceptance.

[AI-27] Development of Image Collection Method Using YOLO and Siamese Network

链接: https://arxiv.org/abs/2410.12561
作者: Chan Young Shin,Ah Hyun Lee,Jun Young Lee,Ji Min Lee,Soo Jin Park
关键词-EN: collecting high-quality data, enter the era, era of big, Siamese network, Siamese
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages, 13 figures, 2 tables

点击查看摘要

Abstract:As we enter the era of big data, collecting high-quality data is very important. However, collecting data by humans is not only very time-consuming but also expensive. Therefore, many scientists have devised various methods to collect data using computers. Among them, there is a method called web crawling, but the authors found that the crawling method has a problem in that unintended data is collected along with the user. The authors found that this can be filtered using the object recognition model YOLOv10. However, there are cases where data that is not properly filtered remains. Here, image reclassification was performed by additionally utilizing the distance output from the Siamese network, and higher performance was recorded than other classification models. (average _f1 score YOLO+MobileNet 0.678-YOLO+SiameseNet 0.772)) The user can specify a distance threshold to adjust the balance between data deficiency and noise-robustness. The authors also found that the Siamese network can achieve higher performance with fewer resources because the cropped images are used for object recognition when processing images in the Siamese network. (Class 20 mean-based f1 score, non-crop+Siamese(MobileNetV3-Small) 80.94 - crop preprocessing+Siamese(MobileNetV3-Small) 82.31) In this way, the image retrieval system that utilizes two consecutive models to reduce errors can save users’ time and effort, and build better quality data faster and with fewer resources than before.

[AI-28] A Claim Decomposition Benchmark for Long-form Answer Verification

链接: https://arxiv.org/abs/2410.12558
作者: Zhihao Zhang,Yixing Fan,Ruqing Zhang,Jiafeng Guo
关键词-EN: complex long-form question, long-form question answering, question answering tasks, significantly boosted, boosted the performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by CCIR 2024

点击查看摘要

Abstract:The advancement of LLMs has significantly boosted the performance of complex long-form question answering tasks. However, one prominent issue of LLMs is the generated “hallucination” responses that are not factual. Consequently, attribution for each claim in responses becomes a common solution to improve the factuality and verifiability. Existing researches mainly focus on how to provide accurate citations for the response, which largely overlook the importance of identifying the claims or statements for each response. To bridge this gap, we introduce a new claim decomposition benchmark, which requires building system that can identify atomic and checkworthy claims for LLM responses. Specifically, we present the Chinese Atomic Claim Decomposition Dataset (CACDD), which builds on the WebCPM dataset with additional expert annotations to ensure high data quality. The CACDD encompasses a collection of 500 human-annotated question-answer pairs, including a total of 4956 atomic claims. We further propose a new pipeline for human annotation and describe the challenges of this task. In addition, we provide experiment results on zero-shot, few-shot and fine-tuned LLMs as baselines. The results show that the claim decomposition is highly challenging and requires further explorations. All code and data are publicly available at \urlthis https URL.

[AI-29] LLM-based Translation Inference with Iterative Bilingual Understanding

链接: https://arxiv.org/abs/2410.12543
作者: Andong Chen,Kehai Chen,Yang Xiang,Xuefeng Bai,Muyun Yang,Tiejun Zhao,Min zhang
关键词-EN: greatly improved translation, large language models, improved translation performance, Iterative Bilingual Understanding, greatly improved
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: work in process

点击查看摘要

Abstract:The remarkable understanding and generation capabilities of large language models (LLMs) have greatly improved translation performance. However, incorrect understanding of the sentence to be translated can degrade translation quality. To address this issue, we proposed a novel Iterative Bilingual Understanding Translation (IBUT) method based on the cross-lingual capabilities of LLMs and the dual characteristics of translation tasks. The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately. Furthermore, the dual characteristics allow IBUT to generate effective cross-lingual feedback, iteratively refining contextual understanding, thereby reducing errors and improving translation performance. Experimental results showed that the proposed IBUT outperforms several strong comparison methods, especially being generalized to multiple domains (e.g., news, commonsense, and cultural translation benchmarks).

[AI-30] Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making

链接: https://arxiv.org/abs/2410.12539
作者: Stelios Triantafyllou,Aleksa Sukovic,Yasaman Zolfimoselo,Goran Radanovic
关键词-EN: Markov decision processes, multi-agent Markov decision, multi-agent Markov, Markov decision, explaining counterfactual outcomes
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We address the challenge of explaining counterfactual outcomes in multi-agent Markov decision processes. In particular, we aim to explain the total counterfactual effect of an agent’s action on the outcome of a realized scenario through its influence on the environment dynamics and the agents’ behavior. To achieve this, we introduce a novel causal explanation formula that decomposes the counterfactual effect by attributing to each agent and state variable a score reflecting their respective contributions to the effect. First, we show that the total counterfactual effect of an agent’s action can be decomposed into two components: one measuring the effect that propagates through all subsequent agents’ actions and another related to the effect that propagates through the state transitions. Building on recent advancements in causal contribution analysis, we further decompose these two effects as follows. For the former, we consider agent-specific effects – a causal concept that quantifies the counterfactual effect of an agent’s action that propagates through a subset of agents. Based on this notion, we use Shapley value to attribute the effect to individual agents. For the latter, we consider the concept of structure-preserving interventions and attribute the effect to state variables based on their “intrinsic” contributions. Through extensive experimentation, we demonstrate the interpretability of our decomposition approach in a Gridworld environment with LLM-assisted agents and a sepsis management simulator.

[AI-31] Characterizing Behavioral Differences and Adaptations of Automated Vehicles and Human Drivers at Unsignalized Intersections: Insights from Waymo and Lyft Open Datasets

链接: https://arxiv.org/abs/2410.12538
作者: Saeed Rahmani,Zhenlin(Gavin)Xu,Simeon C. Calvert,Bart van Arem
关键词-EN: transportation systems presents, enhance road safety, transportation systems, systems presents, presents an unprecedented
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: This work has been submitted to Transportation Research Record for potential publication

点击查看摘要

Abstract:The integration of autonomous vehicles (AVs) into transportation systems presents an unprecedented opportunity to enhance road safety and efficiency. However, understanding the interactions between AVs and human-driven vehicles (HVs) at intersections remains an open research question. This study aims to bridge this gap by examining behavioral differences and adaptations of AVs and HVs at unsignalized intersections by utilizing two comprehensive AV datasets from Waymo and Lyft. Using a systematic methodology, the research identifies and analyzes merging and crossing conflicts by calculating key safety and efficiency metrics, including time to collision (TTC), post-encroachment time (PET), maximum required deceleration (MRD), time advantage (TA), and speed and acceleration profiles. The findings reveal a paradox in mixed traffic flow: while AVs maintain larger safety margins, their conservative behavior can lead to unexpected situations for human drivers, potentially causing unsafe conditions. From a performance point of view, human drivers exhibit more consistent behavior when interacting with AVs versus other HVs, suggesting AVs may contribute to harmonizing traffic flow patterns. Moreover, notable differences were observed between Waymo and Lyft vehicles, which highlights the importance of considering manufacturer-specific AV behaviors in traffic modeling and management strategies for the safe integration of AVs. The processed dataset utilized in this study is openly published to foster the research on AV-HV interactions.

[AI-32] Is Complex Query Answering Really Complex?

链接: https://arxiv.org/abs/2410.12537
作者: Cosimo Gregucci,Bo Xiong,Daniel Hernandez,Lorenzo Loconte,Pasquale Minervini,Steffen Staab,Antonio Vergari
关键词-EN: Complex query answering, challenging reasoning task, knowledge graphs, reasoning task, gaining momentum
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Complex query answering (CQA) on knowledge graphs (KGs) is gaining momentum as a challenging reasoning task. In this paper, we show that the current benchmarks for CQA are not really complex, and the way they are built distorts our perception of progress in this field. For example, we find that in these benchmarks, most queries (up to 98% for some query types) can be reduced to simpler problems, e.g., link prediction, where only one link needs to be predicted. The performance of state-of-the-art CQA models drops significantly when such models are evaluated on queries that cannot be reduced to easier types. Thus, we propose a set of more challenging benchmarks, composed of queries that require models to reason over multiple hops and better reflect the construction of real-world KGs. In a systematic empirical investigation, the new benchmarks show that current methods leave much to be desired from current CQA methods.

[AI-33] QueensCAMP: an RGB-D dataset for robust Visual SLAM

链接: https://arxiv.org/abs/2410.12520
作者: Hudson M. S. Bruno,Esther L. Colombini,Sidney N. Givigi Jr
关键词-EN: Visual Simultaneous Localization, Visual Simultaneous, Localization and Mapping, Simultaneous Localization, robotics applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Visual Simultaneous Localization and Mapping (VSLAM) is a fundamental technology for robotics applications. While VSLAM research has achieved significant advancements, its robustness under challenging situations, such as poor lighting, dynamic environments, motion blur, and sensor failures, remains a challenging issue. To address these challenges, we introduce a novel RGB-D dataset designed for evaluating the robustness of VSLAM systems. The dataset comprises real-world indoor scenes with dynamic objects, motion blur, and varying illumination, as well as emulated camera failures, including lens dirt, condensation, underexposure, and overexposure. Additionally, we offer open-source scripts for injecting camera failures into any images, enabling further customization by the research community. Our experiments demonstrate that ORB-SLAM2, a traditional VSLAM algorithm, and TartanVO, a Deep Learning-based VO algorithm, can experience performance degradation under these challenging conditions. Therefore, this dataset and the camera failure open-source tools provide a valuable resource for developing more robust VSLAM systems capable of handling real-world challenges.

[AI-34] Benchmarking Defeasible Reasoning with Large Language Models – Initial Experiments and Future Directions KR

链接: https://arxiv.org/abs/2410.12509
作者: Ilias Tachmazidis,Sotiris Batsakis,Grigoris Antoniou
关键词-EN: Large Language Models, Large Language, Language Models, exceptional performance, gained prominence
类目: Artificial Intelligence (cs.AI)
*备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:Large Language Models (LLMs) have gained prominence in the AI landscape due to their exceptional performance. Thus, it is essential to gain a better understanding of their capabilities and limitations, among others in terms of nonmonotonic reasoning. This paper proposes a benchmark that corresponds to various defeasible rule-based reasoning patterns. We modified an existing benchmark for defeasible logic reasoners by translating defeasible rules into text suitable for LLMs. We conducted preliminary experiments on nonmonotonic rule-based reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.

[AI-35] DH-VTON: Deep Text-Driven Virtual Try-On via Hybrid Attention Learning ICASSP2025

链接: https://arxiv.org/abs/2410.12501
作者: Jiabao Wei,Zhiyuan Ma
关键词-EN: online shopping scenarios, synthesis specific person, recently receives numerous, receives numerous attention, specific person images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 6 figures, ICASSP2025

点击查看摘要

Abstract:Virtual Try-ON (VTON) aims to synthesis specific person images dressed in given garments, which recently receives numerous attention in online shopping scenarios. Currently, the core challenges of the VTON task mainly lie in the fine-grained semantic extraction (i.e.,deep semantics) of the given reference garments during depth estimation and effective texture preservation when the garments are synthesized and warped onto human body. To cope with these issues, we propose DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module. By standing on the shoulder of a well-built pre-trained paint-by-example (abbr. PBE) approach, we present our DH-VTON pipeline in this work. Specifically, to extract the deep semantics of the garments, we first introduce InternViT-6B as fine-grained feature learner, which can be trained to align with the large-scale intrinsic knowledge with deep text semantics (e.g.,“neckline” or “girdle”) to make up for the deficiency of the commonly adopted CLIP encoder. Based on this, to enhance the customized dressing abilities, we further introduce Garment-Feature ControlNet Plus (abbr. GFC+) module and propose to leverage a fresh hybrid attention strategy for training, which can adaptively integrate fine-grained characteristics of the garments into the different layers of the VTON model, so as to achieve multi-scale features preservation effects. Extensive experiments on several representative datasets demonstrate that our method outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.

[AI-36] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective NEURIPS2024

链接: https://arxiv.org/abs/2410.12490
作者: Yongxin Zhu,Bocheng Li,Hang Zhang,Xin Li,Linli Xu,Lidong Bing
关键词-EN: Latent Diffusion Models, Latent-based image generative, Mask Image Models, achieved notable success, latent space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Latent-based image generative models, such as Latent Diffusion Models (LDMs) and Mask Image Models (MIMs), have achieved notable success in image generation tasks. These models typically leverage reconstructive autoencoders like VQGAN or VAE to encode pixels into a more compact latent space and learn the data distribution in the latent space instead of directly from pixels. However, this practice raises a pertinent question: Is it truly the optimal choice? In response, we begin with an intriguing observation: despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. This finding contrasts sharply with the field of NLP, where the autoregressive model GPT has established a commanding presence. To address this discrepancy, we introduce a unified perspective on the relationship between latent space and generative models, emphasizing the stability of latent space in image generative modeling. Furthermore, we propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling. Experimental results show that image autoregressive modeling with our tokenizer (DiGIT) benefits both image understanding and image generation with the next token prediction principle, which is inherently straightforward for GPT models but challenging for other generative models. Remarkably, for the first time, a GPT-style autoregressive model for images outperforms LDMs, which also exhibits substantial improvement akin to GPT when scaling up model size. Our findings underscore the potential of an optimized latent space and the integration of discrete tokenization in advancing the capabilities of image generative models. The code is available at \urlthis https URL.

[AI-37] Stable Object Placement Planning From Contact Point Robustness

链接: https://arxiv.org/abs/2410.12483
作者: Philippe Nadeau,Jonathan Kelly
关键词-EN: stably placing objects, guide robot manipulators, intricate scenes, designed to guide, manipulators in stably
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE Transactions on Robotics. Contains 14 pages, 11 figures, and 3 tables

点击查看摘要

Abstract:We introduce a planner designed to guide robot manipulators in stably placing objects within intricate scenes. Our proposed method reverses the traditional approach to object placement: our planner selects contact points first and then determines a placement pose that solicits the selected points. This is instead of sampling poses, identifying contact points, and evaluating pose quality. Our algorithm facilitates stability-aware object placement planning, imposing no restrictions on object shape, convexity, or mass density homogeneity, while avoiding combinatorial computational complexity. Our proposed stability heuristic enables our planner to find a solution about 20 times faster when compared to the same algorithm not making use of the heuristic and eight times faster than a state-of-the-art method that uses the traditional sample-and-evaluate approach. Our proposed planner is also more successful in finding stable placements than the five other benchmarked algorithms. Derived from first principles and validated in ten real robot experiments, our planner offers a general and scalable method to tackle the problem of object placement planning with rigid objects.

[AI-38] SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

链接: https://arxiv.org/abs/2410.12481
作者: Loris Gaven,Clement Romac,Thomas Carta,Sylvain Lamprier,Olivier Sigaud,Pierre-Yves Oudeyer
关键词-EN: Large Language Models, Large Language, Language Models, sequential decision-making tasks, solving textual sequential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.

[AI-39] KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

链接: https://arxiv.org/abs/2410.12480
作者: Yongqin Xu,Huan Li,Ke Chen,Lidan Shou
关键词-EN: integration and management, crucial for data, data integration, entity matching tasks, Knowledge-Compliant Matching Framework
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Schema and entity matching tasks are crucial for data integration and management. While large language models (LLMs) have shown promising results in these tasks, they suffer from hallucinations and confusion about task instructions. In this paper, we present the Knowledge-Compliant Matching Framework (KcMF), an LLM-based approach that addresses these issues without the need for domain-specific fine-tuning. KcMF employs a pseudo-code-based task decomposition strategy to adopt task-specific natural language statements that guide LLM reasoning and reduce confusion. We also propose two mechanisms, Dataset as Knowledge (DaK) and Example as Knowledge (EaK), to build domain knowledge sets when unstructured domain knowledge is lacking. Additionally, we introduce a result-ensembling strategy to leverage multiple knowledge sources and suppress poorly formatted outputs. Comprehensive evaluations on schema and entity matching tasks demonstrate that KcMF outperforms previous non-LLM state-of-the-art (SOTA) methods by an average F1 score of 22.9% and competes effectively with SOTA fine-tuned LLMs. Moreover, KcMF generalizes well across different LLMs.

[AI-40] Unifying Economic and Language Models for Enhanced Sentiment Analysis of the Oil Market

链接: https://arxiv.org/abs/2410.12473
作者: Himmet Kaplan,Ralf-Peter Mundani,Heiko Rölke,Albert Weichselbraun,Martin Tschudy
关键词-EN: political events, Generative Pre-trained Transformer, global economy, critical component, Crude oil
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Crude oil, a critical component of the global economy, has its prices influenced by various factors such as economic trends, political events, and natural disasters. Traditional prediction methods based on historical data have their limits in forecasting, but recent advancements in natural language processing bring new possibilities for event-based analysis. In particular, Language Models (LM) and their advancement, the Generative Pre-trained Transformer (GPT), have shown potential in classifying vast amounts of natural language. However, these LMs often have difficulty with domain-specific terminology, limiting their effectiveness in the crude oil sector. Addressing this gap, we introduce CrudeBERT, a fine-tuned LM specifically for the crude oil market. The results indicate that CrudeBERT’s sentiment scores align more closely with the WTI Futures curve and significantly enhance price predictions, underscoring the crucial role of integrating economic principles into LMs.

[AI-41] Evaluating Software Development Agents : Patch Patterns Code Quality and Issue Complexity in Real-World GitHub Scenarios

链接: https://arxiv.org/abs/2410.12468
作者: Zhi Chen,Lingxiao Jiang
关键词-EN: advanced agentic workflows, AI-based software engineering, recent years, agentic workflows, major leap
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 10 pages of main content and 2 pages of references

点击查看摘要

Abstract:In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and interacting with external environments, offer promising solutions to complex software engineering tasks. However, while much research has evaluated code generated by large language models (LLMs), comprehensive studies on agent-generated patches, particularly in real-world settings, are lacking. This study addresses that gap by evaluating 4,892 patches from 10 top-ranked agents on 500 real-world GitHub issues from SWE-Bench Verified, focusing on their impact on code quality. Our analysis shows no single agent dominated, with 170 issues unresolved, indicating room for improvement. Even for patches that passed unit tests and resolved issues, agents made different file and function modifications compared to the gold patches from repository developers, revealing limitations in the benchmark’s test case coverage. Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities; while some agents increased code complexity, many reduced code duplication and minimized code smells. Finally, agents performed better on simpler codebases, suggesting that breaking complex tasks into smaller sub-tasks could improve effectiveness. This study provides the first comprehensive evaluation of agent-generated patches on real-world GitHub issues, offering insights to advance AI-driven software development.

[AI-42] Sharpness-Aware Black-Box Optimization

链接: https://arxiv.org/abs/2410.12457
作者: Feiyang Ye,Yueming Lyu,Xuehao Wang,Masashi Sugiyama,Yu Zhang,Ivor Tsang
关键词-EN: including reinforcement learning, machine learning problems, Black-box optimization, Black-box optimization algorithms, black-box optimization methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 27 pages, 5 figures

点击查看摘要

Abstract:Black-box optimization algorithms have been widely used in various machine learning problems, including reinforcement learning and prompt fine-tuning. However, directly optimizing the training loss value, as commonly done in existing black-box optimization methods, could lead to suboptimal model quality and generalization performance. To address those problems in black-box optimization, we propose a novel Sharpness-Aware Black-box Optimization (SABO) algorithm, which applies a sharpness-aware minimization strategy to improve the model generalization. Specifically, the proposed SABO method first reparameterizes the objective function by its expectation over a Gaussian distribution. Then it iteratively updates the parameterized distribution by approximated stochastic gradients of the maximum objective value within a small neighborhood around the current solution in the Gaussian distribution space. Theoretically, we prove the convergence rate and generalization bound of the proposed SABO algorithm. Empirically, extensive experiments on the black-box prompt fine-tuning tasks demonstrate the effectiveness of the proposed SABO method in improving model generalization performance.

[AI-43] Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs

链接: https://arxiv.org/abs/2410.12445
作者: Hyeonwoo Kim,Dahyun Kim,Jihoo Kim,Sukyung Lee,Yungi Kim,Chanjun Park
关键词-EN: benchmarking Korean Large, Large Language Models, Korean Large Language, Open Ko-LLM Leaderboard, Korean Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Open Ko-LLM Leaderboard has been instrumental in benchmarking Korean Large Language Models (LLMs), yet it has certain limitations. Notably, the disconnect between quantitative improvements on the overly academic leaderboard benchmarks and the qualitative impact of the models should be addressed. Furthermore, the benchmark suite is largely composed of translated versions of their English counterparts, which may not fully capture the intricacies of the Korean language. To address these issues, we propose Open Ko-LLM Leaderboard2, an improved version of the earlier Open Ko-LLM Leaderboard. The original benchmarks are entirely replaced with new tasks that are more closely aligned with real-world capabilities. Additionally, four new native Korean benchmarks are introduced to better reflect the distinct characteristics of the Korean language. Through these refinements, Open Ko-LLM Leaderboard2 seeks to provide a more meaningful evaluation for advancing Korean LLMs.

[AI-44] Reconstruction of Differentially Private Text Sanitization via Large Language Models

链接: https://arxiv.org/abs/2410.12443
作者: Shuchao Pang,Zhigang Lu,Haichen Wang,Peng Fu,Yongbin Zhou,Minhui Xue,Bo Li
关键词-EN: large language models, facto privacy standard, Differential privacy, privacy leakage attacks, including many recently
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) is the de facto privacy standard against privacy leakage attacks, including many recently discovered ones against large language models (LLMs). However, we discovered that LLMs could reconstruct the altered/removed privacy from given DP-sanitized prompts. We propose two attacks (black-box and white-box) based on the accessibility to LLMs and show that LLMs could connect the pair of DP-sanitized text and the corresponding private training data of LLMs by giving sample text pairs as instructions (in the black-box attacks) or fine-tuning data (in the white-box attacks). To illustrate our findings, we conduct comprehensive experiments on modern LLMs (e.g., LLaMA-2, LLaMA-3, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Claude-3, Claude-3.5, OPT, GPT-Neo, GPT-J, Gemma-2, and Pythia) using commonly used datasets (such as WikiMIA, Pile-CC, and Pile-Wiki) against both word-level and sentence-level DP. The experimental results show promising recovery rates, e.g., the black-box attacks against the word-level DP over WikiMIA dataset gave 72.18% on LLaMA-2 (70B), 82.39% on LLaMA-3 (70B), 75.35% on Gemma-2, 91.2% on ChatGPT-4o, and 94.01% on Claude-3.5 (Sonnet). More urgently, this study indicates that these well-known LLMs have emerged as a new security risk for existing DP text sanitization approaches in the current environment.

[AI-45] Conformity in Large Language Models

链接: https://arxiv.org/abs/2410.12428
作者: Xiaochen Zhu,Caiqi Zhang,Tom Stafford,Nigel Collier,Andreas Vlachos
关键词-EN: conformity effect describes, effect describes, describes the tendency, tendency of individuals, individuals to align
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 16 pages (8 pages main body), 14 figures

点击查看摘要

Abstract:The conformity effect describes the tendency of individuals to align their responses with the majority. Studying this bias in large language models (LLMs) is crucial, as LLMs are increasingly used in various information-seeking and decision-making tasks as conversation partners to improve productivity. Thus, conformity to incorrect responses can compromise their effectiveness. In this paper, we adapt psychological experiments to examine the extent of conformity in state-of-the-art LLMs. Our findings reveal that all models tested exhibit varying levels of conformity toward the majority, regardless of their initial choice or correctness, across different knowledge domains. Notably, we are the first to show that LLMs are more likely to conform when they are more uncertain in their own prediction. We further explore factors that influence conformity, such as training paradigms and input characteristics, finding that instruction-tuned models are less susceptible to conformity, while increasing the naturalness of majority tones amplifies conformity. Finally, we propose two interventions–Devil’s Advocate and Question Distillation–to mitigate conformity, providing insights into building more robust language models.

[AI-46] Privacy-Preserving Synthetically Augmented Knowledge Graphs with Semantic Utility

链接: https://arxiv.org/abs/2410.12418
作者: Luigi Bellomarini,Costanza Catalano,Andrea Coletta,Michela Iezzi,Pierangela Samarati
关键词-EN: recently gained relevant, gained relevant attention, application domains, healthcare to biotechnology, logistics to finance
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 32 pages, 5 figures

点击查看摘要

Abstract:Knowledge Graphs (KGs) have recently gained relevant attention in many application domains, from healthcare to biotechnology, from logistics to finance. Financial organisations, central banks, economic research entities, and national supervision authorities apply ontological reasoning on KGs to address crucial business tasks, such as economic policymaking, banking supervision, anti-money laundering, and economic research. Reasoning allows for the generation of derived knowledge capturing complex business semantics and the set up of effective business processes. A major obstacle in KGs sharing is represented by privacy considerations since the identity of the data subjects and their sensitive or company-confidential information may be improperly exposed. In this paper, we propose a novel framework to enable KGs sharing while ensuring that information that should remain private is not directly released nor indirectly exposed via derived knowledge, while maintaining the embedded knowledge of the KGs to support business downstream tasks. Our approach produces a privacy-preserving synthetic KG as an augmentation of the input one via the introduction of structural anonymisation. We introduce a novel privacy measure for KGs, which considers derived knowledge and a new utility metric that captures the business semantics we want to preserve, and propose two novel anonymization algorithms. Our extensive experimental evaluation, with both synthetic graphs and real-world datasets, confirms the effectiveness of our approach achieving up to a 70% improvement in the privacy of entities compared to existing methods not specifically designed for KGs. Comments: 32 pages, 5 figures Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2410.12418 [cs.DB] (or arXiv:2410.12418v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2410.12418 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

链接: https://arxiv.org/abs/2410.12416
作者: Jonghwan Hyeon,Yung-Hwan Oh,Ho-Jin Choi
关键词-EN: Speech Emotion Recognition, analyzes human emotions, human emotions expressed, Emotion Recognition, human emotions
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech Emotion Recognition (SER) analyzes human emotions expressed through speech. Self-supervised learning (SSL) offers a promising approach to SER by learning meaningful representations from a large amount of unlabeled audio data. However, existing SSL-based methods rely on Global Average Pooling (GAP) to represent audio signals, treating speech and non-speech segments equally. This can lead to dilution of informative speech features by irrelevant non-speech information. To address this, the paper proposes Segmental Average Pooling (SAP), which selectively focuses on informative speech segments while ignoring non-speech segments. By applying both GAP and SAP to SSL features, our approach utilizes overall speech signal information from GAP and specific information from SAP, leading to improved SER performance. Experiments show state-of-the-art results on the IEMOCAP for English and superior performance on KEMDy19 for Korean datasets in both unweighted and weighted accuracies.

[AI-48] Revealing the Barriers of Language Agents in Planning

链接: https://arxiv.org/abs/2410.12409
作者: Jian Xie,Kexun Zhang,Jiangjie Chen,Siyu Yuan,Kai Zhang,Yikai Zhang,Lei Li,Yanghua Xiao
关键词-EN: ongoing pursuit, inception of artificial, planning, Autonomous planning, artificial intelligence
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Work in Progress

点击查看摘要

Abstract:Autonomous planning has been an ongoing pursuit since the inception of artificial intelligence. Based on curated problem solvers, early planning agents could deliver precise solutions for specific tasks but lacked generalization. The emergence of large language models (LLMs) and their powerful reasoning capabilities has reignited interest in autonomous planning by automatically generating reasonable solutions for given tasks. However, prior research and our experiments show that current language agents still lack human-level planning abilities. Even the state-of-the-art reasoning model, OpenAI o1, achieves only 15.6% on one of the complex real-world planning benchmarks. This highlights a critical question: What hinders language agents from achieving human-level planning? Although existing studies have highlighted weak performance in agent planning, the deeper underlying issues and the mechanisms and limitations of the strategies proposed to address them remain insufficiently understood. In this work, we apply the feature attribution study and identify two key factors that hinder agent planning: the limited role of constraints and the diminishing influence of questions. We also find that although current strategies help mitigate these challenges, they do not fully resolve them, indicating that agents still have a long way to go before reaching human-level intelligence.

[AI-49] A Fast Convoluted Story: Scaling Probabilistic Inference for Integer Arithmetic

链接: https://arxiv.org/abs/2410.12389
作者: Lennert De Smet,Pedro Zuidberg Dos Martires
关键词-EN: modelling combinatorial problems, powerful tool, tool for modelling, modelling combinatorial, integer linear programming
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As illustrated by the success of integer linear programming, linear integer arithmetic is a powerful tool for modelling combinatorial problems. Furthermore, the probabilistic extension of linear programming has been used to formulate problems in neurosymbolic AI. However, two key problems persist that prevent the adoption of neurosymbolic techniques beyond toy problems. First, probabilistic inference is inherently hard, #P-hard to be precise. Second, the discrete nature of integers renders the construction of meaningful gradients challenging, which is problematic for learning. In order to mitigate these issues, we formulate linear arithmetic over integer-valued random variables as tensor manipulations that can be implemented in a straightforward fashion using modern deep learning libraries. At the core of our formulation lies the observation that the addition of two integer-valued random variables can be performed by adapting the fast Fourier transform to probabilities in the log-domain. By relying on tensor operations we obtain a differentiable data structure, which unlocks, virtually for free, gradient-based learning. In our experimental validation we show that tensorising probabilistic linear integer arithmetic and leveraging the fast Fourier transform allows us to push the state of the art by several orders of magnitude in terms of inference and learning times.

[AI-50] HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

链接: https://arxiv.org/abs/2410.12381
作者: Fengji Zhang,Linquan Wu,Huiyu Bai,Guancheng Lin,Xiao Li,Xiao Yu,Yue Wang,Bei Chen,Jacky Keung
关键词-EN: Artificial General Intelligence, advancing Artificial General, evaluating Large Language, Large Language Models, General Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: homepage this https URL

点击查看摘要

Abstract:Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs – core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs’ visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs’ capabilities. We have open-sourced our code and benchmark at this https URL.

[AI-51] ShapefileGPT: A Multi-Agent Large Language Model Framework for Automated Shapefile Processing

链接: https://arxiv.org/abs/2410.12376
作者: Qingming Lin,Rui Hu,Huaxia Li,Sensen Wu,Yadong Li,Kai Fang,Hailin Feng,Zhenhong Du,Liuchang Xu
关键词-EN: representing geospatial information, geographic information science, core data structures, Vector data, GIS vector data
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vector data is one of the two core data structures in geographic information science (GIS), essential for accurately storing and representing geospatial information. Shapefile, the most widely used vector data format, has become the industry standard supported by all major geographic information systems. However, processing this data typically requires specialized GIS knowledge and skills, creating a barrier for researchers from other fields and impeding interdisciplinary research in spatial data analysis. Moreover, while large language models (LLMs) have made significant advancements in natural language processing and task automation, they still face challenges in handling the complex spatial and topological relationships inherent in GIS vector data. To address these challenges, we propose ShapefileGPT, an innovative framework powered by LLMs, specifically designed to automate Shapefile tasks. ShapefileGPT utilizes a multi-agent architecture, in which the planner agent is responsible for task decomposition and supervision, while the worker agent executes the tasks. We developed a specialized function library for handling Shapefiles and provided comprehensive API documentation, enabling the worker agent to operate Shapefiles efficiently through function calling. For evaluation, we developed a benchmark dataset based on authoritative textbooks, encompassing tasks in categories such as geometric operations and spatial queries. ShapefileGPT achieved a task success rate of 95.24%, outperforming the GPT series models. In comparison to traditional LLMs, ShapefileGPT effectively handles complex vector data analysis tasks, overcoming the limitations of traditional LLMs in spatial analysis. This breakthrough opens new pathways for advancing automation and intelligence in the GIS field, with significant potential in interdisciplinary data analysis and application contexts.

[AI-52] PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agent ic Thinking

链接: https://arxiv.org/abs/2410.12375
作者: Markus J. Buehler
关键词-EN: Modeling for Exploratory, Preference-based Recursive Language, Reinforcement Learning, Recursive Language Modeling, concepts from Reinforcement
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:PRefLexOR (Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning) combines preference optimization with concepts from Reinforcement Learning to enable models to self-teach through iterative reasoning improvements. We propose a recursive learning approach that engages the model in multi-step reasoning, revisiting, and refining intermediate steps before producing a final output in training and inference phases. Through multiple training stages, the model first learns to align its reasoning with accurate decision paths by optimizing the log odds between preferred and non-preferred responses. During this process, PRefLexOR builds a dynamic knowledge graph by generating questions from random text chunks and retrieval-augmentation to contextualize relevant details from the entire training corpus. In the second stage, preference optimization enhances model performance by using rejection sampling to fine-tune reasoning quality by continually producing in-situ training data while masking the reasoning steps. Recursive optimization within a thinking token framework introduces iterative feedback loops, where the model refines reasoning, achieving deeper coherence, consistency, and adaptability. Implemented in small language models with only 3 billion parameters, we should that even tiny models can iteratively teach themselves to reason with greater depth and reflectivity. Our implementation is straightforward and can be incorporated into any existing pretrained LLM. We focus our examples on applications in biological materials science and demonstrate the method in a variety of case studies that range from in-domain to cross-domain applications. Using reasoning strategies that include thinking and reflection modalities we build a multi-agent recursive self-improving inference approach to successively improve responses via repeated sampling in inference time.

[AI-53] Proactive Agent : Shifting LLM Agents from Reactive Responses to Active Assistance

链接: https://arxiv.org/abs/2410.12361
作者: Yaxi Lu,Shenzhi Yang,Cheng Qian,Guirong Chen,Qinyu Luo,Yesai Wu,Huadong Wang,Xin Cong,Zhong Zhang,Yankai Lin,Weiwen Liu,Yasheng Wang,Zhiyuan Liu,Fangming Liu,Maosong Sun
关键词-EN: shown remarkable abilities, solving complex tasks, large language models, powered by large, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.

[AI-54] owards Neural Scaling Laws for Time Series Foundation Models

链接: https://arxiv.org/abs/2410.12360
作者: Qingren Yao,Chao-Han Huck Yang,Renhe Jiang,Yuxuan Liang,Ming Jin,Shirui Pan
关键词-EN: offer valuable insights, time series foundation, laws offer valuable, series foundation models, Scaling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling laws offer valuable insights into the design of time series foundation models (TSFMs). However, previous research has largely focused on the scaling laws of TSFMs for in-distribution (ID) data, leaving their out-of-distribution (OOD) scaling behavior and the influence of model architectures less explored. In this work, we examine two common TSFM architectures, encoder-only and decoder-only Transformers, and investigate their scaling behavior on both ID and OOD data. These models are trained and evaluated across varying parameter counts, compute budgets, and dataset sizes. Our experiments reveal that the log-likelihood loss of TSFMs exhibits similar scaling behavior in both OOD and ID settings. We further compare the scaling properties across different architectures, incorporating two state-of-the-art TSFMs as case studies, showing that model architecture plays a significant role in scaling. The encoder-only Transformers demonstrate better scalability than the decoder-only Transformers, while the architectural enhancements in the two advanced TSFMs primarily improve ID performance but reduce OOD scalability. While scaling up TSFMs is expected to drive performance breakthroughs, the lack of a comprehensive understanding of TSFM scaling laws has hindered the development of a robust framework to guide model scaling. We fill this gap in this work by synthesizing our findings and providing practical guidelines for designing and scaling larger TSFMs with enhanced model capabilities.

[AI-55] GECTurk WEB: An Explainable Online Platform for Turkish Grammatical Error Detection and Correction

链接: https://arxiv.org/abs/2410.12350
作者: Ali Gebeşçe,Gözde Gül Şahin
关键词-EN: English and Chinese, Sophisticated grammatical error, Sophisticated grammatical, grammatical error detection, small set
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sophisticated grammatical error detection/correction tools are available for a small set of languages such as English and Chinese. However, it is not straightforward – if not impossible – to adapt them to morphologically rich languages with complex writing rules like Turkish which has more than 80 million speakers. Even though several tools exist for Turkish, they primarily focus on spelling errors rather than grammatical errors and lack features such as web interfaces, error explanations and feedback mechanisms. To fill this gap, we introduce GECTurk WEB, a light, open-source, and flexible web-based system that can detect and correct the most common forms of Turkish writing errors, such as the misuse of diacritics, compound and foreign words, pronouns, light verbs along with spelling mistakes. Our system provides native speakers and second language learners an easily accessible tool to detect/correct such mistakes and also to learn from their mistakes by showing the explanation for the violated rule(s). The proposed system achieves 88,3 system usability score, and is shown to help learn/remember a grammatical rule (confirmed by 80% of the participants). The GECTurk WEB is available both as an offline tool at this https URL or online at this http URL.

[AI-56] AS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant

链接: https://arxiv.org/abs/2410.12342
作者: Guopeng Li,Qiang Wang,Ke Yan,Shouhong Ding,Yuan Gao,Gui-Song Xia
关键词-EN: methodologies predominantly focus, convolutional neural networks, methodologies predominantly, similar architectures, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 6 figures, and 12 tables

点击查看摘要

Abstract:Most knowledge distillation (KD) methodologies predominantly focus on teacher-student pairs with similar architectures, such as both being convolutional neural networks (CNNs). However, the potential and flexibility of KD can be greatly improved by expanding it to novel Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred flexibly to a given student. The primary challenge in CAKD lies in the substantial feature gaps between heterogeneous models, originating from the distinction of their inherent inductive biases and module functions. To this end, we introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students. More importantly, within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions by merging convolution and attention modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions in CAKD, hindering the effectiveness of conventional pixel-wise mean squared error (MSE) loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing, thereby improving the feature alignments in CAKD. Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, achieving state-of-the-art performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our code and models will be released.

[AI-57] A linguistic analysis of undesirable outcomes in the era of generative AI

链接: https://arxiv.org/abs/2410.12341
作者: Daniele Gambetta,Gizem Gezici,Fosca Giannotti,Dino Pedreschi,Alistair Knott,Luca Pappalardo
关键词-EN: Recent research, generated content, posing scientific, research has focused, medium and long-term
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent research has focused on the medium and long-term impacts of generative AI, posing scientific and societal challenges mainly due to the detection and reliability of machine-generated information, which is projected to form the major content on the Web soon. Prior studies show that LLMs exhibit a lower performance in generation tasks (model collapse) as they undergo a fine-tuning process across multiple generations on their own generated content (self-consuming loop). In this paper, we present a comprehensive simulation framework built upon the chat version of LLama2, focusing particularly on the linguistic aspects of the generated content, which has not been fully examined in existing studies. Our results show that the model produces less lexical rich content across generations, reducing diversity. The lexical richness has been measured using the linguistic measures of entropy and TTR as well as calculating the POSTags frequency. The generated content has also been examined with an n -gram analysis, which takes into account the word order, and semantic networks, which consider the relation between different words. These findings suggest that the model collapse occurs not only by decreasing the content diversity but also by distorting the underlying linguistic patterns of the generated text, which both highlight the critical importance of carefully choosing and curating the initial input text, which can alleviate the model collapse problem. Furthermore, we conduct a qualitative analysis of the fine-tuned models of the pipeline to compare their performances on generic NLP tasks to the original model. We find that autophagy transforms the initial model into a more creative, doubtful and confused one, which might provide inaccurate answers and include conspiracy theories in the model responses, spreading false and biased information on the Web.

[AI-58] Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

链接: https://arxiv.org/abs/2410.12329
作者: Botian Jiang,Lei Li,Xiaonan Li,Zhaowei Li,Xiachong Feng,Lingpeng Kong,Qi Liu,Xipeng Qiu
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, underlying Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has been accompanied by the development of various benchmarks to evaluate their capabilities. However, the true nature of these evaluations and the extent to which they assess multimodal reasoning versus merely leveraging the underlying Large Language Model (LLM) backbone remain unclear. This paper presents a comprehensive investigation into the role of LLM backbones in MLLM evaluation, focusing on two critical aspects: the degree to which current benchmarks truly assess multimodal reasoning and the influence of LLM prior knowledge on performance. Specifically, we introduce a modified evaluation protocol to disentangle the contributions of the LLM backbone from multimodal integration, and an automatic knowledge identification technique for diagnosing whether LLMs equip the necessary knowledge for corresponding multimodal questions. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone, indicating a heavy reliance on language capabilities. To address knowledge deficiencies, we propose a knowledge augmentation pipeline that achieves significant performance gains, with improvements of up to 60% on certain datasets, resulting in a approximately 4x increase in performance. Our work provides crucial insights into the role of the LLM backbone in MLLMs, and highlights the need for more nuanced benchmarking approaches.

[AI-59] Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up

链接: https://arxiv.org/abs/2410.12323
作者: Jiahao Yuan,Dehui Du,Hao Zhang,Zixiang Di,Usman Naseem
关键词-EN: Large language models, shown remarkable performance, Large language, language models, shown remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance in reasoning tasks but face limitations in mathematical and complex logical reasoning. Existing methods to improve LLMs’ logical capabilities either involve traceable or verifiable logical sequences that generate more reliable responses by constructing logical structures yet increase computational costs, or introduces rigid logic template rules, reducing flexibility. In this paper, we propose Reversal of Thought (RoT), a novel framework aimed at enhancing the logical reasoning abilities of LLMs. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning through meta-cognitive mechanisms and pairwise preference self-evaluation to generate task-specific prompts solely through demonstrations, aligning with LLMs’ cognitive preferences shaped by Reinforcement Learning with Human Feedback (RLHF). Through reverse reasoning, we ultilize a Cognitive Preference Manager to assess knowledge boundaries and further expand LLMs’ reasoning capabilities by aggregating solution logic for known tasks and stylistic templates for unknown tasks. Experiments across various tasks demonstrate that RoT surpasses existing baselines in both reasoning accuracy and efficiency.

[AI-60] UTF:Undertrained Tokens as Fingerprints A Novel Approach to LLM Identification

链接: https://arxiv.org/abs/2410.12318
作者: Jiacheng Cai,Jiahao Yu,Yangguang Shao,Yuhang Wu,Xinyu Xing
关键词-EN: Fingerprinting large language, ensuring authenticity, large language models, preventing misuse, large language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fingerprinting large language models (LLMs) is essential for verifying model ownership, ensuring authenticity, and preventing misuse. Traditional fingerprinting methods often require significant computational overhead or white-box verification access. In this paper, we introduce UTF, a novel and efficient approach to fingerprinting LLMs by leveraging under-trained tokens. Under-trained tokens are tokens that the model has not fully learned during its training phase. By utilizing these tokens, we perform supervised fine-tuning to embed specific input-output pairs into the model. This process allows the LLM to produce predetermined outputs when presented with certain inputs, effectively embedding a unique fingerprint. Our method has minimal overhead and impact on model’s performance, and does not require white-box access to target model’s ownership identification. Compared to existing fingerprinting methods, UTF is also more effective and robust to fine-tuning and random guess.

[AI-61] FaceChain-FACT: Face Adapter with Decoupled Training for Identity-preserved Personalization

链接: https://arxiv.org/abs/2410.12312
作者: Cheng Yu,Haoyu Xie,Lei Shang,Yang Liu,Jun Dan,Baigui Sun,Liefeng Bo
关键词-EN: human-centric personalized image, adapter-based method obtains, personalized image generation, portrait generation training, field of human-centric
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:In the field of human-centric personalized image generation, the adapter-based method obtains the ability to customize and generate portraits by text-to-image training on facial data. This allows for identity-preserved personalization without additional fine-tuning in inference. Although there are improvements in efficiency and fidelity, there is often a significant performance decrease in test following ability, controllability, and diversity of generated faces compared to the base model. In this paper, we analyze that the performance degradation is attributed to the failure to decouple identity features from other attributes during extraction, as well as the failure to decouple the portrait generation training from the overall generation task. To address these issues, we propose the Face Adapter with deCoupled Training (FACT) framework, focusing on both model architecture and training strategy. To decouple identity features from others, we leverage a transformer-based face-export encoder and harness fine-grained identity features. To decouple the portrait generation training, we propose Face Adapting Increment Regularization~(FAIR), which effectively constrains the effect of face adapters on the facial region, preserving the generative ability of the base model. Additionally, we incorporate a face condition drop and shuffle mechanism, combined with curriculum learning, to enhance facial controllability and diversity. As a result, FACT solely learns identity preservation from training data, thereby minimizing the impact on the original text-to-image capabilities of the base model. Extensive experiments show that FACT has both controllability and fidelity in both text-to-image generation and inpainting solutions for portrait generation.

[AI-62] Open Domain Question Answering with Conflicting Contexts

链接: https://arxiv.org/abs/2410.12311
作者: Siyi Liu,Qiang Ning,Kishaloy Halder,Wei Xiao,Zheng Qi,Phu Mon Htut,Yi Zhang,Neha Anna John,Bonan Min,Yassine Benajiba,Dan Roth
关键词-EN: systems frequently rely, answering systems frequently, question answering systems, Open domain, Open domain question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.

[AI-63] wo Birds with One Stone: Multi-Task Semantic Communications Systems over Relay Channel

链接: https://arxiv.org/abs/2410.12302
作者: Yujie Cao,Tong Wu,Zhiyong Chen,Yin Xu,Meixia Tao,Wenjun Zhang
关键词-EN: multi-link relay semantic, relay semantic communications, relay node, source node, relay node forwards
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: submitted to IEEE WCNC

点击查看摘要

Abstract:In this paper, we propose a novel multi-task, multi-link relay semantic communications (MTML-RSC) scheme that enables the destination node to simultaneously perform image reconstruction and classification with one transmission from the source node. In the MTML-RSC scheme, the source node broadcasts a signal using semantic communications, and the relay node forwards the signal to the destination. We analyze the coupling relationship between the two tasks and the two links (source-to-relay and source-to-destination) and design a semantic-focused forward method for the relay node, where it selectively forwards only the semantics of the relevant class while ignoring others. At the destination, the node combines signals from both the source node and the relay node to perform classification, and then uses the classification result to assist in decoding the signal from the relay node for image reconstructing. Experimental results demonstrate that the proposed MTML-RSC scheme achieves significant performance gains, e.g., 1.73 dB improvement in peak-signal-to-noise ratio (PSNR) for image reconstruction and increasing the accuracy from 64.89% to 70.31% for classification.

[AI-64] Pyramid-Driven Alignment: Pyramid Principle Guided Integration of Large Language Models and Knowledge Graphs

链接: https://arxiv.org/abs/2410.12298
作者: Lei Sun,Xinchen Wang,Youdi Li
关键词-EN: Large Language Models, Large Language, Language Models, generating incorrect information, possess impressive reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) possess impressive reasoning abilities but are prone to generating incorrect information, often referred to as hallucinations. While incorporating external Knowledge Graphs (KGs) can partially mitigate this issue, existing methods primarily treat KGs as static knowledge repositories, overlooking the critical disparity between KG and LLM knowledge, and failing to fully exploit the reasoning capabilities inherent in KGs. To address these limitations, we propose Pyramid-Driven Alignment (PDA), a novel framework for seamlessly integrating LLMs with KGs. PDA utilizes Pyramid Principle analysis to construct a hierarchical pyramid structure. This structure is designed to reflect the input question and generate more validated deductive knowledge, thereby enhancing the alignment of LLMs and KGs and ensuring more cohesive integration. Furthermore, PDA employs a recursive mechanism to harness the underlying reasoning abilities of KGs, resulting in more accurate knowledge retrieval for question-answering tasks. Our experimental results reveal a substantial performance advantage of PDA over state-of-the-art baselines, with improvements reaching 26.70% and 26.78%.

[AI-65] Conjunction Subspaces Test for Conformal and Selective Classification

链接: https://arxiv.org/abs/2410.12297
作者: Zengyou He,Zerun Li,Junjie Dong,Xinying Liu,Mudi Jiang,Lianyu Hu
关键词-EN: integrates significance testing, significance testing results, yield consensus p-values, integrates significance, quantifying the uncertainty
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 36 pages, 9 figures

点击查看摘要

Abstract:In this paper, we present a new classifier, which integrates significance testing results over different random subspaces to yield consensus p-values for quantifying the uncertainty of classification decision. The null hypothesis is that the test sample has no association with the target class on a randomly chosen subspace, and hence the classification problem can be formulated as a problem of testing for the conjunction of hypotheses. The proposed classifier can be easily deployed for the purpose of conformal prediction and selective classification with reject and refine options by simply thresholding the consensus p-values. The theoretical analysis on the generalization error bound of the proposed classifier is provided and empirical studies on real data sets are conducted as well to demonstrate its effectiveness.

[AI-66] Consistency Calibration: Improving Uncertainty Calibration via Consistency among Perturbed Neighbors

链接: https://arxiv.org/abs/2410.12295
作者: Linwei Tao,Haolan Guo,Minjing Dong,Chang Xu
关键词-EN: deep learning applications, Expected Calibration Error, accurate confidence estimates, learning applications, autonomous driving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Calibration is crucial in deep learning applications, especially in fields like healthcare and autonomous driving, where accurate confidence estimates are vital for decision-making. However, deep neural networks often suffer from miscalibration, with reliability diagrams and Expected Calibration Error (ECE) being the only standard perspective for evaluating calibration performance. In this paper, we introduce the concept of consistency as an alternative perspective on model calibration, inspired by uncertainty estimation literature in large language models (LLMs). We highlight its advantages over the traditional reliability-based view. Building on this concept, we propose a post-hoc calibration method called Consistency Calibration (CC), which adjusts confidence based on the model’s consistency across perturbed inputs. CC is particularly effective in locally uncertainty estimation, as it requires no additional data samples or label information, instead generating input perturbations directly from the source data. Moreover, we show that performing perturbations at the logit level significantly improves computational efficiency. We validate the effectiveness of CC through extensive comparisons with various post-hoc and training-time calibration methods, demonstrating state-of-the-art performance on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet, as well as on long-tailed datasets like ImageNet-LT.

[AI-67] A Prompt-Based Knowledge Graph Foundation Model for Universal In-Context Reasoning NEURIPS2024

链接: https://arxiv.org/abs/2410.12288
作者: Yuanning Cui,Zequn Sun,Wei Hu
关键词-EN: facilitate knowledge-driven tasks, Extensive knowledge graphs, Extensive knowledge, constructed to facilitate, facilitate knowledge-driven
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted in the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Extensive knowledge graphs (KGs) have been constructed to facilitate knowledge-driven tasks across various scenarios. However, existing work usually develops separate reasoning models for different KGs, lacking the ability to generalize and transfer knowledge across diverse KGs and reasoning settings. In this paper, we propose a prompt-based KG foundation model via in-context learning, namely KG-ICL, to achieve a universal reasoning ability. Specifically, we introduce a prompt graph centered with a query-related example fact as context to understand the query relation. To encode prompt graphs with the generalization ability to unseen entities and relations in queries, we first propose a unified tokenizer that maps entities and relations in prompt graphs to predefined tokens. Then, we propose two message passing neural networks to perform prompt encoding and KG reasoning, respectively. We conduct evaluation on 43 different KGs in both transductive and inductive settings. Results indicate that the proposed KG-ICL outperforms baselines on most datasets, showcasing its outstanding generalization and universal reasoning capabilities. The source code is accessible on GitHub: this https URL.

[AI-68] Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

链接: https://arxiv.org/abs/2410.12278
作者: Yong Xie,Karan Aggarwal,Aitzaz Ahmad,Stephen Lau
关键词-EN: automatically generate non-trivial, generate non-trivial task-specific, automatically generate, generate non-trivial, non-trivial task-specific synthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform in-context-learning (ICL)-based detectors by a large margin of 32%. Our extensive experiments confirm the benefits of our approach with cross-task and cross-generator generalization. Our data-mixture-based training further improves the generalization and robustness of hallucination detection.

[AI-69] Kallini et al. (2024) do not compare impossible languages with constituency-based ones

链接: https://arxiv.org/abs/2410.12271
作者: Tim Hunter
关键词-EN: developing human child, typically developing human, Impossible Language Models, human languages, linguistic theory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A central goal of linguistic theory is to find a precise characterization of the notion “possible human language”, in the form of a computational device that is capable of describing all and only the languages that can be acquired by a typically developing human child. The success of recent large language models (LLMs) in NLP applications arguably raises the possibility that LLMs might be computational devices that meet this goal. This would only be the case if, in addition to succeeding in learning human languages, LLMs struggle to learn “impossible” human languages. Kallini et al. (2024; “Mission: Impossible Language Models”, Proc. ACL) conducted experiments aiming to test this by training GPT-2 on a variety of synthetic languages, and found that it learns some more successfully than others. They present these asymmetries as support for the idea that LLMs’ inductive biases align with what is regarded as “possible” for human languages, but the most significant comparison has a confound that makes this conclusion unwarranted. In this paper I explain the confound and suggest some ways forward towards constructing a comparison that appropriately tests the underlying issue.

[AI-70] CATCH: Channel-Aware multivariate Time Series Anomaly Detection via Frequency Patching

链接: https://arxiv.org/abs/2410.12261
作者: Xingjian Wu,Xiangfei Qiu,Zhengyu Li,Yihang Wang,Jilin Hu,Chenjuan Guo,Hui Xiong,Bin Yang
关键词-EN: multivariate time series, Anomaly detection, heterogeneous subsequence anomalies, anomalies may occur, detection in multivariate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anomaly detection in multivariate time series is challenging as heterogeneous subsequence anomalies may occur. Reconstruction-based methods, which focus on learning nomral patterns in the frequency domain to detect diverse abnormal subsequences, achieve promising resutls, while still falling short on capturing fine-grained frequency characteristics and channel correlations. To contend with the limitations, we introduce CATCH, a framework based on frequency patching. We propose to patchify the frequency domain into frequency bands, which enhances its ability to capture fine-grained frequency characteristics. To perceive appropriate channel correlations, we propose a Channel Fusion Module (CFM), which features a patch-wise mask generator and a masked-attention mechanism. Driven by a bi-level multi-objective optimization algorithm, the CFM is encouraged to iteratively discover appropriate patch-wise channel correlations, and to cluster relevant channels while isolating adverse effects from irrelevant channels. Extensive experiments on 9 real-world datasets and 12 synthetic datasets demonstrate that CATCH achieves state-of-the-art performance.

[AI-71] Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts

链接: https://arxiv.org/abs/2410.12258
作者: Fanqi Yan,Huy Nguyen,Dung Le,Pedram Akbarian,Nhat Ho
关键词-EN: pre-trained model, prompt learning, prompt learning problem, large-scaled pre-trained model, parameter estimation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Fanqi Yan, Huy Nguyen, Dung Le contributed equally to this work. 70 pages, 6 figures, 1 table

点击查看摘要

Abstract:We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scaled pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of the pre-trained model and the prompt may converge to zero where the prompt vanishes during the training; (ii) the algebraic interaction among parameters of the pre-trained model and the prompt can occur via some partial differential equation and decelerate the prompt learning. In response, we introduce a distinguishability condition to control the previous parameter interaction. Additionally, we also consider various types of expert structures to understand their effects on the parameter estimation. In each scenario, we provide comprehensive convergence rates of parameter estimation along with the corresponding minimax lower bounds.

[AI-72] Dual Action Policy for Robust Sim-to-Real Reinforcement Learning

链接: https://arxiv.org/abs/2410.12250
作者: Ng Wen Zheng Terence,Chen Jianda
关键词-EN: paper presents Dual, presents Dual Action, Dual Action Policy, dynamics mismatch inherent, presents Dual
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents Dual Action Policy (DAP), a novel approach to address the dynamics mismatch inherent in the sim-to-real gap of reinforcement learning. DAP uses a single policy to predict two sets of actions: one for maximizing task rewards in simulation and another specifically for domain adaptation via reward adjustments. This decoupling makes it easier to maximize the overall reward in the source domain during training. Additionally, DAP incorporates uncertainty-based exploration during training to enhance agent robustness. Experimental results demonstrate DAP’s effectiveness in bridging the sim-to-real gap, outperforming baselines on challenging tasks in simulation, and further improvement is achieved by incorporating uncertainty estimation.

[AI-73] Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay

链接: https://arxiv.org/abs/2410.12236
作者: Yuyang Chen,Kaiyan Zhao,Yiming Wang,Ming Yang,Jian Zhang,Xiaoguang Niu
关键词-EN: Large Language Models, Nowadays transformer-based Large, transformer-based Large Language, Large Language, code generation tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Nowadays transformer-based Large Language Models (LLM) for code generation tasks usually apply sampling and filtering pipelines. Due to the sparse reward problem in code generation tasks caused by one-token incorrectness, transformer-based models will sample redundant programs till they find a correct one, leading to low efficiency. To overcome the challenge, we incorporate Experience Replay (ER) in the fine-tuning phase, where codes and programs produced are stored and will be replayed to give the LLM agent a chance to learn from past experiences. Based on the spirit of ER, we introduce a novel approach called BTP pipeline which consists of three phases: beam search sampling, testing phase, and prioritized experience replay phase. The approach makes use of failed programs collected by code models and replays programs with high Possibility and Pass-rate Prioritized value (P2Value) from the replay buffer to improve efficiency. P2Value comprehensively considers the possibility of transformers’ output and pass rate and can make use of the redundant resources caused by the problem that most programs collected by LLMs fail to pass any tests. We empirically apply our approach in several LLMs, demonstrating that it enhances their performance in code generation tasks and surpasses existing baselines.

[AI-74] Improving the Generalization of Unseen Crowd Behaviors for Reinforcement Learning based Local Motion Planners

链接: https://arxiv.org/abs/2410.12232
作者: Wen Zheng Terence Ng,Jianda Chen,Sinno Jialin Pan,Tianwei Zhang
关键词-EN: safe mobile robot, mobile robot policy, Deploying a safe, Current Reinforcement Learning-based, Reinforcement Learning-based motion
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying a safe mobile robot policy in scenarios with human pedestrians is challenging due to their unpredictable movements. Current Reinforcement Learning-based motion planners rely on a single policy to simulate pedestrian movements and could suffer from the over-fitting issue. Alternatively, framing the collision avoidance problem as a multi-agent framework, where agents generate dynamic movements while learning to reach their goals, can lead to conflicts with human pedestrians due to their homogeneity. To tackle this problem, we introduce an efficient method that enhances agent diversity within a single policy by maximizing an information-theoretic objective. This diversity enriches each agent’s experiences, improving its adaptability to unseen crowd behaviors. In assessing an agent’s robustness against unseen crowds, we propose diverse scenarios inspired by pedestrian crowd behaviors. Our behavior-conditioned policies outperform existing works in these challenging scenes, reducing potential collisions without additional time or travel. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.12232 [cs.RO] (or arXiv:2410.12232v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2410.12232 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/ICRA57147.2024.10610641 Focus to learn more DOI(s) linking to related resources

[AI-75] Comprehending Knowledge Graphs with Large Language Models for Recommender Systems

链接: https://arxiv.org/abs/2410.12229
作者: Ziqiang Cui,Yunpeng Weng,Xing Tang,Fuyuan Lyu,Dugang Liu,Xiuqiang He,Chen Ma
关键词-EN: significantly advanced recommender, advanced recommender systems, significantly advanced, advanced recommender, recommender systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, the introduction of knowledge graphs (KGs) has significantly advanced recommender systems by facilitating the discovery of potential associations between items. However, existing methods still face several limitations. First, most KGs suffer from missing facts or limited scopes. This can lead to biased knowledge representations, thereby constraining the model’s performance. Second, existing methods typically convert textual information into IDs, resulting in the loss of natural semantic connections between different items. Third, existing methods struggle to capture high-order relationships in global KGs due to their inefficient layer-by-layer information propagation mechanisms, which are prone to introducing significant noise. To address these limitations, we propose a novel method called CoLaKG, which leverages large language models (LLMs) for knowledge-aware recommendation. The extensive world knowledge and remarkable reasoning capabilities of LLMs enable them to supplement KGs. Additionally, the strong text comprehension abilities of LLMs allow for a better understanding of semantic information. Based on this, we first extract subgraphs centered on each item from the KG and convert them into textual inputs for the LLM. The LLM then outputs its comprehension of these item-centered subgraphs, which are subsequently transformed into semantic embeddings. Furthermore, to utilize the global information of the KG, we construct an item-item graph using these semantic embeddings, which can directly capture higher-order associations between items. Both the semantic embeddings and the structural information from the item-item graph are effectively integrated into the recommendation model through our designed representation alignment and neighbor augmentation modules. Extensive experiments on four real-world datasets demonstrate the superiority of our method.

[AI-76] riple Modality Fusion: Aligning Visual Textual and Graph Data with Large Language Models for Multi-Behavior Recommendations

链接: https://arxiv.org/abs/2410.12228
作者: Luyi Ma,Xiaohan Li,Zezhong Fan,Jianpeng Xu,Jason Cho,Praveen Kanumala,Kaushiki Nag,Sushant Kumar,Kannan Achan
关键词-EN: Integrating diverse data, Integrating diverse, personalized recommendation systems, diverse data modalities, crucial for enhancing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user’s interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.

[AI-77] On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation

链接: https://arxiv.org/abs/2410.12222
作者: Xiaonan Jing,Srinivas Billa,Danny Godbout
关键词-EN: NLG, popular topic, Abstract, generation, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 14 pages, 13 figures

点击查看摘要

Abstract:Hallucination has been a popular topic in natural language generation (NLG). In real-world applications, unfaithful content can result in bad data quality or loss of trust from end users. Thus, it is crucial to fact-check before adopting NLG for production usage, which can be expensive if done manually. In this paper, we investigate automated faithfulness evaluation in guided NLG. We developed a rubrics template and use large language models (LLMs) to score the generation into quantifiable scales. We compared popular LLMs as well as the widely adopted natural language inference (NLI) models in scoring quality and sensitivity. In addition, we developed methods to generation synthetic unfaithful data, as well as a heuristics to quantify the percentage of hallucination. Our results on 4 travel-domain industry dataset show that GPT-4 can provide accurate judgement and explanation on whether a source and a generation are factually consistent. Furthermore, we found that tuning NLI models on synthetic data can improve performance. Lastly, we present insights on latency and cost for deploying such system.

[AI-78] EdgeRL: Reinforcement Learning-driven Deep Learning Model Inference Optimization at Edge

链接: https://arxiv.org/abs/2410.12221
作者: Motahare Mounesan,Xiaojie Zhang,Saptarshi Debroy
关键词-EN: Balancing mutually diverging, ad-hoc edge environments, mutually diverging performance, Balancing mutually, diverging performance metrics
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Balancing mutually diverging performance metrics, such as, processing latency, outcome accuracy, and end device energy consumption is a challenging undertaking for deep learning model inference in ad-hoc edge environments. In this paper, we propose EdgeRL framework that seeks to strike such balance by using an Advantage Actor-Critic (A2C) Reinforcement Learning (RL) approach that can choose optimal run-time DNN inference parameters and aligns the performance metrics based on the application requirements. Using real world deep learning model and a hardware testbed, we evaluate the benefits of EdgeRL framework in terms of end device energy savings, inference accuracy improvement, and end-to-end inference latency reduction.

[AI-79] OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

链接: https://arxiv.org/abs/2410.12219
作者: Lichang Chen,Hexiang Hu,Mingda Zhang,Yiwen Chen,Zifeng Wang,Yandong Li,Pranav Shyam,Tianyi Zhou,Heng Huang,Ming-Hsuan Yang,Boqing Gong
关键词-EN: SoTA Omni-modality Language, Omni-modality Language Models, Omni-modality Language, benchmark SoTA Omni-modality, SoTA Omni-modality
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: 19 pages, 6 figures, 12 tables

点击查看摘要

Abstract:We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Particularly, the user message might often consist of multiple modalities, such that OLMs have to establish holistic understanding and reasoning across modalities to accomplish the task. Existing benchmarks are limited to single modality or dual-modality tasks, overlooking comprehensive multi-modal assessments of model reasoning. To address this, OmnixR offers two evaluation variants: (1)synthetic subset: a synthetic dataset generated automatically by translating text into multiple modalities–audio, images, video, and hybrids (Omnify). (2)realistic subset: a real-world dataset, manually curated and annotated by experts, for evaluating cross-modal reasoning in natural settings. OmnixR presents a unique evaluation towards assessing OLMs over a diverse mix of modalities, such as a question that involves video, audio, and text, providing a rigorous cross-modal reasoning testbed unlike any existing benchmarks. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer. Further analysis highlights differences in reasoning behavior, underscoring the challenges of omni-modal AI alignment.

[AI-80] Order-Aware Interactive Segmentation

链接: https://arxiv.org/abs/2410.12214
作者: Bin Wang,Anwesa Choudhuri,Meng Zheng,Zhongpai Gao,Benjamin Planche,Andong Deng,Qin Liu,Terrence Chen,Ulas Bagci,Ziyan Wu
关键词-EN: accurately segment target, segment target objects, Interactive segmentation aims, minimal user interactions, accurately separate target
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Interactive demo can be found in project page: this https URL

点击查看摘要

Abstract:Interactive segmentation aims to accurately segment target objects with minimal user interactions. However, current methods often fail to accurately separate target objects from the background, due to a limited understanding of order, the relative depth between objects in a scene. To address this issue, we propose OIS: order-aware interactive segmentation, where we explicitly encode the relative depth between objects into order maps. We introduce a novel order-aware attention, where the order maps seamlessly guide the user interactions (in the form of clicks) to attend to the image features. We further present an object-aware attention module to incorporate a strong object-level understanding to better differentiate objects with similar order. Our approach allows both dense and sparse integration of user clicks, enhancing both accuracy and efficiency as compared to prior works. Experimental results demonstrate that OIS achieves state-of-the-art performance, improving mIoU after one click by 7.61 on the HQSeg44K dataset and 1.32 on the DAVIS dataset as compared to the previous state-of-the-art SegNext, while also doubling inference speed compared to current leading methods. The project page is this https URL

[AI-81] Divide-Verify-Refine: Aligning LLM Responses with Complex Instructions

链接: https://arxiv.org/abs/2410.12207
作者: Xianren Zhang,Xianfeng Tang,Hui Liu,Zongyu Wu,Qi He,Dongwon Lee,Suhang Wang
关键词-EN: Recent studies show, Recent studies, struggle to follow, complex instructions, follow complex instructions
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Recent studies show that LLMs, particularly open-source models, struggle to follow complex instructions with multiple constraints. Despite the importance, methods to improve LLMs’ adherence to such constraints remain unexplored, and current research focuses on evaluating this ability rather than developing solutions. While a few studies enhance constraint adherence through model tuning, this approach is computationally expensive and heavily reliant on training data quality. An alternative is to leverage LLMs’ self-correction capabilities, allowing them to adjust responses to better meet specified constraints. However, this self-correction ability of LLMs is limited by the feedback quality, as LLMs cannot autonomously generate reliable feedback or detect errors. Moreover, the self-refinement process heavily depends on few-shot examples that illustrate how to modify responses to meet constraints. As constraints in complex instructions are diverse and vary widely, manually crafting few-shot examples for each constraint type can be labor-intensive and sub-optimal. To deal with these two challenges, we propose the Divide-Verify-Refine (DVR) framework with three steps: (1) Divide complex instructions into single constraints and prepare appropriate tools; (2) Verify: To address the feedback quality problem, these tools will rigorously verify responses and provide reliable feedback; (3) Refine: To address the constraint diversity challenge, we design a refinement repository that collects successful refinement processes and uses them as few-shot demonstrations for future cases, allowing LLMs to learn from the past experience during inference. Additionally, we develop a new dataset of complex instructions, each containing 1-6 constraints. Experiments show that the framework significantly improves performance, doubling LLama3.1-8B’s constraint adherence on instructions with 6 constraints.

[AI-82] Abnormality Forecasting: Time Series Anomaly Prediction via Future Context Modeling KDD

链接: https://arxiv.org/abs/2410.12206
作者: Sinong Zhao,Wenrui Wang,Hongzuo Xu,Zhaoyang Yu,Qingsong Wen,Gang Wang,xiaoguang Liu,Guansong Pang
关键词-EN: Identifying anomalies, intelligent operation, operation and maintenance, space exploration, series data plays
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures, submitted to KDD conference

点击查看摘要

Abstract:Identifying anomalies from time series data plays an important role in various fields such as infrastructure security, intelligent operation and maintenance, and space exploration. Current research focuses on detecting the anomalies after they occur, which can lead to significant financial/reputation loss or infrastructure damage. In this work we instead study a more practical yet very challenging problem, time series anomaly prediction, aiming at providing early warnings for abnormal events before their occurrence. To tackle this problem, we introduce a novel principled approach, namely future context modeling (FCM). Its key insight is that the future abnormal events in a target window can be accurately predicted if their preceding observation window exhibits any subtle difference to normal data. To effectively capture such differences, FCM first leverages long-term forecasting models to generate a discriminative future context based on the observation data, aiming to amplify those subtle but unusual difference. It then models a normality correlation of the observation data with the forecasting future context to complement the normality modeling of the observation data in foreseeing possible abnormality in the target window. A joint variate-time attention learning is also introduced in FCM to leverage both temporal signals and features of the time series data for more discriminative normality modeling in the aforementioned two views. Comprehensive experiments on five datasets demonstrate that FCM gains good recall rate (70%+) on multiple datasets and significantly outperforms all baselines in F1 score. Code is available at this https URL.

[AI-83] Sparse Prototype Network for Explainable Pedestrian Behavior Prediction

链接: https://arxiv.org/abs/2410.12195
作者: Yan Feng,Alexander Carballo,Kazuya Takeda
关键词-EN: Predicting pedestrian behavior, Predicting pedestrian, smart city, behavior is challenging, challenging yet crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting pedestrian behavior is challenging yet crucial for applications such as autonomous driving and smart city. Recent deep learning models have achieved remarkable performance in making accurate predictions, but they fail to provide explanations of their inner workings. One reason for this problem is the multi-modal inputs. To bridge this gap, we present Sparse Prototype Network (SPN), an explainable method designed to simultaneously predict a pedestrian’s future action, trajectory, and pose. SPN leverages an intermediate prototype bottleneck layer to provide sample-based explanations for its predictions. The prototypes are modality-independent, meaning that they can correspond to any modality from the input. Therefore, SPN can extend to arbitrary combinations of modalities. Regularized by mono-semanticity and clustering constraints, the prototypes learn consistent and human-understandable features and achieve state-of-the-art performance on action, trajectory and pose prediction on TITAN and PIE. Finally, we propose a metric named Top-K Mono-semanticity Scale to quantitatively evaluate the explainability. Qualitative results show the positive correlation between sparsity and explainability. Code available at this https URL.

[AI-84] rajectory Manifold Optimization for Fast and Adaptive Kinodynamic Motion Planning

链接: https://arxiv.org/abs/2410.12193
作者: Yonghyeon Lee
关键词-EN: dynamically changing environments, Fast kinodynamic motion, Fast kinodynamic, changing environments, crucial for systems
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 12 pages, 11 figures

点击查看摘要

Abstract:Fast kinodynamic motion planning is crucial for systems to effectively adapt to dynamically changing environments. Despite some efforts, existing approaches still struggle with rapid planning in high-dimensional, complex problems. Not surprisingly, the primary challenge arises from the high-dimensionality of the search space, specifically the trajectory space. We address this issue with a two-step method: initially, we identify a lower-dimensional trajectory manifold \it offline, comprising diverse trajectories specifically relevant to the task at hand while meeting kinodynamic constraints. Subsequently, we search for solutions within this manifold \it online, significantly enhancing the planning speed. To encode and generate a manifold of continuous-time, differentiable trajectories, we propose a novel neural network model, \it Differentiable Motion Manifold Primitives (DMMP), along with a practical training strategy. Experiments with a 7-DoF robot arm tasked with dynamic throwing to arbitrary target positions demonstrate that our method surpasses existing approaches in planning speed, task success, and constraint satisfaction.

[AI-85] DocETL: Agent ic Query Rewriting and Evaluation for Complex Document Processing

链接: https://arxiv.org/abs/2410.12189
作者: Shreya Shankar,Aditya G. Parameswaran,Eugene Wu
关键词-EN: Analyzing unstructured data, Analyzing unstructured, Large Language Models, unstructured data, Language Models
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 21 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Analyzing unstructured data, such as complex documents, has been a persistent challenge in data processing. Large Language Models (LLMs) have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered unstructured data processing. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is. This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based framework to automatically optimize them, leveraging novel agent-based rewrites (that we call \em rewrite directives) and an optimization and evaluation framework that we introduce. We introduce \em (i) logical rewriting of pipelines, tailored for LLM-based tasks, \em (ii) an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and \em (iii) an optimization algorithm that efficiently finds promising plans, considering the time constraints of LLM-based plan generation and evaluation. Our evaluation on three different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 1.34 to 4.6\times higher quality (e.g., more accurate, comprehensive) than well-engineered baselines, addressing a critical gap in existing declarative frameworks for unstructured data analysis. DocETL is open-source at \tttthis http URL, and as of October 2024, has amassed over 800 GitHub Stars, with users spanning a variety of domains. Comments: 21 pages, 7 figures, 3 tables Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.12189 [cs.DB] (or arXiv:2410.12189v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2410.12189 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-86] DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

链接: https://arxiv.org/abs/2410.12187
作者: Yingsong Luo,Ling Chen
关键词-EN: Large language models, face deployment challenges, deployment challenges due, Large language, hardware constraints
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) excel in various tasks but face deployment challenges due to hardware constraints. We propose density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which identifies the center of high-density weights and centers the dynamic range on this point to align high-density weight regions with floating-point high-precision regions; 2) learnable dynamic range adjustment, which adjusts the dynamic range by optimizing quantization parameters (i.e., scale and zero-point) based on the impact of weights on the model output. Experiments on LLaMA and LLaMA-2 show that DAQ consistently outperforms the best baseline method, reducing perplexity loss by an average of 22.8% on LLaMA and 19.6% on LLaMA-2. Our code is available at this https URL.

[AI-87] Reinforcement Learning with LTL and omega-Regular Objectives via Optimality-Preserving Translation to Average Rewards

链接: https://arxiv.org/abs/2410.12175
作者: Xuan-Bach Le,Dominik Wagner,Leon Witzman,Alexander Rabinovich,Luke Ong
关键词-EN: Linear temporal logic, traditional discount sum, Linear temporal, temporal logic, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Linear temporal logic (LTL) and, more generally, \omega -regular objectives are alternatives to the traditional discount sum and average reward objectives in reinforcement learning (RL), offering the advantage of greater comprehensibility and hence explainability. In this work, we study the relationship between these objectives. Our main result is that each RL problem for \omega -regular objectives can be reduced to a limit-average reward problem in an optimality-preserving fashion, via (finite-memory) reward machines. Furthermore, we demonstrate the efficacy of this approach by showing that optimal policies for limit-average problems can be found asymptotically by solving a sequence of discount-sum problems approximately. Consequently, we resolve an open problem: optimal policies for LTL and \omega -regular objectives can be learned asymptotically.

[AI-88] he State of Robot Motion Generation

链接: https://arxiv.org/abs/2410.12172
作者: Kostas E. Bekris,Joe Doerr,Patrick Meng,Sumanth Tangirala
关键词-EN: generating robot motion, robot motion proposed, robotics research culminating, years of robotics, recent developments
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To be presented at the International Symposium of Robotics Research (ISRR), 2024

点击查看摘要

Abstract:This paper reviews the large spectrum of methods for generating robot motion proposed over the 50 years of robotics research culminating in recent developments. It crosses the boundaries of methodologies, typically not surveyed together, from those that operate over explicit models to those that learn implicit ones. The paper discusses the current state-of-the-art as well as properties of varying methodologies, highlighting opportunities for integration.

[AI-89] Reclaiming the Source of Programmatic Policies: Programmatic versus Latent Spaces ICLR2024

链接: https://arxiv.org/abs/2410.12166
作者: Tales H. Carvalho,Kenneth Tjhia,Levi H. S. Lelis
关键词-EN: Markov decision processes, partially observable Markov, observable Markov decision, define programmatic policies, observable Markov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ICLR 2024

点击查看摘要

Abstract:Recent works have introduced LEAPS and HPRL, systems that learn latent spaces of domain-specific languages, which are used to define programmatic policies for partially observable Markov decision processes (POMDPs). These systems induce a latent space while optimizing losses such as the behavior loss, which aim to achieve locality in program behavior, meaning that vectors close in the latent space should correspond to similarly behaving programs. In this paper, we show that the programmatic space, induced by the domain-specific language and requiring no training, presents values for the behavior loss similar to those observed in latent spaces presented in previous work. Moreover, algorithms searching in the programmatic space significantly outperform those in LEAPS and HPRL. To explain our results, we measured the “friendliness” of the two spaces to local search algorithms. We discovered that algorithms are more likely to stop at local maxima when searching in the latent space than when searching in the programmatic space. This implies that the optimization topology of the programmatic space, induced by the reward function in conjunction with the neighborhood function, is more conducive to search than that of the latent space. This result provides an explanation for the superior performance in the programmatic space.

[AI-90] Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution

链接: https://arxiv.org/abs/2410.12165
作者: Timothy Wei,Hsien Xin Peng,Elaine Xu,Bryan Zhao,Lei Ding,Diji Yang
关键词-EN: Artificial Intelligence models, Artificial Intelligence, increasingly challenging due, grow in size, Large Video-Language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence models, such as Large Video-Language models (VLMs), grow in size, their deployment in real-world applications becomes increasingly challenging due to hardware limitations and computational costs. To address this, we design a hybrid edge-cloud solution that leverages the efficiency of smaller models for local processing while deferring to larger, more accurate cloud-based models when necessary. Specifically, we propose a novel unsupervised data generation method, Dual-Model Distillation (DMD), to train a lightweight switcher model that can predict when the edge model’s output is uncertain and selectively offload inference to the large model in the cloud. Experimental results on the action classification task show that our framework not only requires less computational overhead, but also improves accuracy compared to using a large model alone. Our framework provides a scalable and adaptable solution for action classification in resource-constrained environments, with potential applications beyond healthcare. Noteworthy, while DMD-generated data is used for optimizing performance and resource usage in our pipeline, we expect the concept of DMD to further support future research on knowledge alignment across multiple models.

[AI-91] NSSI-Net: Multi-Concept Generative Adversarial Network for Non-Suicidal Self-Injury Detection Using High-Dimensional EEG Signals in a Semi-Supervised Learning Framework

链接: https://arxiv.org/abs/2410.12159
作者: Zhen Liang,Weishan Ye,Qile Liu,Li Zhang,Gan Huang,Yongjie Zhou
关键词-EN: widespread public concern, attracting widespread public, Non-suicidal self-injury, significantly increasing, public concern
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-suicidal self-injury (NSSI) is a serious threat to the physical and mental health of adolescents, significantly increasing the risk of suicide and attracting widespread public concern. Electroencephalography (EEG), as an objective tool for identifying brain disorders, holds great promise. However, extracting meaningful and reliable features from high-dimensional EEG data, especially by integrating spatiotemporal brain dynamics into informative representations, remains a major challenge. In this study, we introduce an advanced semi-supervised adversarial network, NSSI-Net, to effectively model EEG features related to NSSI. NSSI-Net consists of two key modules: a spatial-temporal feature extraction module and a multi-concept discriminator. In the spatial-temporal feature extraction module, an integrated 2D convolutional neural network (2D-CNN) and a bi-directional Gated Recurrent Unit (BiGRU) are used to capture both spatial and temporal dynamics in EEG data. In the multi-concept discriminator, signal, gender, domain, and disease levels are fully explored to extract meaningful EEG features, considering individual, demographic, disease variations across a diverse population. Based on self-collected NSSI data (n=114), the model’s effectiveness and reliability are demonstrated, with a 7.44% improvement in performance compared to existing machine learning and deep learning methods. This study advances the understanding and early diagnosis of NSSI in adolescents with depression, enabling timely intervention. The source code is available at this https URL.

[AI-92] FragNet: A Graph Neural Network for Molecular Property Prediction with Four Layers of Interpretability

链接: https://arxiv.org/abs/2410.12156
作者: Gihan Panapitiya,Peiyuan Gao,C Mark Maupin,Emily G Saldanha
关键词-EN: storage material design, applications including drug, including drug discovery, energy storage material, modern-day scientific applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Molecular property prediction is a crucial step in many modern-day scientific applications including drug discovery and energy storage material design. Despite the availability of numerous machine learning models for this task, we are lacking in models that provide both high accuracies and interpretability of the predictions. We introduce the FragNet architecture, a graph neural network not only capable of achieving prediction accuracies comparable to the current state-of-the-art models, but also able to provide insight on four levels of molecular substructures. This model enables understanding of which atoms, bonds, molecular fragments, and molecular fragment connections are critical in the prediction of a given molecular property. The ability to interpret the importance of connections between fragments is of particular interest for molecules which have substructures that are not connected with regular covalent bonds. The interpretable capabilities of FragNet are key to gaining scientific insights from the model’s learned patterns between molecular structure and molecular properties.

[AI-93] Exploiting LLMs Reasoning Capability to Infer Implicit Concepts in Legal Information Retrieval KR

链接: https://arxiv.org/abs/2410.12154
作者: Hai-Long Nguyen,Tan-Minh Nguyen,Duc-Minh Nguyen,Thi-Hai-Yen Vuong,Ha-Thanh Nguyen,Xuan-Hieu Phan
关键词-EN: Statutory law retrieval, Statutory law, law engineering, legal language processing, practical applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:Statutory law retrieval is a typical problem in legal language processing, that has various practical applications in law engineering. Modern deep learning-based retrieval methods have achieved significant results for this problem. However, retrieval systems relying on semantic and lexical correlations often exhibit limitations, particularly when handling queries that involve real-life scenarios, or use the vocabulary that is not specific to the legal domain. In this work, we focus on overcoming this weaknesses by utilizing the logical reasoning capabilities of large language models (LLMs) to identify relevant legal terms and facts related to the situation mentioned in the query. The proposed retrieval system integrates additional information from the term–based expansion and query reformulation to improve the retrieval accuracy. The experiments on COLIEE 2022 and COLIEE 2023 datasets show that extra knowledge from LLMs helps to improve the retrieval result of both lexical and semantic ranking models. The final ensemble retrieval system outperformed the highest results among all participating teams in the COLIEE 2022 and 2023 competitions.

[AI-94] Layer-of-Thoughts Prompting (LoT): Leveraging LLM-Based Retrieval with Constraint Hierarchies KR

链接: https://arxiv.org/abs/2410.12153
作者: Wachara Fungwacharakorn,Nguyen Ha Thanh,May Myo Zin,Ken Satoh
关键词-EN: refine candidate responses, utilizes constraint hierarchies, approach termed, hierarchies to filter, filter and refine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Presented at NeLaMKRR@KR, 2024 ( arXiv:2410.05339 )

点击查看摘要

Abstract:This paper presents a novel approach termed Layer-of-Thoughts Prompting (LoT), which utilizes constraint hierarchies to filter and refine candidate responses to a given query. By integrating these constraints, our method enables a structured retrieval process that enhances explainability and automation. Existing methods have explored various prompting techniques but often present overly generalized frameworks without delving into the nuances of prompts in multi-turn interactions. Our work addresses this gap by focusing on the hierarchical relationships among prompts. We demonstrate that the efficacy of thought hierarchy plays a critical role in developing efficient and interpretable retrieval algorithms. Leveraging Large Language Models (LLMs), LoT significantly improves the accuracy and comprehensibility of information retrieval tasks.

[AI-95] Facing Identity: The Formation and Performance of Identity via Face-Based Artificial Intelligence Technologies

链接: https://arxiv.org/abs/2410.12148
作者: Wells Lucas Santo
关键词-EN: artificial intelligence technologies, face-based artificial intelligence, constructed and performed, intelligence technologies, artificial intelligence
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:How is identity constructed and performed in the digital via face-based artificial intelligence technologies? While questions of identity on the textual Internet have been thoroughly explored, the Internet has progressed to a multimedia form that not only centers the visual, but specifically the face. At the same time, a wealth of scholarship has and continues to center the topics of surveillance and control through facial recognition technologies (FRTs), which have extended the logics of the racist pseudoscience of physiognomy. Much less work has been devoted to understanding how such face-based artificial intelligence technologies have influenced the formation and performance of identity. This literature review considers how such technologies interact with faciality, which entails the construction of what a face may represent or signify, along axes of identity such as race, gender, and sexuality. In grappling with recent advances in AI such as image generation and deepfakes, I propose that we are now in an era of “post-facial” technologies that build off our existing culture of facility while eschewing the analog face, complicating our relationship with identity vis-a-vis the face. Drawing from previous frameworks of identity play in the digital, as well as trans practices that have historically played with or transgressed the boundaries of identity classification, we can develop concepts adequate for analyzing digital faciality and identity given the current landscape of post-facial artificial intelligence technologies that allow users to interface with the digital in an entirely novel manner. To ground this framework of transgression, I conclude by proposing an interview study with VTubers – online streamers who perform using motion-captured avatars instead of their real-life faces – to gain qualitative insight on how these sociotechnical experiences.

[AI-96] Sample-Efficient Reinforcement Learning with Temporal Logic Objectives: Leveraging the Task Specification to Guide Exploration

链接: https://arxiv.org/abs/2410.12136
作者: Yiannis Kantaros,Jun Wang
关键词-EN: Linear Temporal Logic, Temporal Logic, Linear Temporal, high-level control objectives, Markov Decision Process
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2205.04424

点击查看摘要

Abstract:This paper addresses the problem of learning optimal control policies for systems with uncertain dynamics and high-level control objectives specified as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace structure and the outcomes of control decisions giving rise to an unknown Markov Decision Process (MDP). Existing reinforcement learning (RL) algorithms for LTL tasks typically rely on exploring a product MDP state-space uniformly (using e.g., an \epsilon -greedy policy) compromising sample-efficiency. This issue becomes more pronounced as the rewards get sparser and the MDP size or the task complexity increase. In this paper, we propose an accelerated RL algorithm that can learn control policies significantly faster than competitive approaches. Its sample-efficiency relies on a novel task-driven exploration strategy that biases exploration towards directions that may contribute to task satisfaction. We provide theoretical analysis and extensive comparative experiments demonstrating the sample-efficiency of the proposed method. The benefit of our method becomes more evident as the task complexity or the MDP size increases.

[AI-97] Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning

链接: https://arxiv.org/abs/2410.12130
作者: Huiwen Wu,Xiaohan Li,Xiaogang Xu,Jiafei Wu,Deyi Zhang,Zhe Liu
关键词-EN: Large Language Models, scientific research fields, scientific literature summarization, Large Language, knowledge graph construction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of Large Language Models (LLMs) has significantly advanced various AI applications in commercial and scientific research fields, such as scientific literature summarization, writing assistance, and knowledge graph construction. However, a significant challenge is the high risk of hallucination during LLM inference, which can lead to security concerns like factual inaccuracies, inconsistent information, and fabricated content. To tackle this issue, it is essential to develop effective methods for reducing hallucination while maintaining the original capabilities of the LLM. This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination. This method modifies the representation layers of pre-trained LLMs by using contrastive positive' and negative’ models, trained on data with and without hallucinations. By leveraging the differences between these two models, we create a more straightforward pathway to eliminate hallucinations, and the iterative nature of contrastive learning further enhances performance. Experimental validation on four pre-trained foundation LLMs (LLaMA2, Alpaca, LLaMA3, and Qwen) finetuning with a specially designed dataset shows that our approach achieves an average improvement of 10.1 points on the TruthfulQA benchmark. Comprehensive experiments demonstrate the effectiveness of Iter-AHMCL in reducing hallucination while maintaining the general capabilities of LLMs.

[AI-98] Parametric Graph Representations in the Era of Foundation Models: A Survey and Position

链接: https://arxiv.org/abs/2410.12126
作者: Dongqi Fu,Liri Fang,Zihao Li,Hanghang Tong,Vetle I. Torvik,Jingrui He
关键词-EN: comprehensive relational data, model comprehensive relational, graph laws, graph, past decades
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Preprint, 15 pages

点击查看摘要

Abstract:Graphs have been widely used in the past decades of big data and AI to model comprehensive relational data. When analyzing a graph’s statistical properties, graph laws serve as essential tools for parameterizing its structure. Identifying meaningful graph laws can significantly enhance the effectiveness of various applications, such as graph generation and link prediction. Facing the large-scale foundation model developments nowadays, the study of graph laws reveals new research potential, e.g., providing multi-modal information for graph neural representation learning and breaking the domain inconsistency of different graph data. In this survey, we first review the previous study of graph laws from multiple perspectives, i.e., macroscope and microscope of graphs, low-order and high-order graphs, static and dynamic graphs, different observation spaces, and newly proposed graph parameters. After we review various real-world applications benefiting from the guidance of graph laws, we conclude the paper with current challenges and future research directions.

[AI-99] Affordance-Centric Policy Learning: Sample Efficient and Generalisable Robot Policy Learning using Affordance-Centric Task Frames

链接: https://arxiv.org/abs/2410.12124
作者: Krishan Rana,Jad Abou-Chakra,Sourav Garg,Robert Lee,Ian Reid,Niko Suenderhauf
关键词-EN: central to robotic, simplified to interactions, interactions with task-specific, task-specific regions, robotic manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Video can be found on our project website: this https URL

点击查看摘要

Abstract:Affordances are central to robotic manipulation, where most tasks can be simplified to interactions with task-specific regions on objects. By focusing on these key regions, we can abstract away task-irrelevant information, simplifying the learning process, and enhancing generalisation. In this paper, we propose an affordance-centric policy-learning approach that centres and appropriately \textitorients a \textittask frame on these affordance regions allowing us to achieve both \textbfintra-category invariance – where policies can generalise across different instances within the same object category – and \textbfspatial invariance – which enables consistent performance regardless of object placement in the environment. We propose a method to leverage existing generalist large vision models to extract and track these affordance frames, and demonstrate that our approach can learn manipulation tasks using behaviour cloning from as little as 10 demonstrations, with equivalent generalisation to an image-based policy trained on 305 demonstrations. We provide video demonstrations on our project site: this https URL.

[AI-100] Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming

链接: https://arxiv.org/abs/2410.12112
作者: Yilun Hao,Yang Zhang,Chuchu Fan
关键词-EN: large language models, recently demonstrated strong, demonstrated strong potential, planning problems, solving planning problems
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 50 pages, 25 figures, 7 tables

点击查看摘要

Abstract:While large language models (LLMs) have recently demonstrated strong potential in solving planning problems, there is a trade-off between flexibility and complexity. LLMs, as zero-shot planners themselves, are still not capable of directly generating valid plans for complex planning problems such as multi-constraint or long-horizon tasks. On the other hand, many frameworks aiming to solve complex planning problems often rely on task-specific preparatory efforts, such as task-specific in-context examples and pre-defined critics/verifiers, which limits their cross-task generalization capability. In this paper, we tackle these challenges by observing that the core of many planning problems lies in optimization problems: searching for the optimal solution (best plan) with goals subject to constraints (preconditions and effects of decisions). With LLMs’ commonsense, reasoning, and programming capabilities, this opens up the possibilities of a universal LLM-based approach to planning problems. Inspired by this observation, we propose LLMFP, a general-purpose framework that leverages LLMs to capture key information from planning problems and formally formulate and solve them as optimization problems from scratch, with no task-specific examples needed. We apply LLMFP to 9 planning problems, ranging from multi-constraint decision making to multi-step planning problems, and demonstrate that LLMFP achieves on average 83.7% and 86.8% optimal rate across 9 tasks for GPT-4o and Claude 3.5 Sonnet, significantly outperforming the best baseline (direct planning with OpenAI o1-preview) with 37.6% and 40.7% improvements. We also validate components of LLMFP with ablation experiments and analyzed the underlying success and failure reasons.

[AI-101] Just-In-Time Software Defect Prediction via Bi-modal Change Representation Learning

链接: https://arxiv.org/abs/2410.12107
作者: Yuze Jiang,Beijun Shen,Xiaodong Gu
关键词-EN: predicting software defects, identify potential defects, researchers have proposed, early stage, predicting software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted by JSS (The Journal of Systems Software)

点击查看摘要

Abstract:For predicting software defects at an early stage, researchers have proposed just-in-time defect prediction (JIT-DP) to identify potential defects in code commits. The prevailing approaches train models to represent code changes in history commits and utilize the learned representations to predict the presence of defects in the latest commit. However, existing models merely learn editions in source code, without considering the natural language intentions behind the changes. This limitation hinders their ability to capture deeper semantics. To address this, we introduce a novel bi-modal change pre-training model called BiCC-BERT. BiCC-BERT is pre-trained on a code change corpus to learn bi-modal semantic representations. To incorporate commit messages from the corpus, we design a novel pre-training objective called Replaced Message Identification (RMI), which learns the semantic association between commit messages and code changes. Subsequently, we integrate BiCC-BERT into JIT-DP and propose a new defect prediction approach – JIT-BiCC. By leveraging the bi-modal representations from BiCC-BERT, JIT-BiCC captures more profound change semantics. We train JIT-BiCC using 27,391 code changes and compare its performance with 8 state-of-the-art JIT-DP approaches. The results demonstrate that JIT-BiCC outperforms all baselines, achieving a 10.8% improvement in F1-score. This highlights its effectiveness in learning the bi-modal semantics for JIT-DP.

[AI-102] he Persian Rug: solving toy models of superposition using large-scale symmetries

链接: https://arxiv.org/abs/2410.12101
作者: Aditya Cowsik,Kfir Dolev,Alex Infanger
关键词-EN: complete mechanistic description, minimal non-linear sparse, non-linear sparse data, large input dimension, compresses sparse data
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension. The model, originally presented in arXiv:2209.10652, compresses sparse data vectors through a linear layer and decompresses using another linear layer followed by a ReLU activation. We notice that when the data is permutation symmetric (no input feature is privileged) large models reliably learn an algorithm that is sensitive to individual weights only through their large-scale statistics. For these models, the loss function becomes analytically tractable. Using this understanding, we give the explicit scalings of the loss at high sparsity, and show that the model is near-optimal among recently proposed architectures. In particular, changing or adding to the activation function any elementwise or filtering operation can at best improve the model’s performance by a constant factor. Finally, we forward-engineer a model with the requisite symmetries and show that its loss precisely matches that of the trained models. Unlike the trained model weights, the low randomness in the artificial weights results in miraculous fractal structures resembling a Persian rug, to which the algorithm is oblivious. Our work contributes to neural network interpretability by introducing techniques for understanding the structure of autoencoders. Code to reproduce our results can be found at this https URL .

[AI-103] Bridging Large Language Models and Graph Structure Learning Models for Robust Representation Learning

链接: https://arxiv.org/abs/2410.12096
作者: Guangxin Su,Yifan Zhu,Wenjie Zhang,Hanchen Wang,Ying Zhang
关键词-EN: encounters pervasive noise, graph structure learning, graph structure, Graph representation learning, node features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Graph structure learning, Graph representation learning, Large language models, Graph neural networks

点击查看摘要

Abstract:Graph representation learning, involving both node features and graph structures, is crucial for real-world applications but often encounters pervasive noise. State-of-the-art methods typically address noise by focusing separately on node features with large language models (LLMs) and on graph structures with graph structure learning models (GSLMs). In this paper, we introduce LangGSL, a robust framework that integrates the complementary strengths of pre-trained language models and GSLMs to jointly enhance both node feature and graph structure learning. In LangGSL, we first leverage LLMs to filter noise in the raw data and extract valuable cleaned information as features, enhancing the synergy of downstream models. During the mutual learning phase in LangGSL, the core idea is to leverage the relatively small language model (LM) to process local attributes and generate reliable pseudo-labels and informative node embeddings, which are then integrated into the GSLM’s prediction phase. This approach enriches the global context and enhances overall performance. Meanwhile, GSLM refines the evolving graph structure constructed from the LM’s output, offering updated labels back to the LM as additional guidance, thus facilitating a more effective mutual learning process. The LM and GSLM work synergistically, complementing each other’s strengths and offsetting weaknesses within a variational information-maximizing framework, resulting in enhanced node features and a more robust graph structure. Extensive experiments on diverse graph datasets of varying scales and across different task scenarios demonstrate the scalability and effectiveness of the proposed approach.

[AI-104] Generative AIs aggregated knowledge versus web-based curated knowledge

链接: https://arxiv.org/abs/2410.12091
作者: Ted Selker,Yunzi Wu
关键词-EN: Large Language Models, Language Models, Large Language, web-sourced search results, search results serve
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 19 pages, 19 references, 8 pages of appendices, 15 figures

点击查看摘要

Abstract:his paper explores what kinds of questions are best served by the way generative AI (GenAI) using Large Language Models(LLMs) that aggregate and package knowledge, and when traditional curated web-sourced search results serve users better. An experiment compared product searches using ChatGPT, Google search engine, or both helped us understand more about the compelling nature of generated responses. The experiment showed GenAI can speed up some explorations and decisions. We describe how search can deepen the testing of facts, logic, and context. We show where existing and emerging knowledge paradigms can help knowledge exploration in different ways. Experimenting with searches, our probes showed the value for curated web search provides for very specific, less popularly-known knowledge. GenAI excelled at bringing together knowledge for broad, relatively well-known topics. The value of curated and aggregated knowledge for different kinds of knowledge reflected in different user goals. We developed a taxonomy to distinguishing when users are best served by these two approaches. Comments: 19 pages, 19 references, 8 pages of appendices, 15 figures Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.12091 [cs.HC] (or arXiv:2410.12091v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2410.12091 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-105] Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning

链接: https://arxiv.org/abs/2410.12085
作者: Fengyu Gao,Ruida Zhou,Tianhao Wang,Cong Shen,Jing Yang
关键词-EN: Large Language Models, Large Language, Language Models, perform in-context learning, contextual information embedded
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL). To mitigate the risk of LLMs potentially leaking private information contained in examples in the prompt, we introduce a novel data-adaptive differentially private algorithm called AdaDPSyn to generate synthetic examples from the private dataset and then use these synthetic examples to perform ICL. The objective of AdaDPSyn is to adaptively adjust the noise level in the data synthesis mechanism according to the inherent statistical properties of the data, thereby preserving high ICL accuracy while maintaining formal differential privacy guarantees. A key innovation in AdaDPSyn is the Precision-Focused Iterative Radius Reduction technique, which dynamically refines the aggregation radius - the scope of data grouping for noise addition - based on patterns observed in data clustering, thereby minimizing the amount of additive noise. We conduct extensive experiments on standard benchmarks and compare AdaDPSyn with DP few-shot generation algorithm (Tang et al., 2023). The experiments demonstrate that AdaDPSyn not only outperforms DP few-shot generation, but also maintains high accuracy levels close to those of non-private baselines, providing an effective solution for ICL with privacy protection.

[AI-106] WeatherDG: LLM-assisted Procedural Weather Generation for Domain-Generalized Semantic Segmentation

链接: https://arxiv.org/abs/2410.12075
作者: Chenghao Qian,Yuhu Guo,Yuhong Mo,Wenjing Li
关键词-EN: Large Language Model, Stable Diffusion, Large Language, driving-screen images based, Language Model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we propose a novel approach, namely WeatherDG, that can generate realistic, weather-diverse, and driving-screen images based on the cooperation of two foundation models, i.e, Stable Diffusion (SD) and Large Language Model (LLM). Specifically, we first fine-tune the SD with source data, aligning the content and layout of generated samples with real-world driving scenarios. Then, we propose a procedural prompt generation method based on LLM, which can enrich scenario descriptions and help SD automatically generate more diverse, detailed images. In addition, we introduce a balanced generation strategy, which encourages the SD to generate high-quality objects of tailed classes under various weather conditions, such as riders and motorcycles. This segmentation-model-agnostic method can improve the generalization ability of existing models by additionally adapting them with the generated synthetic data. Experiments on three challenging datasets show that our method can significantly improve the segmentation performance of different state-of-the-art models on target domains. Notably, in the setting of ‘‘Cityscapes to ACDC’’, our method improves the baseline HRDA by 13.9% in mIoU.

[AI-107] V3D-SLAM: Robust RGB-D SLAM in Dynamic Environments with 3D Semantic Geometry Voting

链接: https://arxiv.org/abs/2410.12068
作者: Tuan Dang,Khang Nguyen,Mandfred Huber
关键词-EN: Simultaneous localization, highly dynamic environments, localization and mapping, environments is challenging, challenging due
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Simultaneous localization and mapping (SLAM) in highly dynamic environments is challenging due to the correlation complexity between moving objects and the camera pose. Many methods have been proposed to deal with this problem; however, the moving properties of dynamic objects with a moving camera remain unclear. Therefore, to improve SLAM’s performance, minimizing disruptive events of moving objects with a physical understanding of 3D shapes and dynamics of objects is needed. In this paper, we propose a robust method, V3D-SLAM, to remove moving objects via two lightweight re-evaluation stages, including identifying potentially moving and static objects using a spatial-reasoned Hough voting mechanism and refining static objects by detecting dynamic noise caused by intra-object motions using Chamfer distances as similarity measurements. Our experiment on the TUM RGB-D benchmark on dynamic sequences with ground-truth camera trajectories showed that our methods outperform the most recent state-of-the-art SLAM methods. Our source code is available at this https URL.

[AI-108] MFC-EQ: Mean-Field Control with Envelope Q-Learning for Moving Decentralized Agents in Formation IROS2024

链接: https://arxiv.org/abs/2410.12062
作者: Qiushi Lin,Hang Ma
关键词-EN: Path Finding aiming, Multi-Agent Path Finding, plan collision-free paths, Path Finding, version of Moving
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted to IROS 2024

点击查看摘要

Abstract:We study a decentralized version of Moving Agents in Formation (MAiF), a variant of Multi-Agent Path Finding aiming to plan collision-free paths for multiple agents with the dual objectives of reaching their goals quickly while maintaining a desired formation. The agents must balance these objectives under conditions of partial observation and limited communication. The formation maintenance depends on the joint state of all agents, whose dimensionality increases exponentially with the number of agents, rendering the learning process intractable. Additionally, learning a single policy that can accommodate different linear preferences for these two objectives presents a significant challenge. In this paper, we propose Mean-Field Control with Envelop Q -learning (MFC-EQ), a scalable and adaptable learning framework for this bi-objective multi-agent problem. We approximate the dynamics of all agents using mean-field theory while learning a universal preference-agnostic policy through envelop Q -learning. Our empirical evaluation of MFC-EQ across numerous instances shows that it outperforms state-of-the-art centralized MAiF baselines. Furthermore, MFC-EQ effectively handles more complex scenarios where the desired formation changes dynamically – a challenge that existing MAiF planners cannot address.

[AI-109] CrediRAG: Network-Augmented Credibility-Based Retrieval for Misinformation Detection in Reddit

链接: https://arxiv.org/abs/2410.12061
作者: Ashwin Ram,Yigit Ege Bayiz,Arash Amini,Mustafa Munir,Radu Marculescu
关键词-EN: accurately detecting online, divisions in society, addressing this issue, detecting online misinformation, threatens democracy
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fake news threatens democracy and exacerbates the polarization and divisions in society; therefore, accurately detecting online misinformation is the foundation of addressing this issue. We present CrediRAG, the first fake news detection model that combines language models with access to a rich external political knowledge base with a dense social network to detect fake news across social media at scale. CrediRAG uses a news retriever to initially assign a misinformation score to each post based on the source credibility of similar news articles to the post title content. CrediRAG then improves the initial retrieval estimations through a novel weighted post-to-post network connected based on shared commenters and weighted by the average stance of all shared commenters across every pair of posts. We achieve 11% increase in the F1-score in detecting misinformative posts over state-of-the-art methods. Extensive experiments conducted on curated real-world Reddit data of over 200,000 posts demonstrate the superior performance of CrediRAG on existing baselines. Thus, our approach offers a more accurate and scalable solution to combat the spread of fake news across social media platforms.

[AI-110] Large-scale cloze evaluation reveals that token prediction tasks are neither lexically nor semantically aligned

链接: https://arxiv.org/abs/2410.12057
作者: Cassandra L. Jacobs,Loïc Grobol,Alvin Tsang
关键词-EN: token prediction level, cloze task, compare the generative, generative behavior, token prediction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work we compare the generative behavior at the next token prediction level in several language models by comparing them to human productions in the cloze task. We find that while large models trained for longer are typically better estimators of human productions, but they reliably under-estimate the probabilities of human responses, over-rank rare responses, under-rank top responses, and produce highly distinct semantic spaces. Altogether, this work demonstrates in a tractable, interpretable domain that LM generations can not be used as replacements of or models of the cloze task.

[AI-111] Enabling Data-Driven and Empathetic Interactions: A Context-Aware 3D Virtual Agent in Mixed Reality for Enhanced Financial Customer Experience

链接: https://arxiv.org/abs/2410.12051
作者: Cindy Xu,Mengyu Chen,Pranav Deshpande,Elvir Azanli,Runqing Yang,Joseph Ligman
关键词-EN: Vision Language Models, utilizing Mixed Reality, Mixed Reality, Language Models, Vision Language
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multimedia (cs.MM)
*备注: to appear at 1st Workshop on Intelligent XR: Harnessing AI for Next-Generation XR User Experiences at International Symposium on Mixed and Augmented Reality (ISMAR) 2024

点击查看摘要

Abstract:In this paper, we introduce a novel system designed to enhance customer service in the financial and retail sectors through a context-aware 3D virtual agent, utilizing Mixed Reality (MR) and Vision Language Models (VLMs). Our approach focuses on enabling data-driven and empathetic interactions that ensure customer satisfaction by introducing situational awareness of the physical location, personalized interactions based on customer profiles, and rigorous privacy and security standards. We discuss our design considerations critical for deployment in real-world customer service environments, addressing challenges in user data management and sensitive information handling. We also outline the system architecture and key features unique to banking and retail environments. Our work demonstrates the potential of integrating MR and VLMs in service industries, offering practical insights in customer service delivery while maintaining high standards of security and personalization.

[AI-112] Sabia-3 Technical Report

链接: https://arxiv.org/abs/2410.12049
作者: Hugo Abonizio,Thales Sales Almeida,Thiago Laitz,Roseval Malaquias Junior,Giovana Kerche Bonás,Rodrigo Nogueira,Ramon Pires
关键词-EN: large brazilian-centric corpus, language model trained, flagship language model, report presents, brazilian-centric corpus
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report presents Sabiá-3, our new flagship language model trained on a large brazilian-centric corpus. Evaluations across diverse professional and academic benchmarks show a strong performance on Portuguese and Brazil-related tasks. Sabiá-3 shows large improvements in comparison to our previous best of model, Sabiá-2 Medium, especially in reasoning-intensive tasks. Notably, Sabiá-3’s average performance matches frontier LLMs, while it is offered at a three to four times lower cost per token, reinforcing the benefits of domain specialization.

[AI-113] Concept-Reversed Winograd Schema Challenge: Evaluating and Improving Robust Reasoning in Large Language Models via Abstraction

链接: https://arxiv.org/abs/2410.12040
作者: Kaiqiao Han,Tianqing Fang,Zhaowei Wang,Yangqiu Song,Mark Steedman
关键词-EN: Large Language Models, Winograd Schema Challenge, superficial logical chains, Language Models, showcased remarkable proficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have showcased remarkable proficiency in reasoning, there is still a concern about hallucinations and unreliable reasoning issues due to semantic associations and superficial logical chains. To evaluate the extent to which LLMs perform robust reasoning instead of relying on superficial logical chains, we propose a new evaluation dataset, the Concept-Reversed Winograd Schema Challenge (CR-WSC), based on the famous Winograd Schema Challenge (WSC) dataset. By simply reversing the concepts to those that are more associated with the wrong answer, we find that the performance of LLMs drops significantly despite the rationale of reasoning remaining the same. Furthermore, we propose Abstraction-of-Thought (AoT), a novel prompt method for recovering adversarial cases to normal cases using conceptual abstraction to improve LLMs’ robustness and consistency in reasoning, as demonstrated by experiments on CR-WSC.

[AI-114] A Survey on Deep Tabular Learning

链接: https://arxiv.org/abs/2410.12034
作者: Shriyank Somvanshi,Subasish Das,Syed Aaqib Javed,Gian Antariksa,Ahmed Hossain
关键词-EN: presents unique challenges, deep learning due, Tabular data, deep learning models, industries like healthcare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 43 pages, 18 figures, 3 tables

点击查看摘要

Abstract:Tabular data, widely used in industries like healthcare, finance, and transportation, presents unique challenges for deep learning due to its heterogeneous nature and lack of spatial structure. This survey reviews the evolution of deep learning models for tabular data, from early fully connected networks (FCNs) to advanced architectures like TabNet, SAINT, TabTranSELU, and MambaNet. These models incorporate attention mechanisms, feature embeddings, and hybrid architectures to address tabular data complexities. TabNet uses sequential attention for instance-wise feature selection, improving interpretability, while SAINT combines self-attention and intersample attention to capture complex interactions across features and data points, both advancing scalability and reducing computational overhead. Hybrid architectures such as TabTransformer and FT-Transformer integrate attention mechanisms with multi-layer perceptrons (MLPs) to handle categorical and numerical data, with FT-Transformer adapting transformers for tabular datasets. Research continues to balance performance and efficiency for large datasets. Graph-based models like GNN4TDL and GANDALF combine neural networks with decision trees or graph structures, enhancing feature representation and mitigating overfitting in small datasets through advanced regularization techniques. Diffusion-based models like the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) generate synthetic data to address data scarcity, improving model robustness. Similarly, models like TabPFN and Ptab leverage pre-trained language models, incorporating transfer learning and self-supervised techniques into tabular tasks. This survey highlights key advancements and outlines future research directions on scalability, generalization, and interpretability in diverse tabular data applications.

[AI-115] A Learning Search Algorithm for the Restricted Longest Common Subsequence Problem

链接: https://arxiv.org/abs/2410.12031
作者: Marko Djukanović,Jaume Reixach,Ana Nikolikj,Tome Eftimov,Aleksandar Kartelj,Christian Blum
关键词-EN: Longest Common Subsequence, well-known Longest Common, Restricted Longest Common, Common Subsequence, Longest Common
类目: Artificial Intelligence (cs.AI)
*备注: 33 pages, 12 figures

点击查看摘要

Abstract:This paper addresses the Restricted Longest Common Subsequence (RLCS) problem, an extension of the well-known Longest Common Subsequence (LCS) problem. This problem has significant applications in bioinformatics, particularly for identifying similarities and discovering mutual patterns and important motifs among DNA, RNA, and protein sequences. Building on recent advancements in solving this problem through a general search framework, this paper introduces two novel heuristic approaches designed to enhance the search process by steering it towards promising regions in the search space. The first heuristic employs a probabilistic model to evaluate partial solutions during the search process. The second heuristic is based on a neural network model trained offline using a genetic algorithm. A key aspect of this approach is extracting problem-specific features of partial solutions and the complete problem instance. An effective hybrid method, referred to as the learning beam search, is developed by combining the trained neural network model with a beam search framework. An important contribution of this paper is found in the generation of real-world instances where scientific abstracts serve as input strings, and a set of frequently occurring academic words from the literature are used as restricted patterns. Comprehensive experimental evaluations demonstrate the effectiveness of the proposed approaches in solving the RLCS problem. Finally, an empirical explainability analysis is applied to the obtained results. In this way, key feature combinations and their respective contributions to the success or failure of the algorithms across different problem types are identified.

[AI-116] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

链接: https://arxiv.org/abs/2410.12013
作者: Yanyue Xie,Zhi Zhang,Ding Zhou,Cong Xie,Ziang Song,Xin Liu,Yanzhi Wang,Xue Lin,An Xu
关键词-EN: architectures face challenges, high memory consumption, architectures face, redundancy in experts, face challenges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. Pruning MoE can reduce network weights while maintaining model performance. Motivated by the recent observation of emergent large magnitude features in Large Language Models (LLM) and MoE routing policy, we propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights, on each output neuron. Our pruning method is one-shot, requiring no retraining or weight updates. We evaluate our method on Mixtral-8x7B and Mixtral-8x22B across multiple language benchmarks. Experimental results show that our pruning method significantly outperforms state-of-the-art LLM pruning methods. Furthermore, our pruned MoE models can benefit from a pretrained teacher model through expert-wise knowledge distillation, improving performance post-pruning. Experimental results demonstrate that the Mixtral-8x7B model with 50% sparsity maintains 99% of the performance of the original model after the expert-wise knowledge distillation.

[AI-117] Bias Similarity Across Large Language Models

链接: https://arxiv.org/abs/2410.12010
作者: Hyejun Jeong,Shiqing Ma,Amir Houmansadr
关键词-EN: machine learning models, models influence decision-making, Large Language Models, chronic problem, human society
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: under review

点击查看摘要

Abstract:Bias in machine learning models has been a chronic problem, especially as these models influence decision-making in human society. In generative AI, such as Large Language Models, the impact of bias is even more profound compared to the classification models. LLMs produce realistic and human-like content that users may unconsciously trust, which could perpetuate harmful stereotypes to the uncontrolled public. It becomes particularly concerning when utilized in journalism or education. While prior studies have explored and quantified bias in individual AI models, no work has yet compared bias similarity across different LLMs. To fill this gap, we take a comprehensive look at ten open- and closed-source LLMs from four model families, assessing the extent of biases through output distribution. Using two datasets-one containing 4k questions and another with one million questions for each of the four bias dimensions – we measure functional similarity to understand how biases manifest across models. Our findings reveal that 1) fine-tuning does not significantly alter output distributions, which would limit its ability to mitigate bias, 2) LLMs within the same family tree do not produce similar output distributions, implying that addressing bias in one model could have limited implications for others in the same family, and 3) there is a possible risk of training data information leakage, raising concerns about privacy and data security. Our analysis provides insight into LLM behavior and highlights potential risks in real-world deployment.

[AI-118] Beyond Labels: A Self-Supervised Framework with Masked Autoencoders and Random Cropping for Breast Cancer Subtype Classification

链接: https://arxiv.org/abs/2410.12006
作者: Annalisa Chiocchetti,Marco Dossena,Christopher Irwin,Luigi Portinale
关键词-EN: work contributes, contributes to breast, breast cancer sub-type, histopathological images, breast cancer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work contributes to breast cancer sub-type classification using histopathological images. We utilize masked autoencoders (MAEs) to learn a self-supervised embedding tailored for computer vision tasks in this domain. This embedding captures informative representations of histopathological data, facilitating feature learning without extensive labeled datasets. During pre-training, we investigate employing a random crop technique to generate a large dataset from WSIs automatically. Additionally, we assess the performance of linear probes for multi-class classification tasks of cancer sub-types using the representations learnt by the MAE. Our approach aims to achieve strong performance on downstream tasks by leveraging the complementary strengths of ViTs and autoencoders. We evaluate our model’s performance on the BRACS dataset and compare it with existing benchmarks.

[AI-119] he Fair Language Model Paradox

链接: https://arxiv.org/abs/2410.11985
作者: Andrea Pinto,Tomer Galanti,Randall Balestriero
关键词-EN: Large Language Models, Large Language, real-world applications, widely deployed, deployed in real-world
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.

[AI-120] Generative AI Policies under the Microscope: How CS Conferences Are Navigating the New Frontier in Scholarly Writing

链接: https://arxiv.org/abs/2410.11977
作者: Mahjabin Nahar,Sian Lee,Becky Guillen,Dongwon Lee
关键词-EN: computer science conferences, policy adoption, paper explores, explores the current, current state
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the current state of generative AI policies of computer science conferences and offers guidelines for policy adoption.

[AI-121] DDIL: Improved Diffusion Distillation With Imitation Learning

链接: https://arxiv.org/abs/2410.11971
作者: Risheek Garrepalli,Shweta Mahajan,Munawar Hayat,Fatih Porikli
关键词-EN: sampling requires multiple, requires multiple denoising, multiple denoising network, denoising network passes, limiting practicality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes, limiting practicality. Efforts such as progressive distillation or consistency distillation have shown promise by reducing the number of passes at the expense of quality of the generated samples. In this work we identify co-variate shift as one of reason for poor performance of multi-step distilled models from compounding error at inference time. To address co-variate shift, we formulate diffusion distillation within imitation learning (DDIL) framework and enhance training distribution for distilling diffusion models on both data distribution (forward diffusion) and student induced distributions (backward diffusion). Training on data distribution helps to diversify the generations by preserving marginal data distribution and training on student distribution addresses compounding error by correcting covariate shift. In addition, we adopt reflected diffusion formulation for distillation and demonstrate improved performance, stable training across different distillation methods. We show that DDIL consistency improves on baseline algorithms of progressive distillation (PD), Latent consistency models (LCM) and Distribution Matching Distillation (DMD2).

[AI-122] CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

链接: https://arxiv.org/abs/2410.11963
作者: Qingqing Cao,Mahyar Najibi,Sachin Mehta
关键词-EN: Pretraining robust vision, Pretraining robust, potentially misaligned, relies on large-scale, long-tail distributions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pretraining robust vision or multimodal foundation models (e.g., CLIP) relies on large-scale datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous works have shown promising results in augmenting datasets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., either image or text only, but not both), and are limited in data diversity due to a lack of fine-grained control over the synthesis process. In this paper, we design a \emphcontrollable image-text synthesis pipeline, CtrlSynth, for data-efficient and robust multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., remove, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose feature in CtrlSynth allows users to control data synthesis in a fine-grained manner by defining customized control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pretrained foundation models such as large language models or diffusion models to reason and recompose basic elements such that synthetic samples are natural and composed in diverse ways. CtrlSynth is a closed-loop, training-free, and modular framework, making it easy to support different pretrained models. With extensive experiments on 31 datasets spanning different vision and vision-language tasks, we show that CtrlSynth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.

[AI-123] ChatHouseDiffusion: Prompt-Guided Generation and Editing of Floor Plans

链接: https://arxiv.org/abs/2410.11908
作者: Sizhong Qin,Chengyu He,Qiaoyun Chen,Sen Yang,Wenjie Liao,Yi Gu,Xinzheng Lu
关键词-EN: architectural planning, requiring a high, generation and editing, critical in architectural, high degree
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The generation and editing of floor plans are critical in architectural planning, requiring a high degree of flexibility and efficiency. Existing methods demand extensive input information and lack the capability for interactive adaptation to user modifications. This paper introduces ChatHouseDiffusion, which leverages large language models (LLMs) to interpret natural language input, employs graphormer to encode topological relationships, and uses diffusion models to flexibly generate and edit floor plans. This approach allows iterative design adjustments based on user ideas, significantly enhancing design efficiency. Compared to existing models, ChatHouseDiffusion achieves higher Intersection over Union (IoU) scores, permitting precise, localized adjustments without the need for complete redesigns, thus offering greater practicality. Experiments demonstrate that our model not only strictly adheres to user specifications but also facilitates a more intuitive design process through its interactive capabilities.

[AI-124] Empowering Users in Digital Privacy Management through Interactive LLM-Based Agents

链接: https://arxiv.org/abs/2410.11906
作者: Bolun Sun,Yifan Zhou,Haiyun Jiang
关键词-EN: Data Practice Identification, interactive dialogue agent, Privacy Question Answering, large language models, Choice Identification
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This paper presents a novel application of large language models (LLMs) to enhance user comprehension of privacy policies through an interactive dialogue agent. We demonstrate that LLMs significantly outperform traditional models in tasks like Data Practice Identification, Choice Identification, Policy Summarization, and Privacy Question Answering, setting new benchmarks in privacy policy analysis. Building on these findings, we introduce an innovative LLM-based agent that functions as an expert system for processing website privacy policies, guiding users through complex legal language without requiring them to pose specific questions. A user study with 100 participants showed that users assisted by the agent had higher comprehension levels (mean score of 2.6 out of 3 vs. 1.8 in the control group), reduced cognitive load (task difficulty ratings of 3.2 out of 10 vs. 7.8), increased confidence in managing privacy, and completed tasks in less time (5.5 minutes vs. 15.8 minutes). This work highlights the potential of LLM-based agents to transform user interaction with privacy policies, leading to more informed consent and empowering users in the digital services landscape.

[AI-125] A Scalable Communication Protocol for Networks of Large Language Models

链接: https://arxiv.org/abs/2410.11905
作者: Samuele Marro,Emanuele La Malfa,Jesse Wright,Guohao Li,Nigel Shadbolt,Michael Wooldridge,Philip Torr
关键词-EN: Agent Communication Trilemma, prerequisite for collaboration, Communication Trilemma, Communication, Agent Communication
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Communication is a prerequisite for collaboration. When scaling networks of AI-powered agents, communication must be versatile, efficient, and portable. These requisites, which we refer to as the Agent Communication Trilemma, are hard to achieve in large networks of agents. We introduce Agora, a meta protocol that leverages existing communication standards to make LLM-powered agents solve complex problems efficiently. In Agora, agents typically use standardised routines for frequent communications, natural language for rare communications, and LLM-written routines for everything in between. Agora sidesteps the Agent Communication Trilemma and robustly handles changes in interfaces and members, allowing unprecedented scalability with full decentralisation and minimal involvement of human beings. On large Agora networks, we observe the emergence of self-organising, fully automated protocols that achieve complex goals without human intervention.

[AI-126] Personalised Feedback Framework for Online Education Programmes Using Generative AI

链接: https://arxiv.org/abs/2410.11904
作者: Ievgeniia Kuzminykh,Tareita Nawaz,Shihao Shenzhang,Bogdan Ghita,Jeffery Raphael,Hannan Xiao
关键词-EN: large language modules, online education programmes, learning management systems, language modules, education programmes
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Submitted to journal

点击查看摘要

Abstract:AI tools, particularly large language modules, have recently proven their effectiveness within learning management systems and online education programmes. As feedback continues to play a crucial role in learning and assessment in schools, educators must carefully customise the use of AI tools in order to optimally support students in their learning journey. Efforts to improve educational feedback systems have seen numerous attempts reflected in the research studies but mostly have been focusing on qualitatively benchmarking AI feedback against human-generated feedback. This paper presents an exploration of an alternative feedback framework which extends the capabilities of ChatGPT by integrating embeddings, enabling a more nuanced understanding of educational materials and facilitating topic-targeted feedback for quiz-based assessments. As part of the study, we proposed and developed a proof of concept solution, achieving an efficacy rate of 90% and 100% for open-ended and multiple-choice questions, respectively. The results showed that our framework not only surpasses expectations but also rivals human narratives, highlighting the potential of AI in revolutionising educational feedback mechanisms.

[AI-127] FLARE: Faithful Logic-Aided Reasoning and Exploration

链接: https://arxiv.org/abs/2410.11900
作者: Erik Arakelyan,Pasquale Minervini,Pat Verga,Patrick Lewis,Isabelle Augenstein
关键词-EN: Modern Question Answering, Large Language Models, Large Language, Modern Question, Question Answering
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Modern Question Answering (QA) and Reasoning approaches based on Large Language Models (LLMs) commonly use prompting techniques, such as Chain-of-Thought (CoT), assuming the resulting generation will have a more granular exploration and reasoning over the question space and scope. However, such methods struggle with generating outputs that are faithful to the intermediate chain of reasoning produced by the model. On the other end of the spectrum, neuro-symbolic methods such as Faithful CoT (F-CoT) propose to combine LLMs with external symbolic solvers. While such approaches boast a high degree of faithfulness, they usually require a model trained for code generation and struggle with tasks that are ambiguous or hard to formalise strictly. We introduce \textbfFaithful \textbfLogic-\textbfAided \textbfReasoning and \textbfExploration (\textbf\ours), a novel interpretable approach for traversing the problem space using task decompositions. We use the LLM to plan a solution, soft-formalise the query into facts and predicates using a logic programming code and simulate that code execution using an exhaustive multi-hop search over the defined space. Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers. Our methods achieve SOTA results on \mathbf7 out of \mathbf9 diverse reasoning benchmarks. We also show that model faithfulness positively correlates with overall performance and further demonstrate that \textbf\ours allows pinpointing the decisive factors sufficient for and leading to the correct answer with optimal reasoning during the multi-hop search.

[AI-128] Study on the Helpfulness of Explainable Artificial Intelligence

链接: https://arxiv.org/abs/2410.11896
作者: Tobias Labarta,Elizaveta Kulicheva,Ronja Froelian,Christian Geißler,Xenia Melman,Julian von Klitzing
关键词-EN: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, machine learning-powered applications, building advanced machine
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: World Conference on Explainable Artificial Intelligence

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) is essential for building advanced machine learning-powered applications, especially in critical domains such as medical diagnostics or autonomous driving. Legal, business, and ethical requirements motivate using effective XAI, but the increasing number of different methods makes it challenging to pick the right ones. Further, as explanations are highly context-dependent, measuring the effectiveness of XAI methods without users can only reveal a limited amount of information, excluding human factors such as the ability to understand it. We propose to evaluate XAI methods via the user’s ability to successfully perform a proxy task, designed such that a good performance is an indicator for the explanation to provide helpful information. In other words, we address the helpfulness of XAI for human decision-making. Further, a user study on state-of-the-art methods was conducted, showing differences in their ability to generate trust and skepticism and the ability to judge the rightfulness of an AI decision correctly. Based on the results, we highly recommend using and extending this approach for more objective-based human-centered user studies to measure XAI performance in an end-to-end fashion.

[AI-129] Neural Metamorphosis ECCV2024

链接: https://arxiv.org/abs/2410.11878
作者: Xingyi Yang,Xinchao Wang
关键词-EN: termed Neural Metamorphosis, learning paradigm termed, paradigm termed Neural, build self-morphable neural, Neural Metamorphosis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: in ECCV2024, this https URL

点击查看摘要

Abstract:This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self-morphable neural networks. Contrary to crafting separate models for different architectures or sizes, NeuMeta directly learns the continuous weight manifold of neural networks. Once trained, we can sample weights for any-sized network directly from the manifold, even for previously unseen configurations, without retraining. To achieve this ambitious goal, NeuMeta trains neural implicit functions as hypernetworks. They accept coordinates within the model space as input, and generate corresponding weight values on the manifold. In other words, the implicit function is learned in a way, that the predicted weights is well-performed across various models sizes. In training those models, we notice that, the final performance closely relates on smoothness of the learned manifold. In pursuit of enhancing this smoothness, we employ two strategies. First, we permute weight matrices to achieve intra-model smoothness, by solving the Shortest Hamiltonian Path problem. Besides, we add a noise on the input coordinates when training the implicit function, ensuring models with various sizes shows consistent outputs. As such, NeuMeta shows promising results in synthesizing parameters for various network configurations. Our extensive tests in image classification, semantic segmentation, and image generation reveal that NeuMeta sustains full-size performance even at a 75% compression rate.

[AI-130] A Framework for Collaborating a Large Language Model Tool in Brainstorming for Triggering Creative Thoughts

链接: https://arxiv.org/abs/2410.11877
作者: Hung-Fu Chang,Tong Li
关键词-EN: synthesizing previous insights, redefining existing concepts, Large Language Models, previous insights, redefining existing
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:Creativity involves not only generating new ideas from scratch but also redefining existing concepts and synthesizing previous insights. Among various techniques developed to foster creative thinking, brainstorming is widely used. With recent advancements in Large Language Models (LLMs), tools like ChatGPT have significantly impacted various fields by using prompts to facilitate complex tasks. While current research primarily focuses on generating accurate responses, there is a need to explore how prompt engineering can enhance creativity, particularly in brainstorming. Therefore, this study addresses this gap by proposing a framework called GPS, which employs goals, prompts, and strategies to guide designers to systematically work with an LLM tool for improving the creativity of ideas generated during brainstorming. Additionally, we adapted the Torrance Tests of Creative Thinking (TTCT) for measuring the creativity of the ideas generated by AI. Our framework, tested through a design example and a case study, demonstrates its effectiveness in stimulating creativity and its seamless LLM tool integration into design practices. The results indicate that our framework can benefit brainstorming sessions with LLM tools, enhancing both the creativity and usefulness of generated ideas.

[AI-131] Rescriber: Smaller-LLM-Powered User-Led Data Minimization for Navigating Privacy Trade-offs in LLM-Based Conversational Agent

链接: https://arxiv.org/abs/2410.11876
作者: Jijie Zhou,Eryue Xu,Yaoyao Wu,Tianshi Li
关键词-EN: LLM-based conversational agents, resulted in excessive, identifiable or sensitive, LLM-based conversational, conversational agents
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The proliferation of LLM-based conversational agents has resulted in excessive disclosure of identifiable or sensitive information. However, existing technologies fail to offer perceptible control or account for users’ personal preferences about privacy-utility tradeoffs due to the lack of user involvement. To bridge this gap, we designed, built, and evaluated Rescriber, a browser extension that supports user-led data minimization in LLM-based conversational agents by helping users detect and sanitize personal information in their prompts. Our studies (N=12) showed that Rescriber helped users reduce unnecessary disclosure and addressed their privacy concerns. Users’ subjective perceptions of the system powered by Llama3-8B were on par with that by GPT-4. The comprehensiveness and consistency of the detection and sanitization emerge as essential factors that affect users’ trust and perceived protection. Our findings confirm the viability of smaller-LLM-powered, user-facing, on-device privacy controls, presenting a promising approach to address the privacy and trust challenges of AI.

[AI-132] A Framework for SLO Carbon and Wastewater-Aware Sustainable FaaS Cloud Platform Management

链接: https://arxiv.org/abs/2410.11875
作者: Sirui Qi,Hayden Moore,Ninad Hogade,Dejan Milojicic,Cullen Bash,Sudeep Pasricha
关键词-EN: traditional serverful approaches, growing cloud computing, cloud computing paradigm, serverful approaches, growing cloud
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Function-as-a-Service (FaaS) is a growing cloud computing paradigm that is expected to reduce the user cost of service over traditional serverful approaches. However, the environmental impact of FaaS has not received much attention. We investigate FaaS scheduling and scaling from a sustainability perspective in this work. We find that the service-level objectives (SLOs) of FaaS and carbon emissions conflict with each other. We also find that SLO-focused FaaS scheduling can exacerbate water use in a datacenter. We propose a novel sustainability-focused FaaS scheduling and scaling framework to co-optimize SLO performance, carbon emissions, and wastewater generation.

[AI-133] Enhancing UI Location Capabilities of Autonomous Agents

链接: https://arxiv.org/abs/2410.11872
作者: Jakub Hoscilowicz,Bartosz Maj,Bartosz Kozakiewicz,Oleksii Tymoschuk,Artur Janicki
关键词-EN: graphical user interfaces, digital devices equipped, effective automation tools, user interfaces, increasingly important
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. Although multimodal large language models (MLLMs) like GPT-4V excel at tasks such as drafting emails, they struggle with GUI interactions, which limits their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent significantly outperforms other prompt-based autonomous agents (such as CogAgent, AppAgent, and Auto-UI) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance. Comments: Work in progress Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.11872 [cs.HC] (or arXiv:2410.11872v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2410.11872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-134] nyClick: Single-Turn Agent for Empowering GUI Automation

链接: https://arxiv.org/abs/2410.11871
作者: Pawel Pawlowski,Krystian Zawistowski,Wojciech Lapacz,Marcin Skorupa,Adam Wiacek,Sebastien Postansque,Jakub Hoscilowicz
关键词-EN: graphical user interface, interaction tasks, present a single-turn, user interface, GUI
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 4 pages without references, 2 figures

点击查看摘要

Abstract:We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. Main goal of the agent is to click on desired UI element based on the screenshot and user command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency. Main improvement comes from multitask training and MLLM-based data augmentation. Manually annotated corpora are scarce, but we show that re-annotation of annotated data with MLLM for multitask training might produce much better result. On Screenspot and OmniAct, our model outperforms both GUI-specific models (e.g., SeeClick) and MLLMs (e.g., GPT-4V).

[AI-135] An Innovative Solution: AI-Based Digital Screen-Integrated Tables for Educational Settings

链接: https://arxiv.org/abs/2410.11866
作者: S. Tamang,D. J. Bora
关键词-EN: customized assignment allotment, digital customized assignment, identifying slow-learners, slow-learners and fast-learners, AI-Based frameworks
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In this paper, we have gone through different AI-Based frameworks used for various educational tasks like digital customized assignment allotment and performance monitoring, identifying slow-learners and fast-learners, etc. application describes a novel invention, digital screen-integrated tables, designed specifically for educational settings. The tables feature integrated digital screens controlled by a central processing unit (CPU), enabling synchronized display of educational content such as textbooks, presentations, exam questions, and interactive learning materials. Additionally, the invention facilitates the collection of student performance data during classroom activities and assessments. The gathered data is utilized for analysis using machine learning models to identify patterns and trends in student learning behaviours. By leveraging machine learning algorithms, educators can ascertain whether a student is a fast learner or a slow learner, based on which, the teacher can allocate more resources to the slow learners. This innovative approach aims to address the evolving needs of modern classrooms by providing a dynamic and data-driven learning environment. The unique integration of digital screens into traditional classroom furniture represents a significant advancement in educational technology. This patent filing encompasses the design, functionality, and method of operation of the digital screen-integrated tables, emphasizing their innovative features and applications in educational institutions.

[AI-136] Shifting the Human-AI Relationship: Toward a Dynamic Relational Learning-Partner Model

链接: https://arxiv.org/abs/2410.11864
作者: Julia Mossbridge
关键词-EN: longer suffices, current paradigm, paradigm of treating, passive tool, tool no longer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: White Paper

点击查看摘要

Abstract:As artificial intelligence (AI) continues to evolve, the current paradigm of treating AI as a passive tool no longer suffices. As a human-AI team, we together advocate for a shift toward viewing AI as a learning partner, akin to a student who learns from interactions with humans. Drawing from interdisciplinary concepts such as ecorithms, order from chaos, and cooperation, we explore how AI can evolve and adapt in unpredictable environments. Arising from these brief explorations, we present two key recommendations: (1) foster ethical, cooperative treatment of AI to benefit both humans and AI, and (2) leverage the inherent heterogeneity between human and AI minds to create a synergistic hybrid intelligence. By reframing AI as a dynamic partner, a model emerges in which AI systems develop alongside humans, learning from human interactions and feedback loops including reflections on team conversations. Drawing from a transpersonal and interdependent approach to consciousness, we suggest that a “third mind” emerges through collaborative human-AI relationships. Through design interventions such as interactive learning and conversational debriefing and foundational interventions allowing AI to model multiple types of minds, we hope to provide a path toward more adaptive, ethical, and emotionally healthy human-AI relationships. We believe this dynamic relational learning-partner (DRLP) model for human-AI teaming, if enacted carefully, will improve our capacity to address powerful solutions to seemingly intractable problems.

[AI-137] ChatVis: Automating Scientific Visualization with a Large Language Model

链接: https://arxiv.org/abs/2410.11863
作者: Tanwi Mallick,Orcun Yildiz,David Lenz,Tom Peterka
关键词-EN: synthetically generate Python, large language model, generate Python scripts, generate Python, develop an iterative
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We develop an iterative assistant we call ChatVis that can synthetically generate Python scripts for data analysis and visualization using a large language model (LLM). The assistant allows a user to specify the operations in natural language, attempting to generate a Python script for the desired operations, prompting the LLM to revise the script as needed until it executes correctly. The iterations include an error detection and correction mechanism that extracts error messages from the execution of the script and subsequently prompts LLM to correct the error. Our method demonstrates correct execution on five canonical visualization scenarios, comparing results with ground truth. We also compared our results with scripts generated by several other LLMs without any assistance. In every instance, ChatVis successfully generated the correct script, whereas the unassisted LLMs failed to do so. The code is available on GitHub: this https URL.

[AI-138] owards using Reinforcement Learning for Scaling and Data Replication in Cloud Systems

链接: https://arxiv.org/abs/2410.11862
作者: Riad Mokadem(IRIT-PYRAMIDE),Fahem Arar(IRIT-PYRAMIDE, ESI),Djamel Eddine Zegour
关键词-EN: Cloud providers opt, enable automatic resource, intuitive nature, providers opt, opt for threshold-based
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given its intuitive nature, many Cloud providers opt for threshold-based data replication to enable automatic resource scaling. However, setting thresholds effectively needs human intervention to calibrate thresholds for each metric and requires a deep knowledge of current workload trends, which can be challenging to achieve. Reinforcement learning is used in many areas related to the Cloud Computing, and it is a promising field to get automatic data replication strategies. In this work, we survey data replication strategies and data scaling based on reinforcement learning (RL).

[AI-139] Investigating Role of Big Five Personality Traits in Audio-Visual Rapport Estimation

链接: https://arxiv.org/abs/2410.11861
作者: Takato Hayashi,Ryusei Kimura,Ryo Ishii,Shogo Okada
关键词-EN: Automatic rapport estimation, Automatic rapport, estimation performance, estimation, affective computing
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Automatic rapport estimation in social interactions is a central component of affective computing. Recent reports have shown that the estimation performance of rapport in initial interactions can be improved by using the participant’s personality traits as the model’s input. In this study, we investigate whether this findings applies to interactions between friends by developing rapport estimation models that utilize nonverbal cues (audio and facial expressions) as inputs. Our experimental results show that adding Big Five features (BFFs) to nonverbal features can improve the estimation performance of self-reported rapport in dyadic interactions between friends. Next, we demystify how BFFs improve the estimation performance of rapport through a comparative analysis between models with and without BFFs. We decompose rapport ratings into perceiver effects (people’s tendency to rate other people), target effects (people’s tendency to be rated by other people), and relationship effects (people’s unique ratings for a specific person) using the social relations model. We then analyze the extent to which BFFs contribute to capturing each effect. Our analysis demonstrates that the perceiver’s and the target’s BFFs lead estimation models to capture the perceiver and the target effects, respectively. Furthermore, our experimental results indicate that the combinations of facial expression features and BFFs achieve best estimation performances not only in estimating rapport ratings, but also in estimating three effects. Our study is the first step toward understanding why personality-aware estimation models of interpersonal perception accomplish high estimation performance.

[AI-140] Comparing Zealous and Restrained AI Recommendations in a Real-World Human-AI Collaboration Task

链接: https://arxiv.org/abs/2410.11860
作者: Chengyuan Xu,Kuo-Chin Lien,Tobias Höllerer
关键词-EN: AI-assisted decision-making system, decision-making system, designing an AI-assisted, AI-assisted decision-making, tradeoff between precision
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 14 figures, accepted to ACM CHI 2023

点击查看摘要

Abstract:When designing an AI-assisted decision-making system, there is often a tradeoff between precision and recall in the AI’s recommendations. We argue that careful exploitation of this tradeoff can harness the complementary strengths in the human-AI collaboration to significantly improve team performance. We investigate a real-world video anonymization task for which recall is paramount and more costly to improve. We analyze the performance of 78 professional annotators working with a) no AI assistance, b) a high-precision “restrained” AI, and c) a high-recall “zealous” AI in over 3,466 person-hours of annotation work. In comparison, the zealous AI helps human teammates achieve significantly shorter task completion time and higher recall. In a follow-up study, we remove AI assistance for everyone and find negative training effects on annotators trained with the restrained AI. These findings and our analysis point to important implications for the design of AI assistance in recall-demanding scenarios.

[AI-141] Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach

链接: https://arxiv.org/abs/2410.11855
作者: Xiongxiao Xu,Solomon Abera Bekele,Brice Videau,Kai Shu
关键词-EN: small wearable devices, future computing architectures, leadership computing facilities, large-scale leadership computing, critical design metric
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous high performance computing (HPC) systems. Moreover, they typically rely on either purely offline training or a hybrid of offline and online training, which are impractical and lead to energy loss during data collection. Therefore, this paper studies a novel and practical online energy optimization problem for GPUs in HPC scenarios. The problem is challenging due to the inherent performance-energy trade-offs of GPUs, the exploration exploitation dilemma across frequencies, and the lack of explicit performance counters in GPUs. To address these challenges, we formulate the online energy consumption optimization problem as a multi-armed bandit framework and develop a novel bandit based framework EnergyUCB. EnergyUCB is designed to dynamically adjust GPU core frequencies in real-time, reducing energy consumption with minimal impact on performance. Specifically, the proposed framework EnergyUCB (1) balances the performance-energy trade-off in the reward function, (2) effectively navigates the exploration exploitation dilemma when adjusting GPU core frequencies online, and (3) leverages the ratio of GPU core utilization to uncore utilization as a real-time GPU performance metric. Experiments on a wide range of real-world HPC benchmarks demonstrate that EnergyUCB can achieve substantial energy savings. The code of EnergyUCB is available at this https URL.

[AI-142] From Commands to Prompts: LLM-based Semantic File System for AIOS

链接: https://arxiv.org/abs/2410.11843
作者: Zeru Shi,Kai Mei,Mingyu Jin,Yongye Su,Chaoji Zuo,Wenyue Hua,Wujiang Xu,Yujie Ren,Zirui Liu,Mengnan Du,Dong Deng,Yongfeng Zhang
关键词-EN: Large language models, file, Large language, demonstrated significant potential, semantic file
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential in the development of intelligent applications and systems such as LLM-based agents and agent operating systems (AIOS). However, when these applications and systems interact with the underlying file system, the file system still remains the traditional paradigm: reliant on manual navigation through precise commands. This paradigm poses a bottleneck to the usability of these systems as users are required to navigate complex folder hierarchies and remember cryptic file names. To address this limitation, we propose an LLM-based semantic file system ( LSFS ) for prompt-driven file management. Unlike conventional approaches, LSFS incorporates LLMs to enable users or agents to interact with files through natural language prompts, facilitating semantic file management. At the macro-level, we develop a comprehensive API set to achieve semantic file management functionalities, such as semantic file retrieval, file update monitoring and summarization, and semantic file rollback). At the micro-level, we store files by constructing semantic indexes for them, design and implement syscalls of different semantic operations (e.g., CRUD, group by, join) powered by vector database. Our experiments show that LSFS offers significant improvements over traditional file systems in terms of user convenience, the diversity of supported functions, and the accuracy and efficiency of file operations. Additionally, with the integration of LLM, our system enables more intelligent file management tasks, such as content summarization and version comparison, further enhancing its capabilities.

[AI-143] Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations

链接: https://arxiv.org/abs/2410.11381
作者: Seongho Kim,Jihyun Moon,Juntaek Oh,Insu Choi,Joon-Sung Yang
关键词-EN: Transformer architecture enables, enables contextually natural, contextually natural text, natural text generation, processing entire source
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 13 pages and 16 figures

点击查看摘要

Abstract:The advent of the Attention mechanism and Transformer architecture enables contextually natural text generation and compresses the burden of processing entire source information into singular vectors. Based on these two main ideas, model sizes gradually increases to accommodate more precise and comprehensive information, leading to the current state-of-the-art LLMs being very large, with parameters around 70 billion. As the model sizes are growing, the demand for substantial storage and computational capacity increases. This leads to the development of high-bandwidth memory and accelerators, as well as a variety of model architectures designed to meet these requirements. We note that LLM architectures have increasingly converged. This paper analyzes how these converged architectures perform in terms of layer configurations, operational mechanisms, and model sizes, considering various hyperparameter settings. In this paper, we conduct a concise survey of the history of LLMs by tracing the evolution of their operational improvements. Furthermore, we summarize the performance trends of LLMs under various hyperparameter settings using the RTX 6000, which features the state-of-the-art Ada Lovelace architecture. We conclude that even the same model can exhibit different behaviors depending on the hyperparameters or whether it is deployed in server or edge environments.

[AI-144] OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions

链接: https://arxiv.org/abs/2410.04328
作者: Yu-Shin Huang,Peter Just,Krishna Narayanan,Chao Tian
关键词-EN: Large Language Model, arithmetic coding decoder, Language Model, Large Language, drives an arithmetic
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 9 figures

点击查看摘要

Abstract:We consider coverless steganography where a Large Language Model (LLM) drives an arithmetic coding decoder to generate stego-texts. An efficient method should embed secret message bits in as few language tokens as possible, while still keeping the stego-text natural and fluent. We show that on the individual token level, this problem is mathematically equivalent to maximizing the entropy of a replacement probability distribution of the next token generation, subject to a constraint on the KL divergence between the chosen probability distribution and the original distribution given by the LLM. A closed-form solution is provided for the optimization problem, which can be computed efficiently. Several important practical issues are also tackled: 1) An often-overlooked tokenization mismatch issue is resolved with a simple prompt selection approach, 2) The combination of the optimized distribution and the vocabulary truncation technique is considered, and 3) The combination of the optimized distribution with other sequence-level selection heuristics to further enhance the efficiency and reliability is studied.

[AI-145] Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

链接: https://arxiv.org/abs/2410.12771
作者: Luis Barroso-Luque,Muhammed Shuaibi,Xiang Fu,Brandon M. Wood,Misko Dzamba,Meng Gao,Ammar Rizvi,C. Lawrence Zitnick,Zachary W. Ulissi
关键词-EN: generation computing hardware, helping mitigate climate, mitigate climate change, computing hardware, ability to discover
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 19 pages

点击查看摘要

Abstract:The ability to discover new materials with desirable properties is critical for numerous applications from helping mitigate climate change to advances in next generation computing hardware. AI has the potential to accelerate materials discovery and design by more effectively exploring the chemical space compared to other computational methods or by trial-and-error. While substantial progress has been made on AI for materials data, benchmarks, and models, a barrier that has emerged is the lack of publicly available training data and open pre-trained models. To address this, we present a Meta FAIR release of the Open Materials 2024 (OMat24) large-scale open dataset and an accompanying set of pre-trained models. OMat24 contains over 110 million density functional theory (DFT) calculations focused on structural and compositional diversity. Our EquiformerV2 models achieve state-of-the-art performance on the Matbench Discovery leaderboard and are capable of predicting ground-state stability and formation energies to an F1 score above 0.9 and an accuracy of 20 meV/atom, respectively. We explore the impact of model size, auxiliary denoising objectives, and fine-tuning on performance across a range of datasets including OMat24, MPtraj, and Alexandria. The open release of the OMat24 dataset and models enables the research community to build upon our efforts and drive further advancements in AI-assisted materials science.

[AI-146] Generative Neural Reparameterization for Differentiable PDE-constrained Optimization NEURIPS2024

链接: https://arxiv.org/abs/2410.12683
作者: Archis S. Joglekar
关键词-EN: acquiring optimal parameters, optimal parameters, systems governed, constrained optimization, neural network
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA); Plasma Physics (physics.plasm-ph)
*备注: Accepted to D3S3: Data-driven and Differentiable Simulations, Surrogates, and Solvers - Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Partial-differential-equation (PDE)-constrained optimization is a well-worn technique for acquiring optimal parameters of systems governed by PDEs. However, this approach is limited to providing a single set of optimal parameters per optimization. Given a differentiable PDE solver, if the free parameters are reparameterized as the output of a neural network, that neural network can be trained to learn a map from a probability distribution to the distribution of optimal parameters. This proves useful in the case where there are many well performing local minima for the PDE. We apply this technique to train a neural network that generates optimal parameters that minimize laser-plasma instabilities relevant to laser fusion and show that the neural network generates many well performing and diverse minima.

[AI-147] Hamiltonian bridge: A physics-driven generative framework for targeted pattern control

链接: https://arxiv.org/abs/2410.12665
作者: Vishaal Krishnan,Sumit Sinha,L. Mahadevan
关键词-EN: Patterns arise spontaneously, study typically focuses, spanning the sciences, evolution in space-time, arise spontaneously
类目: oft Condensed Matter (cond-mat.soft); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 29 pages, 8 figures

点击查看摘要

Abstract:Patterns arise spontaneously in a range of systems spanning the sciences, and their study typically focuses on mechanisms to understand their evolution in space-time. Increasingly, there has been a transition towards controlling these patterns in various functional settings, with implications for engineering. Here, we combine our knowledge of a general class of dynamical laws for pattern formation in non-equilibrium systems, and the power of stochastic optimal control approaches to present a framework that allows us to control patterns at multiple scales, which we dub the “Hamiltonian bridge”. We use a mapping between stochastic many-body Lagrangian physics and deterministic Eulerian pattern forming PDEs to leverage our recent approach utilizing the Feynman-Kac-based adjoint path integral formulation for the control of interacting particles and generalize this to the active control of patterning fields. We demonstrate the applicability of our computational framework via numerical experiments on the control of phase separation with and without a conserved order parameter, self-assembly of fluid droplets, coupled reaction-diffusion equations and finally a phenomenological model for spatio-temporal tissue differentiation. We interpret our numerical experiments in terms of a theoretical understanding of how the underlying physics shapes the geometry of the pattern manifold, altering the transport paths of patterns and the nature of pattern interpolation. We finally conclude by showing how optimal control can be utilized to generate complex patterns via an iterative control protocol over pattern forming pdes which can be casted as gradient flows. All together, our study shows how we can systematically build in physical priors into a generative framework for pattern control in non-equilibrium systems across multiple length and time scales.

[AI-148] Cascade learning in multi-task encoder-decoder networks for concurrent bone segmentation and glenohumeral joint assessment in shoulder CT scans

链接: https://arxiv.org/abs/2410.12641
作者: Luca Marsilio,Davide Marzorati,Matteo Rossi,Andrea Moglia,Luca Mainardi,Alfonso Manzotti,Pietro Cerveri
关键词-EN: degenerative condition affecting, bone density loss, condition affecting bones, joint space narrowing, density loss
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Osteoarthritis is a degenerative condition affecting bones and cartilage, often leading to osteophyte formation, bone density loss, and joint space narrowing. Treatment options to restore normal joint function vary depending on the severity of the condition. This work introduces an innovative deep-learning framework processing shoulder CT scans. It features the semantic segmentation of the proximal humerus and scapula, the 3D reconstruction of bone surfaces, the identification of the glenohumeral (GH) joint region, and the staging of three common osteoarthritic-related pathologies: osteophyte formation (OS), GH space reduction (JS), and humeroscapular alignment (HSA). The pipeline comprises two cascaded CNN architectures: 3D CEL-UNet for segmentation and 3D Arthro-Net for threefold classification. A retrospective dataset of 571 CT scans featuring patients with various degrees of GH osteoarthritic-related pathologies was used to train, validate, and test the pipeline. Root mean squared error and Hausdorff distance median values for 3D reconstruction were 0.22mm and 1.48mm for the humerus and 0.24mm and 1.48mm for the scapula, outperforming state-of-the-art architectures and making it potentially suitable for a PSI-based shoulder arthroplasty preoperative plan context. The classification accuracy for OS, JS, and HSA consistently reached around 90% across all three categories. The computational time for the inference pipeline was less than 15s, showcasing the framework’s efficiency and compatibility with orthopedic radiology practice. The outcomes represent a promising advancement toward the medical translation of artificial intelligence tools. This progress aims to streamline the preoperative planning pipeline delivering high-quality bone surfaces and supporting surgeons in selecting the most suitable surgical approach according to the unique patient joint conditions.

[AI-149] Spectrum Sharing using Deep Reinforcement Learning in Vehicular Networks

链接: https://arxiv.org/abs/2410.12521
作者: Riya Dinesh Deshpande,Faheem A. Khan,Qasim Zeeshan Ahmed
关键词-EN: network grows exponentially, effectively allocating spectrum, vehicular network grows, grows exponentially, addressing the numerous
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the number of devices getting connected to the vehicular network grows exponentially, addressing the numerous challenges of effectively allocating spectrum in dynamic vehicular environment becomes increasingly difficult. Traditional methods may not suffice to tackle this issue. In vehicular networks safety critical messages are involved and it is important to implement an efficient spectrum allocation paradigm for hassle free communication as well as manage the congestion in the network. To tackle this, a Deep Q Network (DQN) model is proposed as a solution, leveraging its ability to learn optimal strategies over time and make decisions. The paper presents a few results and analyses, demonstrating the efficacy of the DQN model in enhancing spectrum sharing efficiency. Deep Reinforcement Learning methods for sharing spectrum in vehicular networks have shown promising outcomes, demonstrating the system’s ability to adjust to dynamic communication environments. Both SARL and MARL models have exhibited successful rates of V2V communication, with the cumulative reward of the RL model reaching its maximum as training progresses.

[AI-150] Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR ICASSP2025

链接: https://arxiv.org/abs/2410.12279
作者: Christoph Minixhofer,Ondrej Klejch,Peter Bell
关键词-EN: Synthetically generated speech, rapidly approached human, approached human levels, Synthetically generated, levels of naturalness
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Under review at ICASSP 2025

点击查看摘要

Abstract:Synthetically generated speech has rapidly approached human levels of naturalness. However, the paradox remains that ASR systems, when trained on TTS output that is judged as natural by humans, continue to perform badly on real speech. In this work, we explore whether this phenomenon is due to the oversmoothing behaviour of models commonly used in TTS, with a particular focus on the behaviour of TTS-for-ASR as the amount of TTS training data is scaled up. We systematically compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We test the scalability of the two approaches, varying both the number hours, and the number of different speakers. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio between real and synthetic speech WER to date (1.46), but also find that a large gap remains.

[AI-151] Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

链接: https://arxiv.org/abs/2410.11933
作者: Junjie Xu,Artem Moskalev,Tommaso Mansi,Mangal Prakash,Rui Liao
关键词-EN: developing RNA-based therapeutics, Accurate prediction, RNA, stability and interactions, RNA-based therapeutics
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Accurate prediction of RNA properties, such as stability and interactions, is crucial for advancing our understanding of biological processes and developing RNA-based therapeutics. RNA structures can be represented as 1D sequences, 2D topological graphs, or 3D all-atom models, each offering different insights into its function. Existing works predominantly focus on 1D sequence-based models, which overlook the geometric context provided by 2D and 3D geometries. This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction, considering not only performance but also real-world challenges such as limited data availability, partial labeling, sequencing noise, and computational efficiency. To this end, we introduce a newly curated set of RNA datasets with enhanced 2D and 3D structural annotations, providing a resource for model evaluation on RNA data. Our findings reveal that models with explicit geometry encoding generally outperform sequence-based models, with an average prediction RMSE reduction of around 12% across all various RNA tasks and excelling in low-data and partial labeling regimes, underscoring the value of explicitly incorporating geometric context. On the other hand, geometry-unaware sequence-based models are more robust under sequencing noise but often require around 2-5x training data to match the performance of geometry-aware models. Our study offers further insights into the trade-offs between different RNA representations in practical applications and addresses a significant gap in evaluating deep learning models for RNA tasks.

[AI-152] ransfer Learning Adapts to Changing PSD in Gravitational Wave Data

链接: https://arxiv.org/abs/2410.11911
作者: Beka Modrekiladze
关键词-EN: black hole inspirals, opened unparalleled opportunities, observing the universe, hole inspirals, opened unparalleled
类目: General Relativity and Quantum Cosmology (gr-qc); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:The detection of gravitational waves has opened unparalleled opportunities for observing the universe, particularly through the study of black hole inspirals. These events serve as unique laboratories to explore the laws of physics under conditions of extreme energies. However, significant noise in gravitational wave (GW) data from observatories such as Advanced LIGO and Virgo poses major challenges in signal identification. Traditional noise suppression methods often fall short in fully addressing the non-Gaussian effects in the data, including the fluctuations in noise power spectral density (PSD) over short time intervals. These challenges have led to the exploration of an AI approach that, while overcoming previous obstacles, introduced its own challenges, such as scalability, reliability issues, and the vanishing gradient problem. Our approach addresses these issues through a simplified architecture. To compensate for the potential limitations of a simpler model, we have developed a novel training methodology that enables it to accurately detect gravitational waves amidst highly complex noise. Employing this strategy, our model achieves over 99% accuracy in non-white noise scenarios and shows remarkable adaptability to changing noise PSD conditions. By leveraging the principles of transfer learning, our model quickly adapts to new noise profiles with just a few epochs of fine-tuning, facilitating real-time applications in dynamically changing noise environments.

[AI-153] Explainable AI Methods for Multi-Omics Analysis: A Survey

链接: https://arxiv.org/abs/2410.11910
作者: Ahmad Hussein,Mukesh Prasad,Ali Braytee
关键词-EN: traditional hypothesis-driven methodologies, Advancements in high-throughput, data-driven approaches, high-throughput technologies, technologies have led
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in high-throughput technologies have led to a shift from traditional hypothesis-driven methodologies to data-driven approaches. Multi-omics refers to the integrative analysis of data derived from multiple ‘omes’, such as genomics, proteomics, transcriptomics, metabolomics, and microbiomics. This approach enables a comprehensive understanding of biological systems by capturing different layers of biological information. Deep learning methods are increasingly utilized to integrate multi-omics data, offering insights into molecular interactions and enhancing research into complex diseases. However, these models, with their numerous interconnected layers and nonlinear relationships, often function as black boxes, lacking transparency in decision-making processes. To overcome this challenge, explainable artificial intelligence (xAI) methods are crucial for creating transparent models that allow clinicians to interpret and work with complex data more effectively. This review explores how xAI can improve the interpretability of deep learning models in multi-omics research, highlighting its potential to provide clinicians with clear insights, thereby facilitating the effective application of such models in clinical settings.

[AI-154] Are Grid Cells Hexagonal for Performance or by Convenience?

链接: https://arxiv.org/abs/2410.11886
作者: Taahaa Mir,Peipei Yao,Kateri Duranceau,Isabeau Prémont-Schwarz
关键词-EN: biologically convenient configuration, grid cells, hexagonal grid cells, square grid cells, grid
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 5 pages, accepted at Montreal AI and Neuroscience Conference 2024

点击查看摘要

Abstract:This paper investigates whether the hexagonal structure of grid cells provides any performance benefits or if it merely represents a biologically convenient configuration. Utilizing the Vector-HaSH content addressable memory model as a model of the grid cell – place cell network of the mammalian brain, we compare the performance of square and hexagonal grid cells in tasks of storing and retrieving spatial memories. Our experiments across different path types, path lengths and grid configurations, reveal that hexagonal grid cells perform similarly to square grid cells with respect to spatial representation and memory recall. Our results show comparable accuracy and robustness across different datasets and noise levels on images to recall. These findings suggest that the brain’s use of hexagonal grids may be more a matter of biological convenience and ease of implementation rather than because they provide superior performance over square grid cells (which are easier to implement in silico).

[AI-155] Neuropsychology of AI: Relationship Between Activation Proximity and Categorical Proximity Within Neural Categories of Synthetic Cognition

链接: https://arxiv.org/abs/2410.11868
作者: Michael Pichat,Enola Campoli,William Pogrund,Jourdan Wilson,Michael Veillet-Guillem,Anton Melkozerov,Paloma Pichat,Armanouche Gasparian,Samuel Demarchi,Judicael Poumay
关键词-EN: artificial intelligence focuses, neural cog nition, synthetic neural cog, artificial neural cognition, Neuropsychology of artificial
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neuropsychology of artificial intelligence focuses on synthetic neural cog nition as a new type of study object within cognitive psychology. With the goal of making artificial neural networks of language models more explainable, this approach involves transposing concepts from cognitive psychology to the interpretive construction of artificial neural cognition. The human cognitive concept involved here is categorization, serving as a heuristic for thinking about the process of segmentation and construction of reality carried out by the neural vectors of synthetic cognition.

[AI-156] MoH: Multi-Head Attention as Mixture-of-Head Attention

链接: https://arxiv.org/abs/2410.11842
作者: Peng Jin,Bo Zhu,Li Yuan,Shuicheng Yan
关键词-EN: multi-head attention, attention, previous accuracy level, attention heads, Transformer model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, code: this https URL

点击查看摘要

Abstract:In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

计算机视觉

[CV-0] Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.12790
作者: Ce Zhang,Simon Stepputtis,Katia Sycara,Yaqi Xie
关键词-EN: holds significant, real-world scenarios, generalize to diverse, diverse data, data with unlabeled
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios. Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability. However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed. To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities. Specifically, we create and evolve two sets of prototypes–textual and visual–to progressively capture more accurate multi-modal representations for target classes during test time. Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities. Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency. Code is available at this https URL.

[CV-1] he Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language Visual and Audio

链接: https://arxiv.org/abs/2410.12787
作者: Sicong Leng,Yun Xing,Zesen Cheng,Yang Zhou,Hang Zhang,Xin Li,Deli Zhao,Shijian Lu,Chunyan Miao,Lidong Bing
关键词-EN: large multimodal models, Recent advancements, integrate additional modalities, significantly enhanced performance, diverse tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this http URL

点击查看摘要

Abstract:Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.

[CV-2] Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

链接: https://arxiv.org/abs/2410.12781
作者: Chen Ziwen,Hao Tan,Kai Zhang,Sai Bi,Fujun Luan,Yicong Hong,Li Fuxin,Zexiang Xu
关键词-EN: Gaussian reconstruction model, capable of reconstructing, long sequence, Gaussian reconstruction, reconstruction model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose Long-LRM, a generalizable 3D Gaussian reconstruction model that is capable of reconstructing a large scene from a long sequence of input images. Specifically, our model can process 32 source images at 960x540 resolution within only 1.3 seconds on a single A100 80G GPU. Our architecture features a mixture of the recent Mamba2 blocks and the classical transformer blocks which allowed many more tokens to be processed than prior work, enhanced by efficient token merging and Gaussian pruning steps that balance between quality and efficiency. Unlike previous feed-forward models that are limited to processing 1~4 input images and can only reconstruct a small portion of a large scene, Long-LRM reconstructs the entire scene in a single feed-forward step. On large-scale scene datasets such as DL3DV-140 and Tanks and Temples, our method achieves performance comparable to optimization-based approaches while being two orders of magnitude more efficient. Project page: this https URL

[CV-3] Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

链接: https://arxiv.org/abs/2410.12777
作者: Hongcheng Gao,Tianyu Pang,Chao Du,Taihang Hu,Zhijie Deng,Min Lin
关键词-EN: diffusion-based content generation, potential model misuse, prevent potential model, content generation, significant efforts
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid progress of diffusion-based content generation, significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs to relearn the unlearned concepts. This occurs partly because certain benign concepts (e.g., “skin”) retained in DMs are related to the unlearned ones (e.g., “nudity”), facilitating their relearning via finetuning. To address this, we propose meta-unlearning on DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered to self-destruct, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies. Our code is available at this https URL.

[CV-4] owards Zero-Shot Camera Trap Image Categorization

链接: https://arxiv.org/abs/2410.12769
作者: Jiří Vyskočil,Lukas Picek
关键词-EN: paper describes, describes the search, camera trap images, alternative approach, trap image categorization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper describes the search for an alternative approach to the automatic categorization of camera trap images. First, we benchmark state-of-the-art classifiers using a single model for all images. Next, we evaluate methods combining MegaDetector with one or more classifiers and Segment Anything to assess their impact on reducing location-specific overfitting. Last, we propose and test two approaches using large language and foundational models, such as DINOv2, BioCLIP, BLIP, and ChatGPT, in a zero-shot scenario. Evaluation carried out on two publicly available datasets (WCT from New Zealand, CCT20 from the Southwestern US) and a private dataset (CEF from Central Europe) revealed that combining MegaDetector with two separate classifiers achieves the highest accuracy. This approach reduced the relative error of a single BEiTV2 classifier by approximately 42% on CCT20, 48% on CEF, and 75% on WCT. Besides, as the background is removed, the error in terms of accuracy in new locations is reduced to half. The proposed zero-shot pipeline based on DINOv2 and FAISS achieved competitive results (1.0% and 4.7% smaller on CCT20, and CEF, respectively), which highlights the potential of zero-shot approaches for camera trap image categorization.

[CV-5] Gravity-aligned Rotation Averaging with Circular Regression ECCV2024

链接: https://arxiv.org/abs/2410.12763
作者: Linfei Pan,Marc Pollefeys,Dániel Baráth
关键词-EN: applications spanning crowd-sourced, spanning crowd-sourced mapping, scene from unordered, vision and robotics, unordered images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted at ECCV2024

点击查看摘要

Abstract:Reconstructing a 3D scene from unordered images is pivotal in computer vision and robotics, with applications spanning crowd-sourced mapping and beyond. While global Structure-from-Motion (SfM) techniques are scalable and fast, they often compromise on accuracy. To address this, we introduce a principled approach that integrates gravity direction into the rotation averaging phase of global pipelines, enhancing camera orientation accuracy and reducing the degrees of freedom. This additional information is commonly available in recent consumer devices, such as smartphones, mixed-reality devices and drones, making the proposed method readily accessible. Rooted in circular regression, our algorithm has similar convergence guarantees as linear regression. It also supports scenarios where only a subset of cameras have known gravity. Additionally, we propose a mechanism to refine error-prone gravity. We achieve state-of-the-art accuracy on four large-scale datasets. Particularly, the proposed method improves upon the SfM baseline by 13 AUC@ 1^\circ points, on average, while running eight times faster. It also outperforms the standard planar pose graph optimization technique by 23 AUC@ 1^\circ points. The code is at this https URL.

[CV-6] SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

链接: https://arxiv.org/abs/2410.12761
作者: Jaehong Yoon,Shoubin Yu,Vaidehi Patil,Huaxiu Yao,Mohit Bansal
关键词-EN: Recent advances, significantly enhanced, enhanced their ability, ability to generate, increased the risk
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The first two authors contributed equally; Project page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model’s weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.

[CV-7] PND-Net: Plant Nutrition Deficiency and Disease Classification using Graph Convolutional Network

链接: https://arxiv.org/abs/2410.12742
作者: Asish Bera,Debotosh Bhattacharjee,Ondrej Krejcar
关键词-EN: Crop yield production, plant nutrition deficiencies, Crop yield, plant nutrition, nutrition deficiencies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Crop yield production could be enhanced for agricultural growth if various plant nutrition deficiencies, and diseases are identified and detected at early stages. The deep learning methods have proven its superior performances in the automated detection of plant diseases and nutrition deficiencies from visual symptoms in leaves. This article proposes a new deep learning method for plant nutrition deficiencies and disease classification using a graph convolutional network (GNN), added upon a base convolutional neural network (CNN). Sometimes, a global feature descriptor might fail to capture the vital region of a diseased leaf, which causes inaccurate classification of disease. To address this issue, regional feature learning is crucial for a holistic feature aggregation. In this work, region-based feature summarization at multi-scales is explored using spatial pyramidal pooling for discriminative feature representation. A GCN is developed to capacitate learning of finer details for classifying plant diseases and insufficiency of nutrients. The proposed method, called Plant Nutrition Deficiency and Disease Network (PND-Net), is evaluated on two public datasets for nutrition deficiency, and two for disease classification using four CNNs. The best classification performances are: (a) 90.00% Banana and 90.54% Coffee nutrition deficiency; and (b) 96.18% Potato diseases and 84.30% on PlantDoc datasets using Xception backbone. Furthermore, additional experiments have been carried out for generalization, and the proposed method has achieved state-of-the-art performances on two public datasets, namely the Breast Cancer Histopathology Image Classification (BreakHis 40X: 95.50%, and BreakHis 100X: 96.79% accuracy) and Single cells in Pap smear images for cervical cancer classification (SIPaKMeD: 99.18% accuracy). Also, PND-Net achieves improved performances using five-fold cross validation.

[CV-8] Optimizing 3D Geometry Reconstruction from Implicit Neural Representations

链接: https://arxiv.org/abs/2410.12725
作者: Shen Fan,Przemyslaw Musialski
关键词-EN: offering unparalleled advantages, Implicit neural representations, tool in learning, offering unparalleled, Implicit neural
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implicit neural representations have emerged as a powerful tool in learning 3D geometry, offering unparalleled advantages over conventional representations like mesh-based methods. A common type of INR implicitly encodes a shape’s boundary as the zero-level set of the learned continuous function and learns a mapping from a low-dimensional latent space to the space of all possible shapes represented by its signed distance function. However, most INRs struggle to retain high-frequency details, which are crucial for accurate geometric depiction, and they are computationally expensive. To address these limitations, we present a novel approach that both reduces computational expenses and enhances the capture of fine details. Our method integrates periodic activation functions, positional encodings, and normals into the neural network architecture. This integration significantly enhances the model’s ability to learn the entire space of 3D shapes while preserving intricate details and sharp features, areas where conventional representations often fall short.

[CV-9] RAFA-Net: Region Attention Network For Food Items And Agricultural Stress Recognition

链接: https://arxiv.org/abs/2410.12718
作者: Asish Bera,Ondrej Krejcar,Debotosh Bhattacharjee
关键词-EN: Deep Convolutional Neural, Convolutional Neural Networks, Deep Convolutional, Convolutional Neural, facilitated remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Convolutional Neural Networks (CNNs) have facilitated remarkable success in recognizing various food items and agricultural stress. A decent performance boost has been witnessed in solving the agro-food challenges by mining and analyzing of region-based partial feature descriptors. Also, computationally expensive ensemble learning schemes using multiple CNNs have been studied in earlier works. This work proposes a region attention scheme for modelling long-range dependencies by building a correlation among different regions within an input image. The attention method enhances feature representation by learning the usefulness of context information from complementary regions. Spatial pyramidal pooling and average pooling pair aggregate partial descriptors into a holistic representation. Both pooling methods establish spatial and channel-wise relationships without incurring extra parameters. A context gating scheme is applied to refine the descriptiveness of weighted attentional features, which is relevant for classification. The proposed Region Attention network for Food items and Agricultural stress recognition method, dubbed RAFA-Net, has been experimented on three public food datasets, and has achieved state-of-the-art performances with distinct margins. The highest top-1 accuracies of RAFA-Net are 91.69%, 91.56%, and 96.97% on the UECFood-100, UECFood-256, and MAFood-121 datasets, respectively. In addition, better accuracies have been achieved on two benchmark agricultural stress datasets. The best top-1 accuracies on the Insect Pest (IP-102) and PlantDoc-27 plant disease datasets are 92.36%, and 85.54%, respectively; implying RAFA-Net’s generalization capability.

[CV-10] WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

链接: https://arxiv.org/abs/2410.12705
作者: Genta Indra Winata,Frederikus Hudi,Patrick Amadeus Irawan,David Anugraha,Rifki Afina Putri,Yutong Wang,Adam Nohejl,Ubaidillah Ariq Prathama,Nedjma Ousidhoum,Afifa Amriani,Anar Rzayev,Anirban Das,Ashmari Pramodya,Aulia Adila,Bryan Wilie,Candy Olivia Mawalim,Ching Lam Cheng,Daud Abolade,Emmanuele Chersoni,Enrico Santus,Fariz Ikhwantri,Garry Kuwanto,Hanyang Zhao,Haryo Akbarianto Wibowo,Holy Lovenia,Jan Christian Blaise Cruz,Jan Wira Gotama Putra,Junho Myung,Lucky Susanto,Maria Angelica Riera Machin,Marina Zhukova,Michael Anugraha,Muhammad Farid Adilazuarda,Natasha Santosa,Peerat Limkonchotiwat,Raj Dabre,Rio Alexander Audino,Samuel Cahyawijaya,Shi-Xiong Zhang,Stephanie Yulia Salim,Yi Zhou,Yinxuan Gui,David Ifeoluwa Adelani,En-Shiun Annie Lee,Shogo Okada,Ayu Purwarianti,Alham Fikri Aji,Taro Watanabe,Derry Tanti Wijaya,Alice Oh,Chong-Wah Ngo
关键词-EN: Vision Language Models, underrepresented cultural contexts, Vision Language, Language Models, underrepresented cultural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

[CV-11] Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization

链接: https://arxiv.org/abs/2410.12700
作者: Xingqi Wang,Xiaoyuan Yi,Xing Xie,Jia Jia
关键词-EN: Recent advancements, produce harmful content, harmful content misaligned, Large Language Models, indistinguishable human-level images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by ACM Multimedia 2024. The dataset and code can be found at this https URL

点击查看摘要

Abstract:Recent advancements in diffusion models trained on large-scale data have enabled the generation of indistinguishable human-level images, yet they often produce harmful content misaligned with human values, e.g., social bias, and offensive content. Despite extensive research on Large Language Models (LLMs), the challenge of Text-to-Image (T2I) model alignment remains largely unexplored. Addressing this problem, we propose LiVO (Lightweight Value Optimization), a novel lightweight method for aligning T2I models with human values. LiVO only optimizes a plug-and-play value encoder to integrate a specified value principle with the input prompt, allowing the control of generated images over both semantics and values. Specifically, we design a diffusion model-tailored preference optimization loss, which theoretically approximates the Bradley-Terry model used in LLM alignment but provides a more flexible trade-off between image quality and value conformity. To optimize the value encoder, we also develop a framework to automatically construct a text-image preference dataset of 86k (prompt, aligned image, violating image, value principle) samples. Without updating most model parameters and through adaptive value selection from the input prompt, LiVO significantly reduces harmful outputs and achieves faster convergence, surpassing several strong baselines and taking an initial step towards ethically aligned T2I models.

[CV-12] AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing

链接: https://arxiv.org/abs/2410.12696
作者: DuoSheng Chen,Binghui Chen,Yifeng Geng,Liefeng Bo
关键词-EN: point-based image editing, high-quality results based, yielding precise, image editing methods, precise and high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, several point-based image editing methods (e.g., DragDiffusion, FreeDrag, DragNoise) have emerged, yielding precise and high-quality results based on user instructions. However, these methods often make insufficient use of semantic information, leading to less desirable results. In this paper, we proposed a novel mask-free point-based image editing method, AdaptiveDrag, which provides a more flexible editing approach and generates images that better align with user intent. Specifically, we design an auto mask generation module using super-pixel division for user-friendliness. Next, we leverage a pre-trained diffusion model to optimize the latent, enabling the dragging of features from handle points to target points. To ensure a comprehensive connection between the input image and the drag process, we have developed a semantic-driven optimization. We design adaptive steps that are supervised by the positions of the points and the semantic regions derived from super-pixel segmentation. This refined optimization process also leads to more realistic and accurate drag results. Furthermore, to address the limitations in the generative consistency of the diffusion model, we introduce an innovative corresponding loss during the sampling process. Building on these effective designs, our method delivers superior generation results using only the single input image and the handle-target point pairs. Extensive experiments have been conducted and demonstrate that the proposed method outperforms others in handling various drag instructions (e.g., resize, movement, extension) across different domains (e.g., animals, human face, land space, clothing).

[CV-13] MultiCamCows2024 – A Multi-view Image Dataset for AI-driven Holstein-Friesian Cattle Re-Identification on a Working Farm

链接: https://arxiv.org/abs/2410.12695
作者: Phoenix Yu,Tilo Burghardt,Andrew W Dowsey,Neill W Campbell
关键词-EN: Holstein-Friesian cattle exploiting, farm-scale image dataset, image dataset filmed, white coat-patterns, exploiting their unique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 10 figures

点击查看摘要

Abstract:We present MultiCamCows2024, a farm-scale image dataset filmed across multiple cameras for the biometric identification of individual Holstein-Friesian cattle exploiting their unique black and white coat-patterns. Captured by three ceiling-mounted visual sensors covering adjacent barn areas over seven days on a working dairy farm, the dataset comprises 101, 329 images of 90 cows, plus the underlying original CCTV footage. The dataset is provided alongside full computer vision recognition baselines, that is both a supervised and self-supervised learning framework for individual cow identification trained on cattle tracklets. We report a performance above 96% single image identification accuracy from the dataset and demonstrate that combining data from multiple cameras during learning enhances self-supervised identification. We show that our framework enables fully automatic cattle identification, barring only the simple human verification of tracklet integrity during data collection. Crucially, our study highlights that multi-camera, supervised and self-supervised components in tandem not only deliver highly accurate individual cow identification but also achieve this efficiently with no labelling of cattle identities by humans at all. We argue that this improvement in efficacy has practical implications for livestock management, behaviour analysis, and agricultural monitoring. For full reproducibility and practical ease of use, we publish all key software and code including re-identification components and the species detector with this paper.

[CV-14] VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

链接: https://arxiv.org/abs/2410.12694
作者: Lingxiao Luo,Bingda Tang,Xuanzhong Chen,Rong Han,Ting Chen
关键词-EN: visually grounded responses, demonstrated remarkable promise, Recent advancements, generating visually grounded, grounded responses
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at this https URL.

[CV-15] Machine Learning Approach to Brain Tumor Detection and Classification

链接: https://arxiv.org/abs/2410.12692
作者: Alice Oh,Inyoung Noh,Jian Choo,Jihoo Lee,Justin Park,Kate Hwang,Sanghyeon Kim,Soo Min Oh
关键词-EN: improve treatment outcomes, significantly improve treatment, brain MRI images, machine learning models, Brain tumor detection
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Brain tumor detection and classification are critical tasks in medical image analysis, particularly in early-stage diagnosis, where accurate and timely detection can significantly improve treatment outcomes. In this study, we apply various statistical and machine learning models to detect and classify brain tumors using brain MRI images. We explore a variety of statistical models including linear, logistic, and Bayesian regressions, and the machine learning models including decision tree, random forest, single-layer perceptron, multi-layer perceptron, convolutional neural network (CNN), recurrent neural network, and long short-term memory. Our findings show that CNN outperforms other models, achieving the best performance. Additionally, we confirm that the CNN model can also work for multi-class classification, distinguishing between four categories of brain MRI images such as normal, glioma, meningioma, and pituitary tumor images. This study demonstrates that machine learning approaches are suitable for brain tumor detection and classification, facilitating real-world medical applications in assisting radiologists with early and accurate diagnosis.

[CV-16] Automatic Mapping of Anatomical Landmarks from Free-Text Using Large Language Models : Insights from Llama-2

链接: https://arxiv.org/abs/2410.12686
作者: Mohamad Abdi,Gerardo Hemosillo Valadez,Halid Ziya Yerebakan
关键词-EN: anomaly detection, navigation and anomaly, Anatomical landmarks, landmarks, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:Anatomical landmarks are vital in medical imaging for navigation and anomaly detection. Modern large language models (LLMs), like Llama-2, offer promise for automating the mapping of these landmarks in free-text radiology reports to corresponding positions in image data. Recent studies propose LLMs may develop coherent representations of generative processes. Motivated by these insights, we investigated whether LLMs accurately represent the spatial positions of anatomical landmarks. Through experiments with Llama-2 models, we found that they can linearly represent anatomical landmarks in space with considerable robustness to different prompts. These results underscore the potential of LLMs to enhance the efficiency and accuracy of medical imaging workflows.

[CV-17] MambaBEV: An efficient 3D detection model with Mamba2

链接: https://arxiv.org/abs/2410.12673
作者: Zihan You,Hao Wang,Qichao Zhao,Jinxiang Wang
关键词-EN: autonomous driving systems, object detection model, important for autonomous, object detection, BEV
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A stable 3D object detection model based on BEV paradigm with temporal information is very important for autonomous driving systems. However, current temporal fusion model use convolutional layer or deformable self-attention is not conducive to the exchange of global information of BEV space and has more computational cost. Recently, a newly proposed based model specialized in processing sequence called mamba has shown great potential in multiple downstream task. In this work, we proposed a mamba2-based BEV 3D object detection model named MambaBEV. We also adapt an end to end self driving paradigm to test the performance of the model. Our work performs pretty good results on nucences datasets:Our base version achieves 51.7% NDS. Our code will be available soon.

[CV-18] 3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

链接: https://arxiv.org/abs/2410.12669
作者: Dewei Zhou,Ji Xie,Zongxin Yang,Yi Yang
关键词-EN: Decoupled Instance Synthesis, allowing users, increasing demand, demand for controllable, controllable outputs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and attributes. However, unlike image-conditional generation methods such as ControlNet, MIG techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL, primarily due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering. In this paper, we introduce Depth-Driven Decoupled Instance Synthesis (3DIS), a novel framework that decouples the MIG process into two stages: (i) generating a coarse scene depth map for accurate instance positioning and scene composition, and (ii) rendering fine-grained attributes using pre-trained ControlNet on any foundational model, without additional training. Our 3DIS framework integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering. Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. Notably, 3DIS offers seamless compatibility with diverse foundational models, providing a robust, adaptable solution for advanced multi-instance generation. The code is available at: this https URL.

[CV-19] Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

链接: https://arxiv.org/abs/2410.12662
作者: Shicheng Xu,Liang Pang,Yunchang Zhu,Huawei Shen,Xueqi Cheng
关键词-EN: Large Vision-Language Models, Vision-language alignment, safety mechanism, Vision-Language Models, Large Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Vision-language alignment in Large Vision-Language Models (LVLMs) successfully enables LLMs to understand visual input. However, we find that existing vision-language alignment methods fail to transfer the existing safety mechanism for text in LLMs to vision, which leads to vulnerabilities in toxic image. To explore the cause of this problem, we give the insightful explanation of where and how the safety mechanism of LVLMs operates and conduct comparative analysis between text and vision. We find that the hidden states at the specific transformer layers play a crucial role in the successful activation of safety mechanism, while the vision-language alignment at hidden states level in current methods is insufficient. This results in a semantic shift for input images compared to text in hidden states, therefore misleads the safety mechanism. To address this, we propose a novel Text-Guided vision-language Alignment method (TGA) for LVLMs. TGA retrieves the texts related to input vision and uses them to guide the projection of vision into the hidden states space in LLMs. Experiments show that TGA not only successfully transfers the safety mechanism for text in basic LLMs to vision in vision-language alignment for LVLMs without any safety fine-tuning on the visual modality but also maintains the general performance on various vision tasks (Safe and Good).

[CV-20] DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

链接: https://arxiv.org/abs/2410.12628
作者: Zhiyuan Zhao,Hengrui Kang,Bin Wang,Conghui He
关键词-EN: Document Layout Analysis, visual features achieve, visual features offer, Layout Analysis, multimodal methods leveraging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Github Repo: this https URL

点击查看摘要

Abstract:Document Layout Analysis is crucial for real-world document understanding systems, but it encounters a challenging trade-off between speed and accuracy: multimodal methods leveraging both text and visual features achieve higher accuracy but suffer from significant latency, whereas unimodal methods relying solely on visual features offer faster processing speeds at the expense of accuracy. To address this dilemma, we introduce DocLayout-YOLO, a novel approach that enhances accuracy while maintaining speed advantages through document-specific optimizations in both pre-training and model design. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm, which frames document synthesis as a two-dimensional bin packing problem, generating the large-scale, diverse DocSynth-300K dataset. Pre-training on the resulting DocSynth-300K dataset significantly improves fine-tuning performance across various document types. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module that is capable of better handling multi-scale variations of document elements. Furthermore, to validate performance across different document types, we introduce a complex and challenging benchmark named DocStructBench. Extensive experiments on downstream datasets demonstrate that DocLayout-YOLO excels in both speed and accuracy. Code, data, and models are available at this https URL.

[CV-21] Exploring Model Kinship for Merging Large Language Models

链接: https://arxiv.org/abs/2410.12613
作者: Yedi Hu,Yunzhi Yao,Ningyu Zhang,Shumin Deng,Huajun Chen
关键词-EN: Large Language Models, Large Language, efficiency of Large, Language Models, model kinship
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Ongoing work

点击查看摘要

Abstract:Model merging has become one of the key technologies for enhancing the capabilities and efficiency of Large Language Models (LLMs). However, our understanding of the expected performance gains and principles when merging any two models remains limited. In this work, we introduce model kinship, the degree of similarity or relatedness between LLMs, analogous to biological evolution. With comprehensive empirical analysis, we find that there is a certain relationship between model kinship and the performance gains after model merging, which can help guide our selection of candidate models. Inspired by this, we propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets. Specifically, we discover that using model kinship as a criterion can assist us in continuously performing model merging, alleviating the degradation (local optima) in model evolution, whereas model kinship can serve as a guide to escape these traps. Code is available at this https URL.

[CV-22] CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

链接: https://arxiv.org/abs/2410.12595
作者: Zhiyuan Ma,Jianjun Li,Guohui Li,Kaiyan Huang
关键词-EN: social media platforms, received great attention, vision-language pre-training, media platforms, recently has received
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: vision-language pre-training, contrastive learning, cross-modal, associative learning, associative mapping classification

点击查看摘要

Abstract:With the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and enhancement between different modalities. However, most of recent studies focus on cross-modal contrastive learning (CMCL) to promote image-text alignment by pulling embeddings of positive sample pairs together while pushing those of negative pairs apart, which ignores the natural asymmetry property between different modalities and requires large-scale image-text corpus to achieve arduous progress. To mitigate this predicament, we propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning for VLP. Specifically, we first respectively embed visual objects and textual tokens into separate hypersphere spaces to learn intra-modal hidden features, and then design a cross-modal associative prompt layer to perform anchor point masking and swap feature filling for constructing a hybrid cross-modal associative prompt. Afterwards, we exploit a unified semantic encoder to learn their cross-modal interactive features for context adaptation. Finally, we design an associative mapping classification layer to learn potential associative mappings between modalities at anchor points, within which we develop a fresh self-supervised associative mapping classification task to boost CMAL’s performance. Experimental results verify the effectiveness of CMAL, showing that it achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks, with significantly fewer corpus. Especially, CMAL obtains new state-of-the-art results on SNLI-VE and REC (testA).

[CV-23] Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

链接: https://arxiv.org/abs/2410.12592
作者: Minkyoung Cho,Yulong Cao,Jiachen Sun,Qingzhao Zhang,Marco Pavone,Jeong Joon Park,Heng Yang,Z. Morley Mao
关键词-EN: long-tail scenarios, important paradigm, enhance accuracy, separate detection pipelines, distinct object configurations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.

[CV-24] Rethinking Visual Counterfactual Explanations Through Region Constraint

链接: https://arxiv.org/abs/2410.12591
作者: Bartlomiej Sobieski,Jakub Grzywaczewski,Bartlomiej Sadlej,Matthew Tivnan,Przemyslaw Biecek
关键词-EN: recently gained immense, gained immense popularity, Visual counterfactual explanations, Visual counterfactual, recently gained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Visual counterfactual explanations (VCEs) have recently gained immense popularity as a tool for clarifying the decision-making process of image classifiers. This trend is largely motivated by what these explanations promise to deliver – indicate semantically meaningful factors that change the classifier’s decision. However, we argue that current state-of-the-art approaches lack a crucial component – the region constraint – whose absence prevents from drawing explicit conclusions, and may even lead to faulty reasoning due to phenomenons like confirmation bias. To address the issue of previous methods, which modify images in a very entangled and widely dispersed manner, we propose region-constrained VCEs (RVCEs), which assume that only a predefined image region can be modified to influence the model’s prediction. To effectively sample from this subclass of VCEs, we propose Region-Constrained Counterfactual Schrödinger Bridges (RCSB), an adaptation of a tractable subclass of Schrödinger Bridges to the problem of conditional inpainting, where the conditioning signal originates from the classifier of interest. In addition to setting a new state-of-the-art by a large margin, we extend RCSB to allow for exact counterfactual reasoning, where the predefined region contains only the factor of interest, and incorporating the user to actively interact with the RVCE by predefining the regions manually.

[CV-25] FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion

链接: https://arxiv.org/abs/2410.12564
作者: Jiacheng Ruan,Yebin Yang,Zehao Lin,Feiyu Xiong,Zeyun Tang,Zhiyu Li
关键词-EN: large language models, large vision-language models, made significant progress, foundational vision models, large language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in progress. 9 pages, 3 figures

点击查看摘要

Abstract:Benefiting from the revolutionary advances in large language models (LLMs) and foundational vision models, large vision-language models (LVLMs) have also made significant progress. However, current benchmarks focus on tasks that evaluating only a single aspect of LVLM capabilities (e.g., recognition, detection, understanding). These tasks fail to fully demonstrate LVLMs’ potential in complex application scenarios. To comprehensively assess the performance of existing LVLMs, we propose a more challenging task called the Flow Text with Image Insertion task (FTII). This task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation. Specifically, given several text paragraphs and a set of candidate images, as the text paragraphs accumulate, the LVLMs are required to select the most suitable image from the candidates to insert after the corresponding paragraph. Constructing a benchmark for such a task is highly challenging, particularly in determining the sequence of flowing text and images. To address this challenge, we turn to professional news reports, which naturally contain a gold standard for image-text sequences. Based on this, we introduce the Flow Text with Image Insertion Benchmark (FTII-Bench), which includes 318 high-quality Chinese image-text news articles and 307 high-quality English image-text news articles, covering 10 different news domains. Using these 625 high-quality articles, we construct problems of two different types with multiple levels of difficulty. Furthermore, we establish two different evaluation pipelines based on the CLIP model and existing LVLMs. We evaluate 9 open-source and 2 closed-source LVLMs as well as 2 CLIP-based models. Results indicate that even the most advanced models (e.g., GPT-4o) face significant challenges when tackling the FTII task.

[CV-26] Adaptive Prompt Learning with SAM for Few-shot Scanning Probe Microscope Image Segmentation

链接: https://arxiv.org/abs/2410.12562
作者: Yao Shen,Ziwei Wei,Chunmeng Liu,Shuming Wei,Qi Zhao,Kaiyang Zeng,Guangyao Li
关键词-EN: Segment Anything Model, Scanning Probe Microscope, SPM image segmentation, natural scene images, Adaptive Prompt Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:The Segment Anything Model (SAM) has demonstrated strong performance in image segmentation of natural scene images. However, its effectiveness diminishes markedly when applied to specific scientific domains, such as Scanning Probe Microscope (SPM) images. This decline in accuracy can be attributed to the distinct data distribution and limited availability of the data inherent in the scientific images. On the other hand, the acquisition of adequate SPM datasets is both time-intensive and laborious as well as skill-dependent. To address these challenges, we propose an Adaptive Prompt Learning with SAM (APL-SAM) framework tailored for few-shot SPM image segmentation. Our approach incorporates two key innovations to enhance SAM: 1) An Adaptive Prompt Learning module leverages few-shot embeddings derived from limited support set to learn adaptively central representatives, serving as visual prompts. This innovation eliminates the need for time-consuming online user interactions for providing prompts, such as exhaustively marking points and bounding boxes slice by slice; 2) A multi-source, multi-level mask decoder specifically designed for few-shot SPM image segmentation is introduced, which can effectively capture the correspondence between the support and query images. To facilitate comprehensive training and evaluation, we introduce a new dataset, SPM-Seg, curated for SPM image segmentation. Extensive experiments on this dataset reveal that the proposed APL-SAM framework significantly outperforms the original SAM, achieving over a 30% improvement in terms of Dice Similarity Coefficient with only one-shot guidance. Moreover, APL-SAM surpasses state-of-the-art few-shot segmentation methods and even fully supervised approaches in performance. Code and dataset used in this study will be made available upon acceptance.

[CV-27] Development of Image Collection Method Using YOLO and Siamese Network

链接: https://arxiv.org/abs/2410.12561
作者: Chan Young Shin,Ah Hyun Lee,Jun Young Lee,Ji Min Lee,Soo Jin Park
关键词-EN: collecting high-quality data, enter the era, era of big, Siamese network, Siamese
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages, 13 figures, 2 tables

点击查看摘要

Abstract:As we enter the era of big data, collecting high-quality data is very important. However, collecting data by humans is not only very time-consuming but also expensive. Therefore, many scientists have devised various methods to collect data using computers. Among them, there is a method called web crawling, but the authors found that the crawling method has a problem in that unintended data is collected along with the user. The authors found that this can be filtered using the object recognition model YOLOv10. However, there are cases where data that is not properly filtered remains. Here, image reclassification was performed by additionally utilizing the distance output from the Siamese network, and higher performance was recorded than other classification models. (average _f1 score YOLO+MobileNet 0.678-YOLO+SiameseNet 0.772)) The user can specify a distance threshold to adjust the balance between data deficiency and noise-robustness. The authors also found that the Siamese network can achieve higher performance with fewer resources because the cropped images are used for object recognition when processing images in the Siamese network. (Class 20 mean-based f1 score, non-crop+Siamese(MobileNetV3-Small) 80.94 - crop preprocessing+Siamese(MobileNetV3-Small) 82.31) In this way, the image retrieval system that utilizes two consecutive models to reduce errors can save users’ time and effort, and build better quality data faster and with fewer resources than before.

[CV-28] One Step Diffusion via Shortcut Models

链接: https://arxiv.org/abs/2410.12557
作者: Kevin Frans,Danijar Hafner,Sergey Levine,Pieter Abbeel
关键词-EN: enabled generating diverse, Diffusion models, enabled generating, generating diverse, diverse and realistic
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models and flow-matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.

[CV-29] Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing

链接: https://arxiv.org/abs/2410.12526
作者: Mingce Guo,Jingxuan He,Shengeng Tang,Zhangye Wang,Lechao Cheng
关键词-EN: garnered significant attention, significant attention due, Text-driven video editing, editing utilizing generative, utilizing generative diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-driven video editing utilizing generative diffusion models has garnered significant attention due to their potential applications. However, existing approaches are constrained by the limited word embeddings provided in pre-training, which hinders nuanced editing targeting open concepts with specific attributes. Directly altering the keywords in target prompts often results in unintended disruptions to the attention mechanisms. To achieve more flexible editing easily, this work proposes an improved concept-augmented video editing approach that generates diverse and stable target videos flexibly by devising abstract conceptual pairs. Specifically, the framework involves concept-augmented textual inversion and a dual prior supervision mechanism. The former enables plug-and-play guidance of stable diffusion for video editing, effectively capturing target attributes for more stylized results. The dual prior supervision mechanism significantly enhances video stability and fidelity. Comprehensive evaluations demonstrate that our approach generates more stable and lifelike videos, outperforming state-of-the-art methods.

[CV-30] MambaPainter: Neural Stroke-Based Rendering in a Single Step SIGGRAPH

链接: https://arxiv.org/abs/2410.12524
作者: Tomoya Sawada,Marie Katsurai
关键词-EN: Stroke-based rendering aims, oil painting style, Stroke-based rendering, aims to reconstruct, painting style
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to SIGGRAPH Asia 2024 posters

点击查看摘要

Abstract:Stroke-based rendering aims to reconstruct an input image into an oil painting style by predicting brush stroke sequences. Conventional methods perform this prediction stroke-by-stroke or require multiple inference steps due to the limitations of a predictable number of strokes. This procedure leads to inefficient translation speed, limiting their practicality. In this study, we propose MambaPainter, capable of predicting a sequence of over 100 brush strokes in a single inference step, resulting in rapid translation. We achieve this sequence prediction by incorporating the selective state-space model. Additionally, we introduce a simple extension to patch-based rendering, which we use to translate high-resolution images, improving the visual quality with a minimal increase in computational cost. Experimental results demonstrate that MambaPainter can efficiently translate inputs to oil painting-style images compared to state-of-the-art methods. The codes are available at this https URL.

[CV-31] QueensCAMP: an RGB-D dataset for robust Visual SLAM

链接: https://arxiv.org/abs/2410.12520
作者: Hudson M. S. Bruno,Esther L. Colombini,Sidney N. Givigi Jr
关键词-EN: Visual Simultaneous Localization, Visual Simultaneous, Localization and Mapping, Simultaneous Localization, robotics applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Visual Simultaneous Localization and Mapping (VSLAM) is a fundamental technology for robotics applications. While VSLAM research has achieved significant advancements, its robustness under challenging situations, such as poor lighting, dynamic environments, motion blur, and sensor failures, remains a challenging issue. To address these challenges, we introduce a novel RGB-D dataset designed for evaluating the robustness of VSLAM systems. The dataset comprises real-world indoor scenes with dynamic objects, motion blur, and varying illumination, as well as emulated camera failures, including lens dirt, condensation, underexposure, and overexposure. Additionally, we offer open-source scripts for injecting camera failures into any images, enabling further customization by the research community. Our experiments demonstrate that ORB-SLAM2, a traditional VSLAM algorithm, and TartanVO, a Deep Learning-based VO algorithm, can experience performance degradation under these challenging conditions. Therefore, this dataset and the camera failure open-source tools provide a valuable resource for developing more robust VSLAM systems capable of handling real-world challenges.

[CV-32] DH-VTON: Deep Text-Driven Virtual Try-On via Hybrid Attention Learning ICASSP2025

链接: https://arxiv.org/abs/2410.12501
作者: Jiabao Wei,Zhiyuan Ma
关键词-EN: online shopping scenarios, synthesis specific person, recently receives numerous, receives numerous attention, specific person images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 6 figures, ICASSP2025

点击查看摘要

Abstract:Virtual Try-ON (VTON) aims to synthesis specific person images dressed in given garments, which recently receives numerous attention in online shopping scenarios. Currently, the core challenges of the VTON task mainly lie in the fine-grained semantic extraction (i.e.,deep semantics) of the given reference garments during depth estimation and effective texture preservation when the garments are synthesized and warped onto human body. To cope with these issues, we propose DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module. By standing on the shoulder of a well-built pre-trained paint-by-example (abbr. PBE) approach, we present our DH-VTON pipeline in this work. Specifically, to extract the deep semantics of the garments, we first introduce InternViT-6B as fine-grained feature learner, which can be trained to align with the large-scale intrinsic knowledge with deep text semantics (e.g.,“neckline” or “girdle”) to make up for the deficiency of the commonly adopted CLIP encoder. Based on this, to enhance the customized dressing abilities, we further introduce Garment-Feature ControlNet Plus (abbr. GFC+) module and propose to leverage a fresh hybrid attention strategy for training, which can adaptively integrate fine-grained characteristics of the garments into the different layers of the VTON model, so as to achieve multi-scale features preservation effects. Extensive experiments on several representative datasets demonstrate that our method outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.

[CV-33] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective NEURIPS2024

链接: https://arxiv.org/abs/2410.12490
作者: Yongxin Zhu,Bocheng Li,Hang Zhang,Xin Li,Linli Xu,Lidong Bing
关键词-EN: Latent Diffusion Models, Latent-based image generative, Mask Image Models, achieved notable success, latent space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Latent-based image generative models, such as Latent Diffusion Models (LDMs) and Mask Image Models (MIMs), have achieved notable success in image generation tasks. These models typically leverage reconstructive autoencoders like VQGAN or VAE to encode pixels into a more compact latent space and learn the data distribution in the latent space instead of directly from pixels. However, this practice raises a pertinent question: Is it truly the optimal choice? In response, we begin with an intriguing observation: despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. This finding contrasts sharply with the field of NLP, where the autoregressive model GPT has established a commanding presence. To address this discrepancy, we introduce a unified perspective on the relationship between latent space and generative models, emphasizing the stability of latent space in image generative modeling. Furthermore, we propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling. Experimental results show that image autoregressive modeling with our tokenizer (DiGIT) benefits both image understanding and image generation with the next token prediction principle, which is inherently straightforward for GPT models but challenging for other generative models. Remarkably, for the first time, a GPT-style autoregressive model for images outperforms LDMs, which also exhibits substantial improvement akin to GPT when scaling up model size. Our findings underscore the potential of an optimized latent space and the integration of discrete tokenization in advancing the capabilities of image generative models. The code is available at \urlthis https URL.

[CV-34] Synthetic Augmentation for Anatomical Landmark Localization using DDPMs

链接: https://arxiv.org/abs/2410.12489
作者: Arnela Hadzic,Lea Bogensperger,Simon Johannes Joham,Martin Urschler
关键词-EN: shown great success, Deep learning techniques, anatomical landmark localization, medical data acquisition, great success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning techniques for anatomical landmark localization (ALL) have shown great success, but their reliance on large annotated datasets remains a problem due to the tedious and costly nature of medical data acquisition and annotation. While traditional data augmentation, variational autoencoders (VAEs), and generative adversarial networks (GANs) have already been used to synthetically expand medical datasets, diffusion-based generative models have recently started to gain attention for their ability to generate high-quality synthetic images. In this study, we explore the use of denoising diffusion probabilistic models (DDPMs) for generating medical images and their corresponding heatmaps of landmarks to enhance the training of a supervised deep learning model for ALL. Our novel approach involves a DDPM with a 2-channel input, incorporating both the original medical image and its heatmap of annotated landmarks. We also propose a novel way to assess the quality of the generated images using a Markov Random Field (MRF) model for landmark matching and a Statistical Shape Model (SSM) to check landmark plausibility, before we evaluate the DDPM-augmented dataset in the context of an ALL task involving hand X-Rays.

[CV-35] Mind the Gap Between Prototypes and Images in Cross-domain Finetuning

链接: https://arxiv.org/abs/2410.12474
作者: Hongduan Tian,Feng Liu,Zhanke Zhou,Tongliang Liu,Chengqi Zhang,Bo Han
关键词-EN: cross-domain few-shot classification, task-specific metric space, image instance embeddings, frozen pre-trained backbone, image instance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In cross-domain few-shot classification (CFC), recent works mainly focus on adapting a simple transformation head on top of a frozen pre-trained backbone with few labeled data to project embeddings into a task-specific metric space where classification can be performed by measuring similarities between image instance and prototype representations. Technically, an assumption implicitly adopted in such a framework is that the prototype and image instance embeddings share the same representation transformation. However, in this paper, we find that there naturally exists a gap, which resembles the modality gap, between the prototype and image instance embeddings extracted from the frozen pre-trained backbone, and simply applying the same transformation during the adaptation phase constrains exploring the optimal representations and shrinks the gap between prototype and image representations. To solve this problem, we propose a simple yet effective method, contrastive prototype-image adaptation (CoPA), to adapt different transformations respectively for prototypes and images similarly to CLIP by treating prototypes as text prompts. Extensive experiments on Meta-Dataset demonstrate that CoPA achieves the state-of-the-art performance more efficiently. Meanwhile, further analyses also indicate that CoPA can learn better representation clusters, enlarge the gap, and achieve minimal validation loss at the enlarged gap.

[CV-36] riplet: Triangle Patchlet for Mesh-Based Inverse Rendering and Scene Parameters Approximation

链接: https://arxiv.org/abs/2410.12414
作者: Jiajie Yang
关键词-EN: improved novel-view synthesis, significantly improved novel-view, Recent advancements, novel-view synthesis, significantly improved
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Recent advancements in Radiance Fields have significantly improved novel-view synthesis. However, in many real-world applications, the more advanced challenge lies in inverse rendering, which seeks to derive the physical properties of a scene, including light, geometry, textures, and materials. Meshes, as a traditional representation adopted by many simulation pipeline, however, still show limited influence in radiance field for inverse rendering. This paper introduces a novel framework called Triangle Patchlet (abbr. Triplet), a mesh-based representation, to comprehensively approximate these scene parameters. We begin by assembling Triplets with either randomly generated points or sparse points obtained from camera calibration where all faces are treated as an independent element. Next, we simulate the physical interaction of light and optimize the scene parameters using traditional graphics rendering techniques like rasterization and ray tracing, accompanying with density control and propagation. An iterative mesh extracting process is also suggested, where we continue to optimize on geometry and materials with graph-based operation. We also introduce several regulation terms to enable better generalization of materials property. Our framework could precisely estimate the light, materials and geometry with mesh without prior of light, materials and geometry in a unified framework. Experiments demonstrate that our approach can achieve state-of-the-art visual quality while reconstructing high-quality geometry and accurate material properties.

[CV-37] AdaCropFollow: Self-Supervised Online Adaptation for Visual Under-Canopy Navigation

链接: https://arxiv.org/abs/2410.12411
作者: Arun N. Sivakumar,Federico Magistri,Mateus V. Gasparino,Jens Behley,Cyrill Stachniss,Girish Chowdhary
关键词-EN: plant manipulation tasks, precise monitoring, growing season, applications like precise, plant manipulation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Under-canopy agricultural robots can enable various applications like precise monitoring, spraying, weeding, and plant manipulation tasks throughout the growing season. Autonomous navigation under the canopy is challenging due to the degradation in accuracy of RTK-GPS and the large variability in the visual appearance of the scene over time. In prior work, we developed a supervised learning-based perception system with semantic keypoint representation and deployed this in various field conditions. A large number of failures of this system can be attributed to the inability of the perception model to adapt to the domain shift encountered during deployment. In this paper, we propose a self-supervised online adaptation method for adapting the semantic keypoint representation using a visual foundational model, geometric prior, and pseudo labeling. Our preliminary experiments show that with minimal data and fine-tuning of parameters, the keypoint prediction model trained with labels on the source domain can be adapted in a self-supervised manner to various challenging target domains onboard the robot computer using our method. This can enable fully autonomous row-following capability in under-canopy robots across fields and crops without requiring human intervention.

[CV-38] Beyond Coarse-Grained Matching in Video-Text Retrieval ACCV2024

链接: https://arxiv.org/abs/2410.12407
作者: Aozhu Chen,Hazel Doughty,Xirong Li,Cees G. M. Snoek
关键词-EN: Video-text retrieval, significant advancements, requires verification, Video-text, fine-grained
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: Accepted to ACCV 2024

点击查看摘要

Abstract:Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-word variations across nouns, verbs, adjectives, adverbs, and prepositions. We perform comprehensive experiments using four state-of-the-art models across two standard benchmarks (MSR-VTT and VATEX) and two specially curated datasets enriched with detailed descriptions (VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our analyses show that the current evaluation benchmarks fall short in detecting a model’s ability to perceive subtle single-word differences, 2) our fine-grained evaluation highlights the difficulty models face in distinguishing such subtle variations. To enhance fine-grained understanding, we propose a new baseline that can be easily combined with current methods. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model’s ability to understand fine-grained differences.

[CV-39] Feature Augmentation for Self-supervised Contrastive Learning: A Closer Look IJCNN2024

链接: https://arxiv.org/abs/2410.12396
作者: Yong Zhang,Rui Zhu,Shifeng Zhang,Xu Zhou,Shifeng Chen,Xiaofan Chen
关键词-EN: Self-supervised contrastive learning, view-invariant pre-trained representation, view variance brought, Self-supervised contrastive, contrastive learning heavily
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IJCNN 2024

点击查看摘要

Abstract:Self-supervised contrastive learning heavily relies on the view variance brought by data augmentation, so that it can learn a view-invariant pre-trained representation. Beyond increasing the view variance for contrast, this work focuses on improving the diversity of training data, to improve the generalization and robustness of the pre-trained models. To this end, we propose a unified framework to conduct data augmentation in the feature space, known as feature augmentation. This strategy is domain-agnostic, which augments similar features to the original ones and thus improves the data diversity. We perform a systematic investigation of various feature augmentation architectures, the gradient-flow skill, and the relationship between feature augmentation and traditional data augmentation. Our study reveals some practical principles for feature augmentation in self-contrastive learning. By integrating feature augmentation on the instance discrimination or the instance similarity paradigm, we consistently improve the performance of pre-trained feature learning and gain better generalization over the downstream image classification and object detection task.

[CV-40] Real-time Stereo-based 3D Object Detection for Streaming Perception NEURIPS2024

链接: https://arxiv.org/abs/2410.12394
作者: Changcai Li,Zonghua Gu,Gang Chen,Libo Huang,Wei Zhang,Huihui Zhou
关键词-EN: autonomous driving, ability to promptly, promptly respond, respond to environmental, system of autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Streaming Perception, 3D Object Detection, NeurIPS2024 poster

点击查看摘要

Abstract:The ability to promptly respond to environmental changes is crucial for the perception system of autonomous driving. Recently, a new task called streaming perception was proposed. It jointly evaluate the latency and accuracy into a single metric for video online perception. In this work, we introduce StreamDSGN, the first real-time stereo-based 3D object detection framework designed for streaming perception. StreamDSGN is an end-to-end framework that directly predicts the 3D properties of objects in the next moment by leveraging historical information, thereby alleviating the accuracy degradation of streaming perception. Further, StreamDSGN applies three strategies to enhance the perception accuracy: (1) A feature-flow-based fusion method, which generates a pseudo-next feature at the current moment to address the misalignment issue between feature and ground truth. (2) An extra regression loss for explicit supervision of object motion consistency in consecutive frames. (3) A large kernel backbone with a large receptive field for effectively capturing long-range spatial contextual features caused by changes in object positions. Experiments on the KITTI Tracking dataset show that, compared with the strong baseline, StreamDSGN significantly improves the streaming average precision by up to 4.33%. Our code is available at this https URL.

[CV-41] HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

链接: https://arxiv.org/abs/2410.12381
作者: Fengji Zhang,Linquan Wu,Huiyu Bai,Guancheng Lin,Xiao Li,Xiao Yu,Yue Wang,Bei Chen,Jacky Keung
关键词-EN: Artificial General Intelligence, advancing Artificial General, evaluating Large Language, Large Language Models, General Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: homepage this https URL

点击查看摘要

Abstract:Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs – core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs’ visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs’ capabilities. We have open-sourced our code and benchmark at this https URL.

[CV-42] Stylistic Multi-Task Analysis of Ukiyo-e Woodblock Prints

链接: https://arxiv.org/abs/2410.12379
作者: Selina Khan,Nanne van Noord
关键词-EN: Computer Vision approaches, focused Computer Vision, Computer Vision, Ukiyo-e, Japanese art form
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work we present a large-scale dataset of \textitUkiyo-e woodblock prints. Unlike previous works and datasets in the artistic domain that primarily focus on western art, this paper explores this pre-modern Japanese art form with the aim of broadening the scope for stylistic analysis and to provide a benchmark to evaluate a variety of art focused Computer Vision approaches. Our dataset consists of over 175.000 prints with corresponding metadata (\eg artist, era, and creation date) from the 17th century to present day. By approaching stylistic analysis as a Multi-Task problem we aim to more efficiently utilize the available metadata, and learn more general representations of style. We show results for well-known baselines and state-of-the-art multi-task learning frameworks to enable future comparison, and to encourage stylistic analysis on this artistic domain.

[CV-43] GAN Based Top-Down View Synthesis in Reinforcement Learning Environments

链接: https://arxiv.org/abs/2410.12372
作者: Usama Younus,Vinoj Jayasundara,Shivam Mishra,Suleyman Aslan
关键词-EN: top-down view, environment, internal mental model, view, generated top-down view
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Human actions are based on the mental perception of the environment. Even when all the aspects of an environment are not visible, humans have an internal mental model that can generalize the partially visible scenes to fully constructed and connected views. This internal mental model uses learned abstract representations of spatial and temporal aspects of the environments encountered in the past. Artificial agents in reinforcement learning environments also benefit by learning a representation of the environment from experience. It provides the agent with viewpoints that are not directly visible to it, helping it make better policy decisions. It can also be used to predict the future state of the environment. This project explores learning the top-down view of an RL environment based on the artificial agent’s first-person view observations with a generative adversarial network(GAN). The top-down view is useful as it provides a complete overview of the environment by building a map of the entire environment. It provides information about the objects’ dimensions and shapes along with their relative positions with one another. Initially, when only a partial observation of the environment is visible to the agent, only a partial top-down view is generated. As the agent explores the environment through a set of actions, the generated top-down view becomes complete. This generated top-down view can assist the agent in deducing better policy decisions. The focus of the project is to learn the top-down view of an RL environment. It doesn’t deal with any Reinforcement Learning task. Subjects: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY) Cite as: arXiv:2410.12372 [cs.CV] (or arXiv:2410.12372v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.12372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-44] Context-Infused Visual Grounding for Art

链接: https://arxiv.org/abs/2410.12369
作者: Selina Khan,Nanne van Noord
关键词-EN: collections contain textual, textual attributes, attributes that provide, provide rich, rich and contextualised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Many artwork collections contain textual attributes that provide rich and contextualised descriptions of artworks. Visual grounding offers the potential for localising subjects within these descriptions on images, however, existing approaches are trained on natural images and generalise poorly to art. In this paper, we present CIGAr (Context-Infused GroundingDINO for Art), a visual grounding approach which utilises the artwork descriptions during training as context, thereby enabling visual grounding on art. In addition, we present a new dataset, Ukiyo-eVG, with manually annotated phrase-grounding annotations, and we set a new state-of-the-art for object detection on two artwork datasets.

[CV-45] owards Flexible and Efficient Diffusion Low Light Enhancer

链接: https://arxiv.org/abs/2410.12346
作者: Guanzhou Lan,Qianli Ma,Yuqi Yang,Zhigang Wang,Dong Wang,Yuan Yuan,Bin Zhao
关键词-EN: Low-Light Image Enhancement, Image Enhancement, Diffusion-based Low-Light Image, Low-Light Image, demonstrated significant success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:Diffusion-based Low-Light Image Enhancement (LLIE) has demonstrated significant success in improving the visibility of low-light images. However, the substantial computational burden introduced by the iterative sampling process remains a major concern. Current acceleration methods, whether training-based or training-free, often lead to significant performance degradation. As a result, to achieve an efficient student model with performance comparable to that of existing multi-step teacher model, it is usually necessary to retrain a more capable teacher model. This approach introduces inflexibility, as it requires additional training to enhance the teacher’s performance. To address these challenges, we propose \textbfReflectance-aware \textbfDiffusion with \textbfDistilled \textbfTrajectory (\textbfReDDiT), a step distillation framework specifically designed for LLIE. ReDDiT trains a student model to replicate the teacher’s trajectory in fewer steps while also possessing the ability to surpass the teacher’s performance. Specifically, we first introduce a trajectory decoder from the teacher model to provide guidance. Subsequently, a reflectance-aware trajectory refinement module is incorporated into the distillation process to enable more deterministic guidance from the teacher model. Our framework achieves comparable performance to previous diffusion-based methods with redundant steps in just 2 steps while establishing new state-of-the-art (SOTA) results with 8 or 4 steps. Comprehensive experimental evaluations on 10 benchmark datasets validate the effectiveness of our method, consistently outperforming existing SOTA methods.

[CV-46] AS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant

链接: https://arxiv.org/abs/2410.12342
作者: Guopeng Li,Qiang Wang,Ke Yan,Shouhong Ding,Yuan Gao,Gui-Song Xia
关键词-EN: methodologies predominantly focus, convolutional neural networks, methodologies predominantly, similar architectures, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 6 figures, and 12 tables

点击查看摘要

Abstract:Most knowledge distillation (KD) methodologies predominantly focus on teacher-student pairs with similar architectures, such as both being convolutional neural networks (CNNs). However, the potential and flexibility of KD can be greatly improved by expanding it to novel Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred flexibly to a given student. The primary challenge in CAKD lies in the substantial feature gaps between heterogeneous models, originating from the distinction of their inherent inductive biases and module functions. To this end, we introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students. More importantly, within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions by merging convolution and attention modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions in CAKD, hindering the effectiveness of conventional pixel-wise mean squared error (MSE) loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing, thereby improving the feature alignments in CAKD. Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, achieving state-of-the-art performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our code and models will be released.

[CV-47] ARIC: An Activity Recognition Dataset in Classroom Surveillance Images

链接: https://arxiv.org/abs/2410.12337
作者: Linfeng Xu,Fanman Meng,Qingbo Wu,Lili Pan,Heqian Qiu,Lanxiao Wang,Kailong Chen,Kanglei Geng,Yilei Qian,Haojie Wang,Shuchang Zhou,Shimou Ling,Zejia Liu,Nanlin Chen,Yingjie Xu,Shaoxu Cheng,Bowen Tan,Ziyong Xu,Hongliang Li
关键词-EN: gaining increasing attention, activity recognition, field is gaining, activity, recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2409.03354

点击查看摘要

Abstract:The application of activity recognition in the ``AI + Education" field is gaining increasing attention. However, current work mainly focuses on the recognition of activities in manually captured videos and a limited number of activity types, with little attention given to recognizing activities in surveillance images from real classrooms. Activity recognition in classroom surveillance images faces multiple challenges, such as class imbalance and high activity similarity. To address this gap, we constructed a novel multimodal dataset focused on classroom surveillance image activity recognition called ARIC (Activity Recognition In Classroom). The ARIC dataset has advantages of multiple perspectives, 32 activity categories, three modalities, and real-world classroom scenarios. In addition to the general activity recognition tasks, we also provide settings for continual learning and few-shot continual learning. We hope that the ARIC dataset can act as a facilitator for future analysis and research for open teaching scenarios. You can download preliminary data from this https URL.

[CV-48] MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

链接: https://arxiv.org/abs/2410.12332
作者: Yunqiu Xu,Linchao Zhu,Yi Yang
关键词-EN: multimodal large language, demonstrated extraordinary vision-language, extraordinary vision-language understanding, vision-language understanding capabilities, visual grounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities and shown potential to serve as general-purpose assistants, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. In order to assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. To facilitate this research, we meticulously construct a new dataset MC-Bench for benchmarking the visual grounding capabilities of MLLMs. MC-Bench features 2K high-quality and manually annotated samples, consisting of instance-level labeled image pairs and corresponding text prompts that indicate the target instances in the images. In total, there are three distinct styles of text prompts, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans across all metrics. We also observe that existing MLLMs typically outperform foundation models without LLMs only on image-level metrics, and the specialist MLLMs trained on single images often struggle to generalize to multi-image scenarios. Moreover, a simple stepwise baseline integrating advanced MLLM and a detector can significantly surpass prior end-to-end MLLMs. We hope our MC-Bench and empirical findings can encourage the research community to further explore and enhance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts. Project page: this https URL.

[CV-49] Improved Anomaly Detection through Conditional Latent Space VAE Ensembles

链接: https://arxiv.org/abs/2410.12328
作者: Oskar Åström,Alexandros Sopasakis
关键词-EN: unknown outlier classes, Conditional Latent space, space Variational Autoencoder, perform improved pre-processing, Variational Autoencoder
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR)
*备注: 13 pages of main article, 19 pages including references and appendix, 4 figures

点击查看摘要

Abstract:We propose a novel Conditional Latent space Variational Autoencoder (CL-VAE) to perform improved pre-processing for anomaly detection on data with known inlier classes and unknown outlier classes. This proposed variational autoencoder (VAE) improves latent space separation by conditioning on information within the data. The method fits a unique prior distribution to each class in the dataset, effectively expanding the classic prior distribution for VAEs to include a Gaussian mixture model. An ensemble of these VAEs are merged in the latent spaces to form a group consensus that greatly improves the accuracy of anomaly detection across data sets. Our approach is compared against the capabilities of a typical VAE, a CNN, and a PCA, with regards AUC for anomaly detection. The proposed model shows increased accuracy in anomaly detection, achieving an AUC of 97.4% on the MNIST dataset compared to 95.7% for the second best model. In addition, the CL-VAE shows increased benefits from ensembling, a more interpretable latent space, and an increased ability to learn patterns in complex data with limited model sizes.

[CV-50] PAPL-SLAM: Principal Axis-Anchored Monocular Point-Line SLAM

链接: https://arxiv.org/abs/2410.12324
作者: Guanghao Li,Yu Cao,Qi Chen,Yifan Yang,Jian Pu
关键词-EN: point-line SLAM systems, point-line SLAM, SLAM systems, SLAM, line structural information
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:In point-line SLAM systems, the utilization of line structural information and the optimization of lines are two significant problems. The former is usually addressed through structural regularities, while the latter typically involves using minimal parameter representations of lines in optimization. However, separating these two steps leads to the loss of constraint information to each other. We anchor lines with similar directions to a principal axis and optimize them with n+2 parameters for n lines, solving both problems together. Our method considers scene structural information, which can be easily extended to different world hypotheses while significantly reducing the number of line parameters to be optimized, enabling rapid and accurate mapping and tracking. To further enhance the system’s robustness and avoid mismatch, we have modeled the line-axis probabilistic data association and provided the algorithm for axis creation, updating, and optimization. Additionally, considering that most real-world scenes conform to the Atlanta World hypothesis, we provide a structural line detection strategy based on vertical priors and vanishing points. Experimental results and ablation studies on various indoor and outdoor datasets demonstrate the effectiveness of our system.

[CV-51] FaceChain-FACT: Face Adapter with Decoupled Training for Identity-preserved Personalization

链接: https://arxiv.org/abs/2410.12312
作者: Cheng Yu,Haoyu Xie,Lei Shang,Yang Liu,Jun Dan,Baigui Sun,Liefeng Bo
关键词-EN: human-centric personalized image, adapter-based method obtains, personalized image generation, portrait generation training, field of human-centric
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:In the field of human-centric personalized image generation, the adapter-based method obtains the ability to customize and generate portraits by text-to-image training on facial data. This allows for identity-preserved personalization without additional fine-tuning in inference. Although there are improvements in efficiency and fidelity, there is often a significant performance decrease in test following ability, controllability, and diversity of generated faces compared to the base model. In this paper, we analyze that the performance degradation is attributed to the failure to decouple identity features from other attributes during extraction, as well as the failure to decouple the portrait generation training from the overall generation task. To address these issues, we propose the Face Adapter with deCoupled Training (FACT) framework, focusing on both model architecture and training strategy. To decouple identity features from others, we leverage a transformer-based face-export encoder and harness fine-grained identity features. To decouple the portrait generation training, we propose Face Adapting Increment Regularization~(FAIR), which effectively constrains the effect of face adapters on the facial region, preserving the generative ability of the base model. Additionally, we incorporate a face condition drop and shuffle mechanism, combined with curriculum learning, to enhance facial controllability and diversity. As a result, FACT solely learns identity preservation from training data, thereby minimizing the impact on the original text-to-image capabilities of the base model. Extensive experiments show that FACT has both controllability and fidelity in both text-to-image generation and inpainting solutions for portrait generation.

[CV-52] DAT: Improving Adversarial Robustness via Generative Amplitude Mix-up in Frequency Domain

链接: https://arxiv.org/abs/2410.12307
作者: Fengpeng Li,Kemou Li,Haiwei Wu,Jinyu Tian,Jiantao Zhou
关键词-EN: deep neural networks, protect deep neural, adversarial attacks, neural networks, protect deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:To protect deep neural networks (DNNs) from adversarial attacks, adversarial training (AT) is developed by incorporating adversarial examples (AEs) into model training. Recent studies show that adversarial attacks disproportionately impact the patterns within the phase of the sample’s frequency spectrum – typically containing crucial semantic information – more than those in the amplitude, resulting in the model’s erroneous categorization of AEs. We find that, by mixing the amplitude of training samples’ frequency spectrum with those of distractor images for AT, the model can be guided to focus on phase patterns unaffected by adversarial perturbations. As a result, the model’s robustness can be improved. Unfortunately, it is still challenging to select appropriate distractor images, which should mix the amplitude without affecting the phase patterns. To this end, in this paper, we propose an optimized Adversarial Amplitude Generator (AAG) to achieve a better tradeoff between improving the model’s robustness and retaining phase patterns. Based on this generator, together with an efficient AE production procedure, we design a new Dual Adversarial Training (DAT) strategy. Experiments on various datasets show that our proposed DAT leads to significantly improved robustness against diverse adversarial attacks.

[CV-53] Consistency Calibration: Improving Uncertainty Calibration via Consistency among Perturbed Neighbors

链接: https://arxiv.org/abs/2410.12295
作者: Linwei Tao,Haolan Guo,Minjing Dong,Chang Xu
关键词-EN: deep learning applications, Expected Calibration Error, accurate confidence estimates, learning applications, autonomous driving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Calibration is crucial in deep learning applications, especially in fields like healthcare and autonomous driving, where accurate confidence estimates are vital for decision-making. However, deep neural networks often suffer from miscalibration, with reliability diagrams and Expected Calibration Error (ECE) being the only standard perspective for evaluating calibration performance. In this paper, we introduce the concept of consistency as an alternative perspective on model calibration, inspired by uncertainty estimation literature in large language models (LLMs). We highlight its advantages over the traditional reliability-based view. Building on this concept, we propose a post-hoc calibration method called Consistency Calibration (CC), which adjusts confidence based on the model’s consistency across perturbed inputs. CC is particularly effective in locally uncertainty estimation, as it requires no additional data samples or label information, instead generating input perturbations directly from the source data. Moreover, we show that performing perturbations at the logit level significantly improves computational efficiency. We validate the effectiveness of CC through extensive comparisons with various post-hoc and training-time calibration methods, demonstrating state-of-the-art performance on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet, as well as on long-tailed datasets like ImageNet-LT.

[CV-54] Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting

链接: https://arxiv.org/abs/2410.12284
作者: Maxime Kayser,Bayar Menzat,Cornelius Emde,Bogdan Bercean,Alex Novak,Abdala Espinosa,Bartlomiej W. Papiez,Susanne Gaube,Thomas Lukasiewicz,Oana-Maria Camburu
关键词-EN: including in safety-critical, safety-critical domains, growing capabilities, explanations, models are leading
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The growing capabilities of AI models are leading to their wider use, including in safety-critical domains. Explainable AI (XAI) aims to make these models safer to use by making their inference process more transparent. However, current explainability methods are seldom evaluated in the way they are intended to be used: by real-world end users. To address this, we conducted a large-scale user study with 85 healthcare practitioners in the context of human-AI collaborative chest X-ray analysis. We evaluated three types of explanations: visual explanations (saliency maps), natural language explanations, and a combination of both modalities. We specifically examined how different explanation types influence users depending on whether the AI advice and explanations are factually correct. We find that text-based explanations lead to significant over-reliance, which is alleviated by combining them with saliency maps. We also observe that the quality of explanations, that is, how much factually correct information they entail, and how much this aligns with AI correctness, significantly impacts the usefulness of the different explanation types.

[CV-55] Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

链接: https://arxiv.org/abs/2410.12278
作者: Yong Xie,Karan Aggarwal,Aitzaz Ahmad,Stephen Lau
关键词-EN: automatically generate non-trivial, generate non-trivial task-specific, automatically generate, generate non-trivial, non-trivial task-specific synthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform in-context-learning (ICL)-based detectors by a large margin of 32%. Our extensive experiments confirm the benefits of our approach with cross-task and cross-generator generalization. Our data-mixture-based training further improves the generalization and robustness of hallucination detection.

[CV-56] Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond

链接: https://arxiv.org/abs/2410.12274
作者: Pengwei Liang,Junjun Jiang,Qing Ma,Xianming Liu,Jiayi Ma
关键词-EN: single degraded image, fusion, Image fusion, Image, source images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18page

点击查看摘要

Abstract:Image fusion is famous as an alternative solution to generate one high-quality image from multiple images in addition to image restoration from a single degraded image. The essence of image fusion is to integrate complementary information from source images. Existing fusion methods struggle with generalization across various tasks and often require labor-intensive designs, in which it is difficult to identify and extract useful information from source images due to the diverse requirements of each fusion task. Additionally, these methods develop highly specialized features for different downstream applications, hindering the adaptation to new and diverse downstream tasks. To address these limitations, we introduce DeFusion++, a novel framework that leverages self-supervised learning (SSL) to enhance the versatility of feature representation for different image fusion tasks. DeFusion++ captures the image fusion task-friendly representations from large-scale data in a self-supervised way, overcoming the constraints of limited fusion datasets. Specifically, we introduce two innovative pretext tasks: common and unique decomposition (CUD) and masked feature modeling (MFM). CUD decomposes source images into abstract common and unique components, while MFM refines these components into robust fused features. Jointly training of these tasks enables DeFusion++ to produce adaptable representations that can effectively extract useful information from various source images, regardless of the fusion task. The resulting fused representations are also highly adaptable for a wide range of downstream tasks, including image segmentation and object detection. DeFusion++ stands out by producing versatile fused representations that can enhance both the quality of image fusion and the effectiveness of downstream high-level vision tasks, simplifying the process with the elegant fusion framework.

[CV-57] DaDiff: Domain-aware Diffusion Model for Nighttime UAV Tracking

链接: https://arxiv.org/abs/2410.12270
作者: Haobo Zuo,Changhong Fu,Guangze Zheng,Liangliang Yao,Kunhan Lu,Jia Pan
关键词-EN: Domain adaptation, nighttime UAV tracking, night image features, issue of day, nighttime UAV
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Domain adaptation is an inspiring solution to the misalignment issue of day/night image features for nighttime UAV tracking. However, the one-step adaptation paradigm is inadequate in addressing the prevalent difficulties posed by low-resolution (LR) objects when viewed from the UAVs at night, owing to the blurry edge contour and limited detail information. Moreover, these approaches struggle to perceive LR objects disturbed by nighttime noise. To address these challenges, this work proposes a novel progressive alignment paradigm, named domain-aware diffusion model (DaDiff), aligning nighttime LR object features to the daytime by virtue of progressive and stable generations. The proposed DaDiff includes an alignment encoder to enhance the detail information of nighttime LR objects, a tracking-oriented layer designed to achieve close collaboration with tracking tasks, and a successive distribution discriminator presented to distinguish different feature distributions at each diffusion timestep successively. Furthermore, an elaborate nighttime UAV tracking benchmark is constructed for LR objects, namely NUT-LR, consisting of 100 annotated sequences. Exhaustive experiments have demonstrated the robustness and feature alignment ability of the proposed DaDiff. The source code and video demo are available at this https URL.

[CV-58] LoD-Loc: Aerial Visual Localization using LoD 3D Map with Neural Wireframe Alignment NEURIPS2024

链接: https://arxiv.org/abs/2410.12269
作者: Juelin Zhu,Shen Yan,Long Wang,Shengyue Zhang,Yu Liu,Maojun Zhang
关键词-EN: Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, method named LoD-Loc, visual localization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024; for Project page, see this https URL

点击查看摘要

Abstract:We propose a new method named LoD-Loc for visual localization in the air. Unlike existing localization algorithms, LoD-Loc does not rely on complex 3D representations and can estimate the pose of an Unmanned Aerial Vehicle (UAV) using a Level-of-Detail (LoD) 3D map. LoD-Loc mainly achieves this goal by aligning the wireframe derived from the LoD projected model with that predicted by the neural network. Specifically, given a coarse pose provided by the UAV sensor, LoD-Loc hierarchically builds a cost volume for uniformly sampled pose hypotheses to describe pose probability distribution and select a pose with maximum probability. Each cost within this volume measures the degree of line alignment between projected and predicted wireframes. LoD-Loc also devises a 6-DoF pose optimization algorithm to refine the previous result with a differentiable Gaussian-Newton method. As no public dataset exists for the studied problem, we collect two datasets with map levels of LoD3.0 and LoD2.0, along with real RGB queries and ground-truth pose annotations. We benchmark our method and demonstrate that LoD-Loc achieves excellent performance, even surpassing current state-of-the-art methods that use textured 3D models for localization. The code and dataset are available at this https URL.

[CV-59] Optimizing YOLOv5s Object Detection through Knowledge Distillation algorithm

链接: https://arxiv.org/abs/2410.12259
作者: Guanming Huang,Aoran Shen,Yuxiang Hu,Junliang Du,Jiacheng Hu,Yingbin Liang
关键词-EN: knowledge distillation technology, target detection tasks, student detection accuracy, knowledge distillation, detection tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the application of knowledge distillation technology in target detection tasks, especially the impact of different distillation temperatures on the performance of student models. By using YOLOv5l as the teacher network and a smaller YOLOv5s as the student network, we found that with the increase of distillation temperature, the student’s detection accuracy gradually improved, and finally achieved mAP50 and mAP50-95 indicators that were better than the original YOLOv5s model at a specific temperature. Experimental results show that appropriate knowledge distillation strategies can not only improve the accuracy of the model but also help improve the reliability and stability of the model in practical applications. This paper also records in detail the accuracy curve and loss function descent curve during the model training process and shows that the model converges to a stable state after 150 training cycles. These findings provide a theoretical basis and technical reference for further optimizing target detection algorithms.

[CV-60] EG-HumanNeRF: Efficient Generalizable Human NeRF Utilizing Human Prior for Sparse View

链接: https://arxiv.org/abs/2410.12242
作者: Zhaorong Wang,Yoshihiro Kanamori,Yuki Endo
关键词-EN: enables neural-based digital, neural radiance field, neural-based digital human, rendering, Generalizable neural radiance
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: project page: this https URL

点击查看摘要

Abstract:Generalizable neural radiance field (NeRF) enables neural-based digital human rendering without per-scene retraining. When combined with human prior knowledge, high-quality human rendering can be achieved even with sparse input views. However, the inference of these methods is still slow, as a large number of neural network queries on each ray are required to ensure the rendering quality. Moreover, occluded regions often suffer from artifacts, especially when the input views are sparse. To address these issues, we propose a generalizable human NeRF framework that achieves high-quality and real-time rendering with sparse input views by extensively leveraging human prior knowledge. We accelerate the rendering with a two-stage sampling reduction strategy: first constructing boundary meshes around the human geometry to reduce the number of ray samples for sampling guidance regression, and then volume rendering using fewer guided samples. To improve rendering quality, especially in occluded regions, we propose an occlusion-aware attention mechanism to extract occlusion information from the human priors, followed by an image space refinement network to improve rendering quality. Furthermore, for volume rendering, we adopt a signed ray distance function (SRDF) formulation, which allows us to propose an SRDF loss at every sample position to improve the rendering quality further. Our experiments demonstrate that our method outperforms the state-of-the-art methods in rendering quality and has a competitive rendering speed compared with speed-prioritized novel view synthesis methods.

[CV-61] Leveraging Spatial Attention and Edge Context for Optimized Feature Selection in Visual Localization

链接: https://arxiv.org/abs/2410.12240
作者: Nanda Febri Istighfarin,HyungGi Jo
关键词-EN: agent precise position, precise position, position and orientation, visual data, Visual localization determines
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual localization determines an agent’s precise position and orientation within an environment using visual data. It has become a critical task in the field of robotics, particularly in applications such as autonomous navigation. This is due to the ability to determine an agent’s pose using cost-effective sensors such as RGB cameras. Recent methods in visual localization employ scene coordinate regression to determine the agent’s pose. However, these methods face challenges as they attempt to regress 2D-3D correspondences across the entire image region, despite not all regions providing useful information. To address this issue, we introduce an attention network that selectively targets informative regions of the image. Using this network, we identify the highest-scoring features to improve the feature selection process and combine the result with edge detection. This integration ensures that the features chosen for the training buffer are located within robust regions, thereby improving 2D-3D correspondence and overall localization performance. Our approach was tested on the outdoor benchmark dataset, demonstrating superior results compared to previous methods.

[CV-62] Evaluating Cascaded Methods of Vision-Language Models for Zero-Shot Detection and Association of Hardhats for Increased Construction Safety

链接: https://arxiv.org/abs/2410.12225
作者: Lucas Choi,Ross Greer
关键词-EN: enhance construction safety, paper evaluates, enhance construction, Hardhat Safety Detection, zero-shot detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper evaluates the use of vision-language models (VLMs) for zero-shot detection and association of hardhats to enhance construction safety. Given the significant risk of head injuries in construction, proper enforcement of hardhat use is critical. We investigate the applicability of foundation models, specifically OWLv2, for detecting hardhats in real-world construction site images. Our contributions include the creation of a new benchmark dataset, Hardhat Safety Detection Dataset, by filtering and combining existing datasets and the development of a cascaded detection approach. Experimental results on 5,210 images demonstrate that the OWLv2 model achieves an average precision of 0.6493 for hardhat detection. We further analyze the limitations and potential improvements for real-world applications, highlighting the strengths and weaknesses of current foundation models in safety perception domains.

[CV-63] Order-Aware Interactive Segmentation

链接: https://arxiv.org/abs/2410.12214
作者: Bin Wang,Anwesa Choudhuri,Meng Zheng,Zhongpai Gao,Benjamin Planche,Andong Deng,Qin Liu,Terrence Chen,Ulas Bagci,Ziyan Wu
关键词-EN: accurately segment target, segment target objects, Interactive segmentation aims, minimal user interactions, accurately separate target
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Interactive demo can be found in project page: this https URL

点击查看摘要

Abstract:Interactive segmentation aims to accurately segment target objects with minimal user interactions. However, current methods often fail to accurately separate target objects from the background, due to a limited understanding of order, the relative depth between objects in a scene. To address this issue, we propose OIS: order-aware interactive segmentation, where we explicitly encode the relative depth between objects into order maps. We introduce a novel order-aware attention, where the order maps seamlessly guide the user interactions (in the form of clicks) to attend to the image features. We further present an object-aware attention module to incorporate a strong object-level understanding to better differentiate objects with similar order. Our approach allows both dense and sparse integration of user clicks, enhancing both accuracy and efficiency as compared to prior works. Experimental results demonstrate that OIS achieves state-of-the-art performance, improving mIoU after one click by 7.61 on the HQSeg44K dataset and 1.32 on the DAVIS dataset as compared to the previous state-of-the-art SegNext, while also doubling inference speed compared to current leading methods. The project page is this https URL

[CV-64] Sparse Prototype Network for Explainable Pedestrian Behavior Prediction

链接: https://arxiv.org/abs/2410.12195
作者: Yan Feng,Alexander Carballo,Kazuya Takeda
关键词-EN: Predicting pedestrian behavior, Predicting pedestrian, smart city, behavior is challenging, challenging yet crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting pedestrian behavior is challenging yet crucial for applications such as autonomous driving and smart city. Recent deep learning models have achieved remarkable performance in making accurate predictions, but they fail to provide explanations of their inner workings. One reason for this problem is the multi-modal inputs. To bridge this gap, we present Sparse Prototype Network (SPN), an explainable method designed to simultaneously predict a pedestrian’s future action, trajectory, and pose. SPN leverages an intermediate prototype bottleneck layer to provide sample-based explanations for its predictions. The prototypes are modality-independent, meaning that they can correspond to any modality from the input. Therefore, SPN can extend to arbitrary combinations of modalities. Regularized by mono-semanticity and clustering constraints, the prototypes learn consistent and human-understandable features and achieve state-of-the-art performance on action, trajectory and pose prediction on TITAN and PIE. Finally, we propose a metric named Top-K Mono-semanticity Scale to quantitatively evaluate the explainability. Qualitative results show the positive correlation between sparsity and explainability. Code available at this https URL.

[CV-65] st-time adaptation for image compression with distribution regularization

链接: https://arxiv.org/abs/2410.12191
作者: Kecheng Chen,Pingping Zhang,Tiexin Qin,Shiqi Wang,Hong Yan,Haoliang Li
关键词-EN: adaptation image compression, image compression models, learned image compression, compression-time adaptation image, screen content images
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Current test- or compression-time adaptation image compression (TTA-IC) approaches, which leverage both latent and decoder refinements as a two-step adaptation scheme, have potentially enhanced the rate-distortion (R-D) performance of learned image compression models on cross-domain compression tasks, \textite.g., from natural to screen content images. However, compared with the emergence of various decoder refinement variants, the latent refinement, as an inseparable ingredient, is barely tailored to cross-domain scenarios. To this end, we aim to develop an advanced latent refinement method by extending the effective hybrid latent refinement (HLR) method, which is designed for \textitin-domain inference improvement but shows noticeable degradation of the rate cost in \textitcross-domain tasks. Specifically, we first provide theoretical analyses, in a cue of marginalization approximation from in- to cross-domain scenarios, to uncover that the vanilla HLR suffers from an underlying mismatch between refined Gaussian conditional and hyperprior distributions, leading to deteriorated joint probability approximation of marginal distribution with increased rate consumption. To remedy this issue, we introduce a simple Bayesian approximation-endowed \textitdistribution regularization to encourage learning a better joint probability approximation in a plug-and-play manner. Extensive experiments on six in- and cross-domain datasets demonstrate that our proposed method not only improves the R-D performance compared with other latent refinement counterparts, but also can be flexibly integrated into existing TTA-IC methods with incremental benefits.

[CV-66] ransAgent : Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration NEURIPS2024

链接: https://arxiv.org/abs/2410.12183
作者: Yiwei Guo,Shaobin Zhuang,Kunchang Li,Yu Qiao,Yali Wang
关键词-EN: large-scale image-text pre-training, transfer learning, owing to large-scale, recently shown, shown their power
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Vision-language foundation models (such as CLIP) have recently shown their power in transfer learning, owing to large-scale image-text pre-training. However, target domain data in the downstream tasks can be highly different from the pre-training phase, which makes it hard for such a single model to generalize well. Alternatively, there exists a wide range of expert models that contain diversified vision and/or language knowledge pre-trained on different modalities, tasks, networks, and datasets. Unfortunately, these models are “isolated agents” with heterogeneous structures, and how to integrate their knowledge for generalizing CLIP-like models has not been fully explored. To bridge this gap, we propose a general and concise TransAgent framework, which transports the knowledge of the isolated agents in a unified manner, and effectively guides CLIP to generalize with multi-source knowledge distillation. With such a distinct framework, we flexibly collaborate with 11 heterogeneous agents to empower vision-language foundation models, without further cost in the inference phase. Finally, our TransAgent achieves state-of-the-art performance on 11 visual recognition datasets. Under the same low-shot setting, it outperforms the popular CoOp with around 10% on average, and 20% on EuroSAT which contains large domain shifts.

[CV-67] Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution

链接: https://arxiv.org/abs/2410.12165
作者: Timothy Wei,Hsien Xin Peng,Elaine Xu,Bryan Zhao,Lei Ding,Diji Yang
关键词-EN: Artificial Intelligence models, Artificial Intelligence, increasingly challenging due, grow in size, Large Video-Language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence models, such as Large Video-Language models (VLMs), grow in size, their deployment in real-world applications becomes increasingly challenging due to hardware limitations and computational costs. To address this, we design a hybrid edge-cloud solution that leverages the efficiency of smaller models for local processing while deferring to larger, more accurate cloud-based models when necessary. Specifically, we propose a novel unsupervised data generation method, Dual-Model Distillation (DMD), to train a lightweight switcher model that can predict when the edge model’s output is uncertain and selectively offload inference to the large model in the cloud. Experimental results on the action classification task show that our framework not only requires less computational overhead, but also improves accuracy compared to using a large model alone. Our framework provides a scalable and adaptable solution for action classification in resource-constrained environments, with potential applications beyond healthcare. Noteworthy, while DMD-generated data is used for optimizing performance and resource usage in our pipeline, we expect the concept of DMD to further support future research on knowledge alignment across multiple models.

[CV-68] SAM-Guided Masked Token Prediction for 3D Scene Understanding NEURIPS2024

链接: https://arxiv.org/abs/2410.12158
作者: Zhimin Chen,Liang Yang,Yingwei Li,Longlong Jing,Bing Li
关键词-EN: marking considerable advancements, knowledge distillation, Foundation models, significantly enhanced, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Foundation models have significantly enhanced 2D task performance, and recent works like Bridge3D have successfully applied these models to improve 3D scene understanding through knowledge distillation, marking considerable advancements. Nonetheless, challenges such as the misalignment between 2D and 3D representations and the persistent long-tail distribution in 3D datasets still restrict the effectiveness of knowledge distillation from 2D to 3D using foundation models. To tackle these issues, we introduce a novel SAM-guided tokenization method that seamlessly aligns 3D transformer structures with region-level knowledge distillation, replacing the traditional KNN-based tokenization techniques. Additionally, we implement a group-balanced re-weighting strategy to effectively address the long-tail problem in knowledge distillation. Furthermore, inspired by the recent success of masked feature prediction, our framework incorporates a two-stage masked token prediction process in which the student model predicts both the global embeddings and the token-wise local embeddings derived from the teacher models trained in the first stage. Our methodology has been validated across multiple datasets, including SUN RGB-D, ScanNet, and S3DIS, for tasks like 3D object detection and semantic segmentation. The results demonstrate significant improvements over current State-of-the-art self-supervised methods, establishing new benchmarks in this field.

[CV-69] Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection

链接: https://arxiv.org/abs/2410.12143
作者: Qishun Wang,Zhengzheng Tu,Kunpeng Wang,Le Gu,Chuanwang Guo
关键词-EN: Video Object Detection, RGB-Thermal Video Object, Object Detection, Video Object, Current RGB-Thermal Video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current RGB-Thermal Video Object Detection (RGBT VOD) methods still depend on manually aligning data at the image level, which hampers its practical application in real-world scenarios since image pairs captured by multispectral sensors often differ in both fields of view and resolution. To address this limitation, we propose a Multi-modal Dynamic Local fusion Network (MDLNet) designed to handle unaligned RGBT image pairs. Specifically, our proposed Multi-modal Dynamic Local Fusion (MDLF) module includes a set of predefined boxes, each enhanced with random Gaussian noise to generate a dynamic box. Each box selects a local region from the original high-resolution RGB image. This region is then fused with the corresponding information from another modality and reinserted into the RGB. This method adapts to various data alignment scenarios by interacting with local features across different ranges. Simultaneously, we introduce a Cascaded Temporal Scrambler (CTS) within an end-to-end architecture. This module leverages consistent spatiotemporal information from consecutive frames to enhance the representation capability of the current frame while maintaining network efficiency. We have curated an open dataset called UVT-VOD2024 for unaligned RGBT VOD. It consists of 30,494 pairs of unaligned RGBT images captured directly from a multispectral camera. We conduct a comprehensive evaluation and comparison with MDLNet and state-of-the-art (SOTA) models, demonstrating the superior effectiveness of MDLNet. We will release our code and UVT-VOD2024 to the public for further research.

[CV-70] OMCAT: Omni Context Aware Transformer

链接: https://arxiv.org/abs/2410.12109
作者: Arushi Goel,Karan Sapra,Matthieu Le,Rafael Valle,Andrew Tao,Bryan Catanzaro
关键词-EN: Large Language Models, Large Language, recent advancements extending, Language Models, Temporal Audio Video
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Demo page: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment, instruction tuning, and OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is this https URL.

[CV-71] SplatPose: Real-time Image-Based Pose-Agnostic 3D Anomaly Detection

链接: https://arxiv.org/abs/2410.12080
作者: Yizhe Liu,Yan Song Hu,Yuhao Chen,John Zelek
关键词-EN: Anomaly Detection, Image-based Pose-Agnostic, Pose-agnostic Anomaly Detection, industrial quality control, important task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image-based Pose-Agnostic 3D Anomaly Detection is an important task that has emerged in industrial quality control. This task seeks to find anomalies from query images of a tested object given a set of reference images of an anomaly-free object. The challenge is that the query views (a.k.a poses) are unknown and can be different from the reference views. Currently, new methods such as OmniposeAD and SplatPose have emerged to bridge the gap by synthesizing pseudo reference images at the query views for pixel-to-pixel comparison. However, none of these methods can infer in real-time, which is critical in industrial quality control for massive production. For this reason, we propose SplatPose+, which employs a hybrid representation consisting of a Structure from Motion (SfM) model for localization and a 3D Gaussian Splatting (3DGS) model for Novel View Synthesis. Although our proposed pipeline requires the computation of an additional SfM model, it offers real-time inference speeds and faster training compared to SplatPose. Quality-wise, we achieved a new SOTA on the Pose-agnostic Anomaly Detection benchmark with the Multi-Pose Anomaly Detection (MAD-SIM) dataset.

[CV-72] WeatherDG: LLM-assisted Procedural Weather Generation for Domain-Generalized Semantic Segmentation

链接: https://arxiv.org/abs/2410.12075
作者: Chenghao Qian,Yuhu Guo,Yuhong Mo,Wenjing Li
关键词-EN: Large Language Model, Stable Diffusion, Large Language, driving-screen images based, Language Model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we propose a novel approach, namely WeatherDG, that can generate realistic, weather-diverse, and driving-screen images based on the cooperation of two foundation models, i.e, Stable Diffusion (SD) and Large Language Model (LLM). Specifically, we first fine-tune the SD with source data, aligning the content and layout of generated samples with real-world driving scenarios. Then, we propose a procedural prompt generation method based on LLM, which can enrich scenario descriptions and help SD automatically generate more diverse, detailed images. In addition, we introduce a balanced generation strategy, which encourages the SD to generate high-quality objects of tailed classes under various weather conditions, such as riders and motorcycles. This segmentation-model-agnostic method can improve the generalization ability of existing models by additionally adapting them with the generated synthetic data. Experiments on three challenging datasets show that our method can significantly improve the segmentation performance of different state-of-the-art models on target domains. Notably, in the setting of ‘‘Cityscapes to ACDC’’, our method improves the baseline HRDA by 13.9% in mIoU.

[CV-73] nvTorchCam: An Open-source Library for Camera-Agnostic Differentiable Geometric Vision

链接: https://arxiv.org/abs/2410.12074
作者: Daniel Lichy,Hang Su,Abhishek Badki,Jan Kautz,Orazio Gallo
关键词-EN: algorithms camera model-independent, designed to make, make deep learning, open-source library, learning algorithms camera
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Source code and installation instructions are available at this https URL

点击查看摘要

Abstract:We introduce nvTorchCam, an open-source library under the Apache 2.0 license, designed to make deep learning algorithms camera model-independent. nvTorchCam abstracts critical camera operations such as projection and unprojection, allowing developers to implement algorithms once and apply them across diverse camera models–including pinhole, fisheye, and 360 equirectangular panoramas, which are commonly used in automotive and real estate capture applications. Built on PyTorch, nvTorchCam is fully differentiable and supports GPU acceleration and batching for efficient computation. Furthermore, deep learning models trained for one camera type can be directly transferred to other camera types without requiring additional modification. In this paper, we provide an overview of nvTorchCam, its functionality, and present various code examples and diagrams to demonstrate its usage. Source code and installation instructions can be found on the nvTorchCam GitHub page at this https URL.

[CV-74] V3D-SLAM: Robust RGB-D SLAM in Dynamic Environments with 3D Semantic Geometry Voting

链接: https://arxiv.org/abs/2410.12068
作者: Tuan Dang,Khang Nguyen,Mandfred Huber
关键词-EN: Simultaneous localization, highly dynamic environments, localization and mapping, environments is challenging, challenging due
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Simultaneous localization and mapping (SLAM) in highly dynamic environments is challenging due to the correlation complexity between moving objects and the camera pose. Many methods have been proposed to deal with this problem; however, the moving properties of dynamic objects with a moving camera remain unclear. Therefore, to improve SLAM’s performance, minimizing disruptive events of moving objects with a physical understanding of 3D shapes and dynamics of objects is needed. In this paper, we propose a robust method, V3D-SLAM, to remove moving objects via two lightweight re-evaluation stages, including identifying potentially moving and static objects using a spatial-reasoned Hough voting mechanism and refining static objects by detecting dynamic noise caused by intra-object motions using Chamfer distances as similarity measurements. Our experiment on the TUM RGB-D benchmark on dynamic sequences with ground-truth camera trajectories showed that our methods outperform the most recent state-of-the-art SLAM methods. Our source code is available at this https URL.

[CV-75] SOE: SO(3)-Equivariant 3D MRI Encoding

链接: https://arxiv.org/abs/2410.12053
作者: Shizhe He,Magdalini Paschali,Jiahong Ouyang,Adnan Masood,Akshay Chaudhari,Ehsan Adeli
关键词-EN: increasingly important, learning latent representations, Representation, space, representation space
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Representation learning has become increasingly important, especially as powerful models have shifted towards learning latent representations before fine-tuning for downstream tasks. This approach is particularly valuable in leveraging the structural information within brain anatomy. However, a common limitation of recent models developed for MRIs is their tendency to ignore or remove geometric information, such as translation and rotation, thereby creating invariance with respect to geometric operations. We contend that incorporating knowledge about these geometric transformations into the model can significantly enhance its ability to learn more detailed anatomical information within brain structures. As a result, we propose a novel method for encoding 3D MRIs that enforces equivariance with respect to all rotations in 3D space, in other words, SO(3)-equivariance (SOE). By explicitly modeling this geometric equivariance in the representation space, we ensure that any rotational operation applied to the input image space is also reflected in the embedding representation space. This approach requires moving beyond traditional representation learning methods, as we need a representation vector space that allows for the application of the same SO(3) operation in that space. To facilitate this, we leverage the concept of vector neurons. The representation space formed by our method captures the brain’s structural and anatomical information more effectively. We evaluate SOE pretrained on the structural MRIs of two public data sets with respect to the downstream task of predicting age and diagnosing Alzheimer’s Disease from T1-weighted brain scans of the ADNI data set. We demonstrate that our approach not only outperforms other methods but is also robust against various degrees of rotation along different axes. The code is available at this https URL.

[CV-76] Learned Neural Physics Simulation for Articulated 3D Human Pose Reconstruction

链接: https://arxiv.org/abs/2410.12023
作者: Mykhaylo Andriluka,Baruch Tabanpour,C. Daniel Freeman,Cristian Sminchisescu
关键词-EN: Learned Articulated Rigid, Articulated Rigid body, neural network approach, Learned Articulated, Rigid body Physics
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a novel neural network approach, LARP (Learned Articulated Rigid body Physics), to model the dynamics of articulated human motion with contact. Our goal is to develop a faster and more convenient methodological alternative to traditional physics simulators for use in computer vision tasks such as human motion reconstruction from video. To that end we introduce a training procedure and model components that support the construction of a recurrent neural architecture to accurately simulate articulated rigid body dynamics. Our neural architecture supports features typically found in traditional physics simulators, such as modeling of joint motors, variable dimensions of body parts, contact between body parts and objects, and is an order of magnitude faster than traditional systems when multiple simulations are run in parallel. To demonstrate the value of LARP we use it as a drop-in replacement for a state of the art classical non-differentiable simulator in an existing video-based reconstruction framework and show comparative or better 3D human pose reconstruction accuracy.

[CV-77] LocoMotion: Learning Motion-Focused Video-Language Representations ACCV2024

链接: https://arxiv.org/abs/2410.12018
作者: Hazel Doughty,Fida Mohammad Thoker,Cees G. M. Snoek
关键词-EN: paper strives, video-language representations, motion-focused video-language representations, learn video-language representations, motion-focused video-language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: ACCV 2024

点击查看摘要

Abstract:This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is often enough to distinguish the relevant caption. We instead propose LocoMotion to learn from motion-focused captions that describe the movement and temporal progression of local object motions. We achieve this by adding synthetic motions to videos and using the parameters of these motions to generate corresponding captions. Furthermore, we propose verb-variation paraphrasing to increase the caption variety and learn the link between primitive motions and high-level verbs. With this, we are able to learn a motion-focused video-language representation. Experiments demonstrate our approach is effective for a variety of downstream tasks, particularly when limited data is available for fine-tuning. Code is available: this https URL

[CV-78] Beyond Labels: A Self-Supervised Framework with Masked Autoencoders and Random Cropping for Breast Cancer Subtype Classification

链接: https://arxiv.org/abs/2410.12006
作者: Annalisa Chiocchetti,Marco Dossena,Christopher Irwin,Luigi Portinale
关键词-EN: work contributes, contributes to breast, breast cancer sub-type, histopathological images, breast cancer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work contributes to breast cancer sub-type classification using histopathological images. We utilize masked autoencoders (MAEs) to learn a self-supervised embedding tailored for computer vision tasks in this domain. This embedding captures informative representations of histopathological data, facilitating feature learning without extensive labeled datasets. During pre-training, we investigate employing a random crop technique to generate a large dataset from WSIs automatically. Additionally, we assess the performance of linear probes for multi-class classification tasks of cancer sub-types using the representations learnt by the MAE. Our approach aims to achieve strong performance on downstream tasks by leveraging the complementary strengths of ViTs and autoencoders. We evaluate our model’s performance on the BRACS dataset and compare it with existing benchmarks.

[CV-79] DDIL: Improved Diffusion Distillation With Imitation Learning

链接: https://arxiv.org/abs/2410.11971
作者: Risheek Garrepalli,Shweta Mahajan,Munawar Hayat,Fatih Porikli
关键词-EN: sampling requires multiple, requires multiple denoising, multiple denoising network, denoising network passes, limiting practicality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes, limiting practicality. Efforts such as progressive distillation or consistency distillation have shown promise by reducing the number of passes at the expense of quality of the generated samples. In this work we identify co-variate shift as one of reason for poor performance of multi-step distilled models from compounding error at inference time. To address co-variate shift, we formulate diffusion distillation within imitation learning (DDIL) framework and enhance training distribution for distilling diffusion models on both data distribution (forward diffusion) and student induced distributions (backward diffusion). Training on data distribution helps to diversify the generations by preserving marginal data distribution and training on student distribution addresses compounding error by correcting covariate shift. In addition, we adopt reflected diffusion formulation for distillation and demonstrate improved performance, stable training across different distillation methods. We show that DDIL consistency improves on baseline algorithms of progressive distillation (PD), Latent consistency models (LCM) and Distribution Matching Distillation (DMD2).

[CV-80] Integrating Artificial Intelligence Models and Synthetic Image Data for Enhanced Asset Inspection and Defect Identification

链接: https://arxiv.org/abs/2410.11967
作者: Reddy Mandati,Vladyslav Anderson,Po-chen Chen,Ankush Agarwal,Tatjana Dokic,David Barnard,Michael Finn,Jesse Cromer,Andrew Mccauley,Clay Tutaj,Neha Dave,Bobby Besharati,Jamie Barnett,Timothy Krall
关键词-EN: past utilities relied, defect detection, identify asset defects, relied on in-field, images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the past utilities relied on in-field inspections to identify asset defects. Recently, utilities have started using drone-based inspections to enhance the field-inspection process. We consider a vast repository of drone images, providing a wealth of information about asset health and potential issues. However, making the collected imagery data useful for automated defect detection requires significant manual labeling effort. We propose a novel solution that combines synthetic asset defect images with manually labeled drone images. This solution has several benefits: improves performance of defect detection, reduces the number of hours spent on manual labeling, and enables the capability to generate realistic images of rare defects where not enough real-world data is available. We employ a workflow that combines 3D modeling tools such as Maya and Unreal Engine to create photorealistic 3D models and 2D renderings of defective assets and their surroundings. These synthetic images are then integrated into our training pipeline augmenting the real data. This study implements an end-to-end Artificial Intelligence solution to detect assets and asset defects from the combined imagery repository. The unique contribution of this research lies in the application of advanced computer vision models and the generation of photorealistic 3D renderings of defective assets, aiming to transform the asset inspection process. Our asset detection model has achieved an accuracy of 92 percent, we achieved a performance lift of 67 percent when introducing approximately 2,000 synthetic images of 2k resolution. In our tests, the defect detection model achieved an accuracy of 73 percent across two batches of images. Our analysis demonstrated that synthetic data can be successfully used in place of real-world manually labeled data to train defect detection model.

[CV-81] CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

链接: https://arxiv.org/abs/2410.11963
作者: Qingqing Cao,Mahyar Najibi,Sachin Mehta
关键词-EN: Pretraining robust vision, Pretraining robust, potentially misaligned, relies on large-scale, long-tail distributions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pretraining robust vision or multimodal foundation models (e.g., CLIP) relies on large-scale datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous works have shown promising results in augmenting datasets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., either image or text only, but not both), and are limited in data diversity due to a lack of fine-grained control over the synthesis process. In this paper, we design a \emphcontrollable image-text synthesis pipeline, CtrlSynth, for data-efficient and robust multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., remove, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose feature in CtrlSynth allows users to control data synthesis in a fine-grained manner by defining customized control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pretrained foundation models such as large language models or diffusion models to reason and recompose basic elements such that synthetic samples are natural and composed in diverse ways. CtrlSynth is a closed-loop, training-free, and modular framework, making it easy to support different pretrained models. With extensive experiments on 31 datasets spanning different vision and vision-language tasks, we show that CtrlSynth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.

[CV-82] Dual-frame Fluid Motion Estimation with Test-time Optimization and Zero-divergence Loss NEURIPS2024

链接: https://arxiv.org/abs/2410.11934
作者: Yifei Zhang,Huan-ang Gao,Zhou Jiang,Hao Zhao
关键词-EN: challenging computational problems, particle tracking velocimetry, dual-frame fluid motion, fluid motion estimation, analyzing turbulent flow
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:3D particle tracking velocimetry (PTV) is a key technique for analyzing turbulent flow, one of the most challenging computational problems of our century. At the core of 3D PTV is the dual-frame fluid motion estimation algorithm, which tracks particles across two consecutive frames. Recently, deep learning-based methods have achieved impressive accuracy in dual-frame fluid motion estimation; however, they heavily depend on large volumes of labeled data. In this paper, we introduce a new method that is completely self-supervised and notably outperforms its fully-supervised counterparts while requiring only 1% of the training samples (without labels) used by previous methods. Our method features a novel zero-divergence loss that is specific to the domain of turbulent flow. Inspired by the success of splat operation in high-dimensional filtering and random fields, we propose a splat-based implementation for this loss which is both efficient and effective. The self-supervised nature of our method naturally supports test-time optimization, leading to the development of a tailored Dynamic Velocimetry Enhancer (DVE) module. We demonstrate that strong cross-domain robustness is achieved through test-time optimization on unseen leave-one-out synthetic domains and real physical/biological domains. Code, data and models are available at this https URL.

[CV-83] Development and Testing of a Wood Panels Bark Removal Equipment Based on Deep Learning

链接: https://arxiv.org/abs/2410.11913
作者: Rijun Wang,Guanghao Zhang,Hongyang Chen,Xinye Yu,Yesheng Chen,Fulong Liang,Xiangwei Mou,Bo Wang
关键词-EN: panels bark removal, wood panels bark, bark removal, bark removal equipment, Attempting to apply
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Attempting to apply deep learning methods to wood panels bark removal equipment to enhance the quality and efficiency of bark removal is a significant and challenging endeavor. This study develops and tests a deep learning-based wood panels bark removal equipment. In accordance with the practical requirements of sawmills, a wood panels bark removal equipment equipped with a vision inspection system is designed. Based on a substantial collection of wood panel images obtained using the visual inspection system, the first general wood panels semantic segmentation dataset is constructed for training the BiSeNetV1 model employed in this study. Furthermore, the calculation methods and processes for the essential key data required in the bark removal process are presented in detail. Comparative experiments of the BiSeNetV1 model and tests of bark removal effectiveness are conducted in both laboratory and sawmill environments. The results of the comparative experiments indicate that the application of the BiSeNetV1 segmentation model is rational and feasible. The results of the bark removal effectiveness tests demonstrate a significant improvement in both the quality and efficiency of bark removal. The developed equipment fully meets the sawmill’s requirements for precision and efficiency in bark removal processing.

[CV-84] Neural Metamorphosis ECCV2024

链接: https://arxiv.org/abs/2410.11878
作者: Xingyi Yang,Xinchao Wang
关键词-EN: termed Neural Metamorphosis, learning paradigm termed, paradigm termed Neural, build self-morphable neural, Neural Metamorphosis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: in ECCV2024, this https URL

点击查看摘要

Abstract:This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self-morphable neural networks. Contrary to crafting separate models for different architectures or sizes, NeuMeta directly learns the continuous weight manifold of neural networks. Once trained, we can sample weights for any-sized network directly from the manifold, even for previously unseen configurations, without retraining. To achieve this ambitious goal, NeuMeta trains neural implicit functions as hypernetworks. They accept coordinates within the model space as input, and generate corresponding weight values on the manifold. In other words, the implicit function is learned in a way, that the predicted weights is well-performed across various models sizes. In training those models, we notice that, the final performance closely relates on smoothness of the learned manifold. In pursuit of enhancing this smoothness, we employ two strategies. First, we permute weight matrices to achieve intra-model smoothness, by solving the Shortest Hamiltonian Path problem. Besides, we add a noise on the input coordinates when training the implicit function, ensuring models with various sizes shows consistent outputs. As such, NeuMeta shows promising results in synthesizing parameters for various network configurations. Our extensive tests in image classification, semantic segmentation, and image generation reveal that NeuMeta sustains full-size performance even at a 75% compression rate.

[CV-85] Comparing Zealous and Restrained AI Recommendations in a Real-World Human-AI Collaboration Task

链接: https://arxiv.org/abs/2410.11860
作者: Chengyuan Xu,Kuo-Chin Lien,Tobias Höllerer
关键词-EN: AI-assisted decision-making system, decision-making system, designing an AI-assisted, AI-assisted decision-making, tradeoff between precision
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 14 figures, accepted to ACM CHI 2023

点击查看摘要

Abstract:When designing an AI-assisted decision-making system, there is often a tradeoff between precision and recall in the AI’s recommendations. We argue that careful exploitation of this tradeoff can harness the complementary strengths in the human-AI collaboration to significantly improve team performance. We investigate a real-world video anonymization task for which recall is paramount and more costly to improve. We analyze the performance of 78 professional annotators working with a) no AI assistance, b) a high-precision “restrained” AI, and c) a high-recall “zealous” AI in over 3,466 person-hours of annotation work. In comparison, the zealous AI helps human teammates achieve significantly shorter task completion time and higher recall. In a follow-up study, we remove AI assistance for everyone and find negative training effects on annotators trained with the restrained AI. These findings and our analysis point to important implications for the design of AI assistance in recall-demanding scenarios.

[CV-86] A Robust Multisource Remote Sensing Image Matching Method Utilizing Attention and Feature Enhancement Against Noise Interference

链接: https://arxiv.org/abs/2410.11848
作者: Yuan Li,Dapeng Wu,Yaping Cui,Peng He,Yuan Zhang,Ruyan Wang
关键词-EN: remote sensing image, multisource remote sensing, remote sensing, sensing image applications, sensing image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 13 figures

点击查看摘要

Abstract:Image matching is a fundamental and critical task of multisource remote sensing image applications. However, remote sensing images are susceptible to various noises. Accordingly, how to effectively achieve accurate matching in noise images is a challenging problem. To solve this issue, we propose a robust multisource remote sensing image matching method utilizing attention and feature enhancement against noise interference. In the first stage, we combine deep convolution with the attention mechanism of transformer to perform dense feature extraction, constructing feature descriptors with higher discriminability and robustness. Subsequently, we employ a coarse-to-fine matching strategy to achieve dense matches. In the second stage, we introduce an outlier removal network based on a binary classification mechanism, which can establish effective and geometrically consistent correspondences between images; through weighting for each correspondence, inliers vs. outliers classification are performed, as well as removing outliers from dense matches. Ultimately, we can accomplish more efficient and accurate matches. To validate the performance of the proposed method, we conduct experiments using multisource remote sensing image datasets for comparison with other state-of-the-art methods under different scenarios, including noise-free, additive random noise, and periodic stripe noise. Comparative results indicate that the proposed method has a more well-balanced performance and robustness. The proposed method contributes a valuable reference for solving the difficult problem of noise image matching.

[CV-87] Cascade learning in multi-task encoder-decoder networks for concurrent bone segmentation and glenohumeral joint assessment in shoulder CT scans

链接: https://arxiv.org/abs/2410.12641
作者: Luca Marsilio,Davide Marzorati,Matteo Rossi,Andrea Moglia,Luca Mainardi,Alfonso Manzotti,Pietro Cerveri
关键词-EN: degenerative condition affecting, bone density loss, condition affecting bones, joint space narrowing, density loss
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Osteoarthritis is a degenerative condition affecting bones and cartilage, often leading to osteophyte formation, bone density loss, and joint space narrowing. Treatment options to restore normal joint function vary depending on the severity of the condition. This work introduces an innovative deep-learning framework processing shoulder CT scans. It features the semantic segmentation of the proximal humerus and scapula, the 3D reconstruction of bone surfaces, the identification of the glenohumeral (GH) joint region, and the staging of three common osteoarthritic-related pathologies: osteophyte formation (OS), GH space reduction (JS), and humeroscapular alignment (HSA). The pipeline comprises two cascaded CNN architectures: 3D CEL-UNet for segmentation and 3D Arthro-Net for threefold classification. A retrospective dataset of 571 CT scans featuring patients with various degrees of GH osteoarthritic-related pathologies was used to train, validate, and test the pipeline. Root mean squared error and Hausdorff distance median values for 3D reconstruction were 0.22mm and 1.48mm for the humerus and 0.24mm and 1.48mm for the scapula, outperforming state-of-the-art architectures and making it potentially suitable for a PSI-based shoulder arthroplasty preoperative plan context. The classification accuracy for OS, JS, and HSA consistently reached around 90% across all three categories. The computational time for the inference pipeline was less than 15s, showcasing the framework’s efficiency and compatibility with orthopedic radiology practice. The outcomes represent a promising advancement toward the medical translation of artificial intelligence tools. This progress aims to streamline the preoperative planning pipeline delivering high-quality bone surfaces and supporting surgeons in selecting the most suitable surgical approach according to the unique patient joint conditions.

[CV-88] From Lab to Pocket: A Novel Continual Learning-based Mobile Application for Screening COVID-19

链接: https://arxiv.org/abs/2410.12589
作者: Danny Falero,Muhammad Ashad Kabir,Nusrat Homaira
关键词-EN: Artificial intelligence, continual learning, medical images, learning, continual
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:Artificial intelligence (AI) has emerged as a promising tool for predicting COVID-19 from medical images. In this paper, we propose a novel continual learning-based approach and present the design and implementation of a mobile application for screening COVID-19. Our approach demonstrates the ability to adapt to evolving datasets, including data collected from different locations or hospitals, varying virus strains, and diverse clinical presentations, without retraining from scratch. We have evaluated state-of-the-art continual learning methods for detecting COVID-19 from chest X-rays and selected the best-performing model for our mobile app. We evaluated various deep learning architectures to select the best-performing one as a foundation model for continual learning. Both regularization and memory-based methods for continual learning were tested, using different memory sizes to develop the optimal continual learning model for our app. DenseNet161 emerged as the best foundation model with 96.87% accuracy, and Learning without Forgetting (LwF) was the top continual learning method with an overall performance of 71.99%. The mobile app design considers both patient and doctor perspectives. It incorporates the continual learning DenseNet161 LwF model on a cloud server, enabling the model to learn from new instances of chest X-rays and their classifications as they are submitted. The app is designed, implemented, and evaluated to ensure it provides an efficient tool for COVID-19 screening. The app is available to download from this https URL.

[CV-89] Self-DenseMobileNet: A Robust Framework for Lung Nodule Classification using Self-ONN and Stacking-based Meta-Classifier

链接: https://arxiv.org/abs/2410.12584
作者: Md. Sohanur Rahman,Muhammad E. H. Chowdhury,Hasib Ryan Rahman,Mosabber Uddin Ahmed,Muhammad Ashad Kabir,Sanjiban Sekhar Roy,Rusab Sarmun
关键词-EN: chest radiographs, non-nodules in chest, designed to enhance, improving classification accuracy, classification
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:In this study, we propose a novel and robust framework, Self-DenseMobileNet, designed to enhance the classification of nodules and non-nodules in chest radiographs (CXRs). Our approach integrates advanced image standardization and enhancement techniques to optimize the input quality, thereby improving classification accuracy. To enhance predictive accuracy and leverage the strengths of multiple models, the prediction probabilities from Self-DenseMobileNet were transformed into tabular data and used to train eight classical machine learning (ML) models; the top three performers were then combined via a stacking algorithm, creating a robust meta-classifier that integrates their collective insights for superior classification performance. To enhance the interpretability of our results, we employed class activation mapping (CAM) to visualize the decision-making process of the best-performing model. Our proposed framework demonstrated remarkable performance on internal validation data, achieving an accuracy of 99.28% using a Meta-Random Forest Classifier. When tested on an external dataset, the framework maintained strong generalizability with an accuracy of 89.40%. These results highlight a significant improvement in the classification of CXRs with lung nodules.

[CV-90] Evaluating Utility of Memory Efficient Medical Image Generation: A Study on Lung Nodule Segmentation

链接: https://arxiv.org/abs/2410.12542
作者: Kathrin Khadra,Utku Türkbey
关键词-EN: imaging data limits, scarcity of publicly, limits the development, development of effective, synthetic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The scarcity of publicly available medical imaging data limits the development of effective AI models. This work proposes a memory-efficient patch-wise denoising diffusion probabilistic model (DDPM) for generating synthetic medical images, focusing on CT scans with lung nodules. Our approach generates high-utility synthetic images with nodule segmentation while efficiently managing memory constraints, enabling the creation of training datasets. We evaluate the method in two scenarios: training a segmentation model exclusively on synthetic data, and augmenting real-world training data with synthetic images. In the first case, models trained solely on synthetic data achieve Dice scores comparable to those trained on real-world data benchmarks. In the second case, augmenting real-world data with synthetic images significantly improves segmentation performance. The generated images demonstrate their potential to enhance medical image datasets in scenarios with limited real-world data.

[CV-91] A Primal-dual algorithm for image reconstruction with ICNNs

链接: https://arxiv.org/abs/2410.12441
作者: Hok Shing Wong,Matthias J. Ehrhardt,Subhadip Mukherjee
关键词-EN: variational reconstruction framework, data-driven variational reconstruction, input-convex neural network, reconstruction framework, regularizer is parameterized
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We address the optimization problem in a data-driven variational reconstruction framework, where the regularizer is parameterized by an input-convex neural network (ICNN). While gradient-based methods are commonly used to solve such problems, they struggle to effectively handle non-smoothness which often leads to slow convergence. Moreover, the nested structure of the neural network complicates the application of standard non-smooth optimization techniques, such as proximal algorithms. To overcome these challenges, we reformulate the problem and eliminate the network’s nested structure. By relating this reformulation to epigraphical projections of the activation functions, we transform the problem into a convex optimization problem that can be efficiently solved using a primal-dual algorithm. We also prove that this reformulation is equivalent to the original variational problem. Through experiments on several imaging tasks, we demonstrate that the proposed approach outperforms subgradient methods in terms of both speed and stability.

[CV-92] Attention-Guided Perturbation for Consistency Regularization in Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2410.12419
作者: Yuxuan Cheng,Chenxi Shao,Jie Ma,Guoliang Li
关键词-EN: Medical image segmentation, image segmentation, therapeutic processes, Medical image, pivotal step
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image segmentation is a pivotal step in diagnostic and therapeutic processes. However, the acquisition of high-quality annotated data is often constrained by scarcity and cost. Semi-supervised learning offers a promising approach to enhance model performance by using unlabeled data. While consistency regularization is a prevalent method in semi-supervised image segmentation, there is a dearth of research on perturbation strategies tailored for semi-supervised medical image segmentation tasks. This paper introduces an attention-guided perturbation strategy for semi-supervised consistency regularization in the context of medical image segmentation. We add the perturbation based on the attention from the model in the image and feature level to achieve consistency regularization. The method is adept at accommodating the intricate structures and high-dimensional semantics inherent in medical images, thereby enhancing the performance of semi-supervised segmentation tasks. Our method achieved state-of-the-art results on benchmark datasets, including a 90.4% Dice score on the ACDC dataset in the 7-case scenario.

[CV-93] De-Identification of Medical Imaging Data: A Comprehensive Tool for Ensuring Patient Privacy

链接: https://arxiv.org/abs/2410.12402
作者: Moritz Rempe,Lukas Heine,Constantin Seibold,Fabian Hörst,Jens Kleesiek
关键词-EN: Health Insurance Portability, Data Protection Regulation, General Data Protection, patient health information, sensitive patient health
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical data employed in research frequently comprises sensitive patient health information (PHI), which is subject to rigorous legal frameworks such as the General Data Protection Regulation (GDPR) or the Health Insurance Portability and Accountability Act (HIPAA). Consequently, these types of data must be pseudonymized prior to utilisation, which presents a significant challenge for many researchers. Given the vast array of medical data, it is necessary to employ a variety of de-identification techniques. To facilitate the anonymization process for medical imaging data, we have developed an open-source tool that can be used to de-identify DICOM magnetic resonance images, computer tomography images, whole slide images and magnetic resonance twix raw data. Furthermore, the implementation of a neural network enables the removal of text within the images. The proposed tool automates an elaborate anonymization pipeline for multiple types of inputs, reducing the need for additional tools used for de-identification of imaging data. We make our code publicly available at this https URL.

[CV-94] Advancing Healthcare: Innovative ML Approaches for Improved Medical Imaging in Data-Constrained Environments

链接: https://arxiv.org/abs/2410.12245
作者: Al Amin,Kamrul Hasan,Saleh Zein-Sabatto,Liang Hong,Sachin Shetty,Imtiaz Ahmed,Tariqul Islam
关键词-EN: Healthcare industries face, industries face challenges, experiencing rare diseases, rare diseases due, Healthcare industries
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:Healthcare industries face challenges when experiencing rare diseases due to limited samples. Artificial Intelligence (AI) communities overcome this situation to create synthetic data which is an ethical and privacy issue in the medical domain. This research introduces the CAT-U-Net framework as a new approach to overcome these limitations, which enhances feature extraction from medical images without the need for large datasets. The proposed framework adds an extra concatenation layer with downsampling parts, thereby improving its ability to learn from limited data while maintaining patient privacy. To validate, the proposed framework’s robustness, different medical conditioning datasets were utilized including COVID-19, brain tumors, and wrist fractures. The framework achieved nearly 98% reconstruction accuracy, with a Dice coefficient close to 0.946. The proposed CAT-U-Net has the potential to make a big difference in medical image diagnostics in settings with limited data.

[CV-95] Method for Evaluating the Number of Signal Sources and Application to Non-invasive Brain-computer Interface

链接: https://arxiv.org/abs/2410.11844
作者: Alexandra Bernadotte,Victor Buchstaber
关键词-EN: time series unfolding, brain-computer interface, Toggle, non-invasive brain-computer interface, brain-computer interface sensors
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:This paper provides a brief introduction of the mathematical theory behind the time series unfolding method. The algorithms presented serve as a valuable mathematical and analytical tool for analyzing data collected from brain-computer interfaces. In our study, we implement a mathematical model based on polyharmonic signals to interpret the data from brain-computer interface sensors. The analysis of data coming to the brain-computer interface sensors is based on a mathematical model of the signal in the form of a polyharmonic signal. Our main focus is on addressing the problem of evaluating the number of sources, or active brain oscillators. The efficiency of our approach is demonstrated through analysis of data recorded from a non-invasive brain-computer interface developed by the author. Comments: 13 pages, 8 figures Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP) Cite as: arXiv:2410.11844 [q-bio.NC] (or arXiv:2410.11844v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2410.11844 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Alexandra Bernadotte Dr [view email] [v1] Thu, 26 Sep 2024 09:03:42 UTC (19,223 KB) Full-text links: Access Paper: View a PDF of the paper titled Method for Evaluating the Number of Signal Sources and Application to Non-invasive Brain-computer Interface, by Alexandra Bernadotte and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: q-bio.NC prev | next new | recent | 2024-10 Change to browse by: cs cs.CV cs.HC eess eess.SP q-bio References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-96] MoH: Multi-Head Attention as Mixture-of-Head Attention

链接: https://arxiv.org/abs/2410.11842
作者: Peng Jin,Bo Zhu,Li Yuan,Shuicheng Yan
关键词-EN: multi-head attention, attention, previous accuracy level, attention heads, Transformer model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, code: this https URL

点击查看摘要

Abstract:In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

机器学习

[LG-0] Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.12790
作者: Ce Zhang,Simon Stepputtis,Katia Sycara,Yaqi Xie
关键词-EN: holds significant, real-world scenarios, generalize to diverse, diverse data, data with unlabeled
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios. Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability. However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed. To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities. Specifically, we create and evolve two sets of prototypes–textual and visual–to progressively capture more accurate multi-modal representations for target classes during test time. Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities. Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency. Code is available at this https URL.

[LG-1] Metal Price Spike Prediction via a Neurosymbolic Ensemble Approach

链接: https://arxiv.org/abs/2410.12785
作者: Nathaniel Lee,Noel Ngu,Harshdeep Singh Sahdev,Pramod Motaganahall,Al Mehdi Saadat Chowdhury,Bowen Xi,Paulo Shakarian
关键词-EN: Predicting price spikes, mitigating economic risks, Predicting price, Nickel is crucial, reshoring of manufacturing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting price spikes in critical metals such as Cobalt, Copper, Magnesium, and Nickel is crucial for mitigating economic risks associated with global trends like the energy transition and reshoring of manufacturing. While traditional models have focused on regression-based approaches, our work introduces a neurosymbolic ensemble framework that integrates multiple neural models with symbolic error detection and correction rules. This framework is designed to enhance predictive accuracy by correcting individual model errors and offering interpretability through rule-based explanations. We show that our method provides up to 6.42% improvement in precision, 29.41% increase in recall at 13.24% increase in F1 over the best performing neural models. Further, our method, as it is based on logical rules, has the benefit of affording an explanation as to which combination of neural models directly contribute to a given prediction.

[LG-2] JudgeBench: A Benchmark for Evaluating LLM-based Judges

链接: https://arxiv.org/abs/2410.12784
作者: Sijun Tan,Siyuan Zhuang,Kyle Montgomery,William Y. Tang,Alejandro Cuadron,Chenguang Wang,Raluca Ada Popa,Ion Stoica
关键词-EN: LLM-based judges, scalable alternative, judges, LLM-based, human
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge’s alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at this https URL .

[LG-3] Context-Scaling versus Task-Scaling in In-Context Learning

链接: https://arxiv.org/abs/2410.12783
作者: Amirhesam Abedsoltan,Adityanarayanan Radhakrishnan,Jingfeng Wu,Mikhail Belkin
关键词-EN: model performance improves, exhibit In-Context Learning, Transformers exhibit In-Context, additional training, prompt without additional
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformers exhibit In-Context Learning (ICL), where these models solve new tasks by using examples in the prompt without additional training. In our work, we identify and analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling. To understand how transformers are capable of context-scaling, we first propose a significantly simplified transformer architecture without key, query, value weights. We show that it performs ICL comparably to the original GPT-2 model in various statistical learning tasks including linear regression, teacher-student settings. Furthermore, a single block of our simplified transformer can be viewed as data dependent feature map followed by an MLP. This feature map on its own is a powerful predictor that is capable of context-scaling but is not capable of task-scaling. We show empirically that concatenating the output of this feature map with vectorized data as an input to MLPs enables both context-scaling and task-scaling. This finding provides a simple setting to study context and task-scaling for ICL.

[LG-4] Geometry-Aware Generative Autoencoders for Warped Riemannian Metric Learning and Generative Modeling on Data Manifolds

链接: https://arxiv.org/abs/2410.12779
作者: Xingzhi Sun,Danqi Liao,Kincaid MacDonald,Yanlei Zhang,Chen Liu,Guillaume Huguet,Guy Wolf,Ian Adelstein,Tim G. J. Rudner,Smita Krishnaswamy
关键词-EN: presents unique computational, single-cell RNA sequencing, Rapid growth, RNA sequencing, scientific discovery
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Rapid growth of high-dimensional datasets in fields such as single-cell RNA sequencing and spatial genomics has led to unprecedented opportunities for scientific discovery, but it also presents unique computational and statistical challenges. Traditional methods struggle with geometry-aware data generation, interpolation along meaningful trajectories, and transporting populations via feasible paths. To address these issues, we introduce Geometry-Aware Generative Autoencoder (GAGA), a novel framework that combines extensible manifold learning with generative modeling. GAGA constructs a neural network embedding space that respects the intrinsic geometries discovered by manifold learning and learns a novel warped Riemannian metric on the data space. This warped metric is derived from both the points on the data manifold and negative samples off the manifold, allowing it to characterize a meaningful geometry across the entire latent space. Using this metric, GAGA can uniformly sample points on the manifold, generate points along geodesics, and interpolate between populations across the learned manifold. GAGA shows competitive performance in simulated and real world datasets, including a 30% improvement over the state-of-the-art methods in single-cell population-level trajectory inference.

[LG-5] Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

链接: https://arxiv.org/abs/2410.12777
作者: Hongcheng Gao,Tianyu Pang,Chao Du,Taihang Hu,Zhijie Deng,Min Lin
关键词-EN: diffusion-based content generation, potential model misuse, prevent potential model, content generation, significant efforts
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid progress of diffusion-based content generation, significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs to relearn the unlearned concepts. This occurs partly because certain benign concepts (e.g., “skin”) retained in DMs are related to the unlearned ones (e.g., “nudity”), facilitating their relearning via finetuning. To address this, we propose meta-unlearning on DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered to self-destruct, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies. Our code is available at this https URL.

[LG-6] he Non-Local Model Merging Problem: Permutation Symmetries and Variance Collapse

链接: https://arxiv.org/abs/2410.12766
作者: Ekansh Sharma,Daniel M. Roy,Gintare Karolina Dziugaite
关键词-EN: multiple expert models, expert models, common foundation model, Model merging aims, combine expert models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging aims to efficiently combine the weights of multiple expert models, each trained on a specific task, into a single multi-task model, with strong performance across all tasks. When applied to all but the last layer of weights, existing methods – such as Task Arithmetic, TIES-merging, and TALL mask merging – work well to combine expert models obtained by fine-tuning a common foundation model, operating within a “local” neighborhood of the foundation model. This work explores the more challenging scenario of “non-local” merging, which we find arises when an expert model changes significantly during pretraining or where the expert models do not even share a common foundation model. We observe that standard merging techniques often fail to generalize effectively in this non-local setting, even when accounting for permutation symmetries using standard techniques. We identify that this failure is, in part, due to “variance collapse”, a phenomenon identified also in the setting of linear mode connectivity by Jordan et al. (2023). To address this, we propose a multi-task technique to re-scale and shift the output activations of the merged model for each task, aligning its output statistics with those of the corresponding task-specific expert models. Our experiments demonstrate that this correction significantly improves the performance of various model merging approaches in non-local settings, providing a strong baseline for future research on this problem. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2410.12766 [cs.LG] (or arXiv:2410.12766v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.12766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

链接: https://arxiv.org/abs/2410.12761
作者: Jaehong Yoon,Shoubin Yu,Vaidehi Patil,Huaxiu Yao,Mohit Bansal
关键词-EN: Recent advances, significantly enhanced, enhanced their ability, ability to generate, increased the risk
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The first two authors contributed equally; Project page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model’s weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.

[LG-8] StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples

链接: https://arxiv.org/abs/2410.12757
作者: Ajay Patel,Jiacheng Zhu,Justin Qiu,Zachary Horvitz,Marianna Apidianaki,Kathleen McKeown,Chris Callison-Burch
关键词-EN: similar writing styles, writing styles closely, embed texts, texts with similar, Style representations aim
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content. However, the contrastive triplets often used for training these representations may vary in both style and content, leading to potential content leakage in the representations. We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings. We use a large language model to create a synthetic dataset of near-exact paraphrases with controlled style variations, and produce positive and negative examples across 40 distinct style features for precise contrastive learning. We assess the quality of our synthetic data and embeddings through human and automatic evaluations. StyleDistance enhances the content-independence of style embeddings, which generalize to real-world benchmarks and outperform leading style representations in downstream applications. Our model can be found at this https URL .

[LG-9] Initialization Method for Factorization Machine Based on Low-Rank Approximation for Constructing a Corrected Approximate Ising Model

链接: https://arxiv.org/abs/2410.12747
作者: Yuya Seki,Hyakka Nakada,Shu Tanaka
关键词-EN: approximate Ising model, machine learning model, Factorization Machine, Ising model, approximate Ising
类目: Machine Learning (cs.LG)
*备注: 25 pages, 5 figures

点击查看摘要

Abstract:This paper presents an initialization method that can approximate a given approximate Ising model with a high degree of accuracy using the Factorization Machine (FM), a machine learning model. The construction of Ising models using FM is applied to the combinatorial optimization problem using the factorization machine with quantum annealing. It is anticipated that the optimization performance of FMQA will be enhanced through the implementation of the warm-start method. Nevertheless, the optimal initialization method for leveraging the warm-start approach in FMQA remains undetermined. Consequently, the present study compares a number of initialization methods and identifies the most appropriate for use with a warm-start in FMQA through numerical experimentation. Furthermore, the properties of the proposed FM initialization method are analyzed using random matrix theory, demonstrating that the approximation accuracy of the proposed method is not significantly influenced by the specific Ising model under consideration. The findings of this study will facilitate the advancement of combinatorial optimization problem-solving through the use of Ising machines.

[LG-10] CREAM: Consistency Regularized Self-Rewarding Language Models

链接: https://arxiv.org/abs/2410.12735
作者: Zhaoyang Wang,Weilei He,Zhiyuan Liang,Xuchao Zhang,Chetan Bansal,Ying Wei,Weitong Zhang,Huaxiu Yao
关键词-EN: Recent self-rewarding large, Recent self-rewarding, large language models, preference data, successfully applied
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to iteratively improve the alignment performance without the need of human annotations for preference data. These methods commonly utilize the same LLM to act as both the policy model (which generates responses) and the reward model (which scores and ranks those responses). The ranked responses are then used as preference pairs to train the LLM via direct alignment technologies (e.g. DPO). However, it is noteworthy that throughout this process, there is no guarantee of accuracy in the rewarding and ranking, which is critical for ensuring accurate rewards and high-quality preference data. Empirical results from relatively small LLMs (e.g., 7B parameters) also indicate that improvements from self-rewarding may diminish after several iterations in certain situations, which we hypothesize is due to accumulated bias in the reward system. This bias can lead to unreliable preference data for training the LLM. To address this issue, we first formulate and analyze the generalized iterative preference fine-tuning framework for self-rewarding language model. We then introduce the regularization to this generalized framework to mitigate the overconfident preference labeling in the self-rewarding process. Based on this theoretical insight, we propose a Consistency Regularized sElf-rewarding lAnguage Model (CREAM) that leverages the rewarding consistency across different iterations to regularize the self-rewarding training, helping the model to learn from more reliable preference data. With this explicit regularization, our empirical results demonstrate the superiority of CREAM in improving both reward consistency and alignment performance. The code is publicly available at this https URL.

[LG-11] Counterfactual Generative Modeling with Variational Causal Inference

链接: https://arxiv.org/abs/2410.12730
作者: Yulun Wu,Louie McConnell,Claudia Iriondo
关键词-EN: supervised learning approaches, individual potential outcomes, counterfactual generative modeling, gene expressions, facial images
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating an individual’s potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, facial images) and covariates are relatively limited. In this case, to predict one’s outcomes under counterfactual treatments, it is crucial to leverage individual information contained in its high-dimensional observed outcome in addition to the covariates. Prior works using variational inference in counterfactual generative modeling have been focusing on neural adaptations and model variants within the conditional variational autoencoder formulation, which we argue is fundamentally ill-suited to the notion of counterfactual in causal inference. In this work, we present a novel variational Bayesian causal inference framework and its theoretical backings to properly handle counterfactual generative modeling tasks, through which we are able to conduct counterfactual supervision end-to-end during training without any counterfactual samples, and encourage latent disentanglement that aids the correct identification of causal effect in counterfactual generations. In experiments, we demonstrate the advantage of our framework compared to state-of-the-art models in counterfactual generative modeling on multiple benchmarks.

[LG-12] ransformer based super-resolution downscaling for regional reanalysis: Full domain vs tiling approaches

链接: https://arxiv.org/abs/2410.12728
作者: Antonio Pérez,Mario Santa Cruz,Daniel San Martín,José Manuel Gutiérrez
关键词-EN: producing high-resolution climate, promising cost-effective downscaling, cost-effective downscaling methodology, high-resolution climate information, promising cost-effective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Super-resolution (SR) is a promising cost-effective downscaling methodology for producing high-resolution climate information from coarser counterparts. A particular application is downscaling regional reanalysis outputs (predictand) from the driving global counterparts (predictor). This study conducts an intercomparison of various SR downscaling methods focusing on temperature and using the CERRA reanalysis (5.5 km resolution, produced with a regional atmospheric model driven by ERA5) as example. The method proposed in this work is the Swin transformer and two alternative methods are used as benchmark (fully convolutional U-Net and convolutional and dense DeepESD) as well as the simple bicubic interpolation. We compare two approaches, the standard one using the full domain as input and a more scalable tiling approach, dividing the full domain into tiles that are used as input. The methods are trained to downscale CERRA surface temperature, based on temperature information from the driving ERA5; in addition, the tiling approach includes static orographic information. We show that the tiling approach, which requires spatial transferability, comes at the cost of a lower performance (although it outperforms some full-domain benchmarks), but provides an efficient scalable solution that allows SR reduction on a pan-European scale and is valuable for real-time applications.

[LG-13] Optimizing 3D Geometry Reconstruction from Implicit Neural Representations

链接: https://arxiv.org/abs/2410.12725
作者: Shen Fan,Przemyslaw Musialski
关键词-EN: offering unparalleled advantages, Implicit neural representations, tool in learning, offering unparalleled, Implicit neural
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implicit neural representations have emerged as a powerful tool in learning 3D geometry, offering unparalleled advantages over conventional representations like mesh-based methods. A common type of INR implicitly encodes a shape’s boundary as the zero-level set of the learned continuous function and learns a mapping from a low-dimensional latent space to the space of all possible shapes represented by its signed distance function. However, most INRs struggle to retain high-frequency details, which are crucial for accurate geometric depiction, and they are computationally expensive. To address these limitations, we present a novel approach that both reduces computational expenses and enhances the capture of fine details. Our method integrates periodic activation functions, positional encodings, and normals into the neural network architecture. This integration significantly enhances the model’s ability to learn the entire space of 3D shapes while preserving intricate details and sharp features, areas where conventional representations often fall short.

[LG-14] How Does Variance Shape the Regret in Contextual Bandits? NEURIPS2024

链接: https://arxiv.org/abs/2410.12713
作者: Zeyu Jia,Jian Qian,Alexander Rakhlin,Chen-Yu Wei
关键词-EN: text, elu, Lambda, sqrt, realizable contextual bandits
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax bounds, we show that the eluder dimension d_\textelu - a complexity measure of the function class - plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner’s action. In this setting, we prove that a regret of \Omega(\sqrt\min\A,d_\textelu\Lambda+d_\textelu) is unavoidable when d_\textelu\leq\sqrtAT , where A is the number of actions, T is the total number of rounds, and \Lambda is the total variance over T rounds. For the A\leq d_\textelu regime, we derive a nearly matching upper bound \tildeO(\sqrtA\Lambda+d_\textelu) for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner’s action. We show that a regret of \Omega(\sqrtd_\textelu\Lambda+d_\textelu) is unavoidable when \sqrtd_\textelu\Lambda+d_\textelu\leq\sqrtAT . In this setting, we provide an upper bound of order \tildeO(d_\textelu\sqrt\Lambda+d_\textelu) . Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound \tildeO(\sqrtd_\textelu\Lambda+d_\textelu) established in their work is unimprovable when \sqrtd_\textelu\Lambda+d_\textelu\leq\sqrtAT . However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of \tildeO(\sqrtA\Lambda+d_\textelu) . Comments: NeurIPS 2024 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2410.12713 [cs.LG] (or arXiv:2410.12713v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.12713 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chen-Yu Wei [view email] [v1] Wed, 16 Oct 2024 16:20:07 UTC (91 KB)

[LG-15] FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

链接: https://arxiv.org/abs/2410.12707
作者: Zhenheng Tang,Xueze Kang,Yiming Yin,Xinglin Pan,Yuxin Wang,Xin He,Qiang Wang,Rongfei Zeng,Kaiyong Zhao,Shaohuai Shi,Amelie Chi Zhou,Bo Li,Bingsheng He,Xiaowen Chu
关键词-EN: large deep neural, training large deep, large language models, alleviate hardware scarcity, deep neural networks
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.12707 [cs.DC] (or arXiv:2410.12707v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2410.12707 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Sarcasm Detection in a Less-Resourced Language

链接: https://arxiv.org/abs/2410.12704
作者: Lazar Đoković,Marko Robnik-Šikonja
关键词-EN: natural language processing, sarcasm, sarcasm detection, Abstract, detection
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 4 pages, published in the Slovenian Conference on Artificial Intelligence

点击查看摘要

Abstract:The sarcasm detection task in natural language processing tries to classify whether an utterance is sarcastic or not. It is related to sentiment analysis since it often inverts surface sentiment. Because sarcastic sentences are highly dependent on context, and they are often accompanied by various non-verbal cues, the task is challenging. Most of related work focuses on high-resourced languages like English. To build a sarcasm detection dataset for a less-resourced language, such as Slovenian, we leverage two modern techniques: a machine translation specific medium-size transformer model, and a very large generative language model. We explore the viability of translated datasets and how the size of a pretrained transformer affects its ability to detect sarcasm. We train ensembles of detection models and evaluate models’ performance. The results show that larger models generally outperform smaller ones and that ensembling can slightly improve sarcasm detection performance. Our best ensemble approach achieves an \textF_1 -score of 0.765 which is close to annotators’ agreement in the source language.

[LG-17] Neural-based Control for CubeSat Docking Maneuvers

链接: https://arxiv.org/abs/2410.12703
作者: Matteo Stoisa,Federica Paganelli Azza,Luca Romanelli,Mattia Varile
关键词-EN: spacecraft dynamics variations, Artificial Neural Networks, GNC systems, limitations of GNC, employing Artificial Neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous Rendezvous and Docking (RVD) have been extensively studied in recent years, addressing the stringent requirements of spacecraft dynamics variations and the limitations of GNC systems. This paper presents an innovative approach employing Artificial Neural Networks (ANN) trained through Reinforcement Learning (RL) for autonomous spacecraft guidance and control during the final phase of the rendezvous maneuver. The proposed strategy is easily implementable onboard and offers fast adaptability and robustness to disturbances by learning control policies from experience rather than relying on predefined models. Extensive Monte Carlo simulations within a relevant environment are conducted in 6DoF settings to validate our approach, along with hardware tests that demonstrate deployment feasibility. Our findings highlight the efficacy of RL in assuring the adaptability and efficiency of spacecraft RVD, offering insights into future mission expectations.

[LG-18] Embedding an Ethical Mind: Aligning Text-to-Image Synthesis via Lightweight Value Optimization

链接: https://arxiv.org/abs/2410.12700
作者: Xingqi Wang,Xiaoyuan Yi,Xing Xie,Jia Jia
关键词-EN: Recent advancements, produce harmful content, harmful content misaligned, Large Language Models, indistinguishable human-level images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by ACM Multimedia 2024. The dataset and code can be found at this https URL

点击查看摘要

Abstract:Recent advancements in diffusion models trained on large-scale data have enabled the generation of indistinguishable human-level images, yet they often produce harmful content misaligned with human values, e.g., social bias, and offensive content. Despite extensive research on Large Language Models (LLMs), the challenge of Text-to-Image (T2I) model alignment remains largely unexplored. Addressing this problem, we propose LiVO (Lightweight Value Optimization), a novel lightweight method for aligning T2I models with human values. LiVO only optimizes a plug-and-play value encoder to integrate a specified value principle with the input prompt, allowing the control of generated images over both semantics and values. Specifically, we design a diffusion model-tailored preference optimization loss, which theoretically approximates the Bradley-Terry model used in LLM alignment but provides a more flexible trade-off between image quality and value conformity. To optimize the value encoder, we also develop a framework to automatically construct a text-image preference dataset of 86k (prompt, aligned image, violating image, value principle) samples. Without updating most model parameters and through adaptive value selection from the input prompt, LiVO significantly reduces harmful outputs and achieves faster convergence, surpassing several strong baselines and taking an initial step towards ethically aligned T2I models.

[LG-19] Machine Learning Approach to Brain Tumor Detection and Classification

链接: https://arxiv.org/abs/2410.12692
作者: Alice Oh,Inyoung Noh,Jian Choo,Jihoo Lee,Justin Park,Kate Hwang,Sanghyeon Kim,Soo Min Oh
关键词-EN: improve treatment outcomes, significantly improve treatment, brain MRI images, machine learning models, Brain tumor detection
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Brain tumor detection and classification are critical tasks in medical image analysis, particularly in early-stage diagnosis, where accurate and timely detection can significantly improve treatment outcomes. In this study, we apply various statistical and machine learning models to detect and classify brain tumors using brain MRI images. We explore a variety of statistical models including linear, logistic, and Bayesian regressions, and the machine learning models including decision tree, random forest, single-layer perceptron, multi-layer perceptron, convolutional neural network (CNN), recurrent neural network, and long short-term memory. Our findings show that CNN outperforms other models, achieving the best performance. Additionally, we confirm that the CNN model can also work for multi-class classification, distinguishing between four categories of brain MRI images such as normal, glioma, meningioma, and pituitary tumor images. This study demonstrates that machine learning approaches are suitable for brain tumor detection and classification, facilitating real-world medical applications in assisting radiologists with early and accurate diagnosis.

[LG-20] Automatic Mapping of Anatomical Landmarks from Free-Text Using Large Language Models : Insights from Llama-2

链接: https://arxiv.org/abs/2410.12686
作者: Mohamad Abdi,Gerardo Hemosillo Valadez,Halid Ziya Yerebakan
关键词-EN: anomaly detection, navigation and anomaly, Anatomical landmarks, landmarks, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:Anatomical landmarks are vital in medical imaging for navigation and anomaly detection. Modern large language models (LLMs), like Llama-2, offer promise for automating the mapping of these landmarks in free-text radiology reports to corresponding positions in image data. Recent studies propose LLMs may develop coherent representations of generative processes. Motivated by these insights, we investigated whether LLMs accurately represent the spatial positions of anatomical landmarks. Through experiments with Llama-2 models, we found that they can linearly represent anatomical landmarks in space with considerable robustness to different prompts. These results underscore the potential of LLMs to enhance the efficiency and accuracy of medical imaging workflows.

[LG-21] Optimizing Multi-Task Learning for Accurate Spacecraft Pose Estimation

链接: https://arxiv.org/abs/2410.12679
作者: Francesco Evangelisti,Francesco Rossi,Tobia Giani,Ilaria Bloise,Mattia Varile
关键词-EN: Accurate satellite pose, satellite pose estimation, pose estimation, Accurate satellite, direct pose estimation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate satellite pose estimation is crucial for autonomous guidance, navigation, and control (GNC) systems in in-orbit servicing (IOS) missions. This paper explores the impact of different tasks within a multi-task learning (MTL) framework for satellite pose estimation using monocular images. By integrating tasks such as direct pose estimation, keypoint prediction, object localization, and segmentation into a single network, the study aims to evaluate the reciprocal influence between tasks by testing different multi-task configurations thanks to the modularity of the convolutional neural network (CNN) used in this work. The trends of mutual bias between the analyzed tasks are found by employing different weighting strategies to further test the robustness of the findings. A synthetic dataset was developed to train and test the MTL network. Results indicate that direct pose estimation and heatmap-based pose estimation positively influence each other in general, while both the bounding box and segmentation tasks do not provide significant contributions and tend to degrade the overall estimation accuracy.

[LG-22] Context Matters: Leveraging Contextual Features for Time Series Forecasting

链接: https://arxiv.org/abs/2410.12672
作者: Sameep Chattopadhyay,Pulkit Paliwal,Sai Shankar Narasimhan,Shubhankar Agarwal,Sandeep P. Chinchali
关键词-EN: Time series forecasts, Time series, exogenous contextual features, series forecasts, influenced by exogenous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasts are often influenced by exogenous contextual features in addition to their corresponding history. For example, in financial settings, it is hard to accurately predict a stock price without considering public sentiments and policy decisions in the form of news articles, tweets, etc. Though this is common knowledge, the current state-of-the-art (SOTA) forecasting models fail to incorporate such contextual information, owing to its heterogeneity and multimodal nature. To address this, we introduce ContextFormer, a novel plug-and-play method to surgically integrate multimodal contextual information into existing pre-trained forecasting models. ContextFormer effectively distills forecast-specific information from rich multimodal contexts, including categorical, continuous, time-varying, and even textual information, to significantly enhance the performance of existing base forecasters. ContextFormer outperforms SOTA forecasting models by up to 30% on a range of real-world datasets spanning energy, traffic, environmental, and financial domains.

[LG-23] New Paradigm of Adversarial Training: Breaking Inherent Trade-Off between Accuracy and Robustness via Dummy Classes

链接: https://arxiv.org/abs/2410.12671
作者: Yanyun Wang,Li Liu,Zi Liang,Qingqing Ye,Haibo Hu
关键词-EN: adversarial samples, Classes-based Adversarial Training, Adversarial, Adversarial Training, adversarial robustness
类目: Machine Learning (cs.LG)
*备注: Preprint. Work in progress. The code is available at this https URL

点击查看摘要

Abstract:Adversarial Training (AT) is one of the most effective methods to enhance the robustness of DNNs. However, existing AT methods suffer from an inherent trade-off between adversarial robustness and clean accuracy, which seriously hinders their real-world deployment. While this problem has been widely studied within the current AT paradigm, existing AT methods still typically experience a reduction in clean accuracy by over 10% to date, without significant improvements in robustness compared with simple baselines like PGD-AT. This inherent trade-off raises a question: whether the current AT paradigm, which assumes to learn the corresponding benign and adversarial samples as the same class, inappropriately combines clean and robust objectives that may be essentially inconsistent. In this work, we surprisingly reveal that up to 40% of CIFAR-10 adversarial samples always fail to satisfy such an assumption across various AT methods and robust models, explicitly indicating the improvement room for the current AT paradigm. Accordingly, to relax the tension between clean and robust learning derived from this overstrict assumption, we propose a new AT paradigm by introducing an additional dummy class for each original class, aiming to accommodate the hard adversarial samples with shifted distribution after perturbation. The robustness w.r.t. these adversarial samples can be achieved by runtime recovery from the predicted dummy classes to their corresponding original ones, eliminating the compromise with clean learning. Building on this new paradigm, we propose a novel plug-and-play AT technology named DUmmy Classes-based Adversarial Training (DUCAT). Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that the DUCAT concurrently improves clean accuracy and adversarial robustness compared with state-of-the-art benchmarks, effectively breaking the existing inherent trade-off.

[LG-24] Explanation-Preserving Augmentation for Semi-Supervised Graph Representation Learning

链接: https://arxiv.org/abs/2410.12657
作者: Zhuomin Chen,Jingchao Ni,Hojat Allah Salehi,Xu Zheng,Esteban Schafir,Farhad Shirani,Dongsheng Luo
关键词-EN: achieving performance improvements, self-supervised GRL, effective technique achieving, technique achieving performance, Graph
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Graph representation learning (GRL), enhanced by graph augmentation methods, has emerged as an effective technique achieving performance improvements in wide tasks such as node classification and graph classification. In self-supervised GRL, paired graph augmentations are generated from each graph. Its objective is to infer similar representations for augmentations of the same graph, but maximally distinguishable representations for augmentations of different graphs. Analogous to image and language domains, the desiderata of an ideal augmentation method include both (1) semantics-preservation; and (2) data-perturbation; i.e., an augmented graph should preserve the semantics of its original graph while carrying sufficient variance. However, most existing (un-)/self-supervised GRL methods focus on data perturbation but largely neglect semantics preservation. To address this challenge, in this paper, we propose a novel method, Explanation-Preserving Augmentation (EPA), that leverages graph explanation techniques for generating augmented graphs that can bridge the gap between semantics-preservation and data-perturbation. EPA first uses a small number of labels to train a graph explainer to infer the sub-structures (explanations) that are most relevant to a graph’s semantics. These explanations are then used to generate semantics-preserving augmentations for self-supervised GRL, namely EPA-GRL. We demonstrate theoretically, using an analytical example, and through extensive experiments on a variety of benchmark datasets that EPA-GRL outperforms the state-of-the-art (SOTA) GRL methods, which are built upon semantics-agnostic data augmentations.

[LG-25] Position Specific Scoring Is All You Need? Revisiting Protein Sequence Classification Tasks

链接: https://arxiv.org/abs/2410.12655
作者: Sarwan Ali,Taslim Murad,Prakash Chourasia,Haris Mansoor,Imdad Ullah Khan,Pin-Yu Chen,Murray Patterson
关键词-EN: Understanding the structural, policy development, structural and functional, developing preventative, preventative and curative
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the structural and functional characteristics of proteins are crucial for developing preventative and curative strategies that impact fields from drug discovery to policy development. An important and popular technique for examining how amino acids make up these characteristics of the protein sequences with position-specific scoring (PSS). While the string kernel is crucial in natural language processing (NLP), it is unclear if string kernels can extract biologically meaningful information from protein sequences, despite the fact that they have been shown to be effective in the general sequence analysis tasks. In this work, we propose a weighted PSS kernel matrix (or W-PSSKM), that combines a PSS representation of protein sequences, which encodes the frequency information of each amino acid in a sequence, with the notion of the string kernel. This results in a novel kernel function that outperforms many other approaches for protein sequence classification. We perform extensive experimentation to evaluate the proposed method. Our findings demonstrate that the W-PSSKM significantly outperforms existing baselines and state-of-the-art methods and achieves up to 45.1% improvement in classification accuracy.

[LG-26] Constrained Posterior Sampling: Time Series Generation with Hard Constraints

链接: https://arxiv.org/abs/2410.12652
作者: Sai Shankar Narasimhan,Shubhankar Agarwal,Litu Rout,Sanjay Shakkottai,Sandeep P. Chinchali
关键词-EN: protecting user privacy, synthetic data, crucial for stress-testing, stress-testing models, models and protecting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Generating realistic time series samples is crucial for stress-testing models and protecting user privacy by using synthetic data. In engineering and safety-critical applications, these samples must meet certain hard constraints that are domain-specific or naturally imposed by physics or nature. Consider, for example, generating electricity demand patterns with constraints on peak demand times. This can be used to stress-test the functioning of power grids during adverse weather conditions. Existing approaches for generating constrained time series are either not scalable or degrade sample quality. To address these challenges, we introduce Constrained Posterior Sampling (CPS), a diffusion-based sampling algorithm that aims to project the posterior mean estimate into the constraint set after each denoising update. Notably, CPS scales to a large number of constraints (~100) without requiring additional training. We provide theoretical justifications highlighting the impact of our projection step on sampling. Empirically, CPS outperforms state-of-the-art methods in sample quality and similarity to real time series by around 10% and 42%, respectively, on real-world stocks, traffic, and air quality datasets.

[LG-27] Optimization and Application of Cloud-based Deep Learning Architecture for Multi-Source Data Prediction

链接: https://arxiv.org/abs/2410.12642
作者: Yang Zhang,Fa Wang,Xin Huang,Xintao Li,Sibei Liu,Hansong Zhang
关键词-EN: AWS cloud platform, cloud-based deep learning, deep learning technologies, accurate risk assessment, distributed computing capabilities
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Databases (cs.DB); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 6 Pages, 5 Figures, 3 Tables. The final version will be published in the proceedings of the IEEE conference

点击查看摘要

Abstract:This study develops a cloud-based deep learning system for early prediction of diabetes, leveraging the distributed computing capabilities of the AWS cloud platform and deep learning technologies to achieve efficient and accurate risk assessment. The system utilizes EC2 p3.8xlarge GPU instances to accelerate model training, reducing training time by 93.2% while maintaining a prediction accuracy of 94.2%. With an automated data processing and model training pipeline built using Apache Airflow, the system can complete end-to-end updates within 18.7 hours. In clinical applications, the system demonstrates a prediction accuracy of 89.8%, sensitivity of 92.3%, and specificity of 95.1%. Early interventions based on predictions lead to a 37.5% reduction in diabetes incidence among the target population. The system’s high performance and scalability provide strong support for large-scale diabetes prevention and management, showcasing significant public health value.

[LG-28] An Exact Finite-dimensional Explicit Feature Map for Kernel Functions

链接: https://arxiv.org/abs/2410.12635
作者: Kamaledin Ghiasi-Shirazi,Mohammadreza Qaraei
关键词-EN: kernel function, Hilbert space, arbitrary kernel function, Kernel, Gaussian and Laplacian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kernel methods in machine learning use a kernel function that takes two data points as input and returns their inner product after mapping them to a Hilbert space, implicitly and without actually computing the mapping. For many kernel functions, such as Gaussian and Laplacian kernels, the feature space is known to be infinite-dimensional, making operations in this space possible only implicitly. This implicit nature necessitates algorithms to be expressed using dual representations and the kernel trick. In this paper, given an arbitrary kernel function, we introduce an explicit, finite-dimensional feature map for any arbitrary kernel function that ensures the inner product of data points in the feature space equals the kernel function value, during both training and testing. The existence of this explicit mapping allows for kernelized algorithms to be formulated in their primal form, without the need for the kernel trick or the dual representation. As a first application, we demonstrate how to derive kernelized machine learning algorithms directly, without resorting to the dual representation, and apply this method specifically to PCA. As another application, without any changes to the t-SNE algorithm and its implementation, we use it for visualizing the feature space of kernel functions.

[LG-29] Explainable Moral Values: a neuro-symbolic approach to value classification ESWC24

链接: https://arxiv.org/abs/2410.12631
作者: Nicolas Lazzari,Stefano De Giorgis,Aldo Gangemi,Valentina Presutti
关键词-EN: Machine Learning techniques, Machine Learning, Ontology Design Pattern, Moral Foundations Theory, reasoning and Machine
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at ESWC24 Satellite Event

点击查看摘要

Abstract:This work explores the integration of ontology-based reasoning and Machine Learning techniques for explainable value classification. By relying on an ontological formalization of moral values as in the Moral Foundations Theory, relying on the DnS Ontology Design Pattern, the \textitsandra neuro-symbolic reasoner is used to infer values (fomalized as descriptions) that are \emphsatisfied by a certain sentence. Sentences, alongside their structured representation, are automatically generated using an open-source Large Language Model. The inferred descriptions are used to automatically detect the value associated with a sentence. We show that only relying on the reasoner’s inference results in explainable classification comparable to other more complex approaches. We show that combining the reasoner’s inferences with distributional semantics methods largely outperforms all the baselines, including complex models based on neural network architectures. Finally, we build a visualization tool to explore the potential of theory-based values classification, which is publicly available at this http URL.

[LG-30] Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety Toxicity and Legal Reasoning

链接: https://arxiv.org/abs/2410.12621
作者: Ruimeng Ye,Yang Xiao,Bo Hui
关键词-EN: large language models, continue to advance, increasingly critical, large language, alignment
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, ensuring their alignment with human values becomes increasingly critical. Traditional alignment methods heavily rely on human feedback to fine-tune models. With the emergence of superhuman models whose outputs may surpass human understanding, evaluating and aligning these models using human judgments poses significant challenges. To address the challenges, recent works use weak supervisors to elicit knowledge from much stronger models. However, there are important disanalogies between the empirical setup in the existing works and the genuine goal of alignment. We remark that existing works investigate the phenomenon of weak-to-strong generation in analogous setup (i.e., binary classification), rather than practical alignment-relevant tasks (e.g., safety). In this paper, we bridge this gap by extending weak-to-strong generation to the context of practical alignment. We empirically demonstrate the widespread phenomenon of weak-to-strong generation in three complicated alignment tasks: safety, toxicity, and legal reasoning. Furthermore, we explore efficient strategies for improving alignment performance to enhance the quality of model outcomes. Lastly, we summarize and analyze the challenges and potential solutions in regard to specific alignment tasks, which we hope to catalyze the research progress on the topic of weak-to-strong generalization. Our code is released at this https URL.

[LG-31] Exploring Model Kinship for Merging Large Language Models

链接: https://arxiv.org/abs/2410.12613
作者: Yedi Hu,Yunzhi Yao,Ningyu Zhang,Shumin Deng,Huajun Chen
关键词-EN: Large Language Models, Large Language, efficiency of Large, Language Models, model kinship
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Ongoing work

点击查看摘要

Abstract:Model merging has become one of the key technologies for enhancing the capabilities and efficiency of Large Language Models (LLMs). However, our understanding of the expected performance gains and principles when merging any two models remains limited. In this work, we introduce model kinship, the degree of similarity or relatedness between LLMs, analogous to biological evolution. With comprehensive empirical analysis, we find that there is a certain relationship between model kinship and the performance gains after model merging, which can help guide our selection of candidate models. Inspired by this, we propose a new model merging strategy: Top-k Greedy Merging with Model Kinship, which can yield better performance on benchmark datasets. Specifically, we discover that using model kinship as a criterion can assist us in continuously performing model merging, alleviating the degradation (local optima) in model evolution, whereas model kinship can serve as a guide to escape these traps. Code is available at this https URL.

[LG-32] owards Graph Foundation Models: The Perspective of Zero-shot Reasoning on Knowledge Graphs

链接: https://arxiv.org/abs/2410.12609
作者: Kai Wang,Siqiang Luo
关键词-EN: artificial general intelligence, Foundation Models, Graph Foundation Models, developing Graph Foundation, general intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 Pages, 5 figures

点击查看摘要

Abstract:Inspired by the success of artificial general intelligence, there is a trend towards developing Graph Foundation Models that excel in generalization across various graph tasks and domains. However, current models often require extensive training or fine-tuning to capture structural and semantic insights on new graphs, which limits their versatility. In this work, we explore graph foundation models from the perspective of zero-shot reasoning on Knowledge Graphs (KGs). Our focus is on utilizing KGs as a unified topological structure to tackle diverse tasks, while addressing semantic isolation challenges in KG reasoning to effectively integrate diverse semantic and structural features. This brings us new methodological insights into KG reasoning, as well as high generalizability towards foundation models in practice. Methodologically, we introduce SCORE, a unified graph reasoning framework that effectively generalizes diverse graph tasks using zero-shot learning. At the core of SCORE is semantic conditional message passing, a technique designed to capture both structural and semantic invariances in graphs, with theoretical backing for its expressive power. Practically, we evaluate the zero-shot reasoning capability of SCORE using 38 diverse graph datasets, covering node-level, link-level, and graph-level tasks across multiple domains. Our experiments reveal a substantial performance improvement over prior foundation models and supervised baselines, highlighting the efficacy and adaptability of our approach.

[LG-33] Low-Rank Adversarial PGD Attack

链接: https://arxiv.org/abs/2410.12607
作者: Dayana Savostianova,Emanuele Zangrando,Francesco Tudisco
关键词-EN: Projected Gradient Descent, deep neural network, neural network models, neural network, Projected Gradient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Adversarial attacks on deep neural network models have seen rapid development and are extensively used to study the stability of these networks. Among various adversarial strategies, Projected Gradient Descent (PGD) is a widely adopted method in computer vision due to its effectiveness and quick implementation, making it suitable for adversarial training. In this work, we observe that in many cases, the perturbations computed using PGD predominantly affect only a portion of the singular value spectrum of the original image, suggesting that these perturbations are approximately low-rank. Motivated by this observation, we propose a variation of PGD that efficiently computes a low-rank attack. We extensively validate our method on a range of standard models as well as robust models that have undergone adversarial training. Our analysis indicates that the proposed low-rank PGD can be effectively used in adversarial training due to its straightforward and fast implementation coupled with competitive performance. Notably, we find that low-rank PGD often performs comparably to, and sometimes even outperforms, the traditional full-rank PGD attack, while using significantly less memory.

[LG-34] Self-Supervised Learning of Disentangled Representations for Multivariate Time-Series NEURIPS2024

链接: https://arxiv.org/abs/2410.12606
作者: Ching Chang,Chiao-Tung Chan,Wei-Yao Wang,Wen-Chih Peng,Tien-Fu Chen
关键词-EN: fields like healthcare, healthcare and industry, industry are informative, informative but challenging, challenging due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice

点击查看摘要

Abstract:Multivariate time-series data in fields like healthcare and industry are informative but challenging due to high dimensionality and lack of labels. Recent self-supervised learning methods excel in learning rich representations without labels but struggle with disentangled embeddings and inductive bias issues like transformation-invariance. To address these challenges, we introduce TimeDRL, a framework for multivariate time-series representation learning with dual-level disentangled embeddings. TimeDRL features: (i) disentangled timestamp-level and instance-level embeddings using a [CLS] token strategy; (ii) timestamp-predictive and instance-contrastive tasks for representation learning; and (iii) avoidance of augmentation methods to eliminate inductive biases. Experiments on forecasting and classification datasets show TimeDRL outperforms existing methods, with further validation in semi-supervised settings with limited labeled data.

[LG-35] he Bayesian Confidence (BACON) Estimator for Deep Neural Networks

链接: https://arxiv.org/abs/2410.12604
作者: Patrick D. Kee,Max J. Brown,Jonathan C. Rice,Christian A. Howell
关键词-EN: Bayesian Confidence Estimator, Bayesian Confidence, deep neural networks, introduces the Bayesian, Confidence Estimator
类目: Machine Learning (cs.LG)
*备注: 14 pages, 15 figures (10 of which include sub-figures)

点击查看摘要

Abstract:This paper introduces the Bayesian Confidence Estimator (BACON) for deep neural networks. Current practice of interpreting Softmax values in the output layer as probabilities of outcomes is prone to extreme predictions of class probability. In this work we extend Waagen’s method of representing the terminal layers with a geometric model, where the probability associated with an output vector is estimated with Bayes’ Rule using validation data to provide likelihood and normalization values. This estimator provides superior ECE and ACE calibration error compared to Softmax for ResNet-18 at 85% network accuracy, and EfficientNet-B0 at 95% network accuracy, on the CIFAR-10 dataset with an imbalanced test set, except for very high accuracy edge cases. In addition, when using the ACE metric, BACON demonstrated improved calibration error when estimating probabilities for the imbalanced test set when using actual class distribution fractions.

[LG-36] Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

链接: https://arxiv.org/abs/2410.12598
作者: Henrique Donâncio,Antoine Barrier,Leah F. South,Florence Forbes
关键词-EN: learning rate, higher learning rates, Deep Reinforcement Learning, Learning models trained, Reinforcement Learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Deep Reinforcement Learning models trained using gradient-based techniques, the choice of optimizer and its learning rate are crucial to achieving good performance: higher learning rates can prevent the model from learning effectively, while lower ones might slow convergence. Additionally, due to the non-stationarity of the objective function, the best-performing learning rate can change over the training steps. To adapt the learning rate, a standard technique consists of using decay schedulers. However, these schedulers assume that the model is progressively approaching convergence, which may not always be true, leading to delayed or premature adjustments. In this work, we propose dynamic Learning Rate for deep Reinforcement Learning (LRRL), a meta-learning approach that selects the learning rate based on the agent’s performance during training. LRRL is based on a multi-armed bandit algorithm, where each arm represents a different learning rate, and the bandit feedback is provided by the cumulative returns of the RL policy to update the arms’ probability distribution. Our empirical results demonstrate that LRRL can substantially improve the performance of deep RL algorithms.

[LG-37] Personalized Prediction Models for Changes in Knee Pain among Patients with Osteoarthritis Participating in Supervised Exercise and Education

链接: https://arxiv.org/abs/2410.12597
作者: M. Rafiei,S. Das,M. Bakhtiari,E.M. Roos,S.T. Skou,D.T. Grønne,J. Baumbach,L. Baumbach
关键词-EN: widespread chronic condition, knee pain, quality of life, pain, models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knee osteoarthritis (OA) is a widespread chronic condition that impairs mobility and diminishes quality of life. Despite the proven benefits of exercise therapy and patient education in managing the OA symptoms pain and functional limitations, these strategies are often underutilized. Personalized outcome prediction models can help motivate and engage patients, but the accuracy of existing models in predicting changes in knee pain remains insufficiently examined. To validate existing models and introduce a concise personalized model predicting changes in knee pain before to after participating in a supervised education and exercise therapy program (GLA:D) for knee OA patients. Our models use self-reported patient information and functional measures. To refine the number of variables, we evaluated the variable importance and applied clinical reasoning. We trained random forest regression models and compared the rate of true predictions of our models with those utilizing average values. We evaluated the performance of a full, continuous, and concise model including all 34, all 11 continuous, and the six most predictive variables respectively. All three models performed similarly and were comparable to the existing model, with R-squares of 0.31-0.32 and RMSEs of 18.65-18.85 - despite our increased sample size. Allowing a deviation of 15 VAS points from the true change in pain, our concise model and utilizing the average values estimated the change in pain at 58% and 51% correctly, respectively. Our supplementary analysis led to similar outcomes. Our concise personalized prediction model more accurately predicts changes in knee pain following the GLA:D program compared to average pain improvement values. Neither the increase in sample size nor the inclusion of additional variables improved previous models. To improve predictions, new variables beyond those in the GLA:D are required.

[LG-38] Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting

链接: https://arxiv.org/abs/2410.12593
作者: Wei Chen,Yuxuan Liang
关键词-EN: sensing devices leads, spatio-temporal graph neural, spatio-temporal forecasting applications, graph neural network, air quality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method, following two fundamental tuning principles guided by empirical and theoretical analysis: expand and compress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base spatio-temporal graph neural network with a continuous prompt pool, utilizing stored prompts (i.e., few learnable parameters) in memory, and jointly optimize them with the base spatio-temporal graph neural network. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority of our method over the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.

[LG-39] Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

链接: https://arxiv.org/abs/2410.12592
作者: Minkyoung Cho,Yulong Cao,Jiachen Sun,Qingzhao Zhang,Marco Pavone,Jeong Joon Park,Heng Yang,Z. Morley Mao
关键词-EN: long-tail scenarios, important paradigm, enhance accuracy, separate detection pipelines, distinct object configurations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.

[LG-40] On the Role of Activation Functions in EEG-To-Text Decoder

链接: https://arxiv.org/abs/2410.12572
作者: Zenon Lamprou,Iakovos Tenedios,Yashar Moshfeghi
关键词-EN: conducted exploring potential, recent years, information retrieval, conducted exploring, exploring potential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, much interdisciplinary research has been conducted exploring potential use cases of neuroscience to advance the field of information retrieval. Initial research concentrated on the use of fMRI data, but fMRI was deemed to be not suitable for real-world applications, and soon, research shifted towards using EEG data. In this paper, we try to improve the original performance of a first attempt at generating text using EEG by focusing on the less explored area of optimising neural network performance. We test a set of different activation functions and compare their performance. Our results show that introducing a higher degree polynomial activation function can enhance model performance without changing the model architecture. We also show that the learnable 3rd-degree activation function performs better on the 1-gram evaluation compared to a 3rd-degree non-learnable function. However, when evaluating the model on 2-grams and above, the polynomial function lacks in performance, whilst the leaky ReLU activation function outperforms the baseline.

[LG-41] One Step Diffusion via Shortcut Models

链接: https://arxiv.org/abs/2410.12557
作者: Kevin Frans,Danijar Hafner,Sergey Levine,Pieter Abbeel
关键词-EN: enabled generating diverse, Diffusion models, enabled generating, generating diverse, diverse and realistic
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models and flow-matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.

[LG-42] Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

链接: https://arxiv.org/abs/2410.12555
作者: Daniel J. Lee,Stefan Heimersheim
关键词-EN: token prediction probabilities, prediction probabilities change, Sensitive directions experiments, directions experiments attempt, Language Models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE’s sparsity, with lower L0 SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

[LG-43] Is Complex Query Answering Really Complex?

链接: https://arxiv.org/abs/2410.12537
作者: Cosimo Gregucci,Bo Xiong,Daniel Hernandez,Lorenzo Loconte,Pasquale Minervini,Steffen Staab,Antonio Vergari
关键词-EN: Complex query answering, challenging reasoning task, knowledge graphs, reasoning task, gaining momentum
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Complex query answering (CQA) on knowledge graphs (KGs) is gaining momentum as a challenging reasoning task. In this paper, we show that the current benchmarks for CQA are not really complex, and the way they are built distorts our perception of progress in this field. For example, we find that in these benchmarks, most queries (up to 98% for some query types) can be reduced to simpler problems, e.g., link prediction, where only one link needs to be predicted. The performance of state-of-the-art CQA models drops significantly when such models are evaluated on queries that cannot be reduced to easier types. Thus, we propose a set of more challenging benchmarks, composed of queries that require models to reason over multiple hops and better reflect the construction of real-world KGs. In a systematic empirical investigation, the new benchmarks show that current methods leave much to be desired from current CQA methods.

[LG-44] Disentangling data distribution for Federated Learning

链接: https://arxiv.org/abs/2410.12530
作者: Xinyuan Zhao,Hanlin Gu,Lixin Fan,Qiang Yang,Yuxing Han
关键词-EN: facilitates collaborative training, private data owned, facilitates collaborative, collaborative training, performance is boosted
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) facilitates collaborative training of a global model whose performance is boosted by private data owned by distributed clients, without compromising data privacy. Yet the wide applicability of FL is hindered by entanglement of data distributions across different clients. This paper demonstrates for the first time that by disentangling data distributions FL can in principle achieve efficiencies comparable to those of distributed systems, requiring only one round of communication. To this end, we propose a novel FedDistr algorithm, which employs stable diffusion models to decouple and recover data distributions. Empirical results on the CIFAR100 and DomainNet datasets show that FedDistr significantly enhances model utility and efficiency in both disentangled and near-disentangled scenarios while ensuring privacy, outperforming traditional federated learning methods.

[LG-45] MING: A Functional Approach to Learning Molecular Generative Models

链接: https://arxiv.org/abs/2410.12522
作者: Van Khoa Nguyen,Maciej Falkiewicz,Giangiacomo Mercatali,Alexandros Kalousis
关键词-EN: require complex permutation-equivariant, Traditional molecule generation, Traditional molecule, complex permutation-equivariant architectures, rely on sequence
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional molecule generation methods often rely on sequence or graph-based representations, which can limit their expressive power or require complex permutation-equivariant architectures. This paper introduces a novel paradigm for learning molecule generative models based on functional representations. Specifically, we propose Molecular Implicit Neural Generation (MING), a diffusion-based model that learns molecular distributions in function space. Unlike standard diffusion processes in data space, MING employs a novel functional denoising probabilistic process, which jointly denoises the information in both the function’s input and output spaces by leveraging an expectation-maximization procedure for latent implicit neural representations of data. This approach allows for a simple yet effective model design that accurately captures underlying function distributions. Experimental results on molecule-related datasets demonstrate MING’s superior performance and ability to generate plausible molecular samples, surpassing state-of-the-art data-space methods while offering a more streamlined architecture and significantly faster generation times.

[LG-46] End-to-end Planner Training for Language Modeling

链接: https://arxiv.org/abs/2410.12492
作者: Nathan Cornille,Florian Mai,Jingyuan Sun,Marie-Francine Moens
关键词-EN: valuable tools, predict abstract labels, language modeling, planner, enhance language modeling
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Through end-to-end training to predict the next token, LLMs have become valuable tools for various tasks. Enhancing their core training in language modeling can improve numerous downstream applications. A successful approach to enhance language modeling uses a separate planning module to predict abstract labels of future sentences and conditions the LM on these predictions. However, this method is non-differentiable, preventing joint end-to-end tuning of the planner with the LM. We propose an effective method to improve this approach by enabling joint fine-tuning of the planner and the LM. We show that a naive way of approximating the gradient of selecting a label via the straight-through estimator is not effective. Instead, we propose to use the predicted label probabilities as mixing weights to condition the LM on a weighted average of label embeddings in a differentiable manner. This not only enables joint fine-tuning of the planner and the LM, but also allows the LM to draw on the full label distribution predicted by the planner, retaining more information. Our experimental results show consistent improvements in perplexity.

[LG-47] Data-Driven Gyroscope Calibration

链接: https://arxiv.org/abs/2410.12485
作者: Zeev Yampolsky,Itzik Klein
关键词-EN: inertial sensors, sensors that measure, measure the angular, angular velocity, calibration
类目: Machine Learning (cs.LG)
*备注: 19 Pages, 5 Figures, 3 Tables

点击查看摘要

Abstract:Gyroscopes are inertial sensors that measure the angular velocity of the platforms to which they are attached. To estimate the gyroscope deterministic error terms prior mission start, a calibration procedure is performed. When considering low-cost gyroscopes, the calibration requires a turntable as the gyros are incapable of sensing the Earth turn rate. In this paper, we propose a data-driven framework to estimate the scale factor and bias of a gyroscope. To train and validate our approach, a dataset of 56 minutes was recorded using a turntable. We demonstrated that our proposed approach outperforms the model-based approach, in terms of accuracy and convergence time. Specifically, we improved the scale factor and bias estimation by an average of 72% during six seconds of calibration time, demonstrating an average of 75% calibration time improvement. That is, instead of minutes, our approach requires only several seconds for the calibration.

[LG-48] SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

链接: https://arxiv.org/abs/2410.12481
作者: Loris Gaven,Clement Romac,Thomas Carta,Sylvain Lamprier,Olivier Sigaud,Pierre-Yves Oudeyer
关键词-EN: Large Language Models, Large Language, Language Models, sequential decision-making tasks, solving textual sequential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.

[LG-49] KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs

链接: https://arxiv.org/abs/2410.12480
作者: Yongqin Xu,Huan Li,Ke Chen,Lidan Shou
关键词-EN: integration and management, crucial for data, data integration, entity matching tasks, Knowledge-Compliant Matching Framework
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Schema and entity matching tasks are crucial for data integration and management. While large language models (LLMs) have shown promising results in these tasks, they suffer from hallucinations and confusion about task instructions. In this paper, we present the Knowledge-Compliant Matching Framework (KcMF), an LLM-based approach that addresses these issues without the need for domain-specific fine-tuning. KcMF employs a pseudo-code-based task decomposition strategy to adopt task-specific natural language statements that guide LLM reasoning and reduce confusion. We also propose two mechanisms, Dataset as Knowledge (DaK) and Example as Knowledge (EaK), to build domain knowledge sets when unstructured domain knowledge is lacking. Additionally, we introduce a result-ensembling strategy to leverage multiple knowledge sources and suppress poorly formatted outputs. Comprehensive evaluations on schema and entity matching tasks demonstrate that KcMF outperforms previous non-LLM state-of-the-art (SOTA) methods by an average F1 score of 22.9% and competes effectively with SOTA fine-tuned LLMs. Moreover, KcMF generalizes well across different LLMs.

[LG-50] Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation

链接: https://arxiv.org/abs/2410.12476
作者: Zerui Xu,Fang Wu,Tianfan Fu,Yue Zhao
关键词-EN: Machine learning, clinical, clinical trials, Machine, synthetic clinical trial
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) exhibits promise in the clinical domain. However, it is constrained by data scarcity and ethical considerations, as the generation of clinical trials presents significant challenges due to stringent privacy regulations, high costs, and the extended duration required for conducting studies with human participants. Despite the advancements of large language models (LLMs) in general generation tasks, their potential in facilitating the generation of synthetic clinical trials is under-explored. To address this gap, we introduce a novel Retrieval-Reasoning few-shot framework that leverages LLMs to generate artificial yet realistic and diverse clinical trials with binary success/failure labels. Experiments conducted on real clinical trials from the \urlthis http URL database demonstrate that our synthetic data can effectively augment real datasets. Furthermore, by fine-tuning a pre-trained model as a binary classifier on synthetic clinical trial datasets, we demonstrate that this augmentation enhances model training for downstream tasks such as trial outcome prediction. Our findings suggest that LLMs for synthetic clinical trial generation hold promise for accelerating clinical research and upholding ethical standards for patient privacy. The code is publicly available at this https URL.

[LG-51] Mind the Gap Between Prototypes and Images in Cross-domain Finetuning

链接: https://arxiv.org/abs/2410.12474
作者: Hongduan Tian,Feng Liu,Zhanke Zhou,Tongliang Liu,Chengqi Zhang,Bo Han
关键词-EN: cross-domain few-shot classification, task-specific metric space, image instance embeddings, frozen pre-trained backbone, image instance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In cross-domain few-shot classification (CFC), recent works mainly focus on adapting a simple transformation head on top of a frozen pre-trained backbone with few labeled data to project embeddings into a task-specific metric space where classification can be performed by measuring similarities between image instance and prototype representations. Technically, an assumption implicitly adopted in such a framework is that the prototype and image instance embeddings share the same representation transformation. However, in this paper, we find that there naturally exists a gap, which resembles the modality gap, between the prototype and image instance embeddings extracted from the frozen pre-trained backbone, and simply applying the same transformation during the adaptation phase constrains exploring the optimal representations and shrinks the gap between prototype and image representations. To solve this problem, we propose a simple yet effective method, contrastive prototype-image adaptation (CoPA), to adapt different transformations respectively for prototypes and images similarly to CLIP by treating prototypes as text prompts. Extensive experiments on Meta-Dataset demonstrate that CoPA achieves the state-of-the-art performance more efficiently. Meanwhile, further analyses also indicate that CoPA can learn better representation clusters, enlarge the gap, and achieve minimal validation loss at the enlarged gap.

[LG-52] Challenges Methods Data – a Survey of Machine Learning in Water Distribution Networks ICANN2024

链接: https://arxiv.org/abs/2410.12461
作者: Valerie Vaquet,Fabian Hinder,André Artelt,Inaam Ashraf,Janine Strotherm,Jonas Vaquet,Johannes Brinkrolf,Barbara Hammer
关键词-EN: Research on methods, gains increasing relevance, distribution networks gains, controlling water distribution, climate change
类目: Machine Learning (cs.LG)
*备注: This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in Artificial Neural Networks and Machine Learning – ICANN 2024

点击查看摘要

Abstract:Research on methods for planning and controlling water distribution networks gains increasing relevance as the availability of drinking water will decrease as a consequence of climate change. So far, the majority of approaches is based on hydraulics and engineering expertise. However, with the increasing availability of sensors, machine learning techniques constitute a promising tool. This work presents the main tasks in water distribution networks, discusses how they relate to machine learning and analyses how the particularities of the domain pose challenges to and can be leveraged by machine learning approaches. Besides, it provides a technical toolkit by presenting evaluation benchmarks and a structured survey of the exemplary task of leakage detection and localization.

[LG-53] HELM: Hierarchical Encoding for mRNA Language Modeling

链接: https://arxiv.org/abs/2410.12459
作者: Mehdi Yazdani-Jahromi,Mangal Prakash,Tommaso Mansi,Artem Moskalev,Rui Liao
关键词-EN: Messenger RNA, impacting biological properties, structure directly impacting, directly impacting biological, plays a crucial
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Messenger RNA (mRNA) plays a crucial role in protein synthesis, with its codon structure directly impacting biological properties. While Language Models (LMs) have shown promise in analyzing biological sequences, existing approaches fail to account for the hierarchical nature of mRNA’s codon structure. We introduce Hierarchical Encoding for mRNA Language Modeling (HELM), a novel pre-training strategy that incorporates codon-level hierarchical structure into language model training. HELM modulates the loss function based on codon synonymity, aligning the model’s learning process with the biological reality of mRNA sequences. We evaluate HELM on diverse mRNA datasets and tasks, demonstrating that HELM outperforms standard language model pre-training as well as existing foundation model baselines on six diverse downstream property prediction tasks and an antibody region annotation tasks on average by around 8%. Additionally, HELM enhances the generative capabilities of language model, producing diverse mRNA sequences that better align with the underlying true data distribution compared to non-hierarchical baselines.

[LG-54] Sharpness-Aware Black-Box Optimization

链接: https://arxiv.org/abs/2410.12457
作者: Feiyang Ye,Yueming Lyu,Xuehao Wang,Masashi Sugiyama,Yu Zhang,Ivor Tsang
关键词-EN: including reinforcement learning, machine learning problems, Black-box optimization, Black-box optimization algorithms, black-box optimization methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 27 pages, 5 figures

点击查看摘要

Abstract:Black-box optimization algorithms have been widely used in various machine learning problems, including reinforcement learning and prompt fine-tuning. However, directly optimizing the training loss value, as commonly done in existing black-box optimization methods, could lead to suboptimal model quality and generalization performance. To address those problems in black-box optimization, we propose a novel Sharpness-Aware Black-box Optimization (SABO) algorithm, which applies a sharpness-aware minimization strategy to improve the model generalization. Specifically, the proposed SABO method first reparameterizes the objective function by its expectation over a Gaussian distribution. Then it iteratively updates the parameterized distribution by approximated stochastic gradients of the maximum objective value within a small neighborhood around the current solution in the Gaussian distribution space. Theoretically, we prove the convergence rate and generalization bound of the proposed SABO algorithm. Empirically, extensive experiments on the black-box prompt fine-tuning tasks demonstrate the effectiveness of the proposed SABO method in improving model generalization performance.

[LG-55] raining Neural Samplers with Reverse Diffusive KL Divergence

链接: https://arxiv.org/abs/2410.12456
作者: Jiajun He,Wenlin Chen,Mingtian Zhang,David Barber,José Miguel Hernández-Lobato
关键词-EN: unnormalized density functions, Training generative models, machine learning, unnormalized density, density functions
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 6 figures, 3 tables, 1 algorithm

点击查看摘要

Abstract:Training generative models to sample from unnormalized density functions is an important and challenging task in machine learning. Traditional training methods often rely on the reverse Kullback-Leibler (KL) divergence due to its tractability. However, the mode-seeking behavior of reverse KL hinders effective approximation of multi-modal target distributions. To address this, we propose to minimize the reverse KL along diffusion trajectories of both model and target densities. We refer to this objective as the reverse diffusive KL divergence, which allows the model to capture multiple modes. Leveraging this objective, we train neural samplers that can efficiently generate samples from the target distribution in one step. We demonstrate that our method enhances sampling performance across various Boltzmann distributions, including both synthetic multi-modal densities and n-body particle systems.

[LG-56] Loss Landscape Characterization of Neural Networks without Over-Parametrziation

链接: https://arxiv.org/abs/2410.12455
作者: Rustem Islamov,Niccolò Ajroldi,Antonio Orvieto,Aurelien Lucchi
关键词-EN: Optimization methods play, remarkable empirical achievements, play a crucial, crucial role, Optimization methods
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Lojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.

[LG-57] FairGLVQ: Fairness in Partition-Based Classification

链接: https://arxiv.org/abs/2410.12452
作者: Felix Störck,Fabian Hinder,Johannes Brinkrolf,Benjamin Paassen,Valerie Vaquet,Barbara Hammer
关键词-EN: objective throughout society, important objective, fair machine learning, machine learning, fair machine
类目: Machine Learning (cs.LG)
*备注: This preprint has not undergone any post-submission improvements or corrections. The Version of Record of this contribution is published in Advances in Self-Organizing Maps, Learning Vector Quantization, Interpretable Machine Learning, and Beyond

点击查看摘要

Abstract:Fairness is an important objective throughout society. From the distribution of limited goods such as education, over hiring and payment, to taxes, legislation, and jurisprudence. Due to the increasing importance of machine learning approaches in all areas of daily life including those related to health, security, and equity, an increasing amount of research focuses on fair machine learning. In this work, we focus on the fairness of partition- and prototype-based models. The contribution of this work is twofold: 1) we develop a general framework for fair machine learning of partition-based models that does not depend on a specific fairness definition, and 2) we derive a fair version of learning vector quantization (LVQ) as a specific instantiation. We compare the resulting algorithm against other algorithms from the literature on theoretical and real-world data showing its practical relevance.

[LG-58] Reconstruction of Differentially Private Text Sanitization via Large Language Models

链接: https://arxiv.org/abs/2410.12443
作者: Shuchao Pang,Zhigang Lu,Haichen Wang,Peng Fu,Yongbin Zhou,Minhui Xue,Bo Li
关键词-EN: large language models, facto privacy standard, Differential privacy, privacy leakage attacks, including many recently
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) is the de facto privacy standard against privacy leakage attacks, including many recently discovered ones against large language models (LLMs). However, we discovered that LLMs could reconstruct the altered/removed privacy from given DP-sanitized prompts. We propose two attacks (black-box and white-box) based on the accessibility to LLMs and show that LLMs could connect the pair of DP-sanitized text and the corresponding private training data of LLMs by giving sample text pairs as instructions (in the black-box attacks) or fine-tuning data (in the white-box attacks). To illustrate our findings, we conduct comprehensive experiments on modern LLMs (e.g., LLaMA-2, LLaMA-3, ChatGPT-3.5, ChatGPT-4, ChatGPT-4o, Claude-3, Claude-3.5, OPT, GPT-Neo, GPT-J, Gemma-2, and Pythia) using commonly used datasets (such as WikiMIA, Pile-CC, and Pile-Wiki) against both word-level and sentence-level DP. The experimental results show promising recovery rates, e.g., the black-box attacks against the word-level DP over WikiMIA dataset gave 72.18% on LLaMA-2 (70B), 82.39% on LLaMA-3 (70B), 75.35% on Gemma-2, 91.2% on ChatGPT-4o, and 94.01% on Claude-3.5 (Sonnet). More urgently, this study indicates that these well-known LLMs have emerged as a new security risk for existing DP text sanitization approaches in the current environment.

[LG-59] ConLUX: Concept-Based Local Unified Explanations

链接: https://arxiv.org/abs/2410.12439
作者: Junhao Liu,Haonan Yu,Xin Zhang
关键词-EN: models, image models, model-agnostic explanation techniques, explanations, explanation techniques
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid advancements of various machine learning models, there is a significant demand for model-agnostic explanation techniques, which can explain these models across different architectures. Mainstream model-agnostic explanation techniques generate local explanations based on basic features (e.g., words for text models and (super-)pixels for image models). However, these explanations often do not align with the decision-making processes of the target models and end-users, resulting in explanations that are unfaithful and difficult for users to understand. On the other hand, concept-based techniques provide explanations based on high-level features (e.g., topics for text models and objects for image models), but most are model-specific or require additional pre-defined external concept knowledge. To address this limitation, we propose \toolname, a general framework to provide concept-based local explanations for any machine learning models. Our key insight is that we can automatically extract high-level concepts from large pre-trained models, and uniformly extend existing local model-agnostic techniques to provide unified concept-based explanations. We have instantiated \toolname on four different types of explanation techniques: LIME, Kernel SHAP, Anchor, and LORE, and applied these techniques to text and image models. Our evaluation results demonstrate that 1) compared to the vanilla versions, \toolname offers more faithful explanations and makes them more understandable to users, and 2) by offering multiple forms of explanations, \toolname outperforms state-of-the-art concept-based explanation techniques specifically designed for text and image models, respectively.

[LG-60] Approaching Metaheuristic Deep Learning Combos for Automated Data Mining

链接: https://arxiv.org/abs/2410.12435
作者: Gustavo Assunção,Paulo Menezes
关键词-EN: areas of research, machine learning, recurring issue, automated data mining, data
类目: Machine Learning (cs.LG)
*备注: Tentative submission for data mining and knowledge discovery

点击查看摘要

Abstract:Lack of data on which to perform experimentation is a recurring issue in many areas of research, particularly in machine learning. The inability of most automated data mining techniques to be generalized to all types of data is inherently related with their dependency on those types which deems them ineffective against anything slightly different. Meta-heuristics are algorithms which attempt to optimize some solution independently of the type of data used, whilst classifiers or neural networks focus on feature extrapolation and dimensionality reduction to fit some model onto data arranged in a particular way. These two algorithmic fields encompass a group of characteristics which when combined are seemingly capable of achieving data mining regardless of how it is arranged. To this end, this work proposes a means of combining meta-heuristic methods with conventional classifiers and neural networks in order to perform automated data mining. Experiments on the MNIST dataset for handwritten digit recognition were performed and it was empirically observed that using a ground truth labeled dataset’s validation accuracy is inadequate for correcting labels of other previously unseen data instances.

[LG-61] Perseus: Leveraging Common Data Patterns with Curriculum Learning for More Robust Graph Neural Networks

链接: https://arxiv.org/abs/2410.12425
作者: Kaiwen Xia,Huijun Wu,Duanyu Li,Min Xie,Ruibo Wang,Wenzhe Zhang
关键词-EN: Graph Neural Networks, excel at handling, neural network models, remain vulnerable, handling graph data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) excel at handling graph data but remain vulnerable to adversarial attacks. Existing defense methods typically rely on assumptions like graph sparsity and homophily to either preprocess the graph or guide structure learning. However, preprocessing methods often struggle to accurately distinguish between normal edges and adversarial perturbations, leading to suboptimal results due to the loss of valuable edge information. Robust graph neural network models train directly on graph data affected by adversarial perturbations, without preprocessing. This can cause the model to get stuck in poor local optima, negatively affecting its performance. To address these challenges, we propose Perseus, a novel adversarial defense method based on curriculum learning. Perseus assesses edge difficulty using global homophily and applies a curriculum learning strategy to adjust the learning order, guiding the model to learn the full graph structure while adaptively focusing on common data patterns. This approach mitigates the impact of adversarial perturbations. Experiments show that models trained with Perseus achieve superior performance and are significantly more robust to adversarial attacks.

[LG-62] racking Universal Features Through Fine-Tuning and Model Merging

链接: https://arxiv.org/abs/2410.12391
作者: Niels Horn,Desmond Elliott
关键词-EN: domains of text, base one-layer Transformer, one-layer Transformer language, Transformer language model, features emerge
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study how features emerge, disappear, and persist across models fine-tuned on different domains of text. More specifically, we start from a base one-layer Transformer language model that is trained on a combination of the BabyLM corpus, and a collection of Python code from The Stack. This base model is adapted to two new domains of text: TinyStories, and the Lua programming language, respectively; and then these two models are merged using these two models using spherical linear interpolation. Our exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.

[LG-63] owards Neural Scaling Laws for Time Series Foundation Models

链接: https://arxiv.org/abs/2410.12360
作者: Qingren Yao,Chao-Han Huck Yang,Renhe Jiang,Yuxuan Liang,Ming Jin,Shirui Pan
关键词-EN: offer valuable insights, time series foundation, laws offer valuable, series foundation models, Scaling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling laws offer valuable insights into the design of time series foundation models (TSFMs). However, previous research has largely focused on the scaling laws of TSFMs for in-distribution (ID) data, leaving their out-of-distribution (OOD) scaling behavior and the influence of model architectures less explored. In this work, we examine two common TSFM architectures, encoder-only and decoder-only Transformers, and investigate their scaling behavior on both ID and OOD data. These models are trained and evaluated across varying parameter counts, compute budgets, and dataset sizes. Our experiments reveal that the log-likelihood loss of TSFMs exhibits similar scaling behavior in both OOD and ID settings. We further compare the scaling properties across different architectures, incorporating two state-of-the-art TSFMs as case studies, showing that model architecture plays a significant role in scaling. The encoder-only Transformers demonstrate better scalability than the decoder-only Transformers, while the architectural enhancements in the two advanced TSFMs primarily improve ID performance but reduce OOD scalability. While scaling up TSFMs is expected to drive performance breakthroughs, the lack of a comprehensive understanding of TSFM scaling laws has hindered the development of a robust framework to guide model scaling. We fill this gap in this work by synthesizing our findings and providing practical guidelines for designing and scaling larger TSFMs with enhanced model capabilities.

[LG-64] Federated Temporal Graph Clustering

链接: https://arxiv.org/abs/2410.12343
作者: Yang Liu,Zihao Zhou,Xianghong Xu,Qian Li
关键词-EN: involves discovering meaningful, discovering meaningful structures, Temporal graph clustering, complex task, task that involves
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:Temporal graph clustering is a complex task that involves discovering meaningful structures in dynamic graphs where relationships and entities change over time. Existing methods typically require centralized data collection, which poses significant privacy and communication challenges. In this work, we introduce a novel Federated Temporal Graph Clustering (FTGC) framework that enables decentralized training of graph neural networks (GNNs) across multiple clients, ensuring data privacy throughout the process. Our approach incorporates a temporal aggregation mechanism to effectively capture the evolution of graph structures over time and a federated optimization strategy to collaboratively learn high-quality clustering representations. By preserving data privacy and reducing communication overhead, our framework achieves competitive performance on temporal graph datasets, making it a promising solution for privacy-sensitive, real-world applications involving dynamic data.

[LG-65] MAX: Masked Autoencoder for X-ray Fluorescence in Geological Investigation

链接: https://arxiv.org/abs/2410.12330
作者: An-Sheng Lee,Yu-Wen Pao,Hsuan-Tien Lin,Sofia Ya Hsuan Liou
关键词-EN: deep learning approaches, application remains limited, learning approaches, de-facto procedure, procedure for deep
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-training foundation models has become the de-facto procedure for deep learning approaches, yet its application remains limited in the geological studies, where in needs of the model transferability to break the shackle of data scarcity. Here we target on the X-ray fluorescence (XRF) scanning data, a standard high-resolution measurement in extensive scientific drilling projects. We propose a scalable self-supervised learner, masked autoencoders on XRF spectra (MAX), to pre-train a foundation model covering geological records from multiple regions of the Pacific and Southern Ocean. In pre-training, we find that masking a high proportion of the input spectrum (50%) yields a nontrivial and meaningful self-supervisory task. For downstream tasks, we select the quantification of XRF spectra into two costly geochemical measurements, CaCO _3 and total organic carbon, due to their importance in understanding the paleo-oceanic carbon system. Our results show that MAX, requiring only one-third of the data, outperforms models without pre-training in terms of quantification accuracy. Additionally, the model’s generalizability improves by more than 60% in zero-shot tests on new materials, with explainability further ensuring its robustness. Thus, our approach offers a promising pathway to overcome data scarcity in geological discovery by leveraging the self-supervised foundation model and fast-acquired XRF scanning data.

[LG-66] Improved Anomaly Detection through Conditional Latent Space VAE Ensembles

链接: https://arxiv.org/abs/2410.12328
作者: Oskar Åström,Alexandros Sopasakis
关键词-EN: unknown outlier classes, Conditional Latent space, space Variational Autoencoder, perform improved pre-processing, Variational Autoencoder
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR)
*备注: 13 pages of main article, 19 pages including references and appendix, 4 figures

点击查看摘要

Abstract:We propose a novel Conditional Latent space Variational Autoencoder (CL-VAE) to perform improved pre-processing for anomaly detection on data with known inlier classes and unknown outlier classes. This proposed variational autoencoder (VAE) improves latent space separation by conditioning on information within the data. The method fits a unique prior distribution to each class in the dataset, effectively expanding the classic prior distribution for VAEs to include a Gaussian mixture model. An ensemble of these VAEs are merged in the latent spaces to form a group consensus that greatly improves the accuracy of anomaly detection across data sets. Our approach is compared against the capabilities of a typical VAE, a CNN, and a PCA, with regards AUC for anomaly detection. The proposed model shows increased accuracy in anomaly detection, achieving an AUC of 97.4% on the MNIST dataset compared to 95.7% for the second best model. In addition, the CL-VAE shows increased benefits from ensembling, a more interpretable latent space, and an increased ability to learn patterns in complex data with limited model sizes.

[LG-67] Revisited Large Language Model for Time Series Analysis through Modality Alignment

链接: https://arxiv.org/abs/2410.12326
作者: Liangwei Nathan Zheng,Chang George Dong,Wei Emma Zhang,Lin Yue,Miao Xu,Olaf Maennel,Weitong Chen
关键词-EN: pivotal web applications, Large Language Models, time series tasks, time series, demonstrated impressive performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models have demonstrated impressive performance in many pivotal web applications such as sensor data analysis. However, since LLMs are not designed for time series tasks, simpler models like linear regressions can often achieve comparable performance with far less complexity. In this study, we perform extensive experiments to assess the effectiveness of applying LLMs to key time series tasks, including forecasting, classification, imputation, and anomaly detection. We compare the performance of LLMs against simpler baseline models, such as single-layer linear models and randomly initialized LLMs. Our results reveal that LLMs offer minimal advantages for these core time series tasks and may even distort the temporal structure of the data. In contrast, simpler models consistently outperform LLMs while requiring far fewer parameters. Furthermore, we analyze existing reprogramming techniques and show, through data manifold analysis, that these methods fail to effectively align time series data with language and display pseudo-alignment behaviour in embedding space. Our findings suggest that the performance of LLM-based methods in time series tasks arises from the intrinsic characteristics and structure of time series data, rather than any meaningful alignment with the language model architecture.

[LG-68] PFL: A Trustworthy Personalized Federated Learning Framework via Subjective Logic

链接: https://arxiv.org/abs/2410.12316
作者: Jinqian Chen,Jihua Zhu
关键词-EN: enables collaborative model, enables collaborative, preserving data privacy, Trustworthy Personalized Federated, Federated learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 17 Pages with Appendix

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across distributed clients while preserving data privacy. Despite its widespread adoption, most FL approaches focusing solely on privacy protection fall short in scenarios where trustworthiness is crucial, necessitating advancements in secure training, dependable decision-making mechanisms, robustness on corruptions, and enhanced performance with Non-IID data. To bridge this gap, we introduce Trustworthy Personalized Federated Learning (TPFL) framework designed for classification tasks via subjective logic in this paper. Specifically, TPFL adopts a unique approach by employing subjective logic to construct federated models, providing probabilistic decisions coupled with an assessment of uncertainty rather than mere probability assignments. By incorporating a trainable heterogeneity prior to the local training phase, TPFL effectively mitigates the adverse effects of data heterogeneity. Model uncertainty and instance uncertainty are further utilized to ensure the safety and reliability of the training and inference stages. Through extensive experiments on widely recognized federated learning benchmarks, we demonstrate that TPFL not only achieves competitive performance compared with advanced methods but also exhibits resilience against prevalent malicious attacks, robustness on domain shifts, and reliability in high-stake scenarios.

[LG-69] DAT: Improving Adversarial Robustness via Generative Amplitude Mix-up in Frequency Domain

链接: https://arxiv.org/abs/2410.12307
作者: Fengpeng Li,Kemou Li,Haiwei Wu,Jinyu Tian,Jiantao Zhou
关键词-EN: deep neural networks, protect deep neural, adversarial attacks, neural networks, protect deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:To protect deep neural networks (DNNs) from adversarial attacks, adversarial training (AT) is developed by incorporating adversarial examples (AEs) into model training. Recent studies show that adversarial attacks disproportionately impact the patterns within the phase of the sample’s frequency spectrum – typically containing crucial semantic information – more than those in the amplitude, resulting in the model’s erroneous categorization of AEs. We find that, by mixing the amplitude of training samples’ frequency spectrum with those of distractor images for AT, the model can be guided to focus on phase patterns unaffected by adversarial perturbations. As a result, the model’s robustness can be improved. Unfortunately, it is still challenging to select appropriate distractor images, which should mix the amplitude without affecting the phase patterns. To this end, in this paper, we propose an optimized Adversarial Amplitude Generator (AAG) to achieve a better tradeoff between improving the model’s robustness and retaining phase patterns. Based on this generator, together with an efficient AE production procedure, we design a new Dual Adversarial Training (DAT) strategy. Experiments on various datasets show that our proposed DAT leads to significantly improved robustness against diverse adversarial attacks.

[LG-70] Continuous Pupillography: A Case for Visual Health Ecosystem

链接: https://arxiv.org/abs/2410.12303
作者: Usama Younus,Nirupam Roy
关键词-EN: ophthalmological diagnostic applications, cover pupillography, biomedical space, article aims, aims to cover
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article aims to cover pupillography, and its potential use in a number of ophthalmological diagnostic applications in biomedical space. With the ever-increasing incorporation of technology within our daily lives and an ever-growing active research into smart devices and technologies, we try to make a case for a health ecosystem that revolves around continuous eye monitoring. We tend to summarize the design constraints requirements for an IoT-based continuous pupil detection system, with an attempt at developing a pipeline for wearable pupillographic device, while comparing two compact mini-camera modules currently available in the market. We use a light algorithm that can be directly adopted to current micro-controllers, and share our results for different lighting conditions, and scenarios. Lastly, we present our findings, along with an analysis on the challenges faced and a way ahead towards successfully building this ecosystem.

[LG-71] wo Birds with One Stone: Multi-Task Semantic Communications Systems over Relay Channel

链接: https://arxiv.org/abs/2410.12302
作者: Yujie Cao,Tong Wu,Zhiyong Chen,Yin Xu,Meixia Tao,Wenjun Zhang
关键词-EN: multi-link relay semantic, relay semantic communications, relay node, source node, relay node forwards
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: submitted to IEEE WCNC

点击查看摘要

Abstract:In this paper, we propose a novel multi-task, multi-link relay semantic communications (MTML-RSC) scheme that enables the destination node to simultaneously perform image reconstruction and classification with one transmission from the source node. In the MTML-RSC scheme, the source node broadcasts a signal using semantic communications, and the relay node forwards the signal to the destination. We analyze the coupling relationship between the two tasks and the two links (source-to-relay and source-to-destination) and design a semantic-focused forward method for the relay node, where it selectively forwards only the semantics of the relevant class while ignoring others. At the destination, the node combines signals from both the source node and the relay node to perform classification, and then uses the classification result to assist in decoding the signal from the relay node for image reconstructing. Experimental results demonstrate that the proposed MTML-RSC scheme achieves significant performance gains, e.g., 1.73 dB improvement in peak-signal-to-noise ratio (PSNR) for image reconstruction and increasing the accuracy from 64.89% to 70.31% for classification.

[LG-72] Conjunction Subspaces Test for Conformal and Selective Classification

链接: https://arxiv.org/abs/2410.12297
作者: Zengyou He,Zerun Li,Junjie Dong,Xinying Liu,Mudi Jiang,Lianyu Hu
关键词-EN: integrates significance testing, significance testing results, yield consensus p-values, integrates significance, quantifying the uncertainty
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 36 pages, 9 figures

点击查看摘要

Abstract:In this paper, we present a new classifier, which integrates significance testing results over different random subspaces to yield consensus p-values for quantifying the uncertainty of classification decision. The null hypothesis is that the test sample has no association with the target class on a randomly chosen subspace, and hence the classification problem can be formulated as a problem of testing for the conjunction of hypotheses. The proposed classifier can be easily deployed for the purpose of conformal prediction and selective classification with reject and refine options by simply thresholding the consensus p-values. The theoretical analysis on the generalization error bound of the proposed classifier is provided and empirical studies on real data sets are conducted as well to demonstrate its effectiveness.

[LG-73] Consistency Calibration: Improving Uncertainty Calibration via Consistency among Perturbed Neighbors

链接: https://arxiv.org/abs/2410.12295
作者: Linwei Tao,Haolan Guo,Minjing Dong,Chang Xu
关键词-EN: deep learning applications, Expected Calibration Error, accurate confidence estimates, learning applications, autonomous driving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Calibration is crucial in deep learning applications, especially in fields like healthcare and autonomous driving, where accurate confidence estimates are vital for decision-making. However, deep neural networks often suffer from miscalibration, with reliability diagrams and Expected Calibration Error (ECE) being the only standard perspective for evaluating calibration performance. In this paper, we introduce the concept of consistency as an alternative perspective on model calibration, inspired by uncertainty estimation literature in large language models (LLMs). We highlight its advantages over the traditional reliability-based view. Building on this concept, we propose a post-hoc calibration method called Consistency Calibration (CC), which adjusts confidence based on the model’s consistency across perturbed inputs. CC is particularly effective in locally uncertainty estimation, as it requires no additional data samples or label information, instead generating input perturbations directly from the source data. Moreover, we show that performing perturbations at the logit level significantly improves computational efficiency. We validate the effectiveness of CC through extensive comparisons with various post-hoc and training-time calibration methods, demonstrating state-of-the-art performance on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet, as well as on long-tailed datasets like ImageNet-LT.

[LG-74] owards LLM-based Cognitive Models of Students with Misconceptions

链接: https://arxiv.org/abs/2410.12294
作者: Shashank Sonkar,Xinghe Chen,Naiming Liu,Richard G. Baraniuk,Mrinmaya Sachan
关键词-EN: AI-driven educational technologies, Accurately modeling student, modeling student cognition, Accurately modeling, developing effective AI-driven
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately modeling student cognition is crucial for developing effective AI-driven educational technologies. A key challenge is creating realistic student models that satisfy two essential properties: (1) accurately replicating specific misconceptions, and (2) correctly solving problems where these misconceptions are not applicable. This dual requirement reflects the complex nature of student understanding, where misconceptions coexist with correct knowledge. This paper investigates whether Large Language Models (LLMs) can be instruction-tuned to meet this dual requirement and effectively simulate student thinking in algebra. We introduce MalAlgoPy, a novel Python library that generates datasets reflecting authentic student solution patterns through a graph-based representation of algebraic problem-solving. Utilizing MalAlgoPy, we define and examine Cognitive Student Models (CSMs) - LLMs instruction tuned to faithfully emulate realistic student behavior. Our findings reveal that LLMs trained on misconception examples can efficiently learn to replicate errors. However, the training diminishes the model’s ability to solve problems correctly, particularly for problem types where the misconceptions are not applicable, thus failing to satisfy second property of CSMs. We demonstrate that by carefully calibrating the ratio of correct to misconception examples in the training data - sometimes as low as 0.25 - it is possible to develop CSMs that satisfy both properties. Our insights enhance our understanding of AI-based student models and pave the way for effective adaptive learning systems.

[LG-75] Discovering Leitmotifs in Multidimensional Time Series

链接: https://arxiv.org/abs/2410.12293
作者: Patrick Schäfer,Ulf Leser
关键词-EN: carries symbolic significance, theme in literature, movies or music, recurring theme, symbolic significance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A leitmotif is a recurring theme in literature, movies or music that carries symbolic significance for the piece it is contained in. When this piece can be represented as a multi-dimensional time series (MDTS), such as acoustic or visual observations, finding a leitmotif is equivalent to the pattern discovery problem, which is an unsupervised and complex problem in time series analytics. Compared to the univariate case, it carries additional complexity because patterns typically do not occur in all dimensions but only in a few - which are, however, unknown and must be detected by the method itself. In this paper, we present the novel, efficient and highly effective leitmotif discovery algorithm LAMA for MDTS. LAMA rests on two core principals: (a) a leitmotif manifests solely given a yet unknown number of sub-dimensions - neither too few, nor too many, and (b) the set of sub-dimensions are not independent from the best pattern found therein, necessitating both problems to be approached in a joint manner. In contrast to most previous methods, LAMA tackles both problems jointly - instead of independently selecting dimensions (or leitmotifs) and finding the best leitmotifs (or dimensions). Our experimental evaluation on a novel ground-truth annotated benchmark of 14 distinct real-life data sets shows that LAMA, when compared to four state-of-the-art baselines, shows superior performance in detecting meaningful patterns without increased computational complexity.

[LG-76] AI-Aided Kalman Filters

链接: https://arxiv.org/abs/2410.12289
作者: Nir Shlezinger,Guy Revach,Anubhab Ghosh,Saikat Chatterjee,Shuo Tang,Tales Imbiriba,Jindrich Dunik,Ondrej Straka,Pau Closas,Yonina C. Eldar
关键词-EN: Kalman filter, signal processing, Kalman, celebrated algorithms, state estimation
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: Submitted to IEEE Signal Processing Magazine

点击查看摘要

Abstract:The Kalman filter (KF) and its variants are among the most celebrated algorithms in signal processing. These methods are used for state estimation of dynamic systems by relying on mathematical representations in the form of simple state-space (SS) models, which may be crude and inaccurate descriptions of the underlying dynamics. Emerging data-centric artificial intelligence (AI) techniques tackle these tasks using deep neural networks (DNNs), which are model-agnostic. Recent developments illustrate the possibility of fusing DNNs with classic Kalman-type filtering, obtaining systems that learn to track in partially known dynamics. This article provides a tutorial-style overview of design approaches for incorporating AI in aiding KF-type algorithms. We review both generic and dedicated DNN architectures suitable for state estimation, and provide a systematic presentation of techniques for fusing AI tools with KFs and for leveraging partial SS modeling and data, categorizing design approaches into task-oriented and SS model-oriented. The usefulness of each approach in preserving the individual strengths of model-based KFs and data-driven DNNs is investigated in a qualitative and quantitative study, whose code is publicly available, illustrating the gains of hybrid model-based/data-driven designs. We also discuss existing challenges and future research directions that arise from fusing AI and Kalman-type algorithms.

[LG-77] A Numerical Study of Chaotic Dynamics of K-S Equation with FNOs

链接: https://arxiv.org/abs/2410.12280
作者: Surbhi Khetrapal,Jaswin Kasi
关键词-EN: financial market risk, predicting weather extremes, Solving non-linear partial, partial differential equations, solving partial differential
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 8 pages, 5 figures. Submitted to CASML 2024

点击查看摘要

Abstract:Solving non-linear partial differential equations which exhibit chaotic dynamics is an important problem with a wide-range of applications such as predicting weather extremes and financial market risk. Fourier neural operators (FNOs) have been shown to be efficient in solving partial differential equations (PDEs). In this work we demonstrate simulation of dynamics in the chaotic regime of the two-dimensional (2d) Kuramoto-Sivashinsky equation using FNOs. Particularly, we analyze the effect of Fourier mode cutoff on the results obtained by using FNOs vs those obtained using traditional PDE solvers. We compare the outputs using metrics such as the 2d power spectrum and the radial power spectrum. In addition we propose the normalised error power spectrum which measures the percentage error in the FNO model outputs. We conclude that FNOs capture the dynamics in the chaotic regime of the 2d K-S equation, provided the Fourier mode cutoff is kept sufficiently high.

[LG-78] Stress Assessment with Convolutional Neural Network Using PPG Signals

链接: https://arxiv.org/abs/2410.12273
作者: Yasin Hasanpoor,Bahram Tarvirdizadeh,Khalil Alipour,Mohammad Ghamari
关键词-EN: nowadays lifestyle, main issues, issues of nowadays, Stress, stressful events
类目: Machine Learning (cs.LG)
*备注: 5 figures, 2 tables

点击查看摘要

Abstract:Stress is one of the main issues of nowadays lifestyle. If it becomes chronic it can have adverse effects on the human body. Thus, the early detection of stress is crucial to prevent its hurting effects on the human body and have a healthier life. Stress can be assessed using physiological signals. To this end, Photoplethysmography (PPG) is one of the most favorable physiological signals for stress assessment. This research is focused on developing a novel technique to assess stressful events using raw PPG signals recorded by Empatica E4 sensor. To achieve this goal, an adaptive convolutional neural network (CNN) combined with Multilayer Perceptron (MLP) has been utilized to realize the detection of stressful events. This research will use a dataset that is publicly available and named wearable stress and effect detection (WESAD). This dataset will be used to simulate the proposed model and to examine the advantages of the proposed developed model. The proposed model in this research will be able to distinguish between normal events and stressful events. This model will be able to detect stressful events with an accuracy of 96.7%.

[LG-79] FuzzyTL: Interpretable Fuzzy Transfer Learning for SSVEP BCI System

链接: https://arxiv.org/abs/2410.12267
作者: Xiaowei Jiang,Beining Cao,Liang Ou,Yu-Cheng Chang,Thomas Do,Chin-Teng Lin
关键词-EN: Visual Evoked Potentials, Steady-State Visual Evoked, Evoked Potentials, Visual Evoked, Brain-Computer Interfaces
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of Brain-Computer Interfaces (BCIs) has significantly influenced the domain of human-computer interaction, with Steady-State Visual Evoked Potentials (SSVEP) emerging as a notably robust paradigm. This study explores advanced classification techniques leveraging interpretable fuzzy transfer learning (iFuzzyTL) to enhance the adaptability and performance of SSVEP-based systems. Recent efforts have strengthened to reduce calibration requirements through innovative transfer learning approaches, which refine cross-subject generalizability and minimize calibration through strategic application of domain adaptation and few-shot learning strategies. Pioneering developments in deep learning also offer promising enhancements, facilitating robust domain adaptation and significantly improving system responsiveness and accuracy in SSVEP classification. However, these methods often require complex tuning and extensive data, limiting immediate applicability. iFuzzyTL introduces an adaptive framework that combines fuzzy logic principles with neural network architectures, focusing on efficient knowledge transfer and domain adaptation. iFuzzyTL refines input signal processing and classification in a human-interpretable format by integrating fuzzy inference systems and attention mechanisms. This approach bolsters the model’s precision and aligns with real-world operational demands by effectively managing the inherent variability and uncertainty of EEG data. The model’s efficacy is demonstrated across three datasets: 12JFPM (89.70% accuracy for 1s with an information transfer rate (ITR) of 149.58), Benchmark (85.81% accuracy for 1s with an ITR of 213.99), and eldBETA (76.50% accuracy for 1s with an ITR of 94.63), achieving state-of-the-art results and setting new benchmarks for SSVEP BCI performance.

[LG-80] Game Theory Meets Statistical Mechanics in Deep Learning Design

链接: https://arxiv.org/abs/2410.12264
作者: Djamel Bouchaffra,Fayçal Ykhlef,Bilal Faye,Hanane Azzag,Mustapha Lebbah
关键词-EN: seamlessly merges principles, deep graphical representation, graphical representation, representation that seamlessly, seamlessly merges
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We present a novel deep graphical representation that seamlessly merges principles of game theory with laws of statistical mechanics. It performs feature extraction, dimensionality reduction, and pattern classification within a single learning framework. Our approach draws an analogy between neurons in a network and players in a game theory model. Furthermore, each neuron viewed as a classical particle (subject to statistical physics’ laws) is mapped to a set of actions representing specific activation value, and neural network layers are conceptualized as games in a sequential cooperative game theory setting. The feed-forward process in deep learning is interpreted as a sequential game, where each game comprises a set of players. During training, neurons are iteratively evaluated and filtered based on their contributions to a payoff function, which is quantified using the Shapley value driven by an energy function. Each set of neurons that significantly contributes to the payoff function forms a strong coalition. These neurons are the only ones permitted to propagate the information forward to the next layers. We applied this methodology to the task of facial age estimation and gender classification. Experimental results demonstrate that our approach outperforms both multi-layer perceptron and convolutional neural network models in terms of efficiency and accuracy.

[LG-81] CATCH: Channel-Aware multivariate Time Series Anomaly Detection via Frequency Patching

链接: https://arxiv.org/abs/2410.12261
作者: Xingjian Wu,Xiangfei Qiu,Zhengyu Li,Yihang Wang,Jilin Hu,Chenjuan Guo,Hui Xiong,Bin Yang
关键词-EN: multivariate time series, Anomaly detection, heterogeneous subsequence anomalies, anomalies may occur, detection in multivariate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anomaly detection in multivariate time series is challenging as heterogeneous subsequence anomalies may occur. Reconstruction-based methods, which focus on learning nomral patterns in the frequency domain to detect diverse abnormal subsequences, achieve promising resutls, while still falling short on capturing fine-grained frequency characteristics and channel correlations. To contend with the limitations, we introduce CATCH, a framework based on frequency patching. We propose to patchify the frequency domain into frequency bands, which enhances its ability to capture fine-grained frequency characteristics. To perceive appropriate channel correlations, we propose a Channel Fusion Module (CFM), which features a patch-wise mask generator and a masked-attention mechanism. Driven by a bi-level multi-objective optimization algorithm, the CFM is encouraged to iteratively discover appropriate patch-wise channel correlations, and to cluster relevant channels while isolating adverse effects from irrelevant channels. Extensive experiments on 9 real-world datasets and 12 synthetic datasets demonstrate that CATCH achieves state-of-the-art performance.

[LG-82] Optimizing YOLOv5s Object Detection through Knowledge Distillation algorithm

链接: https://arxiv.org/abs/2410.12259
作者: Guanming Huang,Aoran Shen,Yuxiang Hu,Junliang Du,Jiacheng Hu,Yingbin Liang
关键词-EN: knowledge distillation technology, target detection tasks, student detection accuracy, knowledge distillation, detection tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the application of knowledge distillation technology in target detection tasks, especially the impact of different distillation temperatures on the performance of student models. By using YOLOv5l as the teacher network and a smaller YOLOv5s as the student network, we found that with the increase of distillation temperature, the student’s detection accuracy gradually improved, and finally achieved mAP50 and mAP50-95 indicators that were better than the original YOLOv5s model at a specific temperature. Experimental results show that appropriate knowledge distillation strategies can not only improve the accuracy of the model but also help improve the reliability and stability of the model in practical applications. This paper also records in detail the accuracy curve and loss function descent curve during the model training process and shows that the model converges to a stable state after 150 training cycles. These findings provide a theoretical basis and technical reference for further optimizing target detection algorithms.

[LG-83] Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts

链接: https://arxiv.org/abs/2410.12258
作者: Fanqi Yan,Huy Nguyen,Dung Le,Pedram Akbarian,Nhat Ho
关键词-EN: pre-trained model, prompt learning, prompt learning problem, large-scaled pre-trained model, parameter estimation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Fanqi Yan, Huy Nguyen, Dung Le contributed equally to this work. 70 pages, 6 figures, 1 table

点击查看摘要

Abstract:We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scaled pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of the pre-trained model and the prompt may converge to zero where the prompt vanishes during the training; (ii) the algebraic interaction among parameters of the pre-trained model and the prompt can occur via some partial differential equation and decelerate the prompt learning. In response, we introduce a distinguishability condition to control the previous parameter interaction. Additionally, we also consider various types of expert structures to understand their effects on the parameter estimation. In each scenario, we provide comprehensive convergence rates of parameter estimation along with the corresponding minimax lower bounds.

[LG-84] Irregularity-Informed Time Series Analysis: Adaptive Modelling of Spatial and Temporal Dynamics

链接: https://arxiv.org/abs/2410.12257
作者: Liangwei Nathan Zheng,Zhengyang Li,Chang George Dong,Wei Emma Zhang,Lin Yue,Miao Xu,Olaf Maennel,Weitong Chen
关键词-EN: Irregular Time Series, Time Series Data, Time Series, Natural Irregular Time, Accidental Irregular Time
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Irregular Time Series Data (IRTS) has shown increasing prevalence in real-world applications. We observed that IRTS can be divided into two specialized types: Natural Irregular Time Series (NIRTS) and Accidental Irregular Time Series (AIRTS). Various existing methods either ignore the impacts of irregular patterns or statically learn the irregular dynamics of NIRTS and AIRTS data and suffer from limited data availability due to the sparsity of IRTS. We proposed a novel transformer-based framework for general irregular time series data that treats IRTS from four views: Locality, Time, Spatio and Irregularity to motivate the data usage to the highest potential. Moreover, we design a sophisticated irregularity-gate mechanism to adaptively select task-relevant information from irregularity, which improves the generalization ability to various IRTS data. We implement extensive experiments to demonstrate the resistance of our work to three highly missing ratio datasets (88.4%, 94.9%, 60% missing value) and investigate the significance of the irregularity information for both NIRTS and AIRTS by additional ablation study. We release our implementation in this https URL

[LG-85] Dual Action Policy for Robust Sim-to-Real Reinforcement Learning

链接: https://arxiv.org/abs/2410.12250
作者: Ng Wen Zheng Terence,Chen Jianda
关键词-EN: paper presents Dual, presents Dual Action, Dual Action Policy, dynamics mismatch inherent, presents Dual
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents Dual Action Policy (DAP), a novel approach to address the dynamics mismatch inherent in the sim-to-real gap of reinforcement learning. DAP uses a single policy to predict two sets of actions: one for maximizing task rewards in simulation and another specifically for domain adaptation via reward adjustments. This decoupling makes it easier to maximize the overall reward in the source domain during training. Additionally, DAP incorporates uncertainty-based exploration during training to enhance agent robustness. Experimental results demonstrate DAP’s effectiveness in bridging the sim-to-real gap, outperforming baselines on challenging tasks in simulation, and further improvement is achieved by incorporating uncertainty estimation.

[LG-86] Devil in the Tail: A Multi-Modal Framework for Drug-Drug Interaction Prediction in Long Tail Distinction

链接: https://arxiv.org/abs/2410.12249
作者: Liangwei Nathan Zheng,Chang George Dong,Wei Emma Zhang,Xin Chen,Lin Yue,Weitong Chen
关键词-EN: Drug-drug interaction, pharmacology research, DDI, crucial aspect, aspect of pharmacology
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drug-drug interaction (DDI) identification is a crucial aspect of pharmacology research. There are many DDI types (hundreds), and they are not evenly distributed with equal chance to occur. Some of the rarely occurred DDI types are often high risk and could be life-critical if overlooked, exemplifying the long-tailed distribution problem. Existing models falter against this distribution challenge and overlook the multi-faceted nature of drugs in DDI prediction. In this paper, a novel multi-modal deep learning-based framework, namely TFDM, is introduced to leverage multiple properties of a drug to achieve DDI classification. The proposed framework fuses multimodal features of drugs, including graph-based, molecular structure, Target and Enzyme, for DDI identification. To tackle the challenge posed by the distribution skewness across categories, a novel loss function called Tailed Focal Loss is introduced, aimed at further enhancing the model performance and address gradient vanishing problem of focal loss in extremely long-tailed dataset. Intensive experiments over 4 challenging long-tailed dataset demonstrate that the TFMD outperforms the most recent SOTA methods in long-tailed DDI classification tasks. The source code is released to reproduce our experiment results: this https URL

[LG-87] ransfer Learning on Multi-Dimensional Data: A Novel Approach to Neural Network-Based Surrogate Modeling

链接: https://arxiv.org/abs/2410.12241
作者: Adrienne M. Propp,Daniel M. Tartakovsky
关键词-EN: partial differential equations, differential equations, modeling of complex, development of efficient, partial differential
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The development of efficient surrogates of partial differential equations (PDEs) is a critical step towards scalable modeling of complex, multiscale systems-of-systems. Convolutional neural networks (CNNs) have gained popularity as the basis for such surrogate models due to their success in capturing high-dimensional input-output mappings and the negligible cost of a forward pass. However, the high cost of generating training data – typically via classical numerical solvers – raises the question of whether these models are worth pursuing over more straightforward alternatives with well-established theoretical foundations, such as Monte Carlo methods. To reduce the cost of data generation, we propose training a CNN surrogate model on a mixture of numerical solutions to both the d -dimensional problem and its ( d-1 )-dimensional approximation, taking advantage of the efficiency savings guaranteed by the curse of dimensionality. We demonstrate our approach on a multiphase flow test problem, using transfer learning to train a dense fully-convolutional encoder-decoder CNN on the two classes of data. Numerical results from a sample uncertainty quantification task demonstrate that our surrogate model outperforms Monte Carlo with several times the data generation budget.

[LG-88] Off-dynamics Conditional Diffusion Planners

链接: https://arxiv.org/abs/2410.12238
作者: Wen Zheng Terence Ng,Jianda Chen,Tianwei Zhang
关键词-EN: Offline Reinforcement Learning, Reinforcement Learning, interactive data acquisition, Offline Reinforcement, leveraging pre-existing datasets
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) offers an attractive alternative to interactive data acquisition by leveraging pre-existing datasets. However, its effectiveness hinges on the quantity and quality of the data samples. This work explores the use of more readily available, albeit off-dynamics datasets, to address the challenge of data scarcity in Offline RL. We propose a novel approach using conditional Diffusion Probabilistic Models (DPMs) to learn the joint distribution of the large-scale off-dynamics dataset and the limited target dataset. To enable the model to capture the underlying dynamics structure, we introduce two contexts for the conditional model: (1) a continuous dynamics score allows for partial overlap between trajectories from both datasets, providing the model with richer information; (2) an inverse-dynamics context guides the model to generate trajectories that adhere to the target environment’s dynamic constraints. Empirical results demonstrate that our method significantly outperforms several strong baselines. Ablation studies further reveal the critical role of each dynamics context. Additionally, our model demonstrates that by modifying the context, we can interpolate between source and target dynamics, making it more robust to subtle shifts in the environment.

[LG-89] Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay

链接: https://arxiv.org/abs/2410.12236
作者: Yuyang Chen,Kaiyan Zhao,Yiming Wang,Ming Yang,Jian Zhang,Xiaoguang Niu
关键词-EN: Large Language Models, Nowadays transformer-based Large, transformer-based Large Language, Large Language, code generation tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Nowadays transformer-based Large Language Models (LLM) for code generation tasks usually apply sampling and filtering pipelines. Due to the sparse reward problem in code generation tasks caused by one-token incorrectness, transformer-based models will sample redundant programs till they find a correct one, leading to low efficiency. To overcome the challenge, we incorporate Experience Replay (ER) in the fine-tuning phase, where codes and programs produced are stored and will be replayed to give the LLM agent a chance to learn from past experiences. Based on the spirit of ER, we introduce a novel approach called BTP pipeline which consists of three phases: beam search sampling, testing phase, and prioritized experience replay phase. The approach makes use of failed programs collected by code models and replays programs with high Possibility and Pass-rate Prioritized value (P2Value) from the replay buffer to improve efficiency. P2Value comprehensively considers the possibility of transformers’ output and pass rate and can make use of the redundant resources caused by the problem that most programs collected by LLMs fail to pass any tests. We empirically apply our approach in several LLMs, demonstrating that it enhances their performance in code generation tasks and surpasses existing baselines.

[LG-90] Improving the Generalization of Unseen Crowd Behaviors for Reinforcement Learning based Local Motion Planners

链接: https://arxiv.org/abs/2410.12232
作者: Wen Zheng Terence Ng,Jianda Chen,Sinno Jialin Pan,Tianwei Zhang
关键词-EN: safe mobile robot, mobile robot policy, Deploying a safe, Current Reinforcement Learning-based, Reinforcement Learning-based motion
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying a safe mobile robot policy in scenarios with human pedestrians is challenging due to their unpredictable movements. Current Reinforcement Learning-based motion planners rely on a single policy to simulate pedestrian movements and could suffer from the over-fitting issue. Alternatively, framing the collision avoidance problem as a multi-agent framework, where agents generate dynamic movements while learning to reach their goals, can lead to conflicts with human pedestrians due to their homogeneity. To tackle this problem, we introduce an efficient method that enhances agent diversity within a single policy by maximizing an information-theoretic objective. This diversity enriches each agent’s experiences, improving its adaptability to unseen crowd behaviors. In assessing an agent’s robustness against unseen crowds, we propose diverse scenarios inspired by pedestrian crowd behaviors. Our behavior-conditioned policies outperform existing works in these challenging scenes, reducing potential collisions without additional time or travel. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.12232 [cs.RO] (or arXiv:2410.12232v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2410.12232 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/ICRA57147.2024.10610641 Focus to learn more DOI(s) linking to related resources

[LG-91] Causally-Aware Unsupervised Feature Selection Learning

链接: https://arxiv.org/abs/2410.12224
作者: Zongxin Shen,Yanyong Huang,Minbo Ma,Tianrui Li
关键词-EN: recently gained attention, processing unlabeled high-dimensional, unlabeled high-dimensional data, Unsupervised feature selection, feature selection
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization.

[LG-92] Exploring the impact of virtual reality user engagement on tourist behavioral response integrated an environment concern of touristic travel perspective: A new hybrid machine learning approach

链接: https://arxiv.org/abs/2410.12223
作者: D. W. Shang
关键词-EN: provide tours product, user engagement, in-person tour intentions, key user engagement, user engagement drivers
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the impact of the COVID-19 pandemic, new attractions ways are tended to be adapted by compelling sites to provide tours product and services, such as virtual reality (VR) to visitors. Based on a systematic human-computer interaction (HCI) user engagement and Narrative transportation theory, we develop and test a theoretical framework using a hybrid partial least squares structural equation model (PLS-SEM) and artificial neural network (ANN) machine learning approach that examines key user engagement drivers of visitors’ imagery and in-person tour intentions (ITI) during COVID-19. Further, we proposed a novel and hybrid approach called Reflective and Formative PLS-SEM-ANN (FRPSA) with considering both reflective and second-order formative constructs in PLS-SEM giving scope to their different advantages in a complex model. According to a sample of visitors’ responses, the results demonstrate that a) user engagement, including felt involvement, aesthetic appeal, perceived usability, focused attention, endurability, and novelty, all directly affect in-person tour intentions; b) environment concern of touristic travel (EC) positively moderates the relationships between user engagement and ITI; c) EC negatively moderates the relationships between imagery and ITI; d) imagery exerts the mediating effect between user engagement and ITI; e) the felt involvement and aesthetic appeal show both the linear significance impact and nonlinear importance. Finally, contributions to theories and practical implications are discussed accordingly.

[LG-93] EdgeRL: Reinforcement Learning-driven Deep Learning Model Inference Optimization at Edge

链接: https://arxiv.org/abs/2410.12221
作者: Motahare Mounesan,Xiaojie Zhang,Saptarshi Debroy
关键词-EN: Balancing mutually diverging, ad-hoc edge environments, mutually diverging performance, Balancing mutually, diverging performance metrics
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Balancing mutually diverging performance metrics, such as, processing latency, outcome accuracy, and end device energy consumption is a challenging undertaking for deep learning model inference in ad-hoc edge environments. In this paper, we propose EdgeRL framework that seeks to strike such balance by using an Advantage Actor-Critic (A2C) Reinforcement Learning (RL) approach that can choose optimal run-time DNN inference parameters and aligns the performance metrics based on the application requirements. Using real world deep learning model and a hardware testbed, we evaluate the benefits of EdgeRL framework in terms of end device energy savings, inference accuracy improvement, and end-to-end inference latency reduction.

[LG-94] Divide-Verify-Refine: Aligning LLM Responses with Complex Instructions

链接: https://arxiv.org/abs/2410.12207
作者: Xianren Zhang,Xianfeng Tang,Hui Liu,Zongyu Wu,Qi He,Dongwon Lee,Suhang Wang
关键词-EN: Recent studies show, Recent studies, struggle to follow, complex instructions, follow complex instructions
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Recent studies show that LLMs, particularly open-source models, struggle to follow complex instructions with multiple constraints. Despite the importance, methods to improve LLMs’ adherence to such constraints remain unexplored, and current research focuses on evaluating this ability rather than developing solutions. While a few studies enhance constraint adherence through model tuning, this approach is computationally expensive and heavily reliant on training data quality. An alternative is to leverage LLMs’ self-correction capabilities, allowing them to adjust responses to better meet specified constraints. However, this self-correction ability of LLMs is limited by the feedback quality, as LLMs cannot autonomously generate reliable feedback or detect errors. Moreover, the self-refinement process heavily depends on few-shot examples that illustrate how to modify responses to meet constraints. As constraints in complex instructions are diverse and vary widely, manually crafting few-shot examples for each constraint type can be labor-intensive and sub-optimal. To deal with these two challenges, we propose the Divide-Verify-Refine (DVR) framework with three steps: (1) Divide complex instructions into single constraints and prepare appropriate tools; (2) Verify: To address the feedback quality problem, these tools will rigorously verify responses and provide reliable feedback; (3) Refine: To address the constraint diversity challenge, we design a refinement repository that collects successful refinement processes and uses them as few-shot demonstrations for future cases, allowing LLMs to learn from the past experience during inference. Additionally, we develop a new dataset of complex instructions, each containing 1-6 constraints. Experiments show that the framework significantly improves performance, doubling LLama3.1-8B’s constraint adherence on instructions with 6 constraints.

[LG-95] Abnormality Forecasting: Time Series Anomaly Prediction via Future Context Modeling KDD

链接: https://arxiv.org/abs/2410.12206
作者: Sinong Zhao,Wenrui Wang,Hongzuo Xu,Zhaoyang Yu,Qingsong Wen,Gang Wang,xiaoguang Liu,Guansong Pang
关键词-EN: Identifying anomalies, intelligent operation, operation and maintenance, space exploration, series data plays
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures, submitted to KDD conference

点击查看摘要

Abstract:Identifying anomalies from time series data plays an important role in various fields such as infrastructure security, intelligent operation and maintenance, and space exploration. Current research focuses on detecting the anomalies after they occur, which can lead to significant financial/reputation loss or infrastructure damage. In this work we instead study a more practical yet very challenging problem, time series anomaly prediction, aiming at providing early warnings for abnormal events before their occurrence. To tackle this problem, we introduce a novel principled approach, namely future context modeling (FCM). Its key insight is that the future abnormal events in a target window can be accurately predicted if their preceding observation window exhibits any subtle difference to normal data. To effectively capture such differences, FCM first leverages long-term forecasting models to generate a discriminative future context based on the observation data, aiming to amplify those subtle but unusual difference. It then models a normality correlation of the observation data with the forecasting future context to complement the normality modeling of the observation data in foreseeing possible abnormality in the target window. A joint variate-time attention learning is also introduced in FCM to leverage both temporal signals and features of the time series data for more discriminative normality modeling in the aforementioned two views. Comprehensive experiments on five datasets demonstrate that FCM gains good recall rate (70%+) on multiple datasets and significantly outperforms all baselines in F1 score. Code is available at this https URL.

[LG-96] SAT: Data-light Uncertainty Set Merging via Synthetics Aggregation and Test Inversion

链接: https://arxiv.org/abs/2410.12201
作者: Shenghao Qin,Jianliang He,Bowen Gang,Yin Xia
关键词-EN: presents challenges, potential dependencies, diverse applications, sets, uncertainty sets
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The integration of uncertainty sets has diverse applications but also presents challenges, particularly when only initial sets and their control levels are available, along with potential dependencies. Examples include merging confidence sets from different distributed sites with communication constraints, as well as combining conformal prediction sets generated by different learning algorithms or data splits. In this article, we introduce an efficient and flexible Synthetic, Aggregation, and Test inversion (SAT) approach to merge various potentially dependent uncertainty sets into a single set. The proposed method constructs a novel class of synthetic test statistics, aggregates them, and then derives merged sets through test inversion. Our approach leverages the duality between set estimation and hypothesis testing, ensuring reliable coverage in dependent scenarios. The procedure is data-light, meaning it relies solely on initial sets and control levels without requiring raw data, and it adapts to any user-specified initial uncertainty sets, accommodating potentially varying coverage levels. Theoretical analyses and numerical experiments confirm that SAT provides finite-sample coverage guarantees and achieves small set sizes.

[LG-97] Potential-Based Intrinsic Motivation: Preserving Optimality With Complex Non-Markovian Shaping Rewards

链接: https://arxiv.org/abs/2410.12197
作者: Grant C. Forbes,Leonardo Villalobos-Arias,Jianxun Wang,Arnav Jhala,David L. Roberts
关键词-EN: set of optimal, intrinsic motivation, Cliff Walking environments, set, Potential-Based Intrinsic Motivation
类目: Machine Learning (cs.LG)
*备注: To be submit to joint AIJ-JAIR special track for award-winning papers. arXiv admin note: substantial text overlap with arXiv:2402.07411

点击查看摘要

Abstract:Recently there has been a proliferation of intrinsic motivation (IM) reward-shaping methods to learn in complex and sparse-reward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we prove preserves the set of optimal policies under a more general set of functions than has been previously proven. We also present \em Potential-Based Intrinsic Motivation (PBIM) and \em Generalized Reward Matching (GRM), methods for converting IM rewards into a potential-based form that are useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey and Cliff Walking environments, we demonstrate that PBIM and GRM successfully prevent the agent from converging to a suboptimal policy and can speed up training. Additionally, we prove that GRM is sufficiently general as to encompass all potential-based reward shaping functions. This paper expands on previous work introducing the PBIM method, and provides an extension to the more general method of GRM, as well as additional proofs, experimental results, and discussion.

[LG-98] LPUF-AuthNet: A Lightweight PUF-Based IoT Authentication via Tandem Neural Networks and Split Learning

链接: https://arxiv.org/abs/2410.12190
作者: Brahim Mefgouda,Raviha Khan,Omar Alhussein,Hani Saleh,Hossien B. Eldeeb,Anshul Pandey,Sami Muhaidat
关键词-EN: billion devices globally, internet of things, fundamentally altering, rural settings, projected to connect
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted to Proc. IEEE Globecom 2024

点击查看摘要

Abstract:By 2025, the internet of things (IoT) is projected to connect over 75 billion devices globally, fundamentally altering how we interact with our environments in both urban and rural settings. However, IoT device security remains challenging, particularly in the authentication process. Traditional cryptographic methods often struggle with the constraints of IoT devices, such as limited computational power and storage. This paper considers physical unclonable functions (PUFs) as robust security solutions, utilizing their inherent physical uniqueness to authenticate devices securely. However, traditional PUF systems are vulnerable to machine learning (ML) attacks and burdened by large datasets. Our proposed solution introduces a lightweight PUF mechanism, called LPUF-AuthNet, combining tandem neural networks (TNN) with a split learning (SL) paradigm. The proposed approach provides scalability, supports mutual authentication, and enhances security by resisting various types of attacks, paving the way for secure integration into future 6G technologies.

[LG-99] DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

链接: https://arxiv.org/abs/2410.12187
作者: Yingsong Luo,Ling Chen
关键词-EN: Large language models, face deployment challenges, deployment challenges due, Large language, hardware constraints
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) excel in various tasks but face deployment challenges due to hardware constraints. We propose density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which identifies the center of high-density weights and centers the dynamic range on this point to align high-density weight regions with floating-point high-precision regions; 2) learnable dynamic range adjustment, which adjusts the dynamic range by optimizing quantization parameters (i.e., scale and zero-point) based on the impact of weights on the model output. Experiments on LLaMA and LLaMA-2 show that DAQ consistently outperforms the best baseline method, reducing perplexity loss by an average of 22.8% on LLaMA and 19.6% on LLaMA-2. Our code is available at this https URL.

[LG-100] ExoTST: Exogenous-Aware Temporal Sequence Transformer for Time Series Prediction ICDM2024

链接: https://arxiv.org/abs/2410.12184
作者: Kshitij Tayal,Arvind Renganathan,Xiaowei Jia,Vipin Kumar,Dan Lu
关键词-EN: Accurate long-term predictions, machine learning applications, Accurate long-term, decision-making processes, current exogenous variables
类目: Machine Learning (cs.LG)
*备注: Accepted at ICDM 2024

点击查看摘要

Abstract:Accurate long-term predictions are the foundations for many machine learning applications and decision-making processes. Traditional time series approaches for prediction often focus on either autoregressive modeling, which relies solely on past observations of the target endogenous variables'', or forward modeling, which considers only current covariate drivers exogenous variables’'. However, effectively integrating past endogenous and past exogenous with current exogenous variables remains a significant challenge. In this paper, we propose ExoTST, a novel transformer-based framework that effectively incorporates current exogenous variables alongside past context for improved time series prediction. To integrate exogenous information efficiently, ExoTST leverages the strengths of attention mechanisms and introduces a novel cross-temporal modality fusion module. This module enables the model to jointly learn from both past and current exogenous series, treating them as distinct modalities. By considering these series separately, ExoTST provides robustness and flexibility in handling data uncertainties that arise from the inherent distribution shift between historical and current exogenous variables. Extensive experiments on real-world carbon flux datasets and time series benchmarks demonstrate ExoTST’s superior performance compared to state-of-the-art baselines, with improvements of up to 10% in prediction accuracy. Moreover, ExoTST exhibits strong robustness against missing values and noise in exogenous drivers, maintaining consistent performance in real-world situations where these imperfections are common.

[LG-101] Model Balancing Helps Low-data Training and Fine-tuning EMNLP2024

链接: https://arxiv.org/abs/2410.12178
作者: Zihang Liu,Yuanzhe Hu,Tianyu Pang,Yefan Zhou,Pu Ren,Yaoqing Yang
关键词-EN: Recent advances, align pre-trained models, curated datasets, domains using small, align pre-trained
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: EMNLP 2024 Oral. First two authors contributed equally

点击查看摘要

Abstract:Recent advances in foundation models have emphasized the need to align pre-trained models with specialized domains using small, curated datasets. Studies on these foundation models underscore the importance of low-data training and fine-tuning. This topic, well-known in natural language processing (NLP), has also gained increasing attention in the emerging field of scientific machine learning (SciML). To address the limitations of low-data training and fine-tuning, we draw inspiration from Heavy-Tailed Self-Regularization (HT-SR) theory, analyzing the shape of empirical spectral densities (ESDs) and revealing an imbalance in training quality across different model layers. To mitigate this issue, we adapt a recently proposed layer-wise learning rate scheduler, TempBalance, which effectively balances training quality across layers and enhances low-data training and fine-tuning for both NLP and SciML tasks. Notably, TempBalance demonstrates increasing performance gains as the amount of available tuning data decreases. Comparative analyses further highlight the effectiveness of TempBalance and its adaptability as an “add-on” method for improving model performance.

[LG-102] Expected Sliced Transport Plans

链接: https://arxiv.org/abs/2410.12176
作者: Xinran Liu,Rocío Díaz Martín,Yikun Bai,Ashkan Shahbazi,Matthew Thorpe,Akram Aldroubi,Soheil Kolouri
关键词-EN: gained significant traction, modern machine learning, Wasserstein distances, provide versatile metrics, determine optimal couplings
类目: Machine Learning (cs.LG); Metric Geometry (math.MG)
*备注:

点击查看摘要

Abstract:The optimal transport (OT) problem has gained significant traction in modern machine learning for its ability to: (1) provide versatile metrics, such as Wasserstein distances and their variants, and (2) determine optimal couplings between probability measures. To reduce the computational complexity of OT solvers, methods like entropic regularization and sliced optimal transport have been proposed. The sliced OT framework improves efficiency by comparing one-dimensional projections (slices) of high-dimensional distributions. However, despite their computational efficiency, sliced-Wasserstein approaches lack a transportation plan between the input measures, limiting their use in scenarios requiring explicit coupling. In this paper, we address two key questions: Can a transportation plan be constructed between two probability measures using the sliced transport framework? If so, can this plan be used to define a metric between the measures? We propose a “lifting” operation to extend one-dimensional optimal transport plans back to the original space of the measures. By computing the expectation of these lifted plans, we derive a new transportation plan, termed expected sliced transport (EST) plans. We prove that using the EST plan to weight the sum of the individual Euclidean costs for moving from one point to another results in a valid metric between the input discrete probability measures. We demonstrate the connection between our approach and the recently proposed min-SWGG, along with illustrative numerical examples that support our theoretical findings.

[LG-103] Reinforcement Learning with LTL and omega-Regular Objectives via Optimality-Preserving Translation to Average Rewards

链接: https://arxiv.org/abs/2410.12175
作者: Xuan-Bach Le,Dominik Wagner,Leon Witzman,Alexander Rabinovich,Luke Ong
关键词-EN: Linear temporal logic, traditional discount sum, Linear temporal, temporal logic, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Linear temporal logic (LTL) and, more generally, \omega -regular objectives are alternatives to the traditional discount sum and average reward objectives in reinforcement learning (RL), offering the advantage of greater comprehensibility and hence explainability. In this work, we study the relationship between these objectives. Our main result is that each RL problem for \omega -regular objectives can be reduced to a limit-average reward problem in an optimality-preserving fashion, via (finite-memory) reward machines. Furthermore, we demonstrate the efficacy of this approach by showing that optimal policies for limit-average problems can be found asymptotically by solving a sequence of discount-sum problems approximately. Consequently, we resolve an open problem: optimal policies for LTL and \omega -regular objectives can be learned asymptotically.

[LG-104] he State of Robot Motion Generation

链接: https://arxiv.org/abs/2410.12172
作者: Kostas E. Bekris,Joe Doerr,Patrick Meng,Sumanth Tangirala
关键词-EN: generating robot motion, robot motion proposed, robotics research culminating, years of robotics, recent developments
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To be presented at the International Symposium of Robotics Research (ISRR), 2024

点击查看摘要

Abstract:This paper reviews the large spectrum of methods for generating robot motion proposed over the 50 years of robotics research culminating in recent developments. It crosses the boundaries of methodologies, typically not surveyed together, from those that operate over explicit models to those that learn implicit ones. The paper discusses the current state-of-the-art as well as properties of varying methodologies, highlighting opportunities for integration.

[LG-105] COMET: Towards Partical W4A4KV4 LLMs Serving

链接: https://arxiv.org/abs/2410.12168
作者: Lian Liu,Haimeng Ren,Long Cheng,Zhaohui Xu,Yudong Pan,Mengdi Wang,Xiaowei Li,Yinhe Han,Ying Wang
关键词-EN: widely-used compression technology, serving large language, cloud data centers, large language models, widely-used compression
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 14 pages, 12 figures

点击查看摘要

Abstract:Quantization is a widely-used compression technology to reduce the overhead of serving large language models (LLMs) on terminal devices and in cloud data centers. However, prevalent quantization methods, such as 8-bit weight-activation or 4-bit weight-only quantization, achieve limited performance improvements due to poor support for low-precision (e.g., 4-bit) activation. This work, for the first time, realizes practical W4A4KV4 serving for LLMs, fully utilizing the INT4 tensor cores on modern GPUs and reducing the memory bottleneck caused by the KV cache. Specifically, we propose a novel fine-grained mixed-precision quantization algorithm (FMPQ) that compresses most activations into 4-bit with negligible accuracy loss. To support mixed-precision matrix multiplication for W4A4 and W4A8, we develop a highly optimized W4Ax kernel. Our approach introduces a novel mixed-precision data layout to facilitate access and fast dequantization for activation and weight tensors, utilizing the GPU’s software pipeline to hide the overhead of data loading and conversion. Additionally, we propose fine-grained streaming multiprocessor (SM) scheduling to achieve load balance across different SMs. We integrate the optimized W4Ax kernel into our inference framework, COMET, and provide efficient management to support popular LLMs such as LLaMA-3-70B. Extensive evaluations demonstrate that, when running LLaMA family models on a single A100-80G-SMX4, COMET achieves a kernel-level speedup of \textbf 2.88\times over cuBLAS and a \textbf 2.02 \times throughput improvement compared to TensorRT-LLM from an end-to-end framework perspective.

[LG-106] Reclaiming the Source of Programmatic Policies: Programmatic versus Latent Spaces ICLR2024

链接: https://arxiv.org/abs/2410.12166
作者: Tales H. Carvalho,Kenneth Tjhia,Levi H. S. Lelis
关键词-EN: Markov decision processes, partially observable Markov, observable Markov decision, define programmatic policies, observable Markov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ICLR 2024

点击查看摘要

Abstract:Recent works have introduced LEAPS and HPRL, systems that learn latent spaces of domain-specific languages, which are used to define programmatic policies for partially observable Markov decision processes (POMDPs). These systems induce a latent space while optimizing losses such as the behavior loss, which aim to achieve locality in program behavior, meaning that vectors close in the latent space should correspond to similarly behaving programs. In this paper, we show that the programmatic space, induced by the domain-specific language and requiring no training, presents values for the behavior loss similar to those observed in latent spaces presented in previous work. Moreover, algorithms searching in the programmatic space significantly outperform those in LEAPS and HPRL. To explain our results, we measured the “friendliness” of the two spaces to local search algorithms. We discovered that algorithms are more likely to stop at local maxima when searching in the latent space than when searching in the programmatic space. This implies that the optimization topology of the programmatic space, induced by the reward function in conjunction with the neighborhood function, is more conducive to search than that of the latent space. This result provides an explanation for the superior performance in the programmatic space.

[LG-107] able-LLM-Specialist: Language Model Specialists for Tables using Iterative Generator-Validator Fine-tuning

链接: https://arxiv.org/abs/2410.12164
作者: Junjie Xing,Yeye He,Mengyu Zhou,Haoyu Dong,Shi Han,Dongmei Zhang,Surajit Chaudhuri
关键词-EN: self-trained fine-tuning paradigm, fine-tuning paradigm specifically, paradigm specifically designed, self-trained fine-tuning, specifically designed
类目: Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we propose Table-LLM-Specialist, or Table-Specialist for short, as a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger \sys models that can specialize in a given task, without requiring manually-labeled data. Our extensive evaluations suggest that our Table-Specialist has (1) \textitstrong performance on diverse table tasks over vanilla language-models – for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) \textitlower cost to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency and inference cost, with comparable quality, and (3) \textitbetter generalizability when evaluated across multiple benchmarks, since \sys is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code and data will be available at this https URL. Subjects: Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2410.12164 [cs.CL] (or arXiv:2410.12164v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.12164 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-108] When to Trust Your Data: Enhancing Dyna-Style Model-Based Reinforcement Learning With Data Filter

链接: https://arxiv.org/abs/2410.12160
作者: Yansong Li,Zeyu Dong,Ertai Luo,Yu Wu,Shuo Wu,Shuo Han
关键词-EN: Reinforcement learning, data, OOD data filter, estimated model, simulated data
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) algorithms can be divided into two classes: model-free algorithms, which are sample-inefficient, and model-based algorithms, which suffer from model bias. Dyna-style algorithms combine these two approaches by using simulated data from an estimated environmental model to accelerate model-free training. However, their efficiency is compromised when the estimated model is inaccurate. Previous works address this issue by using model ensembles or pretraining the estimated model with data collected from the real environment, increasing computational and sample complexity. To tackle this issue, we introduce an out-of-distribution (OOD) data filter that removes simulated data from the estimated model that significantly diverges from data collected in the real environment. We show theoretically that this technique enhances the quality of simulated data. With the help of the OOD data filter, the data simulated from the estimated model better mimics the data collected by interacting with the real model. This improvement is evident in the critic updates compared to using the simulated data without the OOD data filter. Our experiment integrates the data filter into the model-based policy optimization (MBPO) algorithm. The results demonstrate that our method requires fewer interactions with the real environment to achieve a higher level of optimality than MBPO, even without a model ensemble.

[LG-109] NSSI-Net: Multi-Concept Generative Adversarial Network for Non-Suicidal Self-Injury Detection Using High-Dimensional EEG Signals in a Semi-Supervised Learning Framework

链接: https://arxiv.org/abs/2410.12159
作者: Zhen Liang,Weishan Ye,Qile Liu,Li Zhang,Gan Huang,Yongjie Zhou
关键词-EN: widespread public concern, attracting widespread public, Non-suicidal self-injury, significantly increasing, public concern
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Non-suicidal self-injury (NSSI) is a serious threat to the physical and mental health of adolescents, significantly increasing the risk of suicide and attracting widespread public concern. Electroencephalography (EEG), as an objective tool for identifying brain disorders, holds great promise. However, extracting meaningful and reliable features from high-dimensional EEG data, especially by integrating spatiotemporal brain dynamics into informative representations, remains a major challenge. In this study, we introduce an advanced semi-supervised adversarial network, NSSI-Net, to effectively model EEG features related to NSSI. NSSI-Net consists of two key modules: a spatial-temporal feature extraction module and a multi-concept discriminator. In the spatial-temporal feature extraction module, an integrated 2D convolutional neural network (2D-CNN) and a bi-directional Gated Recurrent Unit (BiGRU) are used to capture both spatial and temporal dynamics in EEG data. In the multi-concept discriminator, signal, gender, domain, and disease levels are fully explored to extract meaningful EEG features, considering individual, demographic, disease variations across a diverse population. Based on self-collected NSSI data (n=114), the model’s effectiveness and reliability are demonstrated, with a 7.44% improvement in performance compared to existing machine learning and deep learning methods. This study advances the understanding and early diagnosis of NSSI in adolescents with depression, enabling timely intervention. The source code is available at this https URL.

[LG-110] FragNet: A Graph Neural Network for Molecular Property Prediction with Four Layers of Interpretability

链接: https://arxiv.org/abs/2410.12156
作者: Gihan Panapitiya,Peiyuan Gao,C Mark Maupin,Emily G Saldanha
关键词-EN: storage material design, applications including drug, including drug discovery, energy storage material, modern-day scientific applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Molecular property prediction is a crucial step in many modern-day scientific applications including drug discovery and energy storage material design. Despite the availability of numerous machine learning models for this task, we are lacking in models that provide both high accuracies and interpretability of the predictions. We introduce the FragNet architecture, a graph neural network not only capable of achieving prediction accuracies comparable to the current state-of-the-art models, but also able to provide insight on four levels of molecular substructures. This model enables understanding of which atoms, bonds, molecular fragments, and molecular fragment connections are critical in the prediction of a given molecular property. The ability to interpret the importance of connections between fragments is of particular interest for molecules which have substructures that are not connected with regular covalent bonds. The interpretable capabilities of FragNet are key to gaining scientific insights from the model’s learned patterns between molecular structure and molecular properties.

[LG-111] Preference Optimization with Multi-Sample Comparisons

链接: https://arxiv.org/abs/2410.12138
作者: Chaoqi Wang,Zhuokai Zhao,Chen Zhu,Karthik Abinav Sankararaman,Michal Valko,Xuefei Cao,Zhaorun Chen,Madian Khabsa,Yuxin Chen,Hao Ma,Sinong Wang
关键词-EN: Recent advancements, large language models, large language, driven by extensive, extensive pretraining
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: preprint

点击查看摘要

Abstract:Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.

[LG-112] Parametric Graph Representations in the Era of Foundation Models: A Survey and Position

链接: https://arxiv.org/abs/2410.12126
作者: Dongqi Fu,Liri Fang,Zihao Li,Hanghang Tong,Vetle I. Torvik,Jingrui He
关键词-EN: comprehensive relational data, model comprehensive relational, graph laws, graph, past decades
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Preprint, 15 pages

点击查看摘要

Abstract:Graphs have been widely used in the past decades of big data and AI to model comprehensive relational data. When analyzing a graph’s statistical properties, graph laws serve as essential tools for parameterizing its structure. Identifying meaningful graph laws can significantly enhance the effectiveness of various applications, such as graph generation and link prediction. Facing the large-scale foundation model developments nowadays, the study of graph laws reveals new research potential, e.g., providing multi-modal information for graph neural representation learning and breaking the domain inconsistency of different graph data. In this survey, we first review the previous study of graph laws from multiple perspectives, i.e., macroscope and microscope of graphs, low-order and high-order graphs, static and dynamic graphs, different observation spaces, and newly proposed graph parameters. After we review various real-world applications benefiting from the guidance of graph laws, we conclude the paper with current challenges and future research directions.

[LG-113] Scaling laws for post-training quantized large language models

链接: https://arxiv.org/abs/2410.12119
作者: Zifei Xu,Alexander Lan,Wanzin Yazar,Tristan Webb,Sayeh Sharify,Xin Wang
关键词-EN: well-trained large language, Generalization abilities, large language models, abilities of well-trained, well-trained large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Generalization abilities of well-trained large language models (LLMs) are known to scale predictably as a function of model size. In contrast to the existence of practical scaling laws governing pre-training, the quality of LLMs after post-training compression remains highly unpredictable, often requiring case-by-case validation in practice. In this work, we attempted to close this gap for post-training weight quantization of LLMs by conducting a systematic empirical study on multiple LLM families quantized to numerous low-precision tensor data types using popular weight quantization techniques. We identified key scaling factors pertaining to characteristics of the local loss landscape, based on which the performance of quantized LLMs can be reasonably well predicted by a statistical model.

[LG-114] o Err is AI : A Case Study Informing LLM Flaw Reporting Practices

链接: https://arxiv.org/abs/2410.12104
作者: Sean McGregor,Allyson Ettinger,Nick Judd,Paul Albee,Liwei Jiang,Kavel Rao,Will Smith,Shayne Longpre,Avijit Ghosh,Christopher Fiorelli,Michelle Hoang,Sven Cattell,Nouha Dziri
关键词-EN: hackers generated evaluations, Open Language Model, open-ended bug bounty, bug bounty targeting, Allen Institute
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:In August of 2024, 495 hackers generated evaluations in an open-ended bug bounty targeting the Open Language Model (OLMo) from The Allen Institute for AI. A vendor panel staffed by representatives of OLMo’s safety program adjudicated changes to OLMo’s documentation and awarded cash bounties to participants who successfully demonstrated a need for public disclosure clarifying the intent, capacities, and hazards of model deployment. This paper presents a collection of lessons learned, illustrative of flaw reporting best practices intended to reduce the likelihood of incidents and produce safer large language models (LLMs). These include best practices for safety reporting processes, their artifacts, and safety program staffing.

[LG-115] he Persian Rug: solving toy models of superposition using large-scale symmetries

链接: https://arxiv.org/abs/2410.12101
作者: Aditya Cowsik,Kfir Dolev,Alex Infanger
关键词-EN: complete mechanistic description, minimal non-linear sparse, non-linear sparse data, large input dimension, compresses sparse data
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension. The model, originally presented in arXiv:2209.10652, compresses sparse data vectors through a linear layer and decompresses using another linear layer followed by a ReLU activation. We notice that when the data is permutation symmetric (no input feature is privileged) large models reliably learn an algorithm that is sensitive to individual weights only through their large-scale statistics. For these models, the loss function becomes analytically tractable. Using this understanding, we give the explicit scalings of the loss at high sparsity, and show that the model is near-optimal among recently proposed architectures. In particular, changing or adding to the activation function any elementwise or filtering operation can at best improve the model’s performance by a constant factor. Finally, we forward-engineer a model with the requisite symmetries and show that its loss precisely matches that of the trained models. Unlike the trained model weights, the low randomness in the artificial weights results in miraculous fractal structures resembling a Persian rug, to which the algorithm is oblivious. Our work contributes to neural network interpretability by introducing techniques for understanding the structure of autoencoders. Code to reproduce our results can be found at this https URL .

[LG-116] Bridging Large Language Models and Graph Structure Learning Models for Robust Representation Learning

链接: https://arxiv.org/abs/2410.12096
作者: Guangxin Su,Yifan Zhu,Wenjie Zhang,Hanchen Wang,Ying Zhang
关键词-EN: encounters pervasive noise, graph structure learning, graph structure, Graph representation learning, node features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Graph structure learning, Graph representation learning, Large language models, Graph neural networks

点击查看摘要

Abstract:Graph representation learning, involving both node features and graph structures, is crucial for real-world applications but often encounters pervasive noise. State-of-the-art methods typically address noise by focusing separately on node features with large language models (LLMs) and on graph structures with graph structure learning models (GSLMs). In this paper, we introduce LangGSL, a robust framework that integrates the complementary strengths of pre-trained language models and GSLMs to jointly enhance both node feature and graph structure learning. In LangGSL, we first leverage LLMs to filter noise in the raw data and extract valuable cleaned information as features, enhancing the synergy of downstream models. During the mutual learning phase in LangGSL, the core idea is to leverage the relatively small language model (LM) to process local attributes and generate reliable pseudo-labels and informative node embeddings, which are then integrated into the GSLM’s prediction phase. This approach enriches the global context and enhances overall performance. Meanwhile, GSLM refines the evolving graph structure constructed from the LM’s output, offering updated labels back to the LM as additional guidance, thus facilitating a more effective mutual learning process. The LM and GSLM work synergistically, complementing each other’s strengths and offsetting weaknesses within a variational information-maximizing framework, resulting in enhanced node features and a more robust graph structure. Extensive experiments on diverse graph datasets of varying scales and across different task scenarios demonstrate the scalability and effectiveness of the proposed approach.

[LG-117] Comparative Performance of Collaborative Bandit Algorithms: Effect of Sparsity and Exploration Intensity

链接: https://arxiv.org/abs/2410.12086
作者: Eren Ozbay
关键词-EN: paper offers, offers a comprehensive, comprehensive analysis, collaborative bandit algorithms, Collaborative bandits aim
类目: Machine Learning (cs.LG)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:This paper offers a comprehensive analysis of collaborative bandit algorithms and provides a thorough comparison of their performance. Collaborative bandits aim to improve the performance of contextual bandits by introducing relationships between arms (or items), allowing effective propagation of information. Collaboration among arms allows the feedback obtained through a single user (item) to be shared across related users (items). Introducing collaboration also alleviates the cold user (item) problem, i.e., lack of historical information when a new user (item) arriving to the platform with no prior record of interactions. In the context of modeling the relationships between arms (items), there are two main approaches: Hard and soft clustering. We call approaches that model the relationship between arms in an \textitabsolute manner as hard clustering, i.e., the relationship is binary. Soft clustering relaxes membership constraints, allowing \textitfuzzy assignment. Focusing on the latter, we provide extensive experiments on the state-of-the-art collaborative contextual bandit algorithms and investigate the effect of sparsity and how the exploration intensity acts as a correction mechanism. Our numerical experiments demonstrate that controlling for sparsity in collaboration improves data efficiency and performance as it better informs learning. Meanwhile, increasing the exploration intensity acts as a correction because it effectively reduces variance due to potentially misspecified relationships among users. We observe that this misspecification is further remedied by introducing latent factors, and thus, increasing the dimensionality of the bandit parameters.

[LG-118] Learning to rumble: Automated elephant call classification detection and endpointing using deep architectures

链接: https://arxiv.org/abs/2410.12082
作者: Christiaan M. Geldenhuys,Thomas R. Niesler
关键词-EN: continuously recorded audio, problem of detecting, isolating and classifying, continuously recorded, call
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:We consider the problem of detecting, isolating and classifying elephant calls in continuously recorded audio. Such automatic call characterisation can assist conservation efforts and inform environmental management strategies. In contrast to previous work in which call detection was performed at a segment level, we perform call detection at a frame level which implicitly also allows call endpointing, the isolation of a call in a longer recording. For experimentation, we employ two annotated datasets, one containing Asian and the other African elephant vocalisations. We evaluate several shallow and deep classifier models, and show that the current best performance can be improved by using an audio spectrogram transformer (AST), a neural architecture which has not been used for this purpose before, and which we have configured in a novel sequence-to-sequence manner. We also show that using transfer learning by pre-training leads to further improvements both in terms of computational complexity and performance. Finally, we consider sub-call classification using an accepted taxonomy of call types, a task which has not previously been considered. We show that also in this case the transformer architectures provide the best performance. Our best classifiers achieve an average precision (AP) of 0.962 for framewise binary call classification, and an area under the receiver operating characteristic (AUC) of 0.957 and 0.979 for call classification with 5 classes and sub-call classification with 7 classes respectively. All of these represent either new benchmarks (sub-call classifications) or improvements on previously best systems. We conclude that a fully-automated elephant call detection and subcall classification system is within reach. Such a system would provide valuable information on the behaviour and state of elephant herds for the purposes of conservation and management.

[LG-119] aking off the Rose-Tinted Glasses: A Critical Look at Adversarial ML Through the Lens of Evasion Attacks

链接: https://arxiv.org/abs/2410.12076
作者: Kevin Eykholt,Farhan Ahmed,Pratik Vaishnavi,Amir Rahmati
关键词-EN: garnered significant interest, garnered significant, significant interest, machine learning, machine learning models
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The vulnerability of machine learning models in adversarial scenarios has garnered significant interest in the academic community over the past decade, resulting in a myriad of attacks and defenses. However, while the community appears to be overtly successful in devising new attacks across new contexts, the development of defenses has stalled. After a decade of research, we appear no closer to securing AI applications beyond additional training. Despite a lack of effective mitigations, AI development and its incorporation into existing systems charge full speed ahead with the rise of generative AI and large language models. Will our ineffectiveness in developing solutions to adversarial threats further extend to these new technologies? In this paper, we argue that overly permissive attack and overly restrictive defensive threat models have hampered defense development in the ML domain. Through the lens of adversarial evasion attacks against neural networks, we critically examine common attack assumptions, such as the ability to bypass any defense not explicitly built into the model. We argue that these flawed assumptions, seen as reasonable by the community based on paper acceptance, have encouraged the development of adversarial attacks that map poorly to real-world scenarios. In turn, new defenses evaluated against these very attacks are inadvertently required to be almost perfect and incorporated as part of the model. But do they need to? In practice, machine learning models are deployed as a small component of a larger system. We analyze adversarial machine learning from a system security perspective rather than an AI perspective and its implications for emerging AI paradigms. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2410.12076 [cs.LG] (or arXiv:2410.12076v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.12076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-120] Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products

链接: https://arxiv.org/abs/2410.12071
作者: Nadia Nahar,Christian Kästner,Jenna Butler,Chris Parnin,Thomas Zimmermann,Christian Bird
关键词-EN: Large Language Models, enhancing user experiences, Large Language, Language Models, time introducing numerous
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 10 pages, 2 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly embedded into software products across diverse industries, enhancing user experiences, but at the same time introducing numerous challenges for developers. Unique characteristics of LLMs force developers, who are accustomed to traditional software development and evaluation, out of their comfort zones as the LLM components shatter standard assumptions about software systems. This study explores the emerging solutions that software developers are adopting to navigate the encountered challenges. Leveraging a mixed-method research, including 26 interviews and a survey with 332 responses, the study identifies 19 emerging solutions regarding quality assurance that practitioners across several product teams at Microsoft are exploring. The findings provide valuable insights that can guide the development and evaluation of LLM-based products more broadly in the face of these challenges.

[LG-121] LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text

链接: https://arxiv.org/abs/2410.12064
作者: Ben Hagag,Liav Harpaz,Gil Semo,Dor Bernsohn,Rohit Saha,Pashootan Vaezipoor,Kyryl Truskovskyi,Gerasimos Spanakis
关键词-EN: LegalLens Shared Task, detecting legal violations, identifying legal violation, legal violation entities, relevant legal contexts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the results of the LegalLens Shared Task, focusing on detecting legal violations within text in the wild across two sub-tasks: LegalLens-NER for identifying legal violation entities and LegalLens-NLI for associating these violations with relevant legal contexts and affected individuals. Using an enhanced LegalLens dataset covering labor, privacy, and consumer protection domains, 38 teams participated in the task. Our analysis reveals that while a mix of approaches was used, the top-performing teams in both tasks consistently relied on fine-tuning pre-trained language models, outperforming legal-specific models and few-shot methods. The top-performing team achieved a 7.11% improvement in NER over the baseline, while NLI saw a more marginal improvement of 5.7%. Despite these gains, the complexity of legal texts leaves room for further advancements.

[LG-122] MFC-EQ: Mean-Field Control with Envelope Q-Learning for Moving Decentralized Agents in Formation IROS2024

链接: https://arxiv.org/abs/2410.12062
作者: Qiushi Lin,Hang Ma
关键词-EN: Path Finding aiming, Multi-Agent Path Finding, plan collision-free paths, Path Finding, version of Moving
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted to IROS 2024

点击查看摘要

Abstract:We study a decentralized version of Moving Agents in Formation (MAiF), a variant of Multi-Agent Path Finding aiming to plan collision-free paths for multiple agents with the dual objectives of reaching their goals quickly while maintaining a desired formation. The agents must balance these objectives under conditions of partial observation and limited communication. The formation maintenance depends on the joint state of all agents, whose dimensionality increases exponentially with the number of agents, rendering the learning process intractable. Additionally, learning a single policy that can accommodate different linear preferences for these two objectives presents a significant challenge. In this paper, we propose Mean-Field Control with Envelop Q -learning (MFC-EQ), a scalable and adaptable learning framework for this bi-objective multi-agent problem. We approximate the dynamics of all agents using mean-field theory while learning a universal preference-agnostic policy through envelop Q -learning. Our empirical evaluation of MFC-EQ across numerous instances shows that it outperforms state-of-the-art centralized MAiF baselines. Furthermore, MFC-EQ effectively handles more complex scenarios where the desired formation changes dynamically – a challenge that existing MAiF planners cannot address.

[LG-123] sting Causal Explanations: A Case Study for Understanding the Effect of Interventions on Chronic Kidney Disease

链接: https://arxiv.org/abs/2410.12047
作者: Panayiotis Petousis,David Gordon,Susanne B. Nicholas,Alex A. T. Bui(on behalf of CURE-CKD)
关键词-EN: Randomized controlled trials, controlled trials, standard for evaluating, evaluating the effectiveness, effectiveness of clinical
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Randomized controlled trials (RCTs) are the standard for evaluating the effectiveness of clinical interventions. To address the limitations of RCTs on real-world populations, we developed a methodology that uses a large observational electronic health record (EHR) dataset. Principles of regression discontinuity (rd) were used to derive randomized data subsets to test expert-driven interventions using dynamic Bayesian Networks (DBNs) do-operations. This combined method was applied to a chronic kidney disease (CKD) cohort of more than two million individuals and used to understand the associational and causal relationships of CKD variables with respect to a surrogate outcome of =40% decline in estimated glomerular filtration rate (eGFR). The associational and causal analyses depicted similar findings across DBNs from two independent healthcare systems. The associational analysis showed that the most influential variables were eGFR, urine albumin-to-creatinine ratio, and pulse pressure, whereas the causal analysis showed eGFR as the most influential variable, followed by modifiable factors such as medications that may impact kidney function over time. This methodology demonstrates how real-world EHR data can be used to provide population-level insights to inform improved healthcare delivery.

[LG-124] owards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

链接: https://arxiv.org/abs/2410.12046
作者: Petr Tsvetkov,Aleksandra Eliseeva,Danny Dig,Alexander Bezzubov,Yaroslav Golubev,Timofey Bryksin,Yaroslav Zharov
关键词-EN: CMG system, CMG, Commit message generation, evaluate correctly, crucial task
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Commit message generation (CMG) is a crucial task in software engineering that is challenging to evaluate correctly. When a CMG system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users’ experience. Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers operating within controlled research environments. We release all the code and the dataset for researchers: this https URL. Comments: 10 pages, 5 figures Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2410.12046 [cs.SE] (or arXiv:2410.12046v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2410.12046 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-125] Differential Privacy on Trust Graphs

链接: https://arxiv.org/abs/2410.12045
作者: Badih Ghazi,Ravi Kumar,Pasin Manurangsi,Serena Wang
关键词-EN: study differential privacy, differential privacy, party trusts, party, study differential
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:We study differential privacy (DP) in a multi-party setting where each party only trusts a (known) subset of the other parties with its data. Specifically, given a trust graph where vertices correspond to parties and neighbors are mutually trusting, we give a DP algorithm for aggregation with a much better privacy-utility trade-off than in the well-studied local model of DP (where each party trusts no other party). We further study a robust variant where each party trusts all but an unknown subset of at most t of its neighbors (where t is a given parameter), and give an algorithm for this setting. We complement our algorithms with lower bounds, and discuss implications of our work to other tasks in private learning and analytics.

[LG-126] A Survey on Deep Tabular Learning

链接: https://arxiv.org/abs/2410.12034
作者: Shriyank Somvanshi,Subasish Das,Syed Aaqib Javed,Gian Antariksa,Ahmed Hossain
关键词-EN: presents unique challenges, deep learning due, Tabular data, deep learning models, industries like healthcare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 43 pages, 18 figures, 3 tables

点击查看摘要

Abstract:Tabular data, widely used in industries like healthcare, finance, and transportation, presents unique challenges for deep learning due to its heterogeneous nature and lack of spatial structure. This survey reviews the evolution of deep learning models for tabular data, from early fully connected networks (FCNs) to advanced architectures like TabNet, SAINT, TabTranSELU, and MambaNet. These models incorporate attention mechanisms, feature embeddings, and hybrid architectures to address tabular data complexities. TabNet uses sequential attention for instance-wise feature selection, improving interpretability, while SAINT combines self-attention and intersample attention to capture complex interactions across features and data points, both advancing scalability and reducing computational overhead. Hybrid architectures such as TabTransformer and FT-Transformer integrate attention mechanisms with multi-layer perceptrons (MLPs) to handle categorical and numerical data, with FT-Transformer adapting transformers for tabular datasets. Research continues to balance performance and efficiency for large datasets. Graph-based models like GNN4TDL and GANDALF combine neural networks with decision trees or graph structures, enhancing feature representation and mitigating overfitting in small datasets through advanced regularization techniques. Diffusion-based models like the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) generate synthetic data to address data scarcity, improving model robustness. Similarly, models like TabPFN and Ptab leverage pre-trained language models, incorporating transfer learning and self-supervised techniques into tabular tasks. This survey highlights key advancements and outlines future research directions on scalability, generalization, and interpretability in diverse tabular data applications.

[LG-127] MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from muWatts to MWatts for Sustainable AI

链接: https://arxiv.org/abs/2410.12032
作者: Arya Tschand(1),Arun Tejusve Raghunath Rajan(2),Sachin Idgunji(3),Anirban Ghosh(3),Jeremy Holleman(4),Csaba Kiraly(5),Pawan Ambalkar(6),Ritika Borkar(3),Ramesh Chukka(7),Trevor Cockrell(6),Oliver Curtis(8),Grigori Fursin(9),Miro Hodak(10),Hiwot Kassa(11),Anton Lokhmotov(12),Dejan Miskovic(3),Yuechao Pan(13),Manu Prasad Manmathan(7),Liz Raymond(6),Tom St. John(14),Arjun Suresh(15),Rowan Taubitz(8),Sean Zhan(8),Scott Wasson(16),David Kanter(16),Vijay Janapa Reddi(1) ((1) Harvard University, (2) Self / Meta, (3) NVIDIA, (4) UNC Charlotte / Syntiant, (5) Codex, (6) Dell, (7) Intel, (8) SMC, (9) FlexAI / cTuning, (10) AMD, (11) Meta, (12) KRAI, (13) Google, (14) Decompute, (15) GATE Overflow, (16) MLCommons)
关键词-EN: massive datacenter clusters, Rapid adoption, tiny IoT devices, energy efficiency, machine learning
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages, 11 figures, 1 table

点击查看摘要

Abstract:Rapid adoption of machine learning (ML) technologies has led to a surge in power consumption across diverse systems, from tiny IoT devices to massive datacenter clusters. Benchmarking the energy efficiency of these systems is crucial for optimization, but presents novel challenges due to the variety of hardware platforms, workload characteristics, and system-level interactions. This paper introduces MLPerf Power, a comprehensive benchmarking methodology with capabilities to evaluate the energy efficiency of ML systems at power levels ranging from microwatts to megawatts. Developed by a consortium of industry professionals from more than 20 organizations, MLPerf Power establishes rules and best practices to ensure comparability across diverse architectures. We use representative workloads from the MLPerf benchmark suite to collect 1,841 reproducible measurements from 60 systems across the entire range of ML deployment scales. Our analysis reveals trade-offs between performance, complexity, and energy efficiency across this wide range of systems, providing actionable insights for designing optimized ML solutions from the smallest edge devices to the largest cloud infrastructures. This work emphasizes the importance of energy efficiency as a key metric in the evaluation and comparison of the ML system, laying the foundation for future research in this critical area. We discuss the implications for developing sustainable AI solutions and standardizing energy efficiency benchmarking for ML systems.

[LG-128] EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation

链接: https://arxiv.org/abs/2410.12028
作者: Mithun Manivannan(1),Vignesh Nethrapalli(1),Mark Cartwright(1) ((1) New Jersey Institute of Technology)
关键词-EN: Recent progress, audio-language modeling, progress in audio-language, benefited from training, aid of large-language
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Recent progress in audio-language modeling, such as automated audio captioning, has benefited from training on synthetic data generated with the aid of large-language models. However, such approaches for environmental sound captioning have primarily focused on audio event tags and have not explored leveraging emotional information that may be present in recordings. In this work, we explore the benefit of generating emotion-augmented synthetic audio caption data by instructing ChatGPT with additional acoustic information in the form of estimated soundscape emotion. To do so, we introduce EmotionCaps, an audio captioning dataset comprised of approximately 120,000 audio clips with paired synthetic descriptions enriched with soundscape emotion recognition (SER) information. We hypothesize that this additional information will result in higher-quality captions that match the emotional tone of the audio recording, which will, in turn, improve the performance of captioning models trained with this data. We test this hypothesis through both objective and subjective evaluation, comparing models trained with the EmotionCaps dataset to multiple baseline models. Our findings challenge current approaches to captioning and suggest new directions for developing and assessing captioning models.

[LG-129] Geometric Inductive Biases of Deep Networks: The Role of Data and Architecture

链接: https://arxiv.org/abs/2410.12025
作者: Sajad Movahedi,Antonio Orvieto,Seyed-Mohsen Moosavi-Dezfooli
关键词-EN: geometric invariance hypothesis, curvature remains invariant, average geometry, input space, average geometry evolution
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose the \textitgeometric invariance hypothesis (GIH) , which argues that when training a neural network, the input space curvature remains invariant under transformation in certain directions determined by its architecture. Starting with a simple non-linear binary classification problem residing on a plane in a high dimensional space, we observe that while an MLP can solve this problem regardless of the orientation of the plane, this is not the case for a ResNet. Motivated by this example, we define two maps that provide a compact \textitarchitecture-dependent summary of the input space geometry of a neural network and its evolution during training, which we dub the \textbfaverage geometry and \textbfaverage geometry evolution , respectively. By investigating average geometry evolution at initialization, we discover that the geometry of a neural network evolves according to the projection of data covariance onto average geometry. As a result, in cases where the average geometry is low-rank (such as in a ResNet), the geometry only changes in a subset of the input space. This causes an architecture-dependent invariance property in input-space curvature, which we dub GIH. Finally, we present extensive experimental results to observe the consequences of GIH and how it relates to generalization in neural networks.

[LG-130] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router

链接: https://arxiv.org/abs/2410.12013
作者: Yanyue Xie,Zhi Zhang,Ding Zhou,Cong Xie,Ziang Song,Xin Liu,Yanzhi Wang,Xue Lin,An Xu
关键词-EN: architectures face challenges, high memory consumption, architectures face, redundancy in experts, face challenges
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. Pruning MoE can reduce network weights while maintaining model performance. Motivated by the recent observation of emergent large magnitude features in Large Language Models (LLM) and MoE routing policy, we propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights, on each output neuron. Our pruning method is one-shot, requiring no retraining or weight updates. We evaluate our method on Mixtral-8x7B and Mixtral-8x22B across multiple language benchmarks. Experimental results show that our pruning method significantly outperforms state-of-the-art LLM pruning methods. Furthermore, our pruned MoE models can benefit from a pretrained teacher model through expert-wise knowledge distillation, improving performance post-pruning. Experimental results demonstrate that the Mixtral-8x7B model with 50% sparsity maintains 99% of the performance of the original model after the expert-wise knowledge distillation.

[LG-131] Bias Similarity Across Large Language Models

链接: https://arxiv.org/abs/2410.12010
作者: Hyejun Jeong,Shiqing Ma,Amir Houmansadr
关键词-EN: machine learning models, models influence decision-making, Large Language Models, chronic problem, human society
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: under review

点击查看摘要

Abstract:Bias in machine learning models has been a chronic problem, especially as these models influence decision-making in human society. In generative AI, such as Large Language Models, the impact of bias is even more profound compared to the classification models. LLMs produce realistic and human-like content that users may unconsciously trust, which could perpetuate harmful stereotypes to the uncontrolled public. It becomes particularly concerning when utilized in journalism or education. While prior studies have explored and quantified bias in individual AI models, no work has yet compared bias similarity across different LLMs. To fill this gap, we take a comprehensive look at ten open- and closed-source LLMs from four model families, assessing the extent of biases through output distribution. Using two datasets-one containing 4k questions and another with one million questions for each of the four bias dimensions – we measure functional similarity to understand how biases manifest across models. Our findings reveal that 1) fine-tuning does not significantly alter output distributions, which would limit its ability to mitigate bias, 2) LLMs within the same family tree do not produce similar output distributions, implying that addressing bias in one model could have limited implications for others in the same family, and 3) there is a possible risk of training data information leakage, raising concerns about privacy and data security. Our analysis provides insight into LLM behavior and highlights potential risks in real-world deployment.

[LG-132] Beyond Labels: A Self-Supervised Framework with Masked Autoencoders and Random Cropping for Breast Cancer Subtype Classification

链接: https://arxiv.org/abs/2410.12006
作者: Annalisa Chiocchetti,Marco Dossena,Christopher Irwin,Luigi Portinale
关键词-EN: work contributes, contributes to breast, breast cancer sub-type, histopathological images, breast cancer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work contributes to breast cancer sub-type classification using histopathological images. We utilize masked autoencoders (MAEs) to learn a self-supervised embedding tailored for computer vision tasks in this domain. This embedding captures informative representations of histopathological data, facilitating feature learning without extensive labeled datasets. During pre-training, we investigate employing a random crop technique to generate a large dataset from WSIs automatically. Additionally, we assess the performance of linear probes for multi-class classification tasks of cancer sub-types using the representations learnt by the MAE. Our approach aims to achieve strong performance on downstream tasks by leveraging the complementary strengths of ViTs and autoencoders. We evaluate our model’s performance on the BRACS dataset and compare it with existing benchmarks.

[LG-133] From promise to practice: realizing high-performance decentralized training

链接: https://arxiv.org/abs/2410.11998
作者: Zesen Wang,Jiaojiao Zhang,Xuyang Wu,Mikael Johansson
关键词-EN: deep neural networks, attracted significant attention, theoretically superior scalability, synchronous data-parallel methods, deep neural
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized training of deep neural networks has attracted significant attention for its theoretically superior scalability over synchronous data-parallel methods like All-Reduce. However, realizing this potential in multi-node training is challenging due to the complex design space that involves communication topologies, computation patterns, and optimization algorithms. This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-iteration runtimes. Furthermore, to support the decentralized training of transformer-based models, we study a decentralized Adam algorithm that allows for overlapping communications and computations, prove its convergence, and propose an accumulation technique to mitigate the high variance caused by small local batch sizes. We deploy the proposed approach in clusters with up to 64 GPUs and demonstrate its practicality and advantages in both runtime and generalization performance under a fixed iteration budget.

[LG-134] DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.11988
作者: Shangqian Gao,Chi-Heng Lin,Ting Hua,Tang Zheng,Yilin Shen,Hongxia Jin,Yen-Chang Hsu
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, achieved remarkable success
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, including language modeling, understanding, and generation. However, the increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices. Structural pruning has emerged as a promising solution to reduce the costs of LLMs without requiring post-processing steps. Prior structural pruning methods either follow the dependence of structures at the cost of limiting flexibility, or introduce non-trivial additional parameters by incorporating different projection matrices. In this work, we propose a novel approach that relaxes the constraint imposed by regular structural pruning methods and eliminates the structural dependence along the embedding dimension. Our dimension-independent structural pruning method offers several benefits. Firstly, our method enables different blocks to utilize different subsets of the feature maps. Secondly, by removing structural dependence, we facilitate each block to possess varying widths along its input and output dimensions, thereby significantly enhancing the flexibility of structural pruning. We evaluate our method on various LLMs, including OPT, LLaMA, LLaMA-2, Phi-1.5, and Phi-2. Experimental results demonstrate that our approach outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.

[LG-135] Age-of-Gradient Updates for Federated Learning over Random Access Channels

链接: https://arxiv.org/abs/2410.11986
作者: Yu Heng Wu,Houman Asgari,Stefano Rini,Andrea Munari
关键词-EN: deep neural network, random access channel, wireless networks, neural network, computer networks
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:This paper studies the problem of federated training of a deep neural network (DNN) over a random access channel (RACH) such as in computer networks, wireless networks, and cellular systems. More precisely, a set of remote users participate in training a centralized DNN model using SGD under the coordination of a parameter server (PS). The local model updates are transmitted from the remote users to the PS over a RACH using a slotted ALOHA protocol. The PS collects the updates from the remote users, accumulates them, and sends central model updates to the users at regular time intervals. We refer to this setting as the RACH-FL setting. The RACH-FL setting crucially addresses the problem of jointly designing a (i) client selection and (ii) gradient compression strategy which addresses the communication constraints between the remote users and the PS when transmission occurs over a RACH. For the RACH-FL setting, we propose a policy, which we term the ‘‘age-of-gradient’’ (AoG) policy in which (i) gradient sparsification is performed using top-K sparsification, (ii) the error correction is performed using memory accumulation, and (iii) the slot transmission probability is obtained by comparing the current local memory magnitude minus the magnitude of the gradient update to a threshold. Intuitively, the AoG measure of ‘‘freshness’’ of the memory state is reminiscent of the concept of age-of-information (AoI) in the context of communication theory and provides a rather natural interpretation of this policy. Numerical simulations show the superior performance of the AoG policy as compared to other RACH-FL policies.

[LG-136] he Fair Language Model Paradox

链接: https://arxiv.org/abs/2410.11985
作者: Andrea Pinto,Tomer Galanti,Randall Balestriero
关键词-EN: Large Language Models, Large Language, real-world applications, widely deployed, deployed in real-world
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.

[LG-137] Generative AI Policies under the Microscope: How CS Conferences Are Navigating the New Frontier in Scholarly Writing

链接: https://arxiv.org/abs/2410.11977
作者: Mahjabin Nahar,Sian Lee,Becky Guillen,Dongwon Lee
关键词-EN: computer science conferences, policy adoption, paper explores, explores the current, current state
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the current state of generative AI policies of computer science conferences and offers guidelines for policy adoption.

[LG-138] Heterogeneous Graph Generation: A Hierarchical Approach using Node Feature Pooling

链接: https://arxiv.org/abs/2410.11972
作者: Hritaban Ghosh(Indian Institute of Technology Kharagpur, India),Chen Changyu(Singapore Management University, Singapore),Arunesh Sinha(Rutgers University, Newark, USA),Shamik Sural(Indian Institute of Technology Kharagpur, India)
关键词-EN: social networks, biological networks, Heterogeneous graphs, recommendation systems, heterogeneous graphs consist
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous graphs are present in various domains, such as social networks, recommendation systems, and biological networks. Unlike homogeneous graphs, heterogeneous graphs consist of multiple types of nodes and edges, each representing different entities and relationships. Generating realistic heterogeneous graphs that capture the complex interactions among diverse entities is a difficult task due to several reasons. The generator has to model both the node type distribution along with the feature distribution for each node type. In this paper, we look into solving challenges in heterogeneous graph generation, by employing a two phase hierarchical structure, wherein the first phase creates a skeleton graph with node types using a prior diffusion based model and in the second phase, we use an encoder and a sampler structure as generator to assign node type specific features to the nodes. A discriminator is used to guide training of the generator and feature vectors are sampled from a node feature pool. We conduct extensive experiments with subsets of IMDB and DBLP datasets to show the effectiveness of our method and also the need for various architecture components.

[LG-139] DDIL: Improved Diffusion Distillation With Imitation Learning

链接: https://arxiv.org/abs/2410.11971
作者: Risheek Garrepalli,Shweta Mahajan,Munawar Hayat,Fatih Porikli
关键词-EN: sampling requires multiple, requires multiple denoising, multiple denoising network, denoising network passes, limiting practicality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes, limiting practicality. Efforts such as progressive distillation or consistency distillation have shown promise by reducing the number of passes at the expense of quality of the generated samples. In this work we identify co-variate shift as one of reason for poor performance of multi-step distilled models from compounding error at inference time. To address co-variate shift, we formulate diffusion distillation within imitation learning (DDIL) framework and enhance training distribution for distilling diffusion models on both data distribution (forward diffusion) and student induced distributions (backward diffusion). Training on data distribution helps to diversify the generations by preserving marginal data distribution and training on student distribution addresses compounding error by correcting covariate shift. In addition, we adopt reflected diffusion formulation for distillation and demonstrate improved performance, stable training across different distillation methods. We show that DDIL consistency improves on baseline algorithms of progressive distillation (PD), Latent consistency models (LCM) and Distribution Matching Distillation (DMD2).

[LG-140] Integrating Artificial Intelligence Models and Synthetic Image Data for Enhanced Asset Inspection and Defect Identification

链接: https://arxiv.org/abs/2410.11967
作者: Reddy Mandati,Vladyslav Anderson,Po-chen Chen,Ankush Agarwal,Tatjana Dokic,David Barnard,Michael Finn,Jesse Cromer,Andrew Mccauley,Clay Tutaj,Neha Dave,Bobby Besharati,Jamie Barnett,Timothy Krall
关键词-EN: past utilities relied, defect detection, identify asset defects, relied on in-field, images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the past utilities relied on in-field inspections to identify asset defects. Recently, utilities have started using drone-based inspections to enhance the field-inspection process. We consider a vast repository of drone images, providing a wealth of information about asset health and potential issues. However, making the collected imagery data useful for automated defect detection requires significant manual labeling effort. We propose a novel solution that combines synthetic asset defect images with manually labeled drone images. This solution has several benefits: improves performance of defect detection, reduces the number of hours spent on manual labeling, and enables the capability to generate realistic images of rare defects where not enough real-world data is available. We employ a workflow that combines 3D modeling tools such as Maya and Unreal Engine to create photorealistic 3D models and 2D renderings of defective assets and their surroundings. These synthetic images are then integrated into our training pipeline augmenting the real data. This study implements an end-to-end Artificial Intelligence solution to detect assets and asset defects from the combined imagery repository. The unique contribution of this research lies in the application of advanced computer vision models and the generation of photorealistic 3D renderings of defective assets, aiming to transform the asset inspection process. Our asset detection model has achieved an accuracy of 92 percent, we achieved a performance lift of 67 percent when introducing approximately 2,000 synthetic images of 2k resolution. In our tests, the defect detection model achieved an accuracy of 73 percent across two batches of images. Our analysis demonstrated that synthetic data can be successfully used in place of real-world manually labeled data to train defect detection model.

[LG-141] A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection

链接: https://arxiv.org/abs/2410.11964
作者: James Enouen,Mahito Sugiyama
关键词-EN: learning probability distributions, Markov graphical models, received a significant, theoretical attention, attention in previous
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The log-linear model has received a significant amount of theoretical attention in previous decades and remains the fundamental tool used for learning probability distributions over discrete variables. Despite its large popularity in statistical mechanics and high-dimensional statistics, the vast majority of such energy-based modeling approaches only focus on the two-variable relationships, such as Boltzmann machines and Markov graphical models. Although these approaches have easier-to-solve structure learning problems and easier-to-optimize parametric distributions, they often ignore the rich structure which exists in the higher-order interactions between different variables. Using more recent tools from the field of information geometry, we revisit the classical formulation of the log-linear model with a focus on higher-order mode interactions, going beyond the 1-body modes of independent distributions and the 2-body modes of Boltzmann distributions. This perspective allows us to define a complete decomposition of the KL error. This then motivates the formulation of a sparse selection problem over the set of possible mode interactions. In the same way as sparse graph selection allows for better generalization, we find that our learned distributions are able to more efficiently use the finite amount of data which is available in practice. On both synthetic and real-world datasets, we demonstrate our algorithm’s effectiveness in maximizing the log-likelihood for the generative task and also the ease of adaptability to the discriminative task of classification.

[LG-142] A Prompt-Guided Spatio-Temporal Transformer Model for National-Wide Nuclear Radiation Forecasting

链接: https://arxiv.org/abs/2410.11924
作者: Tengfei Lyu,Jindong Han,Hao Liu
关键词-EN: poses substantial risks, Nuclear radiation, nuclei during decay, poses substantial, energy emitted
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nuclear radiation (NR), which refers to the energy emitted from atomic nuclei during decay, poses substantial risks to human health and environmental safety. Accurate forecasting of nuclear radiation levels is crucial for informed decision-making by both individuals and governments. However, this task is challenging due to the imbalanced distribution of monitoring stations over a wide spatial range and the non-stationary radiation variation patterns. In this study, we introduce NRFormer, an innovative framework tailored for national-wide prediction of nuclear radiation variations. By integrating a non-stationary temporal attention module, an imbalance-aware spatial attention module, and a radiation propagation prompting module, NRFormer collectively captures complex spatio-temporal dynamics of nuclear radiation. Extensive experiments on two real-world datasets demonstrate the superiority of our proposed framework against seven baselines. This research not only enhances the accuracy and reliability in nuclear radiation forecasting but also contributes to advancing emergency response strategies and monitoring systems, thereby safeguarding environmental and public health.

[LG-143] Spatial-Temporal Bearing Fault Detection Using Graph Attention Networks and LSTM

链接: https://arxiv.org/abs/2410.11923
作者: Moirangthem Tiken Singh,Rabinder Kumar Prasad,Gurumayum Robert Michael,N. Hemarjit Singh,N. K. Kaphungkui
关键词-EN: Long Short-Term Memory, Graph Attention Network, combines Graph Attention, Attention Network, Short-Term Memory
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Purpose: This paper aims to enhance bearing fault diagnosis in industrial machinery by introducing a novel method that combines Graph Attention Network (GAT) and Long Short-Term Memory (LSTM) networks. This approach captures both spatial and temporal dependencies within sensor data, improving the accuracy of bearing fault detection under various conditions. Methodology: The proposed method converts time series sensor data into graph representations. GAT captures spatial relationships between components, while LSTM models temporal patterns. The model is validated using the Case Western Reserve University (CWRU) Bearing Dataset, which includes data under different horsepower levels and both normal and faulty conditions. Its performance is compared with methods such as K-Nearest Neighbors (KNN), Local Outlier Factor (LOF), Isolation Forest (IForest) and GNN-based method for bearing fault detection (GNNBFD). Findings: The model achieved outstanding results, with precision, recall, and F1-scores reaching 100% across various testing conditions. It not only identifies faults accurately but also generalizes effectively across different operational scenarios, outperforming traditional methods. Originality: This research presents a unique combination of GAT and LSTM for fault detection, overcoming the limitations of traditional time series methods by capturing complex spatial-temporal dependencies. Its superior performance demonstrates significant potential for predictive maintenance in industrial applications.

[LG-144] A Scalable Communication Protocol for Networks of Large Language Models

链接: https://arxiv.org/abs/2410.11905
作者: Samuele Marro,Emanuele La Malfa,Jesse Wright,Guohao Li,Nigel Shadbolt,Michael Wooldridge,Philip Torr
关键词-EN: Agent Communication Trilemma, prerequisite for collaboration, Communication Trilemma, Communication, Agent Communication
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Communication is a prerequisite for collaboration. When scaling networks of AI-powered agents, communication must be versatile, efficient, and portable. These requisites, which we refer to as the Agent Communication Trilemma, are hard to achieve in large networks of agents. We introduce Agora, a meta protocol that leverages existing communication standards to make LLM-powered agents solve complex problems efficiently. In Agora, agents typically use standardised routines for frequent communications, natural language for rare communications, and LLM-written routines for everything in between. Agora sidesteps the Agent Communication Trilemma and robustly handles changes in interfaces and members, allowing unprecedented scalability with full decentralisation and minimal involvement of human beings. On large Agora networks, we observe the emergence of self-organising, fully automated protocols that achieve complex goals without human intervention.

[LG-145] FLARE: Faithful Logic-Aided Reasoning and Exploration

链接: https://arxiv.org/abs/2410.11900
作者: Erik Arakelyan,Pasquale Minervini,Pat Verga,Patrick Lewis,Isabelle Augenstein
关键词-EN: Modern Question Answering, Large Language Models, Large Language, Modern Question, Question Answering
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Modern Question Answering (QA) and Reasoning approaches based on Large Language Models (LLMs) commonly use prompting techniques, such as Chain-of-Thought (CoT), assuming the resulting generation will have a more granular exploration and reasoning over the question space and scope. However, such methods struggle with generating outputs that are faithful to the intermediate chain of reasoning produced by the model. On the other end of the spectrum, neuro-symbolic methods such as Faithful CoT (F-CoT) propose to combine LLMs with external symbolic solvers. While such approaches boast a high degree of faithfulness, they usually require a model trained for code generation and struggle with tasks that are ambiguous or hard to formalise strictly. We introduce \textbfFaithful \textbfLogic-\textbfAided \textbfReasoning and \textbfExploration (\textbf\ours), a novel interpretable approach for traversing the problem space using task decompositions. We use the LLM to plan a solution, soft-formalise the query into facts and predicates using a logic programming code and simulate that code execution using an exhaustive multi-hop search over the defined space. Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers. Our methods achieve SOTA results on \mathbf7 out of \mathbf9 diverse reasoning benchmarks. We also show that model faithfulness positively correlates with overall performance and further demonstrate that \textbf\ours allows pinpointing the decisive factors sufficient for and leading to the correct answer with optimal reasoning during the multi-hop search.

[LG-146] Automated Discovery of Continuous Dynamics from Videos

链接: https://arxiv.org/abs/2410.11894
作者: Kuang Huang,Dong Heon Cho,Boyuan Chen
关键词-EN: Dynamical systems form, predefined state variables, system dynamics equation, Dynamical systems, traditionally modeled
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Dynamical systems form the foundation of scientific discovery, traditionally modeled with predefined state variables such as the angle and angular velocity, and differential equations such as the equation of motion for a single pendulum. We propose an approach to discover a set of state variables that preserve the smoothness of the system dynamics and to construct a vector field representing the system’s dynamics equation, automatically from video streams without prior physical knowledge. The prominence and effectiveness of the proposed approach are demonstrated through both quantitative and qualitative analyses of various dynamical systems, including the prediction of characteristic frequencies and the identification of chaotic and limit cycle behaviors. This shows the potential of our approach to assist human scientists in scientific discovery.

[LG-147] Simulation-based inference with scattering representations: scattering is all you need NEURIPS

链接: https://arxiv.org/abs/2410.11883
作者: Kiyam Lin,Benjamin Joachimi,Jason D. McEwen
关键词-EN: cosmological case study, simulation-based inference, case study, SBI, compression for simulation-based
类目: Machine Learning (cs.LG); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (stat.ML)
*备注: 9 pages, 2 figures, accepted by NeurIPS workshop on Machine Learning and the Physical Sciences

点击查看摘要

Abstract:We demonstrate the first successful use of scattering representations without further compression for simulation-based inference (SBI) with images (i.e. field-level), illustrated with a cosmological case study. Scattering representations provide a highly effective representational space for subsequent learning tasks, although the higher dimensional compressed space introduces challenges. We overcome these through spatial averaging, coupled with more expressive density estimators. Compared to alternative methods, such an approach does not require additional simulations for either training or computing derivatives, is interpretable, and resilient to covariate shift. As expected, we show that a scattering only approach extracts more information than traditional second order summary statistics.

[LG-148] Neural Metamorphosis ECCV2024

链接: https://arxiv.org/abs/2410.11878
作者: Xingyi Yang,Xinchao Wang
关键词-EN: termed Neural Metamorphosis, learning paradigm termed, paradigm termed Neural, build self-morphable neural, Neural Metamorphosis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: in ECCV2024, this https URL

点击查看摘要

Abstract:This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self-morphable neural networks. Contrary to crafting separate models for different architectures or sizes, NeuMeta directly learns the continuous weight manifold of neural networks. Once trained, we can sample weights for any-sized network directly from the manifold, even for previously unseen configurations, without retraining. To achieve this ambitious goal, NeuMeta trains neural implicit functions as hypernetworks. They accept coordinates within the model space as input, and generate corresponding weight values on the manifold. In other words, the implicit function is learned in a way, that the predicted weights is well-performed across various models sizes. In training those models, we notice that, the final performance closely relates on smoothness of the learned manifold. In pursuit of enhancing this smoothness, we employ two strategies. First, we permute weight matrices to achieve intra-model smoothness, by solving the Shortest Hamiltonian Path problem. Besides, we add a noise on the input coordinates when training the implicit function, ensuring models with various sizes shows consistent outputs. As such, NeuMeta shows promising results in synthesizing parameters for various network configurations. Our extensive tests in image classification, semantic segmentation, and image generation reveal that NeuMeta sustains full-size performance even at a 75% compression rate.

[LG-149] A Framework for SLO Carbon and Wastewater-Aware Sustainable FaaS Cloud Platform Management

链接: https://arxiv.org/abs/2410.11875
作者: Sirui Qi,Hayden Moore,Ninad Hogade,Dejan Milojicic,Cullen Bash,Sudeep Pasricha
关键词-EN: traditional serverful approaches, growing cloud computing, cloud computing paradigm, serverful approaches, growing cloud
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Function-as-a-Service (FaaS) is a growing cloud computing paradigm that is expected to reduce the user cost of service over traditional serverful approaches. However, the environmental impact of FaaS has not received much attention. We investigate FaaS scheduling and scaling from a sustainability perspective in this work. We find that the service-level objectives (SLOs) of FaaS and carbon emissions conflict with each other. We also find that SLO-focused FaaS scheduling can exacerbate water use in a datacenter. We propose a novel sustainability-focused FaaS scheduling and scaling framework to co-optimize SLO performance, carbon emissions, and wastewater generation.

[LG-150] Enhancing UI Location Capabilities of Autonomous Agents

链接: https://arxiv.org/abs/2410.11872
作者: Jakub Hoscilowicz,Bartosz Maj,Bartosz Kozakiewicz,Oleksii Tymoschuk,Artur Janicki
关键词-EN: graphical user interfaces, digital devices equipped, effective automation tools, user interfaces, increasingly important
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. Although multimodal large language models (MLLMs) like GPT-4V excel at tasks such as drafting emails, they struggle with GUI interactions, which limits their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent significantly outperforms other prompt-based autonomous agents (such as CogAgent, AppAgent, and Auto-UI) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance. Comments: Work in progress Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.11872 [cs.HC] (or arXiv:2410.11872v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2410.11872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-151] LLMProxy: Reducing Cost to Access Large Language Models

链接: https://arxiv.org/abs/2410.11857
作者: Noah Martin,Abdullah Bin Faisal,Hiba Eltigani,Rukhshan Haroon,Swaminathan Lamelas,Fahad Dogar
关键词-EN: large language models, proxy for large, large language, language models, explicit support
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we make a case for a proxy for large language models which has explicit support for cost-saving optimizations. We design LLMProxy, which supports three key optimizations: model selection, context management, and caching. These optimizations present tradeoffs in terms of cost, inference time, and response quality, which applications can navigate through our high level, bidirectional interface. As a case study, we implement a WhatsApp-based QA service that uses LLMProxy to provide a rich set of features to the users. This service is deployed on a small scale (100+ users) leveraging the cloud; it has been operational for 15+ weeks and users have asked 1400+ questions so far. We report on the experiences of running this service as well as microbenchmark the specific benefits of the various cost-optimizations we present in this paper.

[LG-152] Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach

链接: https://arxiv.org/abs/2410.11855
作者: Xiongxiao Xu,Solomon Abera Bekele,Brice Videau,Kai Shu
关键词-EN: small wearable devices, future computing architectures, leadership computing facilities, large-scale leadership computing, critical design metric
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous high performance computing (HPC) systems. Moreover, they typically rely on either purely offline training or a hybrid of offline and online training, which are impractical and lead to energy loss during data collection. Therefore, this paper studies a novel and practical online energy optimization problem for GPUs in HPC scenarios. The problem is challenging due to the inherent performance-energy trade-offs of GPUs, the exploration exploitation dilemma across frequencies, and the lack of explicit performance counters in GPUs. To address these challenges, we formulate the online energy consumption optimization problem as a multi-armed bandit framework and develop a novel bandit based framework EnergyUCB. EnergyUCB is designed to dynamically adjust GPU core frequencies in real-time, reducing energy consumption with minimal impact on performance. Specifically, the proposed framework EnergyUCB (1) balances the performance-energy trade-off in the reward function, (2) effectively navigates the exploration exploitation dilemma when adjusting GPU core frequencies online, and (3) leverages the ratio of GPU core utilization to uncore utilization as a real-time GPU performance metric. Experiments on a wide range of real-world HPC benchmarks demonstrate that EnergyUCB can achieve substantial energy savings. The code of EnergyUCB is available at this https URL.

[LG-153] GeoLife: Large-Scale Simulated Trajectory Datasets Calibrated to the GeoLife Dataset

链接: https://arxiv.org/abs/2410.11853
作者: Hossein Amiri,Richard Yang,Andreas Zufle
关键词-EN: Analyzing individual human, data, Analyzing individual, academic applications, finds many commercial
类目: Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted paper at this https URL

点击查看摘要

Abstract:Analyzing individual human trajectory data helps our understanding of human mobility and finds many commercial and academic applications. There are two main approaches to accessing trajectory data for research: one involves using real-world datasets like GeoLife, while the other employs simulations to synthesize data. Real-world data provides insights from real human activities, but such data is generally sparse due to voluntary participation. Conversely, simulated data can be more comprehensive but may capture unrealistic human behavior. In this Data and Resource paper, we combine the benefit of both by leveraging the statistical features of real-world data and the comprehensiveness of simulated data. Specifically, we extract features from the real-world GeoLife dataset such as the average number of individual daily trips, average radius of gyration, and maximum and minimum trip distances. We calibrate the Pattern of Life Simulation, a realistic simulation of human mobility, to reproduce these features. Therefore, we use a genetic algorithm to calibrate the parameters of the simulation to mimic the GeoLife features. For this calibration, we simulated numerous random simulation settings, measured the similarity of generated trajectories to GeoLife, and iteratively (over many generations) combined parameter settings of trajectory datasets most similar to GeoLife. Using the calibrated simulation, we simulate large trajectory datasets that we call GeoLife+, where + denotes the Kleene Plus, indicating unlimited replication with at least one occurrence. We provide simulated GeoLife+ data with 182, 1k, and 5k over 5 years, 10k, and 50k over a year and 100k users over 6 months of simulation lifetime.

[LG-154] A Robust Multisource Remote Sensing Image Matching Method Utilizing Attention and Feature Enhancement Against Noise Interference

链接: https://arxiv.org/abs/2410.11848
作者: Yuan Li,Dapeng Wu,Yaping Cui,Peng He,Yuan Zhang,Ruyan Wang
关键词-EN: remote sensing image, multisource remote sensing, remote sensing, sensing image applications, sensing image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 13 figures

点击查看摘要

Abstract:Image matching is a fundamental and critical task of multisource remote sensing image applications. However, remote sensing images are susceptible to various noises. Accordingly, how to effectively achieve accurate matching in noise images is a challenging problem. To solve this issue, we propose a robust multisource remote sensing image matching method utilizing attention and feature enhancement against noise interference. In the first stage, we combine deep convolution with the attention mechanism of transformer to perform dense feature extraction, constructing feature descriptors with higher discriminability and robustness. Subsequently, we employ a coarse-to-fine matching strategy to achieve dense matches. In the second stage, we introduce an outlier removal network based on a binary classification mechanism, which can establish effective and geometrically consistent correspondences between images; through weighting for each correspondence, inliers vs. outliers classification are performed, as well as removing outliers from dense matches. Ultimately, we can accomplish more efficient and accurate matches. To validate the performance of the proposed method, we conduct experiments using multisource remote sensing image datasets for comparison with other state-of-the-art methods under different scenarios, including noise-free, additive random noise, and periodic stripe noise. Comparative results indicate that the proposed method has a more well-balanced performance and robustness. The proposed method contributes a valuable reference for solving the difficult problem of noise image matching.

[LG-155] From Commands to Prompts: LLM-based Semantic File System for AIOS

链接: https://arxiv.org/abs/2410.11843
作者: Zeru Shi,Kai Mei,Mingyu Jin,Yongye Su,Chaoji Zuo,Wenyue Hua,Wujiang Xu,Yujie Ren,Zirui Liu,Mengnan Du,Dong Deng,Yongfeng Zhang
关键词-EN: Large language models, file, Large language, demonstrated significant potential, semantic file
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential in the development of intelligent applications and systems such as LLM-based agents and agent operating systems (AIOS). However, when these applications and systems interact with the underlying file system, the file system still remains the traditional paradigm: reliant on manual navigation through precise commands. This paradigm poses a bottleneck to the usability of these systems as users are required to navigate complex folder hierarchies and remember cryptic file names. To address this limitation, we propose an LLM-based semantic file system ( LSFS ) for prompt-driven file management. Unlike conventional approaches, LSFS incorporates LLMs to enable users or agents to interact with files through natural language prompts, facilitating semantic file management. At the macro-level, we develop a comprehensive API set to achieve semantic file management functionalities, such as semantic file retrieval, file update monitoring and summarization, and semantic file rollback). At the micro-level, we store files by constructing semantic indexes for them, design and implement syscalls of different semantic operations (e.g., CRUD, group by, join) powered by vector database. Our experiments show that LSFS offers significant improvements over traditional file systems in terms of user convenience, the diversity of supported functions, and the accuracy and efficiency of file operations. Additionally, with the integration of LLM, our system enables more intelligent file management tasks, such as content summarization and version comparison, further enhancing its capabilities.

[LG-156] Survey and Evaluation of Converging Architecture in LLMs based on Footsteps of Operations

链接: https://arxiv.org/abs/2410.11381
作者: Seongho Kim,Jihyun Moon,Juntaek Oh,Insu Choi,Joon-Sung Yang
关键词-EN: Transformer architecture enables, enables contextually natural, contextually natural text, natural text generation, processing entire source
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 13 pages and 16 figures

点击查看摘要

Abstract:The advent of the Attention mechanism and Transformer architecture enables contextually natural text generation and compresses the burden of processing entire source information into singular vectors. Based on these two main ideas, model sizes gradually increases to accommodate more precise and comprehensive information, leading to the current state-of-the-art LLMs being very large, with parameters around 70 billion. As the model sizes are growing, the demand for substantial storage and computational capacity increases. This leads to the development of high-bandwidth memory and accelerators, as well as a variety of model architectures designed to meet these requirements. We note that LLM architectures have increasingly converged. This paper analyzes how these converged architectures perform in terms of layer configurations, operational mechanisms, and model sizes, considering various hyperparameter settings. In this paper, we conduct a concise survey of the history of LLMs by tracing the evolution of their operational improvements. Furthermore, we summarize the performance trends of LLMs under various hyperparameter settings using the RTX 6000, which features the state-of-the-art Ada Lovelace architecture. We conclude that even the same model can exhibit different behaviors depending on the hyperparameters or whether it is deployed in server or edge environments.

[LG-157] Improving Numerical Stability of Normalized Mutual Information Estimator on High Dimensions

链接: https://arxiv.org/abs/2410.07642
作者: Marko Tuononen,Ville Hautamäki
关键词-EN: normalized mutual information, Mutual information, general-purpose metric, metric for quantifying, quantifying the amount
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 4+1 pages, 2 figures, 20 equations

点击查看摘要

Abstract:Mutual information provides a powerful, general-purpose metric for quantifying the amount of shared information between variables. Estimating normalized mutual information using a k-Nearest Neighbor (k-NN) based approach involves the calculation of the scaling-invariant k-NN radius. Calculation of the radius suffers from numerical overflow when the joint dimensionality of the data becomes high, typically in the range of several hundred dimensions. To address this issue, we propose a logarithmic transformation technique that improves the numerical stability of the radius calculation in high-dimensional spaces. By applying the proposed transformation during the calculation of the radius, numerical overflow is avoided, and precision is maintained. Proposed transformation is validated through both theoretical analysis and empirical evaluation, demonstrating its ability to stabilize the calculation without compromizing the precision of the results.

[LG-158] OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions

链接: https://arxiv.org/abs/2410.04328
作者: Yu-Shin Huang,Peter Just,Krishna Narayanan,Chao Tian
关键词-EN: Large Language Model, arithmetic coding decoder, Language Model, Large Language, drives an arithmetic
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 9 figures

点击查看摘要

Abstract:We consider coverless steganography where a Large Language Model (LLM) drives an arithmetic coding decoder to generate stego-texts. An efficient method should embed secret message bits in as few language tokens as possible, while still keeping the stego-text natural and fluent. We show that on the individual token level, this problem is mathematically equivalent to maximizing the entropy of a replacement probability distribution of the next token generation, subject to a constraint on the KL divergence between the chosen probability distribution and the original distribution given by the LLM. A closed-form solution is provided for the optimization problem, which can be computed efficiently. Several important practical issues are also tackled: 1) An often-overlooked tokenization mismatch issue is resolved with a simple prompt selection approach, 2) The combination of the optimized distribution and the vocabulary truncation technique is considered, and 3) The combination of the optimized distribution with other sequence-level selection heuristics to further enhance the efficiency and reliability is studied.

[LG-159] On the sample complexity of purity and inner product estimation

链接: https://arxiv.org/abs/2410.12712
作者: Weiyuan Gong,Jonas Haferkamp,Qi Ye,Zhihan Zhang
关键词-EN: product estimation, quantum, epsilon, quantum communication, estimation
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 33 pages, 1 figure

点击查看摘要

Abstract:We study the sample complexity of the prototypical tasks quantum purity estimation and quantum inner product estimation. In purity estimation, we are to estimate tr(\rho^2) of an unknown quantum state \rho to additive error \epsilon . Meanwhile, for quantum inner product estimation, Alice and Bob are to estimate tr(\rho\sigma) to additive error \epsilon given copies of unknown quantum state \rho and \sigma using classical communication and restricted quantum communication. In this paper, we show a strong connection between the sample complexity of purity estimation with bounded quantum memory and inner product estimation with bounded quantum communication and unentangled measurements. We propose a protocol that solves quantum inner product estimation with k -qubit one-way quantum communication and unentangled local measurements using O(median\1/\epsilon^2,2^n/2/\epsilon,2^n-k/\epsilon^2) copies of \rho and \sigma . Our protocol can be modified to estimate the purity of an unknown quantum state \rho using k -qubit quantum memory with the same complexity. We prove that arbitrary protocols with k -qubit quantum memory that estimate purity to error \epsilon require \Omega(median\1/\epsilon^2,2^n/2/\sqrt\epsilon,2^n-k/\epsilon^2) copies of \rho . This indicates the same lower bound for quantum inner product estimation with one-way k -qubit quantum communication and classical communication, and unentangled local measurements. For purity estimation, we further improve the lower bound to \Omega(\max\1/\epsilon^2,2^n/2/\epsilon) for any protocols using an identical single-copy projection-valued measurement. Additionally, we investigate a decisional variant of quantum distributed inner product estimation without quantum communication for mixed state and provide a lower bound on the sample complexity. Comments: 33 pages, 1 figure Subjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2410.12712 [quant-ph] (or arXiv:2410.12712v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2410.12712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-160] Local transfer learning Gaussian process modeling with applications to surrogate modeling of expensive computer simulators

链接: https://arxiv.org/abs/2410.12690
作者: Xinming Wang,Simon Mak,John Miller,Jianguo Wu
关键词-EN: transfer, costly nature, nature of computer, computer simulations, simulations for complex
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:A critical bottleneck for scientific progress is the costly nature of computer simulations for complex systems. Surrogate models provide an appealing solution: such models are trained on simulator evaluations, then used to emulate and quantify uncertainty on the expensive simulator at unexplored inputs. In many applications, one often has available data on related systems. For example, in designing a new jet turbine, there may be existing studies on turbines with similar configurations. A key question is how information from such “source” systems can be transferred for effective surrogate training on the “target” system of interest. We thus propose a new LOcal transfer Learning Gaussian Process (LOL-GP) model, which leverages a carefully-designed Gaussian process to transfer such information for surrogate modeling. The key novelty of the LOL-GP is a latent regularization model, which identifies regions where transfer should be performed and regions where it should be avoided. This “local transfer” property is desirable in scientific systems: at certain parameters, such systems may behave similarly and thus transfer is beneficial; at other parameters, they may behave differently and thus transfer is detrimental. By accounting for local transfer, the LOL-GP can rectify a critical limitation of “negative transfer” in existing transfer learning models, where the transfer of information worsens predictive performance. We derive a Gibbs sampling algorithm for efficient posterior predictive sampling on the LOL-GP, for both the multi-source and multi-fidelity transfer settings. We then show, via a suite of numerical experiments and an application for jet turbine design, the improved surrogate performance of the LOL-GP over existing methods.

[LG-161] A distance function for stochastic matrices

链接: https://arxiv.org/abs/2410.12689
作者: Antony Lee,Peter Tino,Iain Bruce Styles
关键词-EN: Motivated by information, information geometry, Markov chain runs, Markov chains, Markov
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Motivated by information geometry, a distance function on the space of stochastic matrices is advocated. Starting with sequences of Markov chains the Bhattacharyya angle is advocated as the natural tool for comparing both short and long term Markov chain runs. Bounds on the convergence of the distance and mixing times are derived. Guided by the desire to compare different Markov chain models, especially in the setting of healthcare processes, a new distance function on the space of stochastic matrices is presented. It is a true distance measure which has a closed form and is efficient to implement for numerical evaluation. In the case of ergodic Markov chains, it is shown that considering either the Bhattacharyya angle on Markov sequences or the new stochastic matrix distance leads to the same distance between models.

[LG-162] Generative Neural Reparameterization for Differentiable PDE-constrained Optimization NEURIPS2024

链接: https://arxiv.org/abs/2410.12683
作者: Archis S. Joglekar
关键词-EN: acquiring optimal parameters, optimal parameters, systems governed, constrained optimization, neural network
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA); Plasma Physics (physics.plasm-ph)
*备注: Accepted to D3S3: Data-driven and Differentiable Simulations, Surrogates, and Solvers - Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Partial-differential-equation (PDE)-constrained optimization is a well-worn technique for acquiring optimal parameters of systems governed by PDEs. However, this approach is limited to providing a single set of optimal parameters per optimization. Given a differentiable PDE solver, if the free parameters are reparameterized as the output of a neural network, that neural network can be trained to learn a map from a probability distribution to the distribution of optimal parameters. This proves useful in the case where there are many well performing local minima for the PDE. We apply this technique to train a neural network that generates optimal parameters that minimize laser-plasma instabilities relevant to laser fusion and show that the neural network generates many well performing and diverse minima.

[LG-163] Efficient Optimization Algorithms for Linear Adversarial Training

链接: https://arxiv.org/abs/2410.12677
作者: Antônio H. RIbeiro,Thomas B. Schön,Dave Zahariah,Francis Bach
关键词-EN: robust against perturbations, learn models, Adversarial training, linear models, models
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Adversarial training can be used to learn models that are robust against perturbations. For linear models, it can be formulated as a convex optimization problem. Compared to methods proposed in the context of deep learning, leveraging the optimization structure allows significantly faster convergence rates. Still, the use of generic convex solvers can be inefficient for large-scale problems. Here, we propose tailored optimization algorithms for the adversarial training of linear models, which render large-scale regression and classification problems more tractable. For regression problems, we propose a family of solvers based on iterative ridge regression and, for classification, a family of solvers based on projected gradient descent. The methods are based on extended variable reformulations of the original problem. We illustrate their efficiency in numerical examples.

[LG-164] owards Arbitrary QUBO Optimization: Analysis of Classical and Quantum-Activated Feedforward Neural Networks

链接: https://arxiv.org/abs/2410.12636
作者: Chia-Tso Lai,Carsten Blank,Peter Schmelcher,Rick Mukherjee
关键词-EN: Quadratic Unconstrained Binary, Quadratic Unconstrained, Unconstrained Binary Optimization, Unconstrained Binary, supply chain
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quadratic Unconstrained Binary Optimization (QUBO) sits at the heart of many industries and academic fields such as logistics, supply chain, finance, pharmaceutical science, chemistry, IT, and energy sectors, among others. These problems typically involve optimizing a large number of binary variables, which makes finding exact solutions exponentially more difficult. Consequently, most QUBO problems are classified as NP-hard. To address this challenge, we developed a powerful feedforward neural network (FNN) optimizer for arbitrary QUBO problems. In this work, we demonstrate that the FNN optimizer can provide high-quality approximate solutions for large problems, including dense 80-variable weighted MaxCut and random QUBOs, achieving an average accuracy of over 99% in less than 1.1 seconds on an 8-core CPU. Additionally, the FNN optimizer outperformed the Gurobi optimizer by 72% on 200-variable random QUBO problems within a 100-second computation time limit, exhibiting strong potential for real-time optimization tasks. Building on this model, we explored the novel approach of integrating FNNs with a quantum annealer-based activation function to create a quantum-classical encoder-decoder (QCED) optimizer, aiming to further enhance the performance of FNNs in QUBO optimization.

[LG-165] From Lab to Pocket: A Novel Continual Learning-based Mobile Application for Screening COVID-19

链接: https://arxiv.org/abs/2410.12589
作者: Danny Falero,Muhammad Ashad Kabir,Nusrat Homaira
关键词-EN: Artificial intelligence, continual learning, medical images, learning, continual
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:Artificial intelligence (AI) has emerged as a promising tool for predicting COVID-19 from medical images. In this paper, we propose a novel continual learning-based approach and present the design and implementation of a mobile application for screening COVID-19. Our approach demonstrates the ability to adapt to evolving datasets, including data collected from different locations or hospitals, varying virus strains, and diverse clinical presentations, without retraining from scratch. We have evaluated state-of-the-art continual learning methods for detecting COVID-19 from chest X-rays and selected the best-performing model for our mobile app. We evaluated various deep learning architectures to select the best-performing one as a foundation model for continual learning. Both regularization and memory-based methods for continual learning were tested, using different memory sizes to develop the optimal continual learning model for our app. DenseNet161 emerged as the best foundation model with 96.87% accuracy, and Learning without Forgetting (LwF) was the top continual learning method with an overall performance of 71.99%. The mobile app design considers both patient and doctor perspectives. It incorporates the continual learning DenseNet161 LwF model on a cloud server, enabling the model to learn from new instances of chest X-rays and their classifications as they are submitted. The app is designed, implemented, and evaluated to ensure it provides an efficient tool for COVID-19 screening. The app is available to download from this https URL.

[LG-166] Self-DenseMobileNet: A Robust Framework for Lung Nodule Classification using Self-ONN and Stacking-based Meta-Classifier

链接: https://arxiv.org/abs/2410.12584
作者: Md. Sohanur Rahman,Muhammad E. H. Chowdhury,Hasib Ryan Rahman,Mosabber Uddin Ahmed,Muhammad Ashad Kabir,Sanjiban Sekhar Roy,Rusab Sarmun
关键词-EN: chest radiographs, non-nodules in chest, designed to enhance, improving classification accuracy, classification
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:In this study, we propose a novel and robust framework, Self-DenseMobileNet, designed to enhance the classification of nodules and non-nodules in chest radiographs (CXRs). Our approach integrates advanced image standardization and enhancement techniques to optimize the input quality, thereby improving classification accuracy. To enhance predictive accuracy and leverage the strengths of multiple models, the prediction probabilities from Self-DenseMobileNet were transformed into tabular data and used to train eight classical machine learning (ML) models; the top three performers were then combined via a stacking algorithm, creating a robust meta-classifier that integrates their collective insights for superior classification performance. To enhance the interpretability of our results, we employed class activation mapping (CAM) to visualize the decision-making process of the best-performing model. Our proposed framework demonstrated remarkable performance on internal validation data, achieving an accuracy of 99.28% using a Meta-Random Forest Classifier. When tested on an external dataset, the framework maintained strong generalizability with an accuracy of 89.40%. These results highlight a significant improvement in the classification of CXRs with lung nodules.

[LG-167] Evaluating Utility of Memory Efficient Medical Image Generation: A Study on Lung Nodule Segmentation

链接: https://arxiv.org/abs/2410.12542
作者: Kathrin Khadra,Utku Türkbey
关键词-EN: imaging data limits, scarcity of publicly, limits the development, development of effective, synthetic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The scarcity of publicly available medical imaging data limits the development of effective AI models. This work proposes a memory-efficient patch-wise denoising diffusion probabilistic model (DDPM) for generating synthetic medical images, focusing on CT scans with lung nodules. Our approach generates high-utility synthetic images with nodule segmentation while efficiently managing memory constraints, enabling the creation of training datasets. We evaluate the method in two scenarios: training a segmentation model exclusively on synthetic data, and augmenting real-world training data with synthetic images. In the first case, models trained solely on synthetic data achieve Dice scores comparable to those trained on real-world data benchmarks. In the second case, augmenting real-world data with synthetic images significantly improves segmentation performance. The generated images demonstrate their potential to enhance medical image datasets in scenarios with limited real-world data.

[LG-168] SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model ICASSP2024

链接: https://arxiv.org/abs/2410.12536
作者: Jianwei Cui,Yu Gu,Chao Weng,Jie Zhang,Liping Chen,Lirong Dai
关键词-EN: high-fidelity human-like singing, directly translates lyrical, singing voice synthesis, singing voice, human-like singing
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted by ICASSP 2024, Synthesized audio samples are available at: this https URL

点击查看摘要

Abstract:This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism that directly translates lyrical and melodic cues into expressive and high-fidelity human-like singing. Similarly to VISinger 2, the proposed system also utilizes training paradigms evolved from VITS and incorporates elements like the fundamental pitch (F0) predictor and waveform generation decoder. To address the issue that the coupling of mel-spectrogram features with F0 information may introduce errors during F0 prediction, we consider two strategies. Firstly, we leverage mel-cepstrum (mcep) features to decouple the intertwined mel-spectrogram and F0 characteristics. Secondly, inspired by the neural source-filter models, we introduce source excitation signals as the representation of F0 in the SVS system, aiming to capture pitch nuances more accurately. Meanwhile, differentiable mcep and F0 losses are employed as the waveform decoder supervision to fortify the prediction accuracy of speech envelope and pitch in the generated speech. Experiments on the Opencpop dataset demonstrate efficacy of the proposed model in synthesis quality and intonation accuracy.

[LG-169] Nonlinear bayesian tomography of ion temperature and velocity for Doppler coherence imaging spectroscopy in RT-1

链接: https://arxiv.org/abs/2410.12424
作者: Kenji Ueda,Masaki. Nishiura
关键词-EN: Coherence Imaging Spectroscopy, Imaging Spectroscopy, Bayesian tomography approach, Coherence Imaging, approach for Coherence
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注: 13 page, 9 figures

点击查看摘要

Abstract:We present a novel Bayesian tomography approach for Coherence Imaging Spectroscopy (CIS) that simultaneously reconstructs ion temperature and velocity distributions in plasmas. Utilizing nonlinear Gaussian Process Tomography (GPT) with the Laplace approximation, we model prior distributions of log-emissivity, temperature, and velocity as Gaussian processes. This framework rigorously incorporates nonlinear effects and temperature dependencies often neglected in conventional CIS tomography, enabling robust reconstruction even in the region of high temperature and velocity. By applying a log-Gaussian process, we also address issues like velocity divergence in low-emissivity regions. Validated with phantom simulations and experimental data from the RT-1 device, our method reveals detailed spatial structures of ion temperature and toroidal ion flow characteristic of magnetospheric plasma. This work significantly broadens the scope of CIS tomography, offering a robust tool for plasma diagnostics and facilitating integration with complementary measurement techniques.

[LG-170] Adaptive and Stratified Subsampling Techniques for High Dimensional Non-Standard Data Environments

链接: https://arxiv.org/abs/2410.12367
作者: Prateek Mittal,Jai Dalmotra,Joohi Chauhan
关键词-EN: non-standard data environments, estimating high-dimensional parameters, Adaptive Importance Sampling, specifically Adaptive Importance, data environments
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of estimating high-dimensional parameters in non-standard data environments, where traditional methods often falter due to issues such as heavy-tailed distributions, data contamination, and dependent observations. We propose robust subsampling techniques, specifically Adaptive Importance Sampling (AIS) and Stratified Subsampling, designed to enhance the reliability and efficiency of parameter estimation. Under some clearly outlined conditions, we establish consistency and asymptotic normality for the proposed estimators, providing non-asymptotic error bounds that quantify their performance. Our theoretical foundations are complemented by controlled experiments demonstrating the superiority of our methods over conventional approaches. By bridging the gap between theory and practice, this work offers significant contributions to robust statistical estimation, paving the way for advancements in various applied domains.

[LG-171] Global Censored Quantile Random Forest

链接: https://arxiv.org/abs/2410.12209
作者: Siyu Zhou,Limin Peng
关键词-EN: censored quantile regression, Global Censored Quantile, Censored Quantile Random, Quantile Random Forest, censored quantile
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In recent years, censored quantile regression has enjoyed an increasing popularity for survival analysis while many existing works rely on linearity assumptions. In this work, we propose a Global Censored Quantile Random Forest (GCQRF) for predicting a conditional quantile process on data subject to right censoring, a forest-based flexible, competitive method able to capture complex nonlinear relationships. Taking into account the randomness in trees and connecting the proposed method to a randomized incomplete infinite degree U-process (IDUP), we quantify the prediction process’ variation without assuming an infinite forest and establish its weak convergence. Moreover, feature importance ranking measures based on out-of-sample predictive accuracy are proposed. We demonstrate the superior predictive accuracy of the proposed method over a number of existing alternatives and illustrate the use of the proposed importance ranking measures on both simulated and real data.

[LG-172] Deep Optimal Sensor Placement for Black Box Stochastic Simulations

链接: https://arxiv.org/abs/2410.12036
作者: Paula Cordero-Encinar,Tobias Schröder,Peter Yatsyshin,Andrew Duncan
关键词-EN: Selecting cost-effective optimal, systems faces significant, Selecting cost-effective, significant computational barriers, black-box stochastic systems
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 23 pages

点击查看摘要

Abstract:Selecting cost-effective optimal sensor configurations for subsequent inference of parameters in black-box stochastic systems faces significant computational barriers. We propose a novel and robust approach, modelling the joint distribution over input parameters and solution with a joint energy-based model, trained on simulation data. Unlike existing simulation-based inference approaches, which must be tied to a specific set of point evaluations, we learn a functional representation of parameters and solution. This is used as a resolution-independent plug-and-play surrogate for the joint distribution, which can be conditioned over any set of points, permitting an efficient approach to sensor placement. We demonstrate the validity of our framework on a variety of stochastic problems, showing that our method provides highly informative sensor locations at a lower computational cost compared to conventional approaches.

[LG-173] Learning with Importance Weighted Variational Inference: Asymptotics for Gradient Estimators of the VR-IWAE Bound

链接: https://arxiv.org/abs/2410.12035
作者: Kamélia Daudel,François Roueff
关键词-EN: Evidence Lower BOund, Evidence Lower, maximum likelihood optimization, involving importance weighting, importance weighting ideas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Several popular variational bounds involving importance weighting ideas have been proposed to generalize and improve on the Evidence Lower BOund (ELBO) in the context of maximum likelihood optimization, such as the Importance Weighted Auto-Encoder (IWAE) and the Variational Rényi (VR) bounds. The methodology to learn the parameters of interest using these bounds typically amounts to running gradient-based variational inference algorithms that incorporate the reparameterization trick. However, the way the choice of the variational bound impacts the outcome of variational inference algorithms can be unclear. Recently, the VR-IWAE bound was introduced as a variational bound that unifies the ELBO, IWAE and VR bounds methodologies. In this paper, we provide two analyses for the reparameterized and doubly-reparameterized gradient estimators of the VR-IWAE bound, which reveal the advantages and limitations of these gradient estimators while enabling us to compare of the ELBO, IWAE and VR bounds methodologies. Our work advances the understanding of importance weighted variational inference methods and we illustrate our theoretical findings empirically.

[LG-174] Parametric model reduction of mean-field and stochastic systems via higher-order action matching

链接: https://arxiv.org/abs/2410.12000
作者: Jules Berman,Tobias Blickhan,Benjamin Peherstorfer
关键词-EN: physics parameters, feature stochastic, stochastic and mean-field, mean-field effects, population dynamics
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The aim of this work is to learn models of population dynamics of physical systems that feature stochastic and mean-field effects and that depend on physics parameters. The learned models can act as surrogates of classical numerical models to efficiently predict the system behavior over the physics parameters. Building on the Benamou-Brenier formula from optimal transport and action matching, we use a variational problem to infer parameter- and time-dependent gradient fields that represent approximations of the population dynamics. The inferred gradient fields can then be used to rapidly generate sample trajectories that mimic the dynamics of the physical system on a population level over varying physics parameters. We show that combining Monte Carlo sampling with higher-order quadrature rules is critical for accurately estimating the training objective from sample data and for stabilizing the training process. We demonstrate on Vlasov-Poisson instabilities as well as on high-dimensional particle and chaotic systems that our approach accurately predicts population dynamics over a wide range of parameters and outperforms state-of-the-art diffusion-based and flow-based modeling that simply condition on time and physics parameters.

[LG-175] Agnostic Process Tomography

链接: https://arxiv.org/abs/2410.11957
作者: Chirag Wadhwa,Laura Lewis,Elham Kashefi,Mina Doosti
关键词-EN: agnostic process tomography, agnostic state tomography, quantum, concept class, agnostic
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 11+52 pages, 2 figures, 1 table

点击查看摘要

Abstract:Characterizing a quantum system by learning its state or evolution is a fundamental problem in quantum physics and learning theory with a myriad of applications. Recently, as a new approach to this problem, the task of agnostic state tomography was defined, in which one aims to approximate an arbitrary quantum state by a simpler one in a given class. Generalizing this notion to quantum processes, we initiate the study of agnostic process tomography: given query access to an unknown quantum channel \Phi and a known concept class \mathcalC of channels, output a quantum channel that approximates \Phi as well as any channel in the concept class \mathcalC , up to some error. In this work, we propose several natural applications for this new task in quantum machine learning, quantum metrology, classical simulation, and error mitigation. In addition, we give efficient agnostic process tomography algorithms for a wide variety of concept classes, including Pauli strings, Pauli channels, quantum junta channels, low-degree channels, and a class of channels produced by \mathsfQAC^0 circuits. The main technical tool we use is Pauli spectrum analysis of operators and superoperators. We also prove that, using ancilla qubits, any agnostic state tomography algorithm can be extended to one solving agnostic process tomography for a compatible concept class of unitaries, immediately giving us efficient agnostic learning algorithms for Clifford circuits, Clifford circuits with few T gates, and circuits consisting of a tensor product of single-qubit gates. Together, our results provide insight into the conditions and new algorithms necessary to extend the learnability of a concept class from the standard tomographic setting to the agnostic one.

[LG-176] Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

链接: https://arxiv.org/abs/2410.11933
作者: Junjie Xu,Artem Moskalev,Tommaso Mansi,Mangal Prakash,Rui Liao
关键词-EN: developing RNA-based therapeutics, Accurate prediction, RNA, stability and interactions, RNA-based therapeutics
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Accurate prediction of RNA properties, such as stability and interactions, is crucial for advancing our understanding of biological processes and developing RNA-based therapeutics. RNA structures can be represented as 1D sequences, 2D topological graphs, or 3D all-atom models, each offering different insights into its function. Existing works predominantly focus on 1D sequence-based models, which overlook the geometric context provided by 2D and 3D geometries. This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction, considering not only performance but also real-world challenges such as limited data availability, partial labeling, sequencing noise, and computational efficiency. To this end, we introduce a newly curated set of RNA datasets with enhanced 2D and 3D structural annotations, providing a resource for model evaluation on RNA data. Our findings reveal that models with explicit geometry encoding generally outperform sequence-based models, with an average prediction RMSE reduction of around 12% across all various RNA tasks and excelling in low-data and partial labeling regimes, underscoring the value of explicitly incorporating geometric context. On the other hand, geometry-unaware sequence-based models are more robust under sequencing noise but often require around 2-5x training data to match the performance of geometry-aware models. Our study offers further insights into the trade-offs between different RNA representations in practical applications and addresses a significant gap in evaluating deep learning models for RNA tasks.

[LG-177] Deep vectorised operators for pulsatile hemodynamics estimation in coronary arteries from a steady-state prior

链接: https://arxiv.org/abs/2410.11920
作者: Julian Suk,Guido Nannini,Patryk Rygiel,Christoph Brune,Gianluca Pontone,Alberto Redaelli,Jelmer M. Wolterink
关键词-EN: provide valuable medical, valuable medical decision, medical decision markers, coronary artery disease, fields provide valuable
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Preprint. Under Review

点击查看摘要

Abstract:Cardiovascular hemodynamic fields provide valuable medical decision markers for coronary artery disease. Computational fluid dynamics (CFD) is the gold standard for accurate, non-invasive evaluation of these quantities in vivo. In this work, we propose a time-efficient surrogate model, powered by machine learning, for the estimation of pulsatile hemodynamics based on steady-state priors. We introduce deep vectorised operators, a modelling framework for discretisation independent learning on infinite-dimensional function spaces. The underlying neural architecture is a neural field conditioned on hemodynamic boundary conditions. Importantly, we show how relaxing the requirement of point-wise action to permutation-equivariance leads to a family of models that can be parametrised by message passing and self-attention layers. We evaluate our approach on a dataset of 74 stenotic coronary arteries extracted from coronary computed tomography angiography (CCTA) with patient-specific pulsatile CFD simulations as ground truth. We show that our model produces accurate estimates of the pulsatile velocity and pressure while being agnostic to re-sampling of the source domain (discretisation independence). This shows that deep vectorised operators are a powerful modelling tool for cardiovascular hemodynamics estimation in coronary arteries and beyond.

[LG-178] Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction

链接: https://arxiv.org/abs/2410.11914
作者: Yasir Ghunaim,Robert Hoehndorf
关键词-EN: Pre-training machine learning, machine learning models, materials science, machine learning, properties has proven
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: Accepted as a short paper at the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy 2024)

点击查看摘要

Abstract:Pre-training machine learning models on molecular properties has proven effective for generating robust and generalizable representations, which is critical for advancements in drug discovery and materials science. While recent work has primarily focused on data-driven approaches, the KANO model introduces a novel paradigm by incorporating knowledge-enhanced pre-training. In this work, we expand upon KANO by integrating the large-scale ChEBI knowledge graph, which includes 2,840 functional groups – significantly more than the original 82 used in KANO. We explore two approaches, Replace and Integrate, to incorporate this extensive knowledge into the KANO framework. Our results demonstrate that including ChEBI leads to improved performance on 9 out of 14 molecular property prediction datasets. This highlights the importance of utilizing a larger and more diverse set of functional groups to enhance molecular representations for property predictions. Code: this http URL

[LG-179] Explainable AI Methods for Multi-Omics Analysis: A Survey

链接: https://arxiv.org/abs/2410.11910
作者: Ahmad Hussein,Mukesh Prasad,Ali Braytee
关键词-EN: traditional hypothesis-driven methodologies, Advancements in high-throughput, data-driven approaches, high-throughput technologies, technologies have led
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in high-throughput technologies have led to a shift from traditional hypothesis-driven methodologies to data-driven approaches. Multi-omics refers to the integrative analysis of data derived from multiple ‘omes’, such as genomics, proteomics, transcriptomics, metabolomics, and microbiomics. This approach enables a comprehensive understanding of biological systems by capturing different layers of biological information. Deep learning methods are increasingly utilized to integrate multi-omics data, offering insights into molecular interactions and enhancing research into complex diseases. However, these models, with their numerous interconnected layers and nonlinear relationships, often function as black boxes, lacking transparency in decision-making processes. To overcome this challenge, explainable artificial intelligence (xAI) methods are crucial for creating transparent models that allow clinicians to interpret and work with complex data more effectively. This review explores how xAI can improve the interpretability of deep learning models in multi-omics research, highlighting its potential to provide clinicians with clear insights, thereby facilitating the effective application of such models in clinical settings.

[LG-180] Are Grid Cells Hexagonal for Performance or by Convenience?

链接: https://arxiv.org/abs/2410.11886
作者: Taahaa Mir,Peipei Yao,Kateri Duranceau,Isabeau Prémont-Schwarz
关键词-EN: biologically convenient configuration, grid cells, hexagonal grid cells, square grid cells, grid
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 5 pages, accepted at Montreal AI and Neuroscience Conference 2024

点击查看摘要

Abstract:This paper investigates whether the hexagonal structure of grid cells provides any performance benefits or if it merely represents a biologically convenient configuration. Utilizing the Vector-HaSH content addressable memory model as a model of the grid cell – place cell network of the mammalian brain, we compare the performance of square and hexagonal grid cells in tasks of storing and retrieving spatial memories. Our experiments across different path types, path lengths and grid configurations, reveal that hexagonal grid cells perform similarly to square grid cells with respect to spatial representation and memory recall. Our results show comparable accuracy and robustness across different datasets and noise levels on images to recall. These findings suggest that the brain’s use of hexagonal grids may be more a matter of biological convenience and ease of implementation rather than because they provide superior performance over square grid cells (which are easier to implement in silico).

[LG-181] MoH: Multi-Head Attention as Mixture-of-Head Attention

链接: https://arxiv.org/abs/2410.11842
作者: Peng Jin,Bo Zhu,Li Yuan,Shuicheng Yan
关键词-EN: multi-head attention, attention, previous accuracy level, attention heads, Transformer model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 23 pages, code: this https URL

点击查看摘要

Abstract:In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

信息检索

[IR-0] RosePO: Aligning LLM-based Recommenders with Human Values

链接: https://arxiv.org/abs/2410.12519
作者: Jiayi Liao,Xiangnan He,Ruobing Xie,Jiancan Wu,Yancheng Yuan,Xingwu Sun,Zhanhui Kang,Xiang Wang
关键词-EN: leveraging Large Language, Large Language Models, Large Language, leveraging Large, Language Models
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recently, there has been a growing interest in leveraging Large Language Models (LLMs) for recommendation systems, which usually adapt a pre-trained LLM to the recommendation scenario through supervised fine-tuning (SFT). However, both the pre-training and SFT stages fail to explicitly model the comparative relationships of a user’s preferences on different items. To construct a “helpful and harmless” LLM-based recommender, we propose a general framework – Recommendation with smoothing personalized Preference Optimization (RosePO), which better aligns with customized human values during the post-training stage. Specifically, in addition to the input and chosen response that naturally align with SFT data, we design a rejected sampling strategy tailored for enhancing helpfulness, along with two strategies aimed at mitigating biases to promote harmlessness. To ensure robustness against uncertain labels present in automatically constructed preference data, we introduce a personalized smoothing factor predicted by a preference oracle into the optimization objective. Evaluation on three real-world datasets demonstrates the effectiveness of our method, showcasing not only improved recommendation performance but also mitigation of semantic hallucination and popularity bias.

[IR-1] Unifying Economic and Language Models for Enhanced Sentiment Analysis of the Oil Market

链接: https://arxiv.org/abs/2410.12473
作者: Himmet Kaplan,Ralf-Peter Mundani,Heiko Rölke,Albert Weichselbraun,Martin Tschudy
关键词-EN: political events, Generative Pre-trained Transformer, global economy, critical component, Crude oil
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Crude oil, a critical component of the global economy, has its prices influenced by various factors such as economic trends, political events, and natural disasters. Traditional prediction methods based on historical data have their limits in forecasting, but recent advancements in natural language processing bring new possibilities for event-based analysis. In particular, Language Models (LM) and their advancement, the Generative Pre-trained Transformer (GPT), have shown potential in classifying vast amounts of natural language. However, these LMs often have difficulty with domain-specific terminology, limiting their effectiveness in the crude oil sector. Addressing this gap, we introduce CrudeBERT, a fine-tuned LM specifically for the crude oil market. The results indicate that CrudeBERT’s sentiment scores align more closely with the WTI Futures curve and significantly enhance price predictions, underscoring the crucial role of integrating economic principles into LMs.

[IR-2] Mitigating Dual Latent Confounding Biases in Recommender Systems

链接: https://arxiv.org/abs/2410.12451
作者: Jianfeng Deng,Qingfeng Chen,Debo Cheng,Jiuyong Li,Lin Liu,Xiaojing Du
关键词-EN: predict user preferences, enhanced user engagement, latent confounders, Recommender systems, Traditional recommender systems
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems are extensively utilised across various areas to predict user preferences for personalised experiences and enhanced user engagement and satisfaction. Traditional recommender systems, however, are complicated by confounding bias, particularly in the presence of latent confounders that affect both item exposure and user feedback. Existing debiasing methods often fail to capture the complex interactions caused by latent confounders in interaction data, especially when dual latent confounders affect both the user and item sides. To address this, we propose a novel debiasing method that jointly integrates the Instrumental Variables (IV) approach and identifiable Variational Auto-Encoder (iVAE) for Debiased representation learning in Recommendation systems, referred to as IViDR. Specifically, IViDR leverages the embeddings of user features as IVs to address confounding bias caused by latent confounders between items and user feedback, and reconstructs the embedding of items to obtain debiased interaction data. Moreover, IViDR employs an Identifiable Variational Auto-Encoder (iVAE) to infer identifiable representations of latent confounders between item exposure and user feedback from both the original and debiased interaction data. Additionally, we provide theoretical analyses of the soundness of using IV and the identifiability of the latent representations. Extensive experiments on both synthetic and real-world datasets demonstrate that IViDR outperforms state-of-the-art models in reducing bias and providing reliable recommendations.

[IR-3] QUIDS: Query Intent Generation via Dual Space Modeling

链接: https://arxiv.org/abs/2410.12400
作者: Yumeng Wang,Xiuying Chen,Suzan Verberne
关键词-EN: query intent, underlying search intent, intent, Query, query intent generation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Query understanding is a crucial component of Information Retrieval (IR), aimed at identifying the underlying search intent of textual queries. However, most existing approaches oversimplify this task into query classification or clustering, which fails to fully capture the nuanced intent behind the query. In this paper, we address the task of query intent generation: to automatically generate detailed and precise intent descriptions for search queries using relevant and irrelevant documents given a query. These intent descriptions can help users understand why the search engine considered the top-ranked documents relevant, and provide more transparency to the retrieval process. We propose a dual-space model that uses semantic relevance and irrelevance information in the returned documents to explain the understanding of the query intent. Specifically, in the encoding process, we project, separate, and distinguish relevant and irrelevant documents in the representation space. Then, we introduce a semantic decoupling model in the novel disentangling space, where the semantics of irrelevant information are removed from the relevant space, ensuring that only the essential and relevant intent is captured. This process refines the understanding of the query and provides more accurate explanations for the search results. Experiments on benchmark data demonstrate that our methods produce high-quality query intent descriptions, outperforming existing methods for this task, as well as state-of-the-art query-based summarization methods. A token-level visualization of attention scores reveals that our model effectively reduces the focus on irrelevant intent topics. Our findings open up promising research and application directions for query intent generation, particularly in exploratory search.

[IR-4] Multi-Cause Deconfounding for Recommender Systems with Latent Confounders

链接: https://arxiv.org/abs/2410.12366
作者: Zhirong Huang,Shichao Zhang,Debo Cheng,Jiuyong Li,Lin Liu,Guixian Zhang
关键词-EN: latent confounders, item public attractiveness, affect user behavior, user social environment, latent confounding factors
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recommender systems, various latent confounding factors (e.g., user social environment and item public attractiveness) can affect user behavior, item exposure, and feedback in distinct ways. These factors may directly or indirectly impact user feedback and are often shared across items or users, making them multi-cause latent confounders. However, existing methods typically fail to account for latent confounders between users and their feedback, as well as those between items and user feedback simultaneously. To address the problem of multi-cause latent confounders, we propose a multi-cause deconfounding method for recommender systems with latent confounders (MCDCF). MCDCF leverages multi-cause causal effect estimation to learn substitutes for latent confounders associated with both users and items, using user behaviour data. Specifically, MCDCF treats the multiple items that users interact with and the multiple users that interact with items as treatment variables, enabling it to learn substitutes for the latent confounders that influence the estimation of causality between users and their feedback, as well as between items and user feedback. Additionally, we theoretically demonstrate the soundness of our MCDCF method. Extensive experiments on three real-world datasets demonstrate that our MCDCF method effectively recovers latent confounders related to users and items, reducing bias and thereby improving recommendation accuracy.

[IR-5] Comprehending Knowledge Graphs with Large Language Models for Recommender Systems

链接: https://arxiv.org/abs/2410.12229
作者: Ziqiang Cui,Yunpeng Weng,Xing Tang,Fuyuan Lyu,Dugang Liu,Xiuqiang He,Chen Ma
关键词-EN: significantly advanced recommender, advanced recommender systems, significantly advanced, advanced recommender, recommender systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, the introduction of knowledge graphs (KGs) has significantly advanced recommender systems by facilitating the discovery of potential associations between items. However, existing methods still face several limitations. First, most KGs suffer from missing facts or limited scopes. This can lead to biased knowledge representations, thereby constraining the model’s performance. Second, existing methods typically convert textual information into IDs, resulting in the loss of natural semantic connections between different items. Third, existing methods struggle to capture high-order relationships in global KGs due to their inefficient layer-by-layer information propagation mechanisms, which are prone to introducing significant noise. To address these limitations, we propose a novel method called CoLaKG, which leverages large language models (LLMs) for knowledge-aware recommendation. The extensive world knowledge and remarkable reasoning capabilities of LLMs enable them to supplement KGs. Additionally, the strong text comprehension abilities of LLMs allow for a better understanding of semantic information. Based on this, we first extract subgraphs centered on each item from the KG and convert them into textual inputs for the LLM. The LLM then outputs its comprehension of these item-centered subgraphs, which are subsequently transformed into semantic embeddings. Furthermore, to utilize the global information of the KG, we construct an item-item graph using these semantic embeddings, which can directly capture higher-order associations between items. Both the semantic embeddings and the structural information from the item-item graph are effectively integrated into the recommendation model through our designed representation alignment and neighbor augmentation modules. Extensive experiments on four real-world datasets demonstrate the superiority of our method.

[IR-6] riple Modality Fusion: Aligning Visual Textual and Graph Data with Large Language Models for Multi-Behavior Recommendations

链接: https://arxiv.org/abs/2410.12228
作者: Luyi Ma,Xiaohan Li,Zezhong Fan,Jianpeng Xu,Jason Cho,Praveen Kanumala,Kaushiki Nag,Sushant Kumar,Kannan Achan
关键词-EN: Integrating diverse data, Integrating diverse, personalized recommendation systems, diverse data modalities, crucial for enhancing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user’s interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.

[IR-7] he Moral Case for Using Language Model Agents for Recommendation

链接: https://arxiv.org/abs/2410.12123
作者: Seth Lazar,Luke Thorburn,Tian Jin,Luca Belli
关键词-EN: networked global communication, communication environment, global communication, environment has fallen, fallen short
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Our information and communication environment has fallen short of the ideals that networked global communication might have served. Identifying all the causes of its pathologies is difficult, but existing recommender systems very likely play a contributing role. In this paper, which draws on the normative tools of philosophy of computing, informed by empirical and technical insights from natural language processing and recommender systems, we make the moral case for an alternative approach. We argue that existing recommenders incentivise mass surveillance, concentrate power, fall prey to narrow behaviourism, and compromise user agency. Rather than just trying to avoid algorithms entirely, or to make incremental improvements to the current paradigm, researchers and engineers should explore an alternative paradigm: the use of language model (LM) agents to source and curate content that matches users’ preferences and values, expressed in natural language. The use of LM agents for recommendation poses its own challenges, including those related to candidate generation, computational efficiency, preference modelling, and prompt injection. Nonetheless, if implemented successfully LM agents could: guide us through the digital public sphere without relying on mass surveillance; shift power away from platforms towards users; optimise for what matters instead of just for behavioural proxies; and scaffold our agency instead of undermining it.

[IR-8] Online Digital Investigative Journalism using SociaLens

链接: https://arxiv.org/abs/2410.11890
作者: Hasan M. Jamil,Sajratul Y. Rubaiat
关键词-EN: Media companies witnessed, Media companies, machine learning, companies witnessed, witnessed a significant
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Media companies witnessed a significant transformation with the rise of the internet, bigdata, machine learning (ML) and AI. Recent emergence of large language models (LLM) have added another aspect to this transformation. Researchers believe that with the help of these technologies, investigative digital journalism will enter a new era. Using a smart set of data gathering and analysis tools, journalists will be able to create data driven contents and insights in unprecedented ways. In this paper, we introduce a versatile and autonomous investigative journalism tool, called \em SociaLens, for identifying and extracting query specific data from online sources, responding to probing queries and drawing conclusions entailed by large volumes of data using ML analytics fully autonomously. We envision its use in investigative journalism, law enforcement and social policy planning. The proposed system capitalizes on the integration of ML technology with LLMs and advanced bigdata search techniques. We illustrate the functionality of SociaLens using a focused case study on rape incidents in a developing country and demonstrate that journalists can gain nuanced insights without requiring coding expertise they might lack. SociaLens is designed as a ChatBot that is capable of contextual conversation, find and collect data relevant to queries, initiate ML tasks to respond to queries, generate textual and visual reports, all fully autonomously within the ChatBot environment.

[IR-9] Post-Userist Recommender Systems : A Manifesto ALT RECSYS

链接: https://arxiv.org/abs/2410.11870
作者: Robin Burke,Morgan Sylvester
关键词-EN: systems framed solely, recommender systems framed, define userist recommendation, framed solely, solely in terms
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: Extended abstract for paper presented at AltRecSys Workshop 2024. Held at the 18th ACM Conference on Recommender Systems, Bari, Italy. October 18, 2024

点击查看摘要

Abstract:We define userist recommendation as an approach to recommender systems framed solely in terms of the relation between the user and system. Post-userist recommendation posits a larger field of relations in which stakeholders are embedded and distinguishes the recommendation function (which can potentially connect creators with audiences) from generative media. We argue that in the era of generative media, userist recommendation becomes indistinguishable from personalized media generation, and therefore post-userist recommendation is the only path forward for recommender systems research.

[IR-10] GeoLife: Large-Scale Simulated Trajectory Datasets Calibrated to the GeoLife Dataset

链接: https://arxiv.org/abs/2410.11853
作者: Hossein Amiri,Richard Yang,Andreas Zufle
关键词-EN: Analyzing individual human, data, Analyzing individual, academic applications, finds many commercial
类目: Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted paper at this https URL

点击查看摘要

Abstract:Analyzing individual human trajectory data helps our understanding of human mobility and finds many commercial and academic applications. There are two main approaches to accessing trajectory data for research: one involves using real-world datasets like GeoLife, while the other employs simulations to synthesize data. Real-world data provides insights from real human activities, but such data is generally sparse due to voluntary participation. Conversely, simulated data can be more comprehensive but may capture unrealistic human behavior. In this Data and Resource paper, we combine the benefit of both by leveraging the statistical features of real-world data and the comprehensiveness of simulated data. Specifically, we extract features from the real-world GeoLife dataset such as the average number of individual daily trips, average radius of gyration, and maximum and minimum trip distances. We calibrate the Pattern of Life Simulation, a realistic simulation of human mobility, to reproduce these features. Therefore, we use a genetic algorithm to calibrate the parameters of the simulation to mimic the GeoLife features. For this calibration, we simulated numerous random simulation settings, measured the similarity of generated trajectories to GeoLife, and iteratively (over many generations) combined parameter settings of trajectory datasets most similar to GeoLife. Using the calibrated simulation, we simulate large trajectory datasets that we call GeoLife+, where + denotes the Kleene Plus, indicating unlimited replication with at least one occurrence. We provide simulated GeoLife+ data with 182, 1k, and 5k over 5 years, 10k, and 50k over a year and 100k users over 6 months of simulation lifetime.

附件下载

点击下载今日全部论文列表