Arxiv今日论文 | 2024-10-28

本篇博文主要展示 2024-10-28 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大视觉-语言模型（Large Vision-Language Models, LVLMs）在长上下文推理任务中性能下降的问题，主要原因是模型过度依赖文本信息而减少了视觉依赖。解决方案的关键在于提出了一种无需训练的上下文剪枝方法，通过选择性地移除不关键的文本信息来增强视觉依赖并减少文本噪声，从而提升LVLMs在长上下文推理中的表现。该方法通过构建长上下文数据集进行验证，并进一步分析了不同剪枝策略的鲁棒性以及剪枝率与上下文长度之间的缩放规律。

链接: https://arxiv.org/abs/2410.19732
作者: Yucheng Zhou,Zhi Rao,Jun Wan,Jianbing Shen
关键词-EN: Large Vision-Language Models, Large Vision-Language, Vision-Language Models, experience performance declines, reduced visual dependency
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning due to overreliance on textual information and reduced visual dependency. In this study, we empirically analyze LVLMs in long-context reasoning, revealing that increased context length leads to a higher dependence on language at the expense of visual dependency. To address this issue, we propose a novel training-free context pruning method that selectively removes less critical textual information. Our approach enhances visual dependency and reduces textual noise, thereby improving LVLM performance in long-context reasoning. We validate our method by constructing a long-context dataset, demonstrating its effectiveness across various LVLMs. Moreover, further analysis confirms the robustness of different token pruning strategies and preliminary explores scaling laws between pruning rates and context length.
摘要：大视觉语言模型 (Large Vision-Language Models, LVLMs) 在跨模态任务中表现出色，但在长上下文推理中性能下降，这主要是因为过度依赖文本信息而减少了视觉依赖。在本研究中，我们实证分析了 LVLMs 在长上下文推理中的表现，发现随着上下文长度的增加，模型对语言的依赖性增强，而视觉依赖性减弱。为解决这一问题，我们提出了一种无需训练的上下文剪枝方法，该方法能够选择性地移除不那么关键的文本信息。我们的方法增强了视觉依赖性并减少了文本噪声，从而提高了 LVLMs 在长上下文推理中的性能。我们通过构建一个长上下文数据集验证了该方法的有效性，并展示了其在各种 LVLMs 中的广泛适用性。此外，进一步的分析证实了不同 Token 剪枝策略的鲁棒性，并初步探讨了剪枝率与上下文长度之间的缩放规律。

[NLP-1] Counting Ability of Large Language Models and Impact of Tokenization

【速读】：该论文试图解决的问题是Transformer架构在处理需要深度推理的任务（如计数）时的固有限制。解决方案的关键在于探讨和分析tokenization（分词）对大型语言模型（LLMs）计数能力的影响。论文指出，与专门训练的专家模型不同，LLMs通常使用字节对编码（BPE）分词器，这种分词方式从根本上改变了推理的处理方式，导致在计数任务上的性能显著变化。通过理论和实验分析，论文揭示了不同分词方式对模型理论计算能力的影响，并提出设计新的分词方法以增强LLMs的推理能力。

链接: https://arxiv.org/abs/2410.19730
作者: Xiang Zhang,Juntai Cao,Chenyu You
关键词-EN: modern large language, face inherent architectural, large language models, face inherent, backbone of modern
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers, the backbone of modern large language models (LLMs), face inherent architectural limitations that impede their reasoning capabilities. Unlike recurrent networks, Transformers lack recurrent connections, confining them to constant-depth computation. This restriction places them in the complexity class TC ^0 , making them theoretically incapable of solving tasks that demand increasingly deep reasoning as input length grows. Counting, a fundamental component of many reasoning tasks, also requires reasoning depth to grow linearly to be performed inductively. While previous studies have established the upper limits of counting ability in Transformer-based expert models (i.e., models specifically trained for counting tasks), these findings do not directly extend to general-purpose LLMs due to differences in reasoning mechanisms. Recent work has highlighted how Chain of Thought (CoT) reasoning can help alleviate some of the architectural limitations of Transformers in counting tasks. However, little attention has been paid to the role of tokenization in these models. Unlike expert models that often use character-level tokenization, LLMs typically rely on byte-level (BPE) tokenizers, which fundamentally alters the way reasoning is processed. Our work investigates the impact of tokenization on the counting abilities of LLMs, uncovering substantial performance variations based on input tokenization differences. We provide both theoretical and experimental analyses, offering insights into how tokenization choices can undermine models’ theoretical computability, thereby inspiring the design of new tokenization methods to enhance reasoning in LLMs.
摘要：Transformer，作为现代大语言模型（LLM）的骨干，面临着固有的架构限制，这些限制阻碍了其推理能力。与循环网络不同，Transformer缺乏循环连接，使其局限于恒定深度的计算。这一限制将其置于复杂度类 TC^0 中，使其在理论上无法解决随着输入长度增长而需要越来越深推理的任务。计数，作为许多推理任务的基本组成部分，也需要推理深度线性增长才能进行归纳执行。尽管先前的研究已经确立了基于Transformer的专家模型（即专门训练用于计数任务的模型）在计数能力上的上限，但由于推理机制的差异，这些发现并不直接适用于通用LLM。最近的工作强调了思维链（Chain of Thought, CoT）推理如何帮助缓解Transformer在计数任务中的一些架构限制。然而，这些模型中的Token化作用却鲜少受到关注。与通常使用字符级Token化的专家模型不同，LLM通常依赖于字节级（BPE）Token化器，这从根本上改变了推理的处理方式。我们的工作研究了Token化对LLM计数能力的影响，揭示了基于输入Token化差异的显著性能变化。我们提供了理论和实验分析，探讨了Token化选择如何削弱模型的理论可计算性，从而启发设计新的Token化方法以增强LLM的推理能力。

[NLP-2] FISHNET: Financial Intelligence from Sub-querying Harmonizing Neural-Conditioning Expert Swarms and Task Planning

【速读】：该论文试图解决从海量数据源中生成金融智能的传统方法（如知识图谱构建或数据库工程）所面临的局限性，特别是高推理成本、幻觉问题以及同时分析高维金融数据的复杂性。解决方案的关键在于提出了一种名为FISHNET（Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert swarming, and Task planning）的代理架构。FISHNET通过子查询、协调、神经条件化、专家蜂拥和任务规划等模块化组件，能够高效处理超过98,000份在语义、数据层次和格式上差异极大的监管文件，显著提升了金融洞察生成的成功率（61.8%），并展示了模块化架构在扩展性、灵活性和数据完整性方面的优势。

链接: https://arxiv.org/abs/2410.19727
作者: Nicole Cho,Nishan Srishankar,Lucas Cecchi,William Watson
关键词-EN: Large Language Models, vast data sources, domain-specific Large Language, database engineering, sources has typically
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at the 5th ACM International Conference on AI in Finance (ICAIF '24)

点击查看摘要

Abstract:Financial intelligence generation from vast data sources has typically relied on traditional methods of knowledge-graph construction or database engineering. Recently, fine-tuned financial domain-specific Large Language Models (LLMs), have emerged. While these advancements are promising, limitations such as high inference costs, hallucinations, and the complexity of concurrently analyzing high-dimensional financial data, emerge. This motivates our invention FISHNET (Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert swarming, and Task planning), an agentic architecture that accomplishes highly complex analytical tasks for more than 98,000 regulatory filings that vary immensely in terms of semantics, data hierarchy, or format. FISHNET shows remarkable performance for financial insight generation (61.8% success rate over 5.0% Routing, 45.6% RAG R-Precision). We conduct rigorous ablations to empirically prove the success of FISHNET, each agent’s importance, and the optimized performance of assembling all agents. Our modular architecture can be leveraged for a myriad of use-cases, enabling scalability, flexibility, and data integrity that are critical for financial tasks.
摘要：从海量数据源中生成金融智能通常依赖于传统的知识图谱构建或数据库工程方法。近年来，经过微调的金融领域专用大语言模型（Large Language Models, LLMs）逐渐兴起。尽管这些进展前景广阔，但仍存在一些局限性，如高昂的推理成本、幻觉现象以及同时分析高维金融数据的复杂性。这促使我们发明了FISHNET（Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert swarming, and Task planning），这是一种智能体架构，能够完成对超过98,000份在语义、数据层次结构或格式上差异极大的监管文件进行高度复杂的分析任务。FISHNET在金融洞察生成方面表现出色（成功率61.8%，超过5.0%的路由，45.6%的RAG R-Precision）。我们进行了严格的消融实验，以实证证明FISHNET的成功、每个智能体的重要性以及所有智能体组合后的优化性能。我们的模块化架构可应用于多种用例，实现关键的金融任务所需的扩展性、灵活性和数据完整性。

[NLP-3] 2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

【速读】：该论文试图解决现有直接偏好优化 (Direct Preference Optimization, DPO) 方法在优化大型语言模型 (Large Language Models, LLMs) 时忽略人类偏好多维性 (multi-dimensional nature) 的问题。解决方案的关键在于将偏好优化扩展到两个维度：段落 (segments) 和方面 (aspects)。具体来说，论文首先引入了一个名为 HelpSteer-2D 的二维监督数据集，其中对响应的每个句子进行评分，并在方面维度上设计了涵盖响应质量标准的多个准则。基于这些二维信号，论文提出了一个二维 DPO 框架 (2D-DPO)，将整体优化目标分解为多段落和多方面的子目标。实验结果表明，2D-DPO 在多个基准测试中优于仅优化标量或一维偏好的方法。

链接: https://arxiv.org/abs/2410.19720
作者: Shilong Li,Yancheng He,Hui Huang,Xingyuan Bu,Jiaheng Liu,Hangyu Guo,Weixun Wang,Jihao Gu,Wenbo Su,Bo Zheng
关键词-EN: Large Language Models, Direct Preference Optimization, Language Models, Large Language, Recent advancements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The first four authors contributed equally, 25 pages

点击查看摘要

Abstract:Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.
摘要：近年来，直接偏好优化 (Direct Preference Optimization, DPO) 的进展显著提升了大语言模型 (Large Language Models, LLMs) 与人类偏好的对齐效果，这归功于其简单性和有效性。然而，现有方法通常优化一个标量分数或排序奖励，从而忽略了人类偏好的多维特性。在本研究中，我们提出将 DPO 的偏好扩展到两个维度：段落和方面。首先，我们引入了一个名为 HelpSteer-2D 的二维监督数据集。对于段落维度，我们将响应划分为句子，并为每个段落分配分数。对于方面维度，我们精心设计了涵盖响应质量评分标准的多个准则。基于这些二维信号作为反馈，我们开发了一个二维 DPO 框架，将整体目标分解为多段落和多方面的子目标。在多个流行基准上的广泛实验表明，二维 DPO 的表现优于那些仅优化标量或一维偏好的方法。

[NLP-4] IPPON: Common Sense Guided Informative Path Planning for Object Goal Navigation

【速读】：该论文试图解决在未探索环境中高效导航至目标物体的问题，这是通用智能机器人所需的关键技能。解决方案的关键在于引入了一种新颖的信息路径规划和3D物体概率映射方法。具体来说，该方法通过语义分割和贝叶斯滤波器计算目标物体的概率，并存储常见物体的概率，从而基于大型语言模型提供的常识先验进行语义引导的探索。路径规划器在当前视角捕获到足够高置信度的目标物体体素时终止。尽管采用了零样本方法，该方案在Habitat ObjectNav Challenge 2023中以Success weighted by Path Length (SPL)和Soft SPL指标上达到了最先进的性能，超越了其他方法20%以上，并在实际机器人上验证了其有效性。

链接: https://arxiv.org/abs/2410.19697
作者: Kaixian Qu,Jie Tan,Tingnan Zhang,Fei Xia,Cesar Cadena,Marco Hutter
关键词-EN: Navigating efficiently, general-purpose intelligent robots, unexplored environment, critical skill, skill for general-purpose
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Navigating efficiently to an object in an unexplored environment is a critical skill for general-purpose intelligent robots. Recent approaches to this object goal navigation problem have embraced a modular strategy, integrating classical exploration algorithms-notably frontier exploration-with a learned semantic mapping/exploration module. This paper introduces a novel informative path planning and 3D object probability mapping approach. The mapping module computes the probability of the object of interest through semantic segmentation and a Bayes filter. Additionally, it stores probabilities for common objects, which semantically guides the exploration based on common sense priors from a large language model. The planner terminates when the current viewpoint captures enough voxels identified with high confidence as the object of interest. Although our planner follows a zero-shot approach, it achieves state-of-the-art performance as measured by the Success weighted by Path Length (SPL) and Soft SPL in the Habitat ObjectNav Challenge 2023, outperforming other works by more than 20%. Furthermore, we validate its effectiveness on real robots. Project webpage: this https URL
摘要：在未探索的环境中高效导航至目标物体是通用智能机器人所需的关键技能。近期针对这一目标导航问题的研究采用了模块化策略，结合了经典的探索算法（特别是边界探索）与学习的语义映射/探索模块。本文提出了一种新颖的信息路径规划和三维物体概率映射方法。映射模块通过语义分割和贝叶斯滤波计算目标物体的概率，并存储常见物体的概率，这些概率基于大语言模型的常识先验，语义上指导探索。规划器在当前视角捕捉到足够多的高置信度识别为目标物体的体素时终止。尽管我们的规划器采用零样本方法，但在Habitat ObjectNav挑战赛2023中，通过路径长度加权的成功率（SPL）和软SPL指标，其表现达到了最先进水平，超越了其他研究20%以上。此外，我们在实际机器人上验证了其有效性。项目网页：此https链接

[NLP-5] Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLM s

【速读】：该论文试图解决大语言模型（LLMs）微调过程中计算复杂度和资源需求高的问题，特别是低秩适应（LoRA）在实际性能与理论最优之间的差距。解决方案的关键是提出了eXtreme Gradient Boosting LoRA (XGBLoRA)框架，通过集成学习的力量，迭代地学习和合并一系列LoRA适应来优化模型预测。XGBLoRA不仅在性能上优于标准LoRA，而且在计算效率上保持了秩1适应的优势。该方法通过理论分析展示了其收敛性和最优性，并在多种自然语言处理任务中进行了广泛实验，结果表明XGBLoRA在减少可训练参数的同时，性能可与全量微调相媲美。

链接: https://arxiv.org/abs/2410.19694
作者: Yifei Zhang,Hao Zhu,Aiwei Liu,Han Yu,Piotr Koniusz,Irwin King
关键词-EN: Large Language Models, Fine-tuning Large Language, Large Language, Fine-tuning Large, crucial technique
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. However, the enormous size of LLMs poses significant challenges in terms of computational complexity and resource requirements. Low-Rank Adaptation (LoRA) has emerged as a promising solution. However, there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. In this work, we propose eXtreme Gradient Boosting LoRA (XGBLoRA), a novel framework that bridges this gap by leveraging the power of ensemble learning. Inspired by gradient boosting, XGBLoRA iteratively learns and merges a sequence of LoRA adaptations to refine model predictions. It achieves better performance than the standard LoRA, while enjoying the computational efficiency of rank-1 adaptations. We provide theoretical analysis to show the convergence and optimality of our approach, and conduct extensive experiments on a range of natural language processing tasks. The results demonstrate that XGBLoRA consistently outperforms standard LoRA and achieves performance comparable to full fine-tuning with significantly fewer trainable parameters. This work advances parameter-efficient fine-tuning for LLMs, and offers a promising solution for adapting LLMs to downstream tasks while optimizing performance and efficiency.
摘要：微调大语言模型（Large Language Models, LLMs）已成为将预训练模型适应于下游任务的关键技术。然而，LLMs 的巨大规模在计算复杂性和资源需求方面带来了显著挑战。低秩适应（Low-Rank Adaptation, LoRA）作为一种有前景的解决方案应运而生。然而，低秩适应的实际性能与其理论最优值之间存在差距。在本研究中，我们提出了 eXtreme Gradient Boosting LoRA（XGBLoRA），这是一个利用集成学习力量来弥合这一差距的新框架。受梯度提升的启发，XGBLoRA 通过迭代学习和合并一系列 LoRA 适应来优化模型预测。它在保持秩-1 适应的计算效率的同时，实现了比标准 LoRA 更好的性能。我们提供了理论分析，证明了我们方法的收敛性和最优性，并在一系列自然语言处理任务上进行了广泛的实验。结果表明，XGBLoRA 始终优于标准 LoRA，并且在可训练参数显著减少的情况下，达到了与全量微调相当的性能。本研究推进了 LLMs 的参数高效微调，并为在优化性能和效率的同时适应 LLMs 到下游任务提供了有前景的解决方案。

[NLP-6] AGENT -CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLM s

【速读】：该论文试图解决开放领域对话搜索系统中生成多样且有效的澄清问题（clarifying questions）的问题。解决方案的关键在于提出了一个端到端的基于大型语言模型（LLM）的框架，称为AGENT-CQ。该框架通过两个阶段实现：生成阶段利用LLM提示策略生成澄清问题，评估阶段（CrowdLLM）则通过模拟人类众包判断，使用多个LLM实例评估生成的问题和答案，基于全面的质量指标。实验结果表明，CrowdLLM在评估问题和答案质量方面表现出色，且AGENT-CQ的生成阶段在问题和答案质量的多个方面均优于基线方法。此外，在基于检索的评估中，LLM生成的问题显著提升了BM25和交叉编码器模型的检索效果，优于人类生成的问题。

链接: https://arxiv.org/abs/2410.19692
作者: Clemencia Siro,Yifei Yuan,Mohammad Aliannejadi,Maarten de Rijke
关键词-EN: open-domain conversational search, improving query understanding, Generating diverse, effective clarifying questions, clarifying questions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 23 pages

点击查看摘要

Abstract:Generating diverse and effective clarifying questions is crucial for improving query understanding and retrieval performance in open-domain conversational search (CS) systems. We propose AGENT-CQ (Automatic GENeration, and evaluaTion of Clarifying Questions), an end-to-end LLM-based framework addressing the challenges of scalability and adaptability faced by existing methods that rely on manual curation or template-based approaches. AGENT-CQ consists of two stages: a generation stage employing LLM prompting strategies to generate clarifying questions, and an evaluation stage (CrowdLLM) that simulates human crowdsourcing judgments using multiple LLM instances to assess generated questions and answers based on comprehensive quality metrics. Extensive experiments on the ClariQ dataset demonstrate CrowdLLM’s effectiveness in evaluating question and answer quality. Human evaluation and CrowdLLM show that the AGENT-CQ - generation stage, consistently outperforms baselines in various aspects of question and answer quality. In retrieval-based evaluation, LLM-generated questions significantly enhance retrieval effectiveness for both BM25 and cross-encoder models compared to human-generated questions.
摘要：在开放领域对话搜索 (CS) 系统中，生成多样且有效的澄清问题是提升查询理解和检索性能的关键。我们提出了 AGENT-CQ（自动生成和评估澄清问题），这是一个基于大语言模型 (LLM) 的端到端框架，旨在解决现有方法在可扩展性和适应性方面的挑战，这些方法依赖于人工编排或基于模板的方法。AGENT-CQ 包含两个阶段：生成阶段采用 LLM 提示策略生成澄清问题，评估阶段（CrowdLLM）则利用多个 LLM 实例模拟人类众包判断，基于全面的质量指标评估生成的问题和答案。在 ClariQ 数据集上的广泛实验表明，CrowdLLM 在评估问题和答案质量方面具有显著效果。人类评估和 CrowdLLM 的结果显示，AGENT-CQ 生成阶段在问题和答案质量的各个方面均优于基线。在基于检索的评估中，LLM 生成的问题显著提升了 BM25 和交叉编码器模型相对于人类生成问题的检索效果。

[NLP-7] ProvocationProbe: Instigating Hate Speech Dataset from Twitter

【速读】：该论文试图解决的问题是如何区分煽动性仇恨言论（instigating hate speech）与一般仇恨言论（general hate speech）。解决方案的关键在于引入了一个名为 ProvocationProbe 的数据集，该数据集包含了从 Twitter 上收集的约两万条推文，涵盖了九个全球性争议话题。通过详细标注这些数据，论文识别出煽动性仇恨言论的特征，如针对特定身份的攻击和仇恨的原因，从而揭示了煽动性仇恨言论与一般仇恨言论之间的区别。

链接: https://arxiv.org/abs/2410.19687
作者: Abhay Kumar,Vigneshwaran Shankaran,Rajesh Sharma
关键词-EN: recent years online, social media platforms, years online social, online social media, social media
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the recent years online social media platforms has been flooded with hateful remarks such as racism, sexism, homophobia etc. As a result, there have been many measures taken by various social media platforms to mitigate the spread of hate-speech over the internet. One particular concept within the domain of hate speech is instigating hate, which involves provoking hatred against a particular community, race, colour, gender, religion or ethnicity. In this work, we introduce \textitProvocationProbe - a dataset designed to explore what distinguishes instigating hate speech from general hate speech. For this study, we collected around twenty thousand tweets from Twitter, encompassing a total of nine global controversies. These controversies span various themes including racism, politics, and religion. In this paper, i) we present an annotated dataset after comprehensive examination of all the controversies, ii) we also highlight the difference between hate speech and instigating hate speech by identifying distinguishing features, such as targeted identity attacks and reasons for hate.
摘要：近年来，在线社交媒体平台上充斥着种族主义、性别歧视、恐同等仇恨言论。为此，各大社交媒体平台采取了许多措施来遏制网络仇恨言论的传播。在仇恨言论领域中，煽动仇恨是一个特别概念，它涉及煽动对特定社区、种族、肤色、性别、宗教或民族的仇恨。在本研究中，我们引入了煽动探测数据集 (ProvocationProbe)，旨在探讨煽动性仇恨言论与一般仇恨言论的区别。为此，我们从Twitter上收集了约两万条推文，涵盖了九个全球性争议事件。这些争议涉及种族、政治和宗教等多个主题。本文中，我们：i) 在全面审查所有争议后，提供了一个标注数据集；ii) 通过识别如针对性身份攻击和仇恨原因等区分特征，突出了仇恨言论与煽动性仇恨言论之间的差异。

[NLP-8] A distributional simplicity bias in the learning dynamics of transformers NEURIPS2024

【速读】：该论文试图解决的问题是：在自监督学习技术下训练的Transformer模型是否也表现出“简单性偏差”（simplicity bias），即模型是否首先学习简单的分类器，然后再逐步学习更复杂的非线性函数。解决方案的关键在于通过生成克隆数据集（clones of a given natural language data set）来严格捕捉输入token之间到指定阶数的交互作用，从而分析Transformer模型在自然语言数据上学习多体交互的顺序和饱和点。这种方法不仅揭示了Transformer模型在低阶交互上的预测误差饱和现象，还展示了其对高阶交互的持续学习能力，为研究不同阶数交互对学习的影响提供了新的途径。

链接: https://arxiv.org/abs/2410.19637
作者: Riccardo Rende,Federica Gerace,Alessandro Laio,Sebastian Goldt
关键词-EN: over-parameterised neural networks, neural networks prevent, networks prevent overfitting, initially learning simple, learning simple classifiers
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures, Accepted at NeurIPS 2024

点击查看摘要

Abstract:The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a ``simplicity bias’': neural networks prevent overfitting by initially learning simple classifiers before progressing to more complex, non-linear functions. While simplicity biases have been described theoretically and experimentally in feed-forward networks for supervised learning, the extent to which they also explain the remarkable success of transformers trained with self-supervised techniques remains unclear. In our study, we demonstrate that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions while continuing to learn high-degree interactions. To conduct this analysis, we develop a procedure to generate \textitclones of a given natural language data set, which rigorously capture the interactions between tokens up to a specified order. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
摘要：过度参数化的神经网络在有效泛化方面的显著能力可以通过引入“简单性偏差”来解释：神经网络通过首先学习简单的分类器，然后逐步过渡到更复杂的非线性函数，从而防止过拟合。尽管在监督学习的全连接网络中，简单性偏差已经在理论和实验上得到了描述，但它们在多大程度上也能解释使用自监督技术训练的Transformer的显著成功仍然不清楚。在我们的研究中，我们证明了在自然语言数据上训练的Transformer也表现出简单性偏差。具体来说，它们按顺序学习输入Token之间的多体相互作用，在低阶相互作用的预测误差达到饱和点的同时，继续学习高阶相互作用。为了进行这一分析，我们开发了一种生成给定自然语言数据集的克隆数据集的程序，该程序严格捕捉Token之间到指定阶数的相互作用。这种方法开启了研究数据中不同阶相互作用如何影响学习的可能性，不仅限于自然语言处理领域。

[NLP-9] OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration Feedback and Optimization

【速读】：该论文试图解决现有开源自主代理在处理现实世界多模态任务时表现不佳的问题，特别是在缺乏明确奖励信号的真实环境中。解决方案的关键在于引入一个开源框架，通过模仿学习训练基础模型，然后让代理在开放网络中自主探索并收集反馈，再利用另一个通用模型评估表现良好的轨迹来优化策略。这种探索-反馈-优化的循环可以迭代进行，从而使代理在每次迭代后都能自我提升，最终在多个测试集上展现出强大的性能。

链接: https://arxiv.org/abs/2410.19609
作者: Hongliang He,Wenlin Yao,Kaixin Ma,Wenhao Yu,Hongming Zhang,Tianqing Fang,Zhenzhong Lan,Dong Yu
关键词-EN: sparked significant interest, handling real-world scenarios, develop autonomous agents, autonomous agents capable, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of large language and multimodal models has sparked significant interest in using proprietary models, such as GPT-4o, to develop autonomous agents capable of handling real-world scenarios like web navigation. Although recent open-source efforts have tried to equip agents with the ability to explore environments and continuously improve over time, they are building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents struggle to generalize to realistic settings that require multimodal perception abilities and lack ground-truth signals. In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets.
摘要：大语言模型和多模态模型的快速发展引发了利用如 GPT-4o 等专有模型开发能够处理现实世界场景（如网页导航）的自主智能体的浓厚兴趣。尽管最近的开放源代码工作尝试赋予智能体探索环境和随时间持续改进的能力，但它们构建的仅限于文本的智能体是在奖励信号明确的人工环境中运行的。这类智能体在需要多模态感知能力且缺乏真实信号的现实环境中难以泛化。本文介绍了一个开源框架，旨在促进能够自主进行现实世界探索并自我改进的多模态网页智能体的开发。我们首先通过模仿学习训练基础模型以获得基本能力，然后让智能体探索开放网络并收集其轨迹的反馈。随后，它通过从另一个通用模型判断的高效轨迹中学习来进一步优化其策略。这一探索-反馈-优化循环可以持续进行多次迭代。实验结果表明，我们的网页智能体在每次迭代后都能成功自我改进，在多个测试集上展示了强大的性能。

[NLP-10] ChunkRAG: Novel LLM -Chunk Filtering Method for RAG Systems

【速读】：该论文试图解决现有检索增强生成 (Retrieval-Augmented Generation, RAG) 系统在使用大型语言模型 (LLMs) 时，由于检索到不相关或松散相关信息而导致生成不准确响应的问题。解决方案的关键在于提出了一种基于大型语言模型的分块过滤框架，称为 ChunkRAG。该框架通过语义分块将文档划分为连贯的部分，并利用 LLM 进行相关性评分，以评估每个分块与用户查询的对齐程度。通过在生成阶段之前过滤掉不相关的分块，显著减少了幻觉现象并提高了事实准确性。实验结果表明，该方法在需要精确信息检索的任务中优于现有的 RAG 模型，从而增强了 RAG 系统的可靠性，特别适用于事实核查和多跳推理等应用。

链接: https://arxiv.org/abs/2410.19572
作者: Ritvik Aggarwal Ishneet Sukhvinder Singh Ibrahim Allahverdiyev,Muhammad Taha,Aslihan Akalin,Kevin Zhu
关键词-EN: generate inaccurate responses, inaccurate responses due, large language models, loosely related information, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems using large language models (LLMs) often generate inaccurate responses due to the retrieval of irrelevant or loosely related information. Existing methods, which operate at the document level, fail to effectively filter out such content. We propose LLM-driven chunk filtering, ChunkRAG, a framework that enhances RAG systems by evaluating and filtering retrieved information at the chunk level. Our approach employs semantic chunking to divide documents into coherent sections and utilizes LLM-based relevance scoring to assess each chunk’s alignment with the user’s query. By filtering out less pertinent chunks before the generation phase, we significantly reduce hallucinations and improve factual accuracy. Experiments show that our method outperforms existing RAG models, achieving higher accuracy on tasks requiring precise information retrieval. This advancement enhances the reliability of RAG systems, making them particularly beneficial for applications like fact-checking and multi-hop reasoning.
摘要：使用大语言模型 (LLM) 的检索增强生成 (Retrieval-Augmented Generation, RAG) 系统常常由于检索到不相关或松散相关的信息而生成不准确的响应。现有的方法在文档级别进行操作，无法有效过滤此类内容。我们提出了 LLM 驱动的分块过滤框架，称为 ChunkRAG，该框架通过在分块级别评估和过滤检索到的信息来增强 RAG 系统。我们的方法采用语义分块技术将文档划分为连贯的部分，并利用基于 LLM 的相关性评分来评估每个分块与用户查询的对齐程度。通过在生成阶段之前过滤掉不太相关的分块，我们显著减少了幻觉现象并提高了事实准确性。实验表明，我们的方法优于现有的 RAG 模型，在需要精确信息检索的任务中实现了更高的准确性。这一进步增强了 RAG 系统的可靠性，使其在事实核查和多跳推理等应用中特别有益。

[NLP-11] Mirror Matrix on the Wall: coding and vector notation as tools for introspection

【速读】：该论文试图解决的问题是如何通过采用向量表示法（vector notation）来增强GNU Octave作为编程语言的表达能力和解决复杂问题的效率。解决方案的关键在于深入分析GNU Octave中的操作符和函数，使其更接近数学表示法，并通过引入索引（indexing）、广播（broadcasting）和函数句柄（function handles）等基本概念，以及通过案例研究来加深对这些概念的理解。通过这种方式，GNU Octave能够更有效地帮助数学家、科学家和工程师表达和解决复杂问题。

链接: https://arxiv.org/abs/2410.19549
作者: Leonardo Araújo
关键词-EN: Kenneth E. Iverson, GNU Octave plays, vision of Kenneth, GNU Octave, vector notation adopted
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 22 pages, 1 figure (3 subfigures)

点击查看摘要

Abstract:The vector notation adopted by GNU Octave plays a significant role as a tool for introspection, aligning itself with the vision of Kenneth E. Iverson. He believed that, just like mathematics, a programming language should be an effective thinking tool for representing and reasoning about problems we wish to address. This work aims to explore the use of vector notation in GNU Octave through the analysis of operators and functions, providing a closer alignment with mathematical notation and enhancing code efficiency. We will delve into fundamental concepts such as indexing, broadcasting, and function handles, and present case studies for a deeper understanding of these concepts. By adopting vector notation, GNU Octave becomes a powerful tool for mathematicians, scientists and engineers, enabling them to express and solve complex problems more effectively and intuitively.
摘要：GNU Octave 采用的向量表示法在自省工具中扮演了重要角色，与 Kenneth E. Iverson 的愿景相契合。Iverson 认为，就像数学一样，编程语言应当成为一个有效的思维工具，用于表示和推理我们希望解决的问题。本研究旨在通过分析操作符和函数，探讨向量表示法在 GNU Octave 中的应用，以更紧密地与数学表示法对齐，并提高代码效率。我们将深入探讨索引、广播和函数句柄等基本概念，并通过案例研究来加深对这些概念的理解。通过采用向量表示法，GNU Octave 成为数学家、科学家和工程师的强大工具，使他们能够更有效地和直观地表达和解决复杂问题。

[NLP-12] Detection of Human and Machine-Authored Fake News in Urdu

【速读】：该论文试图解决社交媒体时代下，由大型语言模型（LLMs）如ChatGPT引发的虚假信息传播问题，特别是传统基于语言线索的虚假新闻检测方法在面对高度逼真、无错误的机器生成信息时的失效问题。解决方案的关键在于更新检测框架，将机器生成的新闻纳入检测范围，并特别关注乌尔都语等低资源语言。论文提出了一种分层检测策略，通过实验验证了其在不同数据集和设置下的有效性和鲁棒性。

链接: https://arxiv.org/abs/2410.19517
作者: Muhammad Zain Ali,Yuxia Wang,Bernhard Pfahringer,Tony Smith
关键词-EN: large language models, error-free misinformation, highly convincing, making it increasingly, truth from falsehood
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of social media has amplified the spread of fake news, now further complicated by large language models (LLMs) like ChatGPT, which ease the generation of highly convincing, error-free misinformation, making it increasingly challenging for the public to discern truth from falsehood. Traditional fake news detection methods relying on linguistic cues also becomes less effective. Moreover, current detectors primarily focus on binary classification and English texts, often overlooking the distinction between machine-generated true vs. fake news and the detection in low-resource languages. To this end, we updated detection schema to include machine-generated news with focus on the Urdu language. We further propose a hierarchical detection strategy to improve the accuracy and robustness. Experiments show its effectiveness across four datasets in various settings.
摘要：社交媒体的兴起加剧了假新闻的传播，如今随着像 ChatGPT 这样的大语言模型 (LLM) 的出现，这一问题变得更加复杂。这些模型使得生成高度可信、无错误的虚假信息变得更为容易，从而使得公众越来越难以辨别真伪。依赖于语言线索的传统假新闻检测方法也因此变得不那么有效。此外，当前的检测器主要集中在二元分类和英文文本上，往往忽视了机器生成的新闻与真实新闻之间的区别，以及在低资源语言中的检测。为此，我们更新了检测框架，将机器生成的新闻纳入考虑，并特别关注乌尔都语。我们进一步提出了一种分层检测策略，以提高准确性和鲁棒性。实验结果显示，该策略在不同设置下的四个数据集中均表现出色。

[NLP-13] SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在推理成本和内存需求方面的高昂开销问题。解决方案的关键是提出了一种名为 SWITCH (Studying WIth TeaCHer for Knowledge Distillation) 的新方法，该方法通过在学生模型生成序列时战略性地引入教师模型来优化知识蒸馏 (Knowledge Distillation, KD) 过程。SWITCH 通过识别教师模型和学生模型在标记概率上的差异，允许教师模型在学生模型生成长序列时进行选择性干预，从而减少教师模型在长序列中的误导风险。实验结果表明，SWITCH 在生成长序列数据方面显著优于传统的知识蒸馏方法。

链接: https://arxiv.org/abs/2410.19503
作者: Jahyun Koo,Yerin Hwang,Yongil Kim,Taegwan Kang,Hyunkyung Bae,Kyomin Jung
关键词-EN: Large Language Models, Large Language, high inference costs, success of Large, face challenges related
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the success of Large Language Models (LLMs), they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with student-generated outputs (SGOs) being particularly notable for reducing the mismatch between training and inference. However, SGOs often produce noisy and biased sequences, which can lead to misguidance from the teacher model, especially in long sequences. To mitigate these challenges, we propose SWITCH (Studying WIth TeaCHer for Knowledge Distillation), a novel approach that strategically incorporates the teacher model during the student’s sequence generation. SWITCH identifies discrepancies between the token probabilities of the teacher and student models, allowing the teacher to intervene selectively, particularly in long sequences that are more prone to teacher misguidance. Extensive experimental results across three model families and five instruction-following datasets show that SWITCH surpasses traditional KD methods, particularly excelling in the generation of long sequential data.
摘要：尽管大语言模型 (Large Language Models, LLMs) 取得了显著的成功，但它们仍面临高推理成本和内存需求的挑战。为了解决这些问题，知识蒸馏 (Knowledge Distillation, KD) 作为一种模型压缩的流行方法应运而生，其中学生生成的输出 (Student-Generated Outputs, SGOs) 在减少训练和推理之间的不匹配方面尤为突出。然而，SGOs 往往产生噪声和偏差序列，这可能导致教师模型在长序列中出现误导。为了缓解这些挑战，我们提出了 SWITCH（Studying WIth TeaCHer for Knowledge Distillation），这是一种新颖的方法，战略性地在学生序列生成过程中引入教师模型。SWITCH 识别教师和学生模型之间 Token 概率的差异，允许教师在长序列中更有可能出现误导的情况下进行选择性干预。在三个模型家族和五个指令跟随数据集上的广泛实验结果表明，SWITCH 超越了传统的 KD 方法，特别是在生成长序列数据方面表现尤为出色。

[NLP-14] Introducing MAPO: Momentum-Aided Gradient Descent Prompt Optimization

【速读】：该论文试图解决大型语言模型（LLMs）中提示优化（Prompt Optimization）的效率和效果问题。解决方案的关键在于Momentum-Aided Prompt Optimization (MAPO)，它通过结合正向自然语言“梯度”（positive natural language “gradients”）和基于动量的扩展来有效优化提示。MAPO通过跟踪梯度历史避免局部最小值和振荡，并利用束搜索（beam search）和上置信界（Upper Confidence Bound, UCB）算法来平衡候选扩展和选择，从而实现更快的收敛时间和更高的F1分数。

链接: https://arxiv.org/abs/2410.19499
作者: Anthony Cui,Pranav Nandyalam,Kevin Zhu
关键词-EN: Large Language Models, Momentum-Aided Prompt Optimization, Prompt Optimization, Language Models, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Momentum-Aided Prompt Optimization (MAPO) enhances the efficiency and efficacy of prompt optimization for Large Language Models (LLMs). Building on ProTeGi, MAPO uses positive natural language “gradients” and a momentum-based extension to refine prompts effectively. By tracking gradient history, MAPO avoids local minima and oscillations. It also utilizes beam search and an Upper Confidence Bound (UCB) algorithm for balanced candidate expansion and selection. Benchmark testing shows that MAPO achieves faster convergence time with fewer API calls and higher F1 scores than ProTeGi, proving it as a robust and scalable solution for automated prompt engineering in LLMs.
摘要：动量辅助提示优化 (Momentum-Aided Prompt Optimization, MAPO) 提升了大语言模型 (Large Language Models, LLMs) 提示优化的效率和效果。基于 ProTeGi，MAPO 利用正向自然语言“梯度”和基于动量的扩展来有效精炼提示。通过跟踪梯度历史，MAPO 避免了局部最小值和振荡。它还利用束搜索 (beam search) 和上置信界 (Upper Confidence Bound, UCB) 算法来平衡候选扩展和选择。基准测试表明，与 ProTeGi 相比，MAPO 在更少的 API 调用次数下实现了更快的收敛时间，并获得了更高的 F1 分数，证明了其在 LLMs 自动化提示工程中的稳健性和可扩展性。

[NLP-15] Graph Linearization Methods for Reasoning on Graphs with Large Language Models

【速读】：该论文试图解决如何将图数据有效地转换为线性序列，以便大型语言模型（LLMs）能够自然地处理图机器学习任务。解决方案的关键在于图线性化（graph linearization），即如何将图结构转化为能够反映自然语言文本特性的线性序列，如局部依赖性和全局对齐性。为此，论文提出了基于图中心性（graph centrality）、退化性（degeneracy）和节点重标记（node relabeling）的多种图线性化方法，并通过实验验证了这些方法在图推理任务中相对于随机线性化基线的有效性。这些新型的图表示方法有助于将图机器学习与多模态处理趋势相结合，通过统一的Transformer模型实现更广泛的应用。

链接: https://arxiv.org/abs/2410.19494
作者: Christos Xypolopoulos,Guokan Shang,Xiao Fei,Giannis Nikolentzos,Hadi Abdine,Iakovos Evdaimon,Michail Chatzianastasis,Giorgos Stamou,Michalis Vazirgiannis
关键词-EN: process multiple modalities, Large language models, Large language, images and audio, multiple modalities
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have evolved to process multiple modalities beyond text, such as images and audio, which motivates us to explore how to effectively leverage them for graph machine learning tasks. The key question, therefore, is how to transform graphs into linear sequences of tokens, a process we term graph linearization, so that LLMs can handle graphs naturally. We consider that graphs should be linearized meaningfully to reflect certain properties of natural language text, such as local dependency and global alignment, in order to ease contemporary LLMs, trained on trillions of textual tokens, better understand graphs. To achieve this, we developed several graph linearization methods based on graph centrality, degeneracy, and node relabeling schemes. We then investigated their effect on LLM performance in graph reasoning tasks. Experimental results on synthetic graphs demonstrate the effectiveness of our methods compared to random linearization baselines. Our work introduces novel graph representations suitable for LLMs, contributing to the potential integration of graph machine learning with the trend of multi-modal processing using a unified transformer model.
摘要：大语言模型已经进化到能够处理文本之外的多种模态数据，如图像和音频，这促使我们探索如何有效地利用这些模型进行图机器学习任务。因此，关键问题是如何将图转化为线性序列的 Token，这一过程我们称之为图线性化 (graph linearization)，以便大语言模型能够自然地处理图数据。我们认为，图的线性化应当有意义地反映自然语言文本的某些属性，如局部依赖性和全局对齐性，以帮助当前基于数万亿文本 Token 训练的大语言模型更好地理解图。为此，我们基于图的中心性、退化性和节点重标记方案开发了几种图线性化方法，并研究了它们对大语言模型在图推理任务中性能的影响。在合成图上的实验结果表明，与随机线性化基线相比，我们的方法具有显著的有效性。我们的工作引入了适用于大语言模型的新型图表示，为图机器学习与使用统一 Transformer 模型的多模态处理趋势的潜在整合做出了贡献。

[NLP-16] A Debate-Driven Experiment on LLM Hallucinations and Accuracy

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 中的幻觉问题 (hallucination)，即模型生成未经输入或外部知识证实的信息。解决方案的关键在于引入多模型交互机制，通过让多个GPT-4o-Mini模型在一个辩论式框架中互动，其中一部分模型被指示生成看似合理但错误的答案，而其他模型则被要求提供真实答案。这种设计旨在评估错误信息的引入是否能促使真实答案的模型更好地解释其推理过程，从而提高在TruthfulQA基准测试中的表现。研究结果表明，多模型交互可以为提高LLM输出的准确性和鲁棒性提供有价值的见解，补充现有的幻觉缓解策略。

链接: https://arxiv.org/abs/2410.19485
作者: Ray Li,Tanishka Bagade,Kevin Martinez,Flora Yasmin,Grant Ayala,Michael Lam,Kevin Zhu
关键词-EN: Large language models, contextually relevant text, Large language, relevant text, producing information
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved a degree of success in generating coherent and contextually relevant text, yet they remain prone to a significant challenge known as hallucination: producing information that is not substantiated by the input or external knowledge. Previous efforts to mitigate hallucinations have focused on techniques such as fine-tuning models on high-quality datasets, incorporating fact-checking mechanisms, and developing adversarial training methods. While these approaches have shown some promise, they often address the issue at the level of individual model outputs, leaving unexplored the effects of inter-model interactions on hallucination. This study investigates the phenomenon of hallucination in LLMs through a novel experimental framework where multiple instances of GPT-4o-Mini models engage in a debate-like interaction prompted with questions from the TruthfulQA dataset. One model is deliberately instructed to generate plausible but false answers while the other models are asked to respond truthfully. The experiment is designed to assess whether the introduction of misinformation by one model can challenge the truthful majority to better justify their reasoning, improving performance on the TruthfulQA benchmark. The findings suggest that inter-model interactions can offer valuable insights into improving the accuracy and robustness of LLM outputs, complementing existing mitigation strategies.
摘要：大语言模型（LLMs）在生成连贯且上下文相关文本方面取得了一定程度的成功，但它们仍然面临一个重大挑战，即“幻觉”现象：生成未经输入或外部知识证实的信息。先前减少幻觉的努力主要集中在技术上，如在高质数据集上微调模型、引入事实核查机制以及开发对抗训练方法。尽管这些方法显示出一定的潜力，但它们通常仅在单个模型输出的层面上解决问题，未深入探讨模型间交互对幻觉的影响。本研究通过一种新颖的实验框架探讨了LLMs中的幻觉现象，其中多个GPT-4o-Mini模型实例参与类似辩论的交互，问题来自TruthfulQA数据集。一个模型被故意指示生成看似合理但错误的答案，而其他模型则被要求如实回答。实验旨在评估一个模型引入的错误信息是否能挑战如实回答的多数模型，使其更好地解释其推理过程，从而在TruthfulQA基准上提升性能。研究结果表明，模型间的交互可以为提高LLM输出的准确性和鲁棒性提供有价值的见解，补充现有的缓解策略。

[NLP-17] ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework

【速读】：该论文试图解决多语言大型语言模型（LLMs）在非主导语言（如英语以外的语言）上表现不佳的问题，这是由于训练数据在不同语言之间的不平衡导致的。解决方案的关键是提出了一个基于移位的对比框架，称为ShifCon。该框架通过将非主导语言的内部表示向主导语言的子空间进行移位，使其能够访问模型参数中编码的更丰富的信息，然后再将这些表示移回原始语言子空间进行生成。此外，论文还引入了一个子空间距离度量来确定最佳的移位层区域，并采用多语言对比学习来进一步增强该区域内表示的对齐。实验结果表明，ShifCon显著提升了非主导语言，尤其是低资源语言的性能。

链接: https://arxiv.org/abs/2410.19453
作者: Hengyuan Zhang,Chenming Shang,Sizhe Wang,Dongdong Zhang,Feng Yao,Renliang Sun,Yiyao Yu,Yujiu Yang,Furu Wei
关键词-EN: fine-tuning Large Language, Large Language Models, fine-tuning Large, Large Language, non-dominant languages
类目: Computation and Language (cs.CL)
备注: 23 pages, 11 figures

点击查看摘要

Abstract:Although fine-tuning Large Language Models (LLMs) with multilingual data can rapidly enhance the multilingual capabilities of LLMs, they still exhibit a performance gap between the dominant language (e.g., English) and non-dominant ones due to the imbalance of training data across languages. To further enhance the performance of non-dominant languages, we propose ShifCon, a Shift-based Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one. Specifically, it shifts the representations of non-dominant languages into the dominant language subspace, allowing them to access relatively rich information encoded in the model parameters. The enriched representations are then shifted back into their original language subspace before generation. Moreover, we introduce a subspace distance metric to pinpoint the optimal layer area for shifting representations and employ multilingual contrastive learning to further enhance the alignment of representations within this area. Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages, particularly for low-resource ones. Further analysis offers extra insights to verify the effectiveness of ShifCon and propel future research
摘要：尽管通过多语言数据对大语言模型 (LLM) 进行微调可以迅速提升其多语言能力，但由于训练数据在各语言之间的不平衡，这些模型在主导语言（如英语）和非主导语言之间仍然存在性能差距。为了进一步提升非主导语言的性能，我们提出了 ShifCon，一种基于移位的对比框架，该框架将其他语言的内部前向过程对齐到主导语言的前向过程。具体而言，它将非主导语言的表示移位到主导语言的子空间中，使其能够访问模型参数中编码的相对丰富的信息。在生成之前，这些丰富的表示会被移回其原始语言子空间。此外，我们引入了一种子空间距离度量，以精确确定移位表示的最佳层区域，并采用多语言对比学习来进一步增强该区域内表示的对齐。实验表明，我们的 ShifCon 框架显著提升了非主导语言的性能，特别是对于低资源语言。进一步的分析提供了额外的见解，验证了 ShifCon 的有效性，并推动了未来的研究。

[NLP-18] Intelligent Understanding of Large Language Models in Traditional Chinese Medicine Based on Prompt Engineering Framework

【速读】：该论文试图解决在中医（Traditional Chinese Medicine, TCM）领域中，如何通过提示工程（prompt engineering）提升大型语言模型（Large Language Models, LLMs）的性能问题。解决方案的关键在于提出了TCM-Prompt框架，该框架整合了多种预训练语言模型（Pre-trained Language Models, PLMs）、模板、分词（tokenization）和语言化（verbalization）方法，使得研究人员能够便捷地构建和微调针对特定中医任务的模型。通过在疾病分类、证候识别、草药推荐和通用自然语言处理（Natural Language Processing, NLP）任务上的实验，证明了该方法相对于基线方法的有效性和优越性。

链接: https://arxiv.org/abs/2410.19451
作者: Yirui Chen,Qinyu Xiao,Jia Yi,Jing Chen,Mengyang Wang
关键词-EN: Traditional Chinese Medicine, Traditional Chinese, large language models, Chinese Medicine, paper explores
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the application of prompt engineering to enhance the performance of large language models (LLMs) in the domain of Traditional Chinese Medicine (TCM). We propose TCM-Prompt, a framework that integrates various pre-trained language models (PLMs), templates, tokenization, and verbalization methods, allowing researchers to easily construct and fine-tune models for specific TCM-related tasks. We conducted experiments on disease classification, syndrome identification, herbal medicine recommendation, and general NLP tasks, demonstrating the effectiveness and superiority of our approach compared to baseline methods. Our findings suggest that prompt engineering is a promising technique for improving the performance of LLMs in specialized domains like TCM, with potential applications in digitalization, modernization, and personalized medicine.
摘要：本文探讨了将提示工程应用于提升大语言模型 (LLM) 在中医 (Traditional Chinese Medicine, TCM) 领域的表现。我们提出了 TCM-Prompt 框架，该框架整合了多种预训练语言模型 (PLM)、模板、Token 化及语言化方法，使研究人员能够轻松构建和微调针对特定中医相关任务的模型。我们在疾病分类、证候识别、草药推荐及一般自然语言处理任务上进行了实验，证明了我们的方法相对于基线方法的有效性和优越性。我们的研究结果表明，提示工程是一种有前景的技术，可用于提升 LLM 在中医等专业领域的表现，具有在数字化、现代化及个性化医疗中的潜在应用。

[NLP-19] KAHANI: Culturally-Nuanced Visual Storytelling Pipeline for Non-Western Cultures

【速读】：该论文试图解决的问题是大型语言模型（LLMs）和文本到图像（T2I）模型在生成故事时主要反映全球北方文化的倾向，导致非西方文化社区在生成文化特定故事时面临额外挑战。解决方案的关键在于开发了一个名为KAHANI的视觉故事生成管道，该管道利用现成的模型GPT-4 Turbo和Stable Diffusion XL (SDXL)，通过思维链（Chain of Thought, CoT）和T2I提示技术，捕捉用户提示中的文化背景，生成具有文化根基的视觉故事。通过对比用户研究，KAHANI在文化相关性和视觉故事生成质量方面均优于ChatGPT-4，特别是在捕捉和整合文化特定元素（Culturally Specific Items, CSIs）方面表现更为出色。

链接: https://arxiv.org/abs/2410.19419
作者: Hamna,Deepthi Sudharsan,Agrima Seth,Ritvik Budhiraja,Deepika Khullar,Vyshak Jain,Kalika Bali,Aditya Vashistha,Sameer Segal
关键词-EN: Large Language Models, Large Language, generate compelling text, Language Models, demonstrated the ability
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large Language Models (LLMs) and Text-To-Image (T2I) models have demonstrated the ability to generate compelling text and visual stories. However, their outputs are predominantly aligned with the sensibilities of the Global North, often resulting in an outsider’s gaze on other cultures. As a result, non-Western communities have to put extra effort into generating culturally specific stories. To address this challenge, we developed a visual storytelling pipeline called KAHANI that generates culturally grounded visual stories for non-Western cultures. Our pipeline leverages off-the-shelf models GPT-4 Turbo and Stable Diffusion XL (SDXL). By using Chain of Thought (CoT) and T2I prompting techniques, we capture the cultural context from user’s prompt and generate vivid descriptions of the characters and scene compositions. To evaluate the effectiveness of KAHANI, we conducted a comparative user study with ChatGPT-4 (with DALL-E3) in which participants from different regions of India compared the cultural relevance of stories generated by the two tools. Results from the qualitative and quantitative analysis performed on the user study showed that KAHANI was able to capture and incorporate more Culturally Specific Items (CSIs) compared to ChatGPT-4. In terms of both its cultural competence and visual story generation quality, our pipeline outperformed ChatGPT-4 in 27 out of the 36 comparisons.
摘要：大语言模型 (LLM) 和文本到图像 (T2I) 模型展示了生成引人入胜的文本和视觉故事的能力。然而，它们的输出主要迎合全球北方的审美，往往导致对其他文化的局外视角。因此，非西方社群需要额外努力来生成具有文化特异性的故事。为了应对这一挑战，我们开发了一个名为 KAHANI 的视觉叙事流程，旨在为非西方文化生成具有文化根基的视觉故事。我们的流程利用了现成的模型 GPT-4 Turbo 和 Stable Diffusion XL (SDXL)。通过使用思维链 (CoT) 和 T2I 提示技术，我们能够从用户的提示中捕捉文化背景，并生成生动的角色和场景描述。为了评估 KAHANI 的有效性，我们进行了一项对比用户研究，参与者来自印度的不同地区，比较了两种工具生成的故事的文化相关性。定性和定量分析的结果显示，KAHANI 在捕捉和融入文化特异性元素 (CSI) 方面优于 ChatGPT-4。在文化能力和视觉故事生成质量方面，我们的流程在 36 次比较中胜出了 27 次。

[NLP-20] Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在自然语言处理任务中产生的幻觉 (hallucinations) 问题。解决方案的关键在于通过提示工程 (prompt engineering) 设计和优化指令，以减少幻觉的发生。论文通过系统地评估不同的提示策略和框架，发现最优的提示技术取决于问题的类型，并且简单的提示技术往往比复杂的方法更能有效减少幻觉。此外，论文还研究了工具调用代理 (LLMs augmented with external tools) 对幻觉率的影响，发现这些代理由于外部工具使用的复杂性，可能会导致更高的幻觉率。

链接: https://arxiv.org/abs/2410.19385
作者: Liam Barkley,Brink van der Merwe
关键词-EN: Large Language Models, powerful computational models, computational models trained, general-purpose language understanding, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful computational models trained on extensive corpora of human-readable text, enabling them to perform general-purpose language understanding and generation. LLMs have garnered significant attention in both industry and academia due to their exceptional performance across various natural language processing (NLP) tasks. Despite these successes, LLMs often produce inaccuracies, commonly referred to as hallucinations. Prompt engineering, the process of designing and formulating instructions for LLMs to perform specific tasks, has emerged as a key approach to mitigating hallucinations. This paper provides a comprehensive empirical evaluation of different prompting strategies and frameworks aimed at reducing hallucinations in LLMs. Various prompting techniques are applied to a broad set of benchmark datasets to assess the accuracy and hallucination rate of each method. Additionally, the paper investigates the influence of tool-calling agents (LLMs augmented with external tools to enhance their capabilities beyond language generation) on hallucination rates in the same benchmarks. The findings demonstrate that the optimal prompting technique depends on the type of problem, and that simpler techniques often outperform more complex methods in reducing hallucinations. Furthermore, it is shown that LLM agents can exhibit significantly higher hallucination rates due to the added complexity of external tool usage.
摘要：大语言模型 (Large Language Models, LLMs) 是经过大量人类可读文本语料训练的强大计算模型，使其能够执行通用语言理解和生成任务。LLMs 在业界和学术界均引起了广泛关注，因其卓越的表现跨越了各种自然语言处理 (Natural Language Processing, NLP) 任务。尽管取得了这些成功，LLMs 常常产生不准确的结果，通常被称为“幻觉”。提示工程 (Prompt Engineering)，即设计和制定指令以使 LLMs 执行特定任务的过程，已成为减轻幻觉的关键方法。本文对旨在减少 LLMs 中幻觉的不同提示策略和框架进行了全面的实证评估。将各种提示技术应用于广泛的基准数据集，以评估每种方法的准确性和幻觉率。此外，本文还研究了工具调用智能体（通过外部工具增强语言生成能力之外的 LLMs）对同一基准中幻觉率的影响。研究结果表明，最佳的提示技术取决于问题的类型，且在减少幻觉方面，简单的技术往往优于复杂的方法。此外，研究表明，由于外部工具使用的复杂性增加，LLM 智能体可能表现出显著更高的幻觉率。

[NLP-21] Interleaving Text and Number Embeddings to Solve Mathemathics Problems

【速读】：该论文试图解决在大语言模型（LLMs）中有效整合文本和数字的问题，特别是当前方法在处理数字时依赖于离散的标记化（tokenization），如科学记数法或十进制分解，这可能导致数值伪影（numerical artefacts）和处理范围受限的问题。论文的关键解决方案包括：1) 引入更具表达力的数值嵌入（numerical embeddings），通过多层感知器（MLP）为不同的数字在嵌入空间中分配不同的方向；2) 引入路由层（routing layer），区分数值嵌入和文本嵌入，从而使模型能够在保持算术操作能力的同时，区分文本和数字分布。这种方法在仅使用45M参数的编码器-解码器架构下，实现了在广泛数值范围（10^-3到10^8）内的高精度（R^2 = 0.9988），并减少了数值伪影和偏差。

链接: https://arxiv.org/abs/2410.19353
作者: Marvin Alberts,Gianmarco Gabrieli,Irina Espejo Morales
关键词-EN: enhancing Large Language, Large Language Models, enhancing Large, Large Language, capabilities in assisting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integrating text and numbers effectively is a crucial step towards enhancing Large Language Models (LLMs) capabilities in assisting in scientific tasks. While most current approaches rely on discrete tokenization of numbers, for instance, conversion to scientific notation or base 10-decomposition, a recent approach proposed a continuous numerical encoding as an inductive bias. In this paper, we build upon this approach by introducing more expressive numerical embeddings. Our method addresses key shortcomings, including the elimination of numerical artefacts and the ability to handle a wide range of magnitudes without clipping. Our work presents two key contributions. First, we employ an MLP to assign distinct directions in the embedding space to different numbers. Our second contribution is the introduction of a routing layer that differentiates between numerical and text embeddings. We hypothesise that this combined approach enables the model to distinguish between text and number distributions while maintaining its capacity for arithmetic operations. Using only a 45 M parameter encoder-decoder architecture our method achieves a R^2 =0.9988 over a wide range of magnitude ( 10^-3,10^8 ). In addition, we empirically observe a reduction of the numerical artefacts and biases observed compared to the baselines. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.19353 [cs.CL] (or arXiv:2410.19353v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.19353 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：有效整合文本和数字是提升大语言模型（LLMs）在科学任务中辅助能力的关键步骤。尽管当前大多数方法依赖于数字的离散化处理，例如转换为科学计数法或十进制分解，但最近的一种方法提出了一种连续数值编码作为归纳偏置。本文在此基础上，引入了更具表达力的数值嵌入。我们的方法解决了关键的不足，包括消除数值伪影和处理广泛量级而不需要裁剪的能力。我们的工作提出了两个主要贡献。首先，我们采用多层感知机（MLP）为嵌入空间中的不同数字分配不同的方向。其次，我们引入了一个路由层，用于区分数值嵌入和文本嵌入。我们假设这种综合方法使模型能够在保持算术操作能力的同时，区分文本和数字分布。仅使用一个45M参数的编码器-解码器架构，我们的方法在广泛量级（10^-3, 10^8）范围内实现了R^2 = 0.9988。此外，我们通过实验观察到，与基线相比，数值伪影和偏差有所减少。

主题：计算与语言（cs.CL）；人工智能（cs.AI）
引用方式：arXiv:2410.19353 [cs.CL]（或arXiv:2410.19353v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2410.19353
通过DataCite发布的arXiv DOI（待注册）

[NLP-22] AgentS ense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios

【速读】：该论文试图解决大型语言模型（LLMs）在复杂社会互动中的评估问题，特别是其在模拟人类行为和处理多层次社交场景中的能力。解决方案的关键在于引入了一个名为AgentSense的基准测试，该基准基于戏剧理论（Dramaturgical Theory），采用自下而上的方法构建了1,225个多样化的社交场景，并通过多轮互动评估LLM驱动的代理。评估重点包括目标完成和隐性推理，使用ERG理论分析目标，并进行全面实验。研究发现，LLMs在复杂社交场景中的目标处理能力存在不足，尤其是高层次的成长需求，即使是GPT-4o在私人信息推理方面也需要改进。

链接: https://arxiv.org/abs/2410.19346
作者: Xinyi Mou,Jingcong Liang,Jiayu Lin,Xinnong Zhang,Xiawei Liu,Shiyue Yang,Rong Ye,Lei Chen,Haoyu Kuang,Xuanjing Huang,Zhongyu Wei
关键词-EN: Large language models, empower autonomous agents, Large language, behavioral research, increasingly leveraged
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly leveraged to empower autonomous agents to simulate human beings in various fields of behavioral research. However, evaluating their capacity to navigate complex social interactions remains a challenge. Previous studies face limitations due to insufficient scenario diversity, complexity, and a single-perspective focus. To this end, we introduce AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios. Drawing on Dramaturgical Theory, AgentSense employs a bottom-up approach to create 1,225 diverse social scenarios constructed from extensive scripts. We evaluate LLM-driven agents through multi-turn interactions, emphasizing both goal completion and implicit reasoning. We analyze goals using ERG theory and conduct comprehensive experiments. Our findings highlight that LLMs struggle with goals in complex social scenarios, especially high-level growth needs, and even GPT-4o requires improvement in private information reasoning.
摘要：大语言模型 (LLMs) 正越来越多地被用于赋能自主智能体，以模拟人类在行为研究各个领域中的表现。然而，评估这些模型在复杂社交互动中的能力仍然是一个挑战。以往的研究由于场景多样性、复杂性不足以及单一视角的局限，存在一定的局限性。为此，我们提出了 AgentSense：通过互动场景评估语言智能体的社交智能。基于戏剧理论，AgentSense 采用自下而上的方法，从广泛的剧本中构建了 1,225 个多样化的社交场景。我们通过多轮互动对 LLM 驱动的智能体进行评估，强调目标完成和隐性推理。我们使用 ERG 理论分析目标，并进行了全面的实验。研究结果表明，LLMs 在复杂社交场景中的目标处理能力存在困难，尤其是高层次的成长需求，即使是 GPT-4o 在私人信息推理方面也需要改进。

[NLP-23] wo are better than one: Context window extension with multi-grained self-injection

【速读】：该论文试图解决当代大型语言模型（LLMs）由于有限的上下文窗口限制其广泛应用的问题。解决方案的关键在于提出了SharedLLM，这是一种基于多粒度上下文压缩和查询感知信息检索设计理念的新方法。SharedLLM由两个短上下文LLMs（如LLaMA-2）组成，分别称为上层模型和下层模型。下层模型作为压缩器，上层模型作为解码器。上层模型接收来自下层模型的压缩、多粒度上下文信息，并对运行文本进行上下文感知建模。信息传递仅在最低层进行，以避免下层模型的长前向路径和上层模型的冗余交叉注意力模块。此外，论文引入了一种专门的树状数据结构，用于高效编码、存储和检索多粒度上下文信息，结合搜索算法，能够根据输入查询快速从树的不同层次中检索相关信息，整个过程称为自注入。

链接: https://arxiv.org/abs/2410.19318
作者: Wei Han,Pan Zhou,Soujanya Poria,Shuicheng Yan
关键词-EN: contemporary large language, large language models, limited context window, upper model, lower model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The code is available at this https URL

点击查看摘要

Abstract:The limited context window of contemporary large language models (LLMs) remains a huge barrier to their broader application across various domains. While continual pre-training on long-context data is a straightforward and effective solution, it incurs substantial costs in terms of data acquisition and computational resources. To alleviate this issue, we propose SharedLLM, a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval. SharedLLM is composed of two short-context LLMs such as LLaMA-2, termed upper model and lower model. The lower model functions as a compressor while the upper model acts as a decoder. The upper model receives compressed, multi-grained context information from the lower model and performs context-aware modeling on the running text. Information transfer between the compressor and decoder occurs only at the lowest layers to refrain from long forward paths in the lower model and redundant cross-attention modules in the upper model. Based on this architecture, we introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks. This structure, combined with a search algorithm, enables rapid encoding and retrieval of relevant information from various levels of the tree based on the input query. This entire process, wherein the sender and receiver are derived from the same LLM layer, is referred to as self-injection.
摘要：当代大语言模型 (LLM) 的有限上下文窗口仍然是其在各个领域广泛应用的一大障碍。尽管在长上下文数据上进行持续预训练是一种直接且有效的解决方案，但这在数据获取和计算资源方面带来了巨大的成本。为了缓解这一问题，我们提出了 SharedLLM，这是一种基于多粒度上下文压缩和查询感知信息检索设计理念的新方法。SharedLLM 由两个短上下文 LLM 组成，例如 LLaMA-2，分别称为上层模型和下层模型。下层模型作为压缩器，而上层模型作为解码器。上层模型接收来自下层模型的压缩、多粒度上下文信息，并对运行文本进行上下文感知建模。压缩器和解码器之间的信息传递仅在最底层进行，以避免下层模型中的长前向路径和上层模型中的冗余交叉注意力模块。基于这种架构，我们引入了一种专门的树形数据结构，用于高效地编码、存储和检索文本块的多粒度上下文信息。结合搜索算法，该结构能够根据输入查询从树的不同层次快速编码和检索相关信息。整个过程中，发送方和接收方均源自同一 LLM 层，这一过程被称为自注入。

[NLP-24] FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLM s

【速读】：该论文试图解决在大语言模型（LLM）基于的聊天机器人中日益突出的公平性问题，特别是在多轮对话场景中。解决方案的关键在于提出了一个全面的公平性基准测试 FairMT-Bench，该基准针对多轮对话中的三个阶段：上下文理解、用户交互和指令权衡，每个阶段包含两个任务。通过构建一个包含10,000个多轮对话样本的数据集 FairMT-10K，并使用GPT-4和多种偏见分类器（如Llama-Guard-3）以及人工验证进行评估，研究揭示了当前LLM在多轮对话中更容易生成偏见性响应，且不同任务和模型间性能差异显著。基于此，论文进一步筛选出更具挑战性的数据集 FairMT-1K，并测试了15个当前最先进的LLM，展示了该方法在评估多轮对话公平性方面的有效性，并呼吁未来工作应聚焦于LLM公平性的改进及采用 FairMT-1K 进行相关研究。

链接: https://arxiv.org/abs/2410.19317
作者: Zhiting Fan,Ruizhe Chen,Tianxiang Hu,Zuozhu Liu
关键词-EN: large language model, fairness, multi-turn dialogue scenarios, large language, chatbots has raised
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing use of large language model (LLM)-based chatbots has raised concerns about fairness. Fairness issues in LLMs can lead to severe consequences, such as bias amplification, discrimination, and harm to marginalized communities. While existing fairness benchmarks mainly focus on single-turn dialogues, multi-turn scenarios, which in fact better reflect real-world conversations, present greater challenges due to conversational complexity and potential bias accumulation. In this paper, we propose a comprehensive fairness benchmark for LLMs in multi-turn dialogue scenarios, \textbfFairMT-Bench. Specifically, we formulate a task taxonomy targeting LLM fairness capabilities across three stages: context understanding, user interaction, and instruction trade-offs, with each stage comprising two tasks. To ensure coverage of diverse bias types and attributes, we draw from existing fairness datasets and employ our template to construct a multi-turn dialogue dataset, \textttFairMT-10K. For evaluation, GPT-4 is applied, alongside bias classifiers including Llama-Guard-3 and human validation to ensure robustness. Experiments and analyses on \textttFairMT-10K reveal that in multi-turn dialogue scenarios, current LLMs are more likely to generate biased responses, and there is significant variation in performance across different tasks and models. Based on this, we curate a challenging dataset, \textttFairMT-1K, and test 15 current state-of-the-art (SOTA) LLMs on this dataset. The results show the current state of fairness in LLMs and showcase the utility of this novel approach for assessing fairness in more realistic multi-turn dialogue contexts, calling for future work to focus on LLM fairness improvement and the adoption of \textttFairMT-1K in such efforts.
摘要：随着基于大语言模型（LLM）的聊天机器人的广泛使用，公平性问题引起了人们的关注。LLM中的公平性问题可能导致严重的后果，如偏见放大、歧视以及对边缘化群体的伤害。现有的公平性基准主要集中在单轮对话上，而多轮对话场景更能反映现实世界的交流，由于对话复杂性和潜在偏见积累，带来了更大的挑战。本文提出了一种针对多轮对话场景的LLM综合公平性基准，即 FairMT-Bench。具体而言，我们制定了一个任务分类法，针对LLM在三个阶段的公平性能力：上下文理解、用户交互和指令权衡，每个阶段包含两个任务。为了确保涵盖多种偏见类型和属性，我们借鉴了现有的公平性数据集，并使用我们的模板构建了一个多轮对话数据集，即 FairMT-10K。在评估方面，我们应用了GPT-4，以及包括Llama-Guard-3在内的偏见分类器和人工验证，以确保评估的稳健性。对 FairMT-10K 的实验和分析表明，在多轮对话场景中，当前的LLM更可能生成带有偏见的响应，并且在不同任务和模型之间存在显著的性能差异。基于此，我们精选了一个具有挑战性的数据集，即 FairMT-1K，并在此数据集上测试了15个当前最先进的（SOTA）LLM。结果显示了当前LLM的公平性状态，并展示了这种新颖方法在评估更现实的多轮对话场景中公平性的实用性，呼吁未来工作应聚焦于LLM公平性的改进以及在相关努力中采用 FairMT-1K。

[NLP-25] Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)

【速读】：该论文试图解决预训练大型语言模型（LLMs）与视觉输入结合的多模态任务中存在的性别偏见问题。研究通过评估22个流行的开源视觉语言助手（VLAs）在个性特质、技能和职业方面的性别偏见，发现这些模型在一定程度上复制了人类数据中可能存在的偏见，如现实世界中的职业不平衡。解决方案的关键在于采用基于微调的去偏方法，以在去偏和保持下游任务性能之间达到最佳平衡。论文强调在部署前进行性别偏见评估的重要性，并呼吁进一步开发去偏策略，以确保社会公平性。

链接: https://arxiv.org/abs/2410.19314
作者: Leander Girrbach,Yiran Huang,Stephan Alaniz,Trevor Darrell,Zeynep Akata
关键词-EN: Pre-trained large language, Pre-trained large, large language models, large language, reliably integrated
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained large language models (LLMs) have been reliably integrated with visual input for multimodal tasks. The widespread adoption of instruction-tuned image-to-text vision-language assistants (VLAs) like LLaVA and InternVL necessitates evaluating gender biases. We study gender bias in 22 popular open-source VLAs with respect to personality traits, skills, and occupations. Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances. Similarly, they tend to attribute more skills and positive personality traits to women than to men, and we see a consistent tendency to associate negative personality traits with men. To eliminate the gender bias in these models, we find that finetuning-based debiasing methods achieve the best tradeoff between debiasing and retaining performance on downstream tasks. We argue for pre-deploying gender bias assessment in VLAs and motivate further development of debiasing strategies to ensure equitable societal outcomes.
摘要：预训练的大语言模型 (LLM) 已可靠地与视觉输入结合，用于多模态任务。随着指令调优的图像到文本视觉语言助手 (VLA) 如 LLaVA 和 InternVL 的广泛采用，评估其性别偏见变得至关重要。我们研究了 22 个流行的开源 VLA 在人格特质、技能和职业方面的性别偏见。结果显示，VLA 再现了数据中可能存在的人类偏见，例如现实世界中的职业不平衡。同样，它们倾向于将更多的技能和积极的人格特质归因于女性而非男性，并且我们观察到一致的趋势，即将消极人格特质与男性关联。为了消除这些模型中的性别偏见，我们发现基于微调的去偏方法在去偏和保留下游任务性能之间实现了最佳平衡。我们主张在部署前对 VLA 进行性别偏见评估，并推动进一步开发去偏策略，以确保社会公平的结果。

[NLP-26] Any Other Thoughts Hedgehog? Linking Deliberation Chains in Collaborative Dialogues EMNLP2024

【速读】：该论文试图解决在协作对话中探测性问题（probing questions）的生成问题，特别是如何从对话早期的发言中建模出导致探测性问题出现的因果关系。解决方案的关键在于提出了一种基于图的决策链（deliberation chains）框架，并将构建这种决策链的问题重新定义为一种类似共指消解（coreference-style clustering）的聚类问题。该框架联合建模了探测性问题及其因果发言，并通过在两个具有挑战性的协作任务数据集（Weights Task和DeliData）上的评估，展示了其理论基础方法的有效性，相较于基线和更强的共指消解方法，确立了在该新颖任务中的性能标准。

链接: https://arxiv.org/abs/2410.19301
作者: Abhijnan Nath,Videep Venkatesha,Mariah Bradford,Avyakta Chelle,Austin Youngren,Carlos Mabrey,Nathaniel Blanchard,Nikhil Krishnaswamy
关键词-EN: collaborative problem solving, knowledge construction, long been established, established as key, key to knowledge
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EMNLP 2024

点击查看摘要

Abstract:Question-asking in collaborative dialogue has long been established as key to knowledge construction, both in internal and collaborative problem solving. In this work, we examine probing questions in collaborative dialogues: questions that explicitly elicit responses from the speaker’s interlocutors. Specifically, we focus on modeling the causal relations that lead directly from utterances earlier in the dialogue to the emergence of the probing question. We model these relations using a novel graph-based framework of deliberation chains, and reframe the problem of constructing such chains as a coreference-style clustering problem. Our framework jointly models probing and causal utterances and the links between them, and we evaluate on two challenging collaborative task datasets: the Weights Task and DeliData. Our results demonstrate the effectiveness of our theoretically-grounded approach compared to both baselines and stronger coreference approaches, and establish a standard of performance in this novel task.
摘要：在协作对话中，提问长期以来被认为是知识构建的关键，无论是在内部还是协作问题解决中。在本研究中，我们探讨了协作对话中的探究性问题：这些问题明确地引导发言者的对话者做出回应。具体而言，我们专注于建模导致探究性问题直接从对话早期的发言中产生的因果关系。我们采用了一种新颖的基于图的审议链框架来建模这些关系，并将构建此类链的问题重新定义为核心引用风格的聚类问题。我们的框架同时建模了探究性发言及其因果关系，并在两个具有挑战性的协作任务数据集上进行了评估：Weights Task 和 DeliData。我们的结果表明，与基线方法和更强的核心引用方法相比，我们基于理论的方法在效果上更为显著，并为这一新颖任务设定了性能标准。

[NLP-27] Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning

【速读】：该论文试图解决大语言模型（LLM）在微调过程中由于预训练与微调数据之间的知识不一致性导致的幻觉（hallucinations）问题。解决方案的关键在于提出了一种名为Prereq-Tune的新型微调策略，该策略通过解耦技能学习和知识学习，确保模型在微调阶段仅学习任务所需的技能，而不受知识不一致性的影响。具体来说，Prereq-Tune引入了一个额外的先决条件学习阶段，用于学习微调（SFT）所需的必要知识，从而使后续的SFT阶段能够专注于任务技能的学习。此外，Prereq-Tune还可以结合虚构的合成数据来增强LLM输出与其内部知识的关联性，从而提高事实性。实验结果表明，Prereq-Tune在短问答和长篇生成任务中均优于现有基线方法，为LLM的知识控制生成开辟了新的可能性。

链接: https://arxiv.org/abs/2410.19290
作者: Yujian Liu,Shiyu Chang,Tommi Jaakkola,Yang Zhang
关键词-EN: Recent studies, knowledge inconsistency, unfamiliar fine-tuning data, fine-tuning data mislead, studies have identified
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have identified one aggravating factor of LLM hallucinations as the knowledge inconsistency between pre-training and fine-tuning, where unfamiliar fine-tuning data mislead the LLM to fabricate plausible but wrong outputs. In this paper, we propose a novel fine-tuning strategy called Prereq-Tune to address this knowledge inconsistency and reduce hallucinations. Fundamentally, Prereq-Tune disentangles the learning of skills and knowledge, so the model learns only the task skills without being impacted by the knowledge inconsistency. To achieve this, Prereq-Tune introduces an additional prerequisite learning stage to learn the necessary knowledge for SFT, allowing subsequent SFT to focus only on task skills. Prereq-Tune can also be combined with fictitious synthetic data to enhance the grounding of LLM outputs to their internal knowledge. Experiments show that Prereq-Tune outperforms existing baselines in improving LLM’s factuality across short QA and long-form generation tasks. It also opens new possibilities for knowledge-controlled generation in LLMs. Our code is available at this https URL.
摘要：最近的研究发现，大语言模型（LLM）产生幻觉的一个加剧因素是预训练与微调之间的知识不一致性，其中不熟悉的微调数据误导 LLM 生成看似合理但实际上错误的输出。本文提出了一种名为 Prereq-Tune 的新型微调策略，旨在解决这种知识不一致性并减少幻觉现象。从根本上说，Prereq-Tune 将技能学习和知识学习解耦，使得模型仅学习任务所需的技能而不受知识不一致性的影响。为此，Prereq-Tune 引入了一个额外的先决条件学习阶段，以学习监督微调（SFT）所需的必要知识，从而使后续的 SFT 能够专注于任务技能的学习。此外，Prereq-Tune 还可以结合虚构的合成数据，以增强 LLM 输出与其内部知识之间的关联性。实验结果表明，Prereq-Tune 在提升 LLM 在短问答和长篇生成任务中的事实准确性方面优于现有的基准方法。同时，它也为 LLM 中的知识控制生成开启了新的可能性。我们的代码可在以下链接获取：https URL。

[NLP-28] Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning ICLR2025

【速读】：该论文试图解决大语言模型（LLMs）中键值缓存（Key-Value caching, KV caching）随着输入长度增加而导致的内存开销急剧增长的问题。解决方案的关键在于提出了一种基于注意力头（attention heads）级别的KV缓存压缩方法，称为HeadKV和HeadKV-R2。这些方法通过评估每个注意力头在上下文问答任务中的重要性，特别是对于需要检索和推理能力的任务，来选择性地保留关键信息。实验结果表明，这种头级别的KV缓存压缩方法在多种基准测试、模型架构和长上下文能力测试中显著优于现有基线方法，特别是在低资源设置下（KV缓存大小为64-128），能够在仅保留1.5%的KV缓存的情况下，达到全量KV缓存97%的性能。

链接: https://arxiv.org/abs/2410.19258
作者: Yu Fu,Zefan Cai,Abedelkadir Asi,Wayne Xiong,Yue Dong,Wen Xiao
关键词-EN: Large Language Models, Large Language, memory overhead grows, overhead grows rapidly, efficiency of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18pages,submitted to ICLR 2025

点击查看摘要

Abstract:Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests demonstrate that our head-level KV cache compression significantly outperforms strong baselines, particularly in low-resource settings (KV size = 64 128). Notably, our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark.
摘要：键值 (Key-Value, KV) 缓存是一种常见的提升大语言模型 (Large Language Models, LLM) 计算效率的技术，但其内存开销随输入长度的增加而迅速增长。先前的工作表明，并非所有 Token 对文本生成都同等重要，因此提出了层级 KV 缓存压缩方法，以有选择地保留关键信息。我们认识到注意力头在生成过程中的不同作用，提出了 HeadKV，一种基于头的 KV 缓存压缩方法，以及 HeadKV-R2，该方法利用了一种新颖的上下文推理能力估计来进行压缩。我们的方法在单个头的层面上操作，估计其在需要检索和推理能力的上下文问答任务中的重要性。在多种基准测试（如 LongBench、LooGLE）、模型架构（例如 Llama-3-8B-Instruct、Mistral-7B-Instruct）以及长上下文能力测试中进行的广泛实验表明，我们的基于头的 KV 缓存压缩方法显著优于强基线方法，特别是在低资源设置下（KV 大小 = 64 128）。值得注意的是，我们的方法在保留仅 1.5% 的 KV 缓存的同时，在上下文问答基准测试中达到了完整 KV 缓存性能的 97%。

[NLP-29] he Reopening of Pandoras Box: Analyzing the Role of LLM s in the Evolving Battle Against AI-Generated Fake News

【速读】：该论文试图解决生成式 AI (Generative AI) 大规模生成内容带来的假新闻传播问题。解决方案的关键在于通过大学级别的竞赛，探索人类如何利用大型语言模型 (LLMs) 创建假新闻，并评估人类标注者和 AI 模型在检测假新闻方面的能力。研究发现，LLMs 在检测真实新闻方面比人类有效约 68%，但在假新闻检测方面，LLMs 和人类的准确率相当（约 60%）。此外，研究还探讨了视觉元素对假新闻检测准确性的影响，以及假新闻创建者如何利用各种策略增强 AI 生成内容的可信度。该研究强调了在人机协作环境中检测 AI 生成假新闻的复杂性。

链接: https://arxiv.org/abs/2410.19250
作者: Xinyu Wang,Wenbo Zhang,Sai Koneru,Hangzhi Guo,Bonam Mingole,S. Shyam Sundar,Sarah Rajtmajer,Amulya Yadav
关键词-EN: large language models, fake, genuine concerns, large language, spewed at scale
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rise of AI-generated content spewed at scale from large language models (LLMs), genuine concerns about the spread of fake news have intensified. The perceived ability of LLMs to produce convincing fake news at scale poses new challenges for both human and automated fake news detection systems. To address this gap, this work presents the findings from a university-level competition which aimed to explore how LLMs can be used by humans to create fake news, and to assess the ability of human annotators and AI models to detect it. A total of 110 participants used LLMs to create 252 unique fake news stories, and 84 annotators participated in the detection tasks. Our findings indicate that LLMs are ~68% more effective at detecting real news than humans. However, for fake news detection, the performance of LLMs and humans remains comparable (~60% accuracy). Additionally, we examine the impact of visual elements (e.g., pictures) in news on the accuracy of detecting fake news stories. Finally, we also examine various strategies used by fake news creators to enhance the credibility of their AI-generated content. This work highlights the increasing complexity of detecting AI-generated fake news, particularly in collaborative human-AI settings.
摘要：随着大语言模型 (LLM) 大规模生成 AI 内容，对虚假新闻传播的担忧日益加剧。LLM 被认为能够大规模生成令人信服的虚假新闻，这为人类和自动化的虚假新闻检测系统带来了新的挑战。为了应对这一问题，本研究展示了来自一项大学级别竞赛的结果，该竞赛旨在探讨人类如何利用 LLM 创建虚假新闻，并评估人类标注员和 AI 模型检测虚假新闻的能力。共有 110 名参与者使用 LLM 生成了 252 篇独特的虚假新闻故事，84 名标注员参与了检测任务。我们的研究结果表明，LLM 在检测真实新闻方面比人类有效约 68%。然而，在虚假新闻检测方面，LLM 和人类的性能相当（准确率约为 60%）。此外，我们还探讨了新闻中的视觉元素（如图片）对检测虚假新闻准确性的影响。最后，我们还研究了虚假新闻创建者为增强其 AI 生成内容的可信度所采用的各种策略。本研究突显了检测 AI 生成虚假新闻的日益复杂性，特别是在人机协作的环境中。

[NLP-30] Developing a Tutoring Dialog Dataset to Optimize LLM s for Educational Use

【速读】：该论文试图解决在大规模教育应用中使用大型语言模型（LLMs）进行对话式辅导系统时面临的挑战，特别是有效教学策略的需求和专家策划数据集的高成本问题。解决方案的关键在于开发和微调较小、更具成本效益的LLMs，以用于一对一辅导中的阅读理解问题。通过创建合成辅导对话数据集并由人类教师评估，论文展示了微调后的较小模型在实际辅导场景中与较大模型表现相当，但成本更低，从而提供了一种在教育环境中实现基于LLM的辅导系统的可行且经济高效的途径。

链接: https://arxiv.org/abs/2410.19231
作者: Menna Fateen,Tsunenori Mine
关键词-EN: remains challenging due, effective pedagogical strategies, Recent advances, scalable educational applications, systems remains challenging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown promise for scalable educational applications, but their use in dialog-based tutoring systems remains challenging due to the need for effective pedagogical strategies and the high costs associated with expert-curated datasets. Our study explores the use of smaller, more affordable LLMs for one-on-one tutoring in the context of solving reading comprehension problems. We developed a synthetic tutoring dialog dataset, evaluated by human teachers, and fine-tuned a smaller LLM using this dataset. Furthermore, we conducted an interactive experiment comparing the performance of the fine-tuned model with a larger model in real-world tutoring scenarios. Our results show that the fine-tuned model performs on par with the larger model but at a lower cost, demonstrating a viable, cost-effective approach for implementing LLM-based tutoring systems in educational settings.
摘要：近年来，大语言模型 (LLM) 在可扩展教育应用方面展现出巨大潜力，但其在基于对话的辅导系统中的应用仍面临挑战，主要原因是需要有效的教学策略以及专家策划数据集的高成本。本研究探讨了在解决阅读理解问题的一对一辅导背景下，使用更小、更经济实惠的 LLM。我们开发了一个由人类教师评估的合成辅导对话数据集，并使用该数据集对一个较小的 LLM 进行了微调。此外，我们进行了一项交互式实验，比较了微调模型与较大模型在实际辅导场景中的表现。结果显示，微调模型在性能上与较大模型相当，但成本更低，这表明在教育环境中实施基于 LLM 的辅导系统是一种可行且成本效益高的方法。

[NLP-31] Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors

【速读】：该论文试图解决的问题是如何通过攻击策略使大型语言模型（LLMs）生成的文本能够绕过现有的文本检测器，使其无法区分机器生成的文本和人类书写的文本。解决方案的关键在于引入了一种代理攻击策略，通过在解码阶段利用经过强化学习（RL）微调的人性化小型语言模型（SLM）来攻击源模型。这种策略能够生成与人类书写文本难以区分的响应，从而有效地欺骗领先的检测器，导致其性能显著下降，平均AUROC下降达70.4%，最高达90.3%。此外，该策略在跨学科和跨语言场景中也表现出显著的欺骗效果，同时保持了生成文本的质量。

链接: https://arxiv.org/abs/2410.19230
作者: Tianchun Wang,Yuanzhou Chen,Zichuan Liu,Zhanwen Chen,Haifeng Chen,Xiang Zhang,Wei Cheng
关键词-EN: mimic human-like writing, closely mimic human-like, human-like writing, large language models, advent of large
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 26 pages

点击查看摘要

Abstract:The advent of large language models (LLMs) has revolutionized the field of text generation, producing outputs that closely mimic human-like writing. Although academic and industrial institutions have developed detectors to prevent the malicious usage of LLM-generated texts, other research has doubt about the robustness of these systems. To stress test these detectors, we introduce a proxy-attack strategy that effortlessly compromises LLMs, causing them to produce outputs that align with human-written text and mislead detection systems. Our method attacks the source model by leveraging a reinforcement learning (RL) fine-tuned humanized small language model (SLM) in the decoding phase. Through an in-depth analysis, we demonstrate that our attack strategy is capable of generating responses that are indistinguishable to detectors, preventing them from differentiating between machine-generated and human-written text. We conduct systematic evaluations on extensive datasets using proxy-attacked open-source models, including Llama2-13B, Llama3-70B, and Mixtral-87B in both white- and black-box settings. Our findings show that the proxy-attack strategy effectively deceives the leading detectors, resulting in an average AUROC drop of 70.4% across multiple datasets, with a maximum drop of 90.3% on a single dataset. Furthermore, in cross-discipline scenarios, our strategy also bypasses these detectors, leading to a significant relative decrease of up to 90.9%, while in cross-language scenario, the drop reaches 91.3%. Despite our proxy-attack strategy successfully bypassing the detectors with such significant relative drops, we find that the generation quality of the attacked models remains preserved, even within a modest utility budget, when compared to the text produced by the original, unattacked source model.
摘要：大语言模型 (LLM) 的出现彻底改变了文本生成领域，生成的输出内容与人类写作极为相似。尽管学术界和工业界已经开发了检测器来防止 LLM 生成文本的恶意使用，但其他研究对这些系统的鲁棒性表示怀疑。为了对这些检测器进行压力测试，我们引入了一种代理攻击策略，该策略能够轻松地破坏 LLM，使其生成与人类书写文本一致的输出，从而误导检测系统。我们的方法通过在解码阶段利用经过强化学习 (RL) 微调的人性化小型语言模型 (SLM) 来攻击源模型。通过深入分析，我们证明了我们的攻击策略能够生成检测器无法区分的响应，从而阻止它们区分机器生成和人类书写的文本。我们在广泛的数据集上使用代理攻击的开源模型（包括 Llama2-13B、Llama3-70B 和 Mixtral-87B）进行了系统评估，涵盖了白盒和黑盒设置。我们的研究结果表明，代理攻击策略有效地欺骗了领先的检测器，导致多个数据集的平均 AUROC 下降了 70.4%，单个数据集的最大下降幅度达到 90.3%。此外，在跨学科场景中，我们的策略也能绕过这些检测器，导致相对减少幅度高达 90.9%，而在跨语言场景中，下降幅度达到 91.3%。尽管我们的代理攻击策略成功地绕过了检测器，并导致了如此显著的相对下降，但我们发现，与原始未攻击的源模型生成的文本相比，攻击模型在适度的效用预算内仍能保持生成质量。

[NLP-32] Can Stories Help LLM s Reason? Curating Information Space Through Narrative

【速读】：该论文试图解决的问题是如何通过融入叙事元素来提升大型语言模型 (Large Language Models, LLMs) 在解决复杂问题时的效果。解决方案的关键在于提出了一种名为“思维故事” (Story of Thought, SoT) 的新方法，该方法将叙事结构整合到问题解决的提示技术中。SoT 通过围绕问题陈述构建叙事，并创建一个框架来识别和组织相关信息，从而增强问题理解。实验结果表明，在物理、化学、数学和生物学问题中，使用 SoT 的 LLMs 在 GPQA 和 JEEBench 数据集上的表现均优于使用其他技术的 LLMs。SoT 通过叙事化的信息整理过程，将关键的领域内信息情境化，并突显问题空间内的因果关系，从而提升了问题的理解能力。

链接: https://arxiv.org/abs/2410.19221
作者: Vahid Sadiri Javadi,Johanne R. Trippas,Yash Kumar Lal,Lucie Flek
关键词-EN: Large Language Models, assist Large Language, science communication, widely recognized, powerful tool
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Narratives are widely recognized as a powerful tool for structuring information and facilitating comprehension of complex ideas in various domains such as science communication. This paper investigates whether incorporating narrative elements can assist Large Language Models (LLMs) in solving complex problems more effectively. We propose a novel approach, Story of Thought (SoT), integrating narrative structures into prompting techniques for problem-solving. This approach involves constructing narratives around problem statements and creating a framework to identify and organize relevant information. Our experiments show that using various LLMs with SoT consistently surpasses using them with other techniques on physics, chemistry, math, and biology questions in both the GPQA and JEEBench datasets. The narrative-based information curation process in SoT enhances problem comprehension by contextualizing critical in-domain information and highlighting causal relationships within the problem space.
摘要：叙事在结构化信息和促进复杂概念理解方面被广泛认为是一种强大的工具，尤其在科学传播等领域。本文探讨了将叙事元素融入大语言模型（Large Language Models, LLMs）是否能更有效地解决复杂问题。我们提出了一种新颖的方法，即“思维的故事”（Story of Thought, SoT），将叙事结构整合到问题解决的提示技术中。该方法涉及围绕问题陈述构建叙事，并创建一个框架来识别和组织相关信息。我们的实验表明，在GPQA和JEEBench数据集的物理、化学、数学和生物学问题中，使用SoT的多种LLM在性能上始终优于使用其他技术的LLM。SoT中的基于叙事的信息整理过程通过将关键的领域内信息情境化，并突出问题空间内的因果关系，从而增强了问题的理解。

[NLP-33] Inference time LLM alignment in single and multidomain preference spectrum

【速读】：该论文试图解决大型语言模型（LLM）在处理主观性和细微偏好层次时所需的灵活性和控制问题，这些问题通常需要资源密集且耗时的过程。解决方案的关键在于引入了一种推理时模型对齐方法，该方法通过学习称为对齐向量（Alignment Vectors, AV）的偏好维度编码表示来实现。这些对齐向量通过模型编辑中的减法操作从基础模型中计算得出，使得在推理过程中可以通过简单的线性操作动态调整模型行为。该方法不仅减少了推理成本（相比提示工程方法降低了一半），还展示了在不同微调阶段之间的可转移性，并支持多领域、多样化的偏好对齐，使得对齐过程比重新训练方法快12倍。

链接: https://arxiv.org/abs/2410.19206
作者: Sadat Shahriar,Zheng Qi,Nikolaos Pappas,Srikanth Doss,Monica Sunkara,Kishaloy Halder,Manuel Mager,Yassine Benajiba
关键词-EN: Aligning Large Language, Large Language Models, Aligning Large, Large Language, Language Models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning Large Language Models (LLM) to address subjectivity and nuanced preference levels requires adequate flexibility and control, which can be a resource-intensive and time-consuming procedure. Existing training-time alignment methods require full re-training when a change is needed and inference-time ones typically require access to the reward model at each inference step. To address these limitations, we introduce inference-time model alignment method that learns encoded representations of preference dimensions, called \textitAlignment Vectors (AV). These representations are computed by subtraction of the base model from the aligned model as in model editing enabling dynamically adjusting the model behavior during inference through simple linear operations. Even though the preference dimensions can span various granularity levels, here we focus on three gradual response levels across three specialized domains: medical, legal, and financial, exemplifying its practical potential. This new alignment paradigm introduces adjustable preference knobs during inference, allowing users to tailor their LLM outputs while reducing the inference cost by half compared to the prompt engineering approach. Additionally, we find that AVs are transferable across different fine-tuning stages of the same model, demonstrating their flexibility. AVs also facilitate multidomain, diverse preference alignment, making the process 12x faster than the retraining approach.
摘要：将大语言模型 (LLM) 对齐以处理主观性和细微的偏好水平需要足够的灵活性和控制，这可能是一个资源密集且耗时的过程。现有的训练时对齐方法在需要更改时需要完全重新训练，而推理时对齐方法通常要求在每次推理步骤中访问奖励模型。为了解决这些限制，我们引入了一种推理时模型对齐方法，该方法学习偏好维度的编码表示，称为对齐向量 (Alignment Vectors, AV)。这些表示通过从对齐模型中减去基础模型来计算，类似于模型编辑，从而通过简单的线性操作在推理过程中动态调整模型行为。尽管偏好维度可以跨越各种粒度级别，但在这里我们重点关注三个专业领域（医疗、法律和金融）中的三个渐进响应级别，展示了其实际潜力。这种新的对齐范式在推理过程中引入了可调整的偏好旋钮，使用户能够定制其 LLM 输出，同时将推理成本降低到提示工程方法的一半。此外，我们发现 AV 在同一模型的不同微调阶段之间是可转移的，展示了其灵活性。AV 还促进了多领域、多样化的偏好对齐，使过程比重新训练方法快 12 倍。

[NLP-34] Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis

【速读】：该论文试图解决盲人或视觉障碍者以及低识字人群在使用社交网络时面临的情感表达障碍问题。传统文本转语音（TTS）系统在情感生成方面依赖于人工输入预期情感，存在数据简化导致信息丢失和音素时长不准确的问题，从而影响情感表达的自然性和准确性。论文提出的解决方案是一个端到端上下文感知的TTS合成系统，该系统通过文本输入自动推导情感，并结合先进的自然语言处理（NLP）和语音合成技术，生成聚焦于情感和说话者特征的自然且富有表现力的语音。这一系统不仅解决了情感表达的难题，还在推理时间性能上表现出色，适合实时应用。

链接: https://arxiv.org/abs/2410.19199
作者: Suparna De,Ionut Bostan,Nishanth Sastry
关键词-EN: Recent studies, accessibility challenges faced, visually impaired, less-literate people, social networks
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent studies have outlined the accessibility challenges faced by blind or visually impaired, and less-literate people, in interacting with social networks, in-spite of facilitating technologies such as monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to synthesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers’ emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this unpredictability. We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.
摘要：近期研究表明，尽管有诸如单音调文本转语音（Text-to-Speech, TTS）屏幕阅读器和表情符号等视觉元素的音频叙述等辅助技术，盲人或视力障碍者以及低识字率人群在使用社交网络时仍面临可访问性挑战。传统的情感语音生成依赖于人工输入预期情感与文本进行合成，同时面临数据简化（导致信息丢失）和时长不准确的问题，从而导致情感表达不足。在实际交流中，音素的时长可能因说话者的情感状态或口音而异（称为文本到语音生成的一对多问题）。因此，需要一种先进的语音合成系统来应对这种不可预测性。我们提出了一种端到端的上下文感知文本转语音（TTS）合成系统，该系统从文本输入中提取传达的情感，并合成聚焦于情感和说话者特征的音频，以实现自然且富有表现力的语音，整合了先进的自然语言处理（Natural Language Processing, NLP）和语音合成技术，适用于实时应用。我们的系统在与最先进的TTS模型进行基准测试时，展示了具有竞争力的推理时间性能，使其适合用于实时可访问性应用。

[NLP-35] Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models

【速读】：该论文试图解决零样本分类中标签选项（如词汇选择、顺序和详细程度）对上下文学习（In-context Learning, ICL）性能的影响问题。解决方案的关键在于提出了基于激活分布峰度的标签集优化方法（Label set Optimization via Activation Distribution kurtosiS, LOADS）。LOADS 是一种无需梯度传播的后处理方法，通过分析模型内部状态，识别出能够激活较少异常神经元的最优标签名称，从而提升分类性能。该方法在不同模型类型和规模上均表现出有效性，并且具有跨语言的可迁移性。

链接: https://arxiv.org/abs/2410.19195
作者: Yue Li,Zhixue Zhao,Carolina Scarton
关键词-EN: In-context learning, zero-shot ICL classification, ICL classification performance, influences zero-shot ICL, class label options
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) performance is known to be sensitive to the prompt design, yet the impact of class label options in zero-shot classification has been largely overlooked. This study presents the first comprehensive empirical study investigating how label option (e.g., lexical choice, order, and elaboration) influences zero-shot ICL classification performance. Our findings reveal that lexical choices for label names (e.g., agree this http URL in stance classification) play an important role, with effects also linked to label orders. An analysis of the model internal states further shows that optimal label names tend to activate fewer outlier neurons in the feed forward network. Based on this observation, we propose Label set Optimization via Activation Distribution kurtosiS (LOADS), a post-hoc approach requiring no gradient propagation. LOADS not only demonstrates effectiveness with only 100 unlabelled samples across different model types and sizes, but also shows cross-lingual transferability.
摘要：上下文学习（In-context Learning, ICL）的性能对提示设计非常敏感，然而零样本分类中类别标签选项的影响在很大程度上被忽视了。本研究首次进行了全面的实证研究，探讨了标签选项（如词汇选择、顺序和详细说明）如何影响零样本 ICL 分类性能。我们的研究发现，标签名称的词汇选择（例如，在立场分类中使用“同意此 http URL”）起着重要作用，其效果还与标签顺序相关。对模型内部状态的分析进一步表明，最佳标签名称往往在前馈网络中激活较少的异常神经元。基于这一观察，我们提出了通过激活分布峰度进行标签集优化（Label set Optimization via Activation Distribution kurtosiS, LOADS），这是一种无需梯度传播的后处理方法。LOADS 不仅在不同模型类型和规模下仅使用 100 个未标记样本就展示了其有效性，还表现出跨语言的可迁移性。

[NLP-36] Enriching GNNs with Text Contextual Representations for Detecting Disinformation Campaigns on Social Media

【速读】：该论文试图解决社交媒体上的虚假信息检测问题，特别是如何利用文本特征提升检测性能。解决方案的关键在于将基于Transformer的语言模型生成的高质量上下文文本表示（contextual text representations）整合到图神经网络（Graph Neural Networks, GNNs）中，以增强虚假新闻检测的准确性。实验结果表明，这种整合方法相较于静态文本表示和无文本特征的GNNs，分别提升了9.3%的Macro F1分数和33.8%的性能。然而，数据增强中的噪声处理不当会导致性能下降和不稳定性增加。

链接: https://arxiv.org/abs/2410.19193
作者: Bruno Croso Cunha da Silva,Thomas Palmeira Ferraz,Roseli De Deus Lopes
关键词-EN: social media poses, Disinformation on social, technical challenges, Graph Neural Networks, social media
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
备注: Work in progress

点击查看摘要

Abstract:Disinformation on social media poses both societal and technical challenges. While previous studies have integrated textual information into propagation networks, they have yet to fully leverage the advancements in Transformer-based language models for high-quality contextual text representations. This work investigates the impact of incorporating textual features into Graph Neural Networks (GNNs) for fake news detection. Our experiments demonstrate that contextual representations improve performance by 9.3% in Macro F1 over static ones and 33.8% over GNNs without textual features. However, noisy data augmentation degrades performance and increases instability. We expect our methodology to open avenues for further research, and all code is made publicly available.
摘要：社交媒体上的虚假信息对社会和技术都构成了挑战。尽管先前的研究已将文本信息整合到传播网络中，但尚未充分利用基于 Transformer 的语言模型在高质量上下文文本表示方面的进展。本研究探讨了将文本特征融入图神经网络 (GNN) 对虚假新闻检测的影响。我们的实验表明，上下文表示在 Macro F1 上比静态表示提高了 9.3%，比不包含文本特征的 GNN 提高了 33.8%。然而，噪声数据增强会降低性能并增加不稳定性。我们期望我们的方法能为进一步研究开辟道路，所有代码均已公开。

[NLP-37] No Argument Left Behind: Overlapping Chunks for Faster Processing of Arbitrarily Long Legal Texts

【速读】：该论文试图解决巴西司法系统因案件处理缓慢而面临的危机，特别是针对大量法律文本的高效分析问题。解决方案的关键在于引入了一种名为uBERT的混合模型，该模型结合了Transformer和循环神经网络（Recurrent Neural Network, RNN）架构，能够有效处理长篇法律文本。uBERT不仅能够处理全文而不受长度限制，而且在保持合理计算开销的同时，相比BERT+LSTM在处理重叠输入时表现更优，且在处理长篇法律文档时显著快于ULMFiT。

链接: https://arxiv.org/abs/2410.19184
作者: Israel Fama,Bárbara Bueno,Alexandre Alcoforado,Thomas Palmeira Ferraz,Arnold Moya,Anna Helena Reali Costa
关键词-EN: Brazilian judiciary system, develop efficient methods, Recurrent Neural Network, Brazilian judiciary, analyzing legal texts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: To appear at 15th STIL @ BRACIS’24

点击查看摘要

Abstract:In a context where the Brazilian judiciary system, the largest in the world, faces a crisis due to the slow processing of millions of cases, it becomes imperative to develop efficient methods for analyzing legal texts. We introduce uBERT, a hybrid model that combines Transformer and Recurrent Neural Network architectures to effectively handle long legal texts. Our approach processes the full text regardless of its length while maintaining reasonable computational overhead. Our experiments demonstrate that uBERT achieves superior performance compared to BERT+LSTM when overlapping input is used and is significantly faster than ULMFiT for processing long legal documents.
摘要：在巴西司法系统（全球规模最大）因处理数百万案件速度缓慢而面临危机的背景下，开发高效的法律文本分析方法变得至关重要。我们引入了 uBERT，这是一种结合了 Transformer 和循环神经网络 (Recurrent Neural Network) 架构的混合模型，能够有效处理长篇法律文本。我们的方法能够处理全文，无论其长度如何，同时保持合理的计算开销。实验结果表明，在使用重叠输入的情况下，uBERT 的性能优于 BERT+LSTM，并且在处理长篇法律文档时，其速度显著快于 ULMFiT。

[NLP-38] Indication Finding: a novel use case for representation learning

【速读】：该论文试图解决如何优先考虑新机制（MoA）潜在适应症的问题。解决方案的关键在于利用自然语言处理中的表示学习方法，生成适应症的嵌入（embeddings），并根据这些嵌入与已有最强证据支持的适应症之间的接近程度来优先排序。具体方法包括使用SPPMI（Shifted Positive Pointwise Mutual Information）生成嵌入，并通过评估框架来验证适应症发现结果和嵌入的质量。

链接: https://arxiv.org/abs/2410.19174
作者: Maren Eckhoff,Valmir Selimi,Alexander Aranovitch,Ian Lyons,Emily Briggs,Jennifer Hou,Alex Devereson,Matej Macak,David Champagne,Chris Anagnostopoulos
关键词-EN: treating multiple diseases, multiple diseases, therapies are effective, effective in treating, treating multiple
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many therapies are effective in treating multiple diseases. We present an approach that leverages methods developed in natural language processing and real-world data to prioritize potential, new indications for a mechanism of action (MoA). We specifically use representation learning to generate embeddings of indications and prioritize them based on their proximity to the indications with the strongest available evidence for the MoA. We demonstrate the successful deployment of our approach for anti-IL-17A using embeddings generated with SPPMI and present an evaluation framework to determine the quality of indication finding results and the derived embeddings.
摘要：许多疗法在治疗多种疾病方面都具有疗效。我们提出了一种方法，利用自然语言处理领域开发的技术和真实世界的数据，来优先考虑某一作用机制（Mechanism of Action, MoA）的新潜在适应症。我们具体采用了表示学习（Representation Learning）来生成适应症的嵌入（Embeddings），并根据这些嵌入与现有最强证据支持的适应症之间的接近程度来对其进行优先排序。我们展示了使用SPPMI生成的嵌入成功应用于抗IL-17A的方法，并提出了一个评估框架，用于确定适应症发现结果及其衍生嵌入的质量。

[NLP-39] Adversarial Attacks on Large Language Models Using Regularized Relaxation

【速读】：该论文试图解决现有对抗攻击方法在大型语言模型（LLMs）中的效率和有效性问题。现有方法主要分为两类：依赖于离散标记优化的方法效率有限，而连续优化方法则无法生成模型词汇表中的有效标记，导致其在实际应用中不切实际。论文提出的解决方案关键在于利用正则化梯度与连续优化方法相结合，显著提高了攻击速度（比最先进的贪婪坐标梯度方法快两个数量级），并大幅提升了对对齐语言模型的攻击成功率。此外，该方法还能生成有效的标记，克服了现有连续优化方法的一个根本性限制。

链接: https://arxiv.org/abs/2410.19160
作者: Samuel Jacob Chacko,Sajib Biswas,Chashi Mahiul Islam,Fatema Tabassum Liza,Xiuwen Liu
关键词-EN: powerful Large Language, powerful Large, Large Language Models, Large Language, numerous practical applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:As powerful Large Language Models (LLMs) are now widely used for numerous practical applications, their safety is of critical importance. While alignment techniques have significantly improved overall safety, LLMs remain vulnerable to carefully crafted adversarial inputs. Consequently, adversarial attack methods are extensively used to study and understand these vulnerabilities. However, current attack methods face significant limitations. Those relying on optimizing discrete tokens suffer from limited efficiency, while continuous optimization techniques fail to generate valid tokens from the model’s vocabulary, rendering them impractical for real-world applications. In this paper, we propose a novel technique for adversarial attacks that overcomes these limitations by leveraging regularized gradients with continuous optimization methods. Our approach is two orders of magnitude faster than the state-of-the-art greedy coordinate gradient-based method, significantly improving the attack success rate on aligned language models. Moreover, it generates valid tokens, addressing a fundamental limitation of existing continuous optimization methods. We demonstrate the effectiveness of our attack on five state-of-the-art LLMs using four datasets.
摘要：随着大语言模型 (Large Language Models, LLMs) 在众多实际应用中的广泛使用，其安全性变得至关重要。尽管对齐技术显著提升了整体安全性，但 LLMs 仍然容易受到精心设计的对抗性输入的攻击。因此，对抗性攻击方法被广泛用于研究和理解这些漏洞。然而，当前的攻击方法面临显著的局限性。依赖于优化离散 Token 的方法效率有限，而连续优化技术无法从模型的词汇表中生成有效的 Token，使其在实际应用中不切实际。本文提出了一种新的对抗性攻击技术，通过利用带有正则化梯度的连续优化方法克服了这些局限。我们的方法比最先进的基于贪婪坐标梯度的方法快两个数量级，显著提高了对齐语言模型的攻击成功率。此外，它还能生成有效的 Token，解决了现有连续优化方法的一个根本性限制。我们通过在四个数据集上对五个最先进的 LLMs 进行攻击，展示了我们方法的有效性。

[NLP-40] Lived Experience Not Found: LLM s Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use

【速读】：该论文试图解决精神药物不良反应（Adverse Drug Reactions, ADRs）检测和缓解策略生成的问题。解决方案的关键在于引入了一个名为Psych-ADR的基准测试和一个名为Adverse Drug Reaction Response Assessment (ADRA)的评估框架，以系统地评估大型语言模型（Large Language Models, LLMs）在检测ADR表达和提供专家级缓解策略方面的性能。研究发现，尽管LLMs在情感表达和语气上与专家一致，但在理解ADR的细微差别和区分不同类型的ADR方面存在困难，且其生成的策略更为复杂、难以阅读，与专家策略的匹配度仅为70.86%，提供的可操作建议平均减少12.32%。该研究为高风险领域中策略驱动任务的LLM评估提供了全面的基准和评估框架。

链接: https://arxiv.org/abs/2410.19155
作者: Mohit Chandra,Siddharth Sriraman,Gaurav Verma,Harneet Singh Khanuja,Jose Suarez Campayo,Zihang Li,Michael L. Birnbaum,Munmun De Choudhury
关键词-EN: mental health patients, Large Language Models, Adverse Drug Reactions, Adverse Drug, Drug Reaction Response
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 27 pages, 8 figures, 15 tables

点击查看摘要

Abstract:Adverse Drug Reactions (ADRs) from psychiatric medications are the leading cause of hospitalizations among mental health patients. With healthcare systems and online communities facing limitations in resolving ADR-related issues, Large Language Models (LLMs) have the potential to fill this gap. Despite the increasing capabilities of LLMs, past research has not explored their capabilities in detecting ADRs related to psychiatric medications or in providing effective harm reduction strategies. To address this, we introduce the Psych-ADR benchmark and the Adverse Drug Reaction Response Assessment (ADRA) framework to systematically evaluate LLM performance in detecting ADR expressions and delivering expert-aligned mitigation strategies. Our analyses show that LLMs struggle with understanding the nuances of ADRs and differentiating between types of ADRs. While LLMs align with experts in terms of expressed emotions and tone of the text, their responses are more complex, harder to read, and only 70.86% aligned with expert strategies. Furthermore, they provide less actionable advice by a margin of 12.32% on average. Our work provides a comprehensive benchmark and evaluation framework for assessing LLMs in strategy-driven tasks within high-risk domains.
摘要：精神药物引起的不良药物反应（ADRs）是精神健康患者住院的主要原因。随着医疗系统和在线社区在解决与ADR相关问题方面的局限性，大语言模型（LLMs）有可能填补这一空白。尽管LLMs的能力不断增强，但以往的研究尚未探索其在检测与精神药物相关的ADR或提供有效的伤害减少策略方面的能力。为此，我们引入了Psych-ADR基准和不良药物反应响应评估（ADRA）框架，以系统地评估LLM在检测ADR表达和提供与专家一致的缓解策略方面的性能。我们的分析表明，LLMs在理解ADR的细微差别和区分不同类型的ADR方面存在困难。尽管LLMs在表达情感和文本语气方面与专家一致，但其响应更为复杂，难以阅读，并且仅与专家策略的70.86%一致。此外，它们提供的可操作建议平均减少了12.32%。我们的工作为在高风险领域内评估策略驱动任务中的LLMs提供了全面的基准和评估框架。

[NLP-41] A Test of Time: Predicting the Sustainable Success of Online Collaboration in Wikipedia

【速读】：该论文试图解决在线协作项目中长期维持高质量标准的问题。解决方案的关键在于提出了一个名为“可持续成功 (Sustainable Success)”的新指标，该指标用于衡量协作努力在时间维度上维持其质量的能力。通过以维基百科 (Wikipedia) 为案例研究，论文构建了 SustainPedia 数据集，该数据集包含超过 40,000 篇维基百科文章的数据，包括可持续成功标签和超过 300 个解释性特征，如编辑历史、用户体验和团队组成。利用这些数据，论文开发了机器学习模型来预测维基百科文章的可持续成功，其中表现最佳的模型在 AU-ROC 评分上达到了 0.88。研究结果表明，文章被认定为高质量所需的时间越长，其长期维持高质量状态的可能性越大，且用户体验是可持续性的最关键预测因素。

链接: https://arxiv.org/abs/2410.19150
作者: Abraham Israeli,David Jurgens,Daniel Romero
关键词-EN: Internet has significantly, allowing millions, significantly expanded, expanded the potential, potential for global
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The Internet has significantly expanded the potential for global collaboration, allowing millions of users to contribute to collective projects like Wikipedia. While prior work has assessed the success of online collaborations, most approaches are time-agnostic, evaluating success without considering its longevity. Research on the factors that ensure the long-term preservation of high-quality standards in online collaboration is scarce. In this study, we address this gap. We propose a novel metric, `Sustainable Success,’ which measures the ability of collaborative efforts to maintain their quality over time. Using Wikipedia as a case study, we introduce the SustainPedia dataset, which compiles data from over 40K Wikipedia articles, including each article’s sustainable success label and more than 300 explanatory features such as edit history, user experience, and team composition. Using this dataset, we develop machine learning models to predict the sustainable success of Wikipedia articles. Our best-performing model achieves a high AU-ROC score of 0.88 on average. Our analysis reveals important insights. For example, we find that the longer an article takes to be recognized as high-quality, the more likely it is to maintain that status over time (i.e., be sustainable). Additionally, user experience emerged as the most critical predictor of sustainability. Our analysis provides insights into broader collective actions beyond Wikipedia (e.g., online activism, crowdsourced open-source software), where the same social dynamics that drive success on Wikipedia might play a role. We make all data and code used for this study publicly available for further research.
摘要：互联网极大地扩展了全球协作的潜力，使得数百万用户能够参与到像维基百科这样的集体项目中。尽管已有研究评估了在线协作的成功，但大多数方法都是时间无关的，即在评估成功时未考虑其持久性。关于确保在线协作长期保持高质量标准的因素的研究相对匮乏。在本研究中，我们填补了这一空白。我们提出了一种新的度量标准，称为“可持续成功 (Sustainable Success)”，用于衡量协作努力在时间上维持其质量的能力。以维基百科为案例研究，我们引入了 SustainPedia 数据集，该数据集汇集了超过 4 万篇维基百科文章的数据，包括每篇文章的可持续成功标签以及超过 300 个解释性特征，如编辑历史、用户经验和团队构成。利用该数据集，我们开发了机器学习模型来预测维基百科文章的可持续成功。我们表现最佳的模型在 AU-ROC 评分上平均达到了 0.88 的高分。我们的分析揭示了重要见解。例如，我们发现，文章被认定为高质量所需的时间越长，其在未来维持该状态（即可持续性）的可能性就越大。此外，用户经验成为可持续性的最关键预测因素。我们的分析为维基百科之外的更广泛集体行动（如在线激进主义、众包开源软件）提供了见解，在这些领域中，推动维基百科成功的相同社会动态可能同样发挥作用。我们公开了本研究中使用的所有数据和代码，以供进一步研究。

[NLP-42] Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant EMNLP

【速读】：该论文试图解决知识感知的基于文本的视觉问答（Text-KVQA）问题，特别是在现代大规模多模态模型（LMMs）的背景下。解决方案的关键在于提出了两个主要模块：(i) VisTEL，一种基于视觉文本实体链接的系统，它结合了先进的视觉文本识别引擎和大规模多模态模型的能力，通过图像中的上下文线索来联合推理，将视觉文本实体链接到正确的知识库实体；(ii) KaLMA，一个知识感知的LMM助手，通过增强LMM与图像中视觉文本实体相关的知识，从而得出准确的答案。通过这些创新，论文在Text-KVQA任务上实现了显著的性能提升，超越了以往的最佳方法，并建立了新的技术水平。

链接: https://arxiv.org/abs/2410.19144
作者: Abhirama Subramanyam Penamakuri,Anand Mishra
关键词-EN: visual text entity, large multimodal models, text entity linking, perform visual text, revisit knowledge-aware text-based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to EMNLP (Main) 2024

点击查看摘要

Abstract:We revisit knowledge-aware text-based visual question answering, also known as Text-KVQA, in the light of modern advancements in large multimodal models (LMMs), and make the following contributions: (i) We propose VisTEL - a principled approach to perform visual text entity linking. The proposed VisTEL module harnesses a state-of-the-art visual text recognition engine and the power of a large multimodal model to jointly reason using textual and visual context obtained using surrounding cues in the image to link the visual text entity to the correct knowledge base entity. (ii) We present KaLMA - a knowledge-aware large multimodal assistant that augments an LMM with knowledge associated with visual text entity in the image to arrive at an accurate answer. Further, we provide a comprehensive experimental analysis and comparison of our approach with traditional visual question answering, pre-large multimodal models, and large multimodal models, as well as prior top-performing approaches. Averaging over three splits of Text-KVQA, our proposed approach surpasses the previous best approach by a substantial 23.3% on an absolute scale and establishes a new state of the art. We make our implementation publicly available.
摘要：我们重新审视了在现代大尺度多模态模型（LMMs）背景下，基于知识的文本视觉问答（Text-KVQA），并做出了以下贡献：（i）我们提出了 VisTEL——一种基于原则的视觉文本实体链接方法。VisTEL 模块利用了最先进的视觉文本识别引擎和一个大尺度多模态模型的能力，通过图像中的周围线索获取的文本和视觉上下文，共同推理以将视觉文本实体链接到正确的知识库实体。（ii）我们介绍了 KaLMA——一个知识增强的大尺度多模态助手，它通过图像中与视觉文本实体相关的知识来增强 LMM，从而得出准确的答案。此外，我们提供了全面的实验分析和比较，将我们的方法与传统的视觉问答、大尺度多模态模型之前的模型以及先前表现最佳的方法进行了对比。在 Text-KVQA 的三个分割数据集上，我们提出的方法在绝对尺度上比之前最好的方法提升了 23.3%，并确立了新的技术水平。我们将我们的实现公开发布。

[NLP-43] AlignCap: Aligning Speech Emotion Captioning to Human Preferences EMNLP2024

【速读】：该论文试图解决语音情感描述（Speech Emotion Captioning, SEC）中现有方法在处理未见过的语音时出现的幻觉和泛化能力不足的问题。解决方案的关键在于提出了一种名为AlignCap的方法，该方法通过将语音情感描述与人类偏好对齐来克服这些问题。具体来说，AlignCap包括两个核心属性：1) 语音-文本对齐（Speech-Text Alignment），通过知识蒸馏（Knowledge Distillation, KD）正则化来最小化大型语言模型（Large Language Model, LLM）对语音和文本输入的响应预测分布之间的差异；2) 人类偏好对齐（Human Preference Alignment），设计了偏好优化（Preference Optimization, PO）正则化来消除事实性和忠实性幻觉。此外，AlignCap还提取情感线索作为提示，以在KD正则化下丰富细粒度信息。实验结果表明，AlignCap在零样本SEC任务中表现优于其他最先进的方法。

链接: https://arxiv.org/abs/2410.19134
作者: Ziqi Liang,Haoxiang Shi,Hanhui Chen
关键词-EN: Speech Emotion Captioning, active research task, Emotion Captioning, Aligning Speech Emotion, Speech Emotion
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to EMNLP2024 main conference

点击查看摘要

Abstract:Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM’s response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task.
摘要：语音情感描述 (Speech Emotion Captioning, SEC) 逐渐成为一个活跃的研究任务。人类语音所传达的情感内容往往复杂，将其分类为固定类别可能不足以全面捕捉语音情感。通过自然语言描述语音情感可能是一种更有效的方法。然而，现有的 SEC 方法常常产生幻觉，并且在未见过的语音上失去泛化能力。为了克服这些问题，我们提出了 AlignCap，这是一种基于大语言模型 (LLM) 的语音情感描述方法，旨在与人类偏好对齐，具备以下两个特性：1) 语音-文本对齐，通过知识蒸馏 (KD) 正则化最小化 LLM 对语音和文本输入的响应预测分布之间的差异；2) 人类偏好对齐，我们设计了偏好优化 (PO) 正则化，以消除事实性和忠实性幻觉。我们还在 KD 正则化下提取情感线索作为提示，以丰富细粒度信息。实验表明，AlignCap 在零样本 SEC 任务中表现出比其他最先进方法更强的性能。

[NLP-44] Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

【速读】：该论文试图解决在训练语言模型（LMs）时，直接收集人类偏好数据成本高、耗时长且存在高方差的问题。解决方案的关键在于引入一个路由框架，通过结合人类和语言模型的输入来提高标注质量，同时降低人类标注的总成本。具体来说，该方法的核心是通过识别哪些偏好实例能从人类标注中受益，将其转化为一个优化问题：在给定偏好数据集和评估指标的情况下，训练一个性能预测模型来预测奖励模型在任意组合的人类和语言模型标注上的表现，并采用一种路由策略选择能最大化预测性能的组合。通过这种方式，论文展示了使用混合的LM和直接人类偏好数据，结合路由框架，能够比单独使用其中一种方法获得更好的奖励模型性能。

链接: https://arxiv.org/abs/2410.19133
作者: Lester James V. Miranda,Yizhong Wang,Yanai Elazar,Sachin Kumar,Valentina Pyatkin,Faeze Brahman,Noah A. Smith,Hannaneh Hajishirzi,Pradeep Dasigi
关键词-EN: human, enabled the alignment, alignment of language, human preferences, Learning
类目: Computation and Language (cs.CL)
备注: Code in this https URL , MultiPref dataset in this https URL

点击查看摘要

Abstract:Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, directly collecting human preferences can be expensive, time-consuming, and can have high variance. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations as they are more consistent, cheaper, and scale better than human annotation; however, they are also prone to biases and errors. In this work, we introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality, while reducing the total cost of human annotation. The crux of our approach is to identify preference instances that will benefit from human annotations. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we train a performance prediction model to predict a reward model’s performance on an arbitrary combination of human and LM annotations and employ a routing strategy that selects a combination that maximizes predicted performance. We train the performance prediction model on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of LM and direct human preferences using our routing framework achieves better reward model performance compared to using either one exclusively. We simulate selective human preference collection on three other datasets and show that our method generalizes well to all three. We analyze features from the routing model to identify characteristics of instances that can benefit from human feedback, e.g., prompts with a moderate safety concern or moderate intent complexity. We release the dataset, annotation platform, and source code used in this study to foster more efficient and accurate preference collection in the future.
摘要：通过从人类反馈中学习，语言模型 (LMs) 与人类偏好的一致性得到了提升。然而，直接收集人类偏好既昂贵又耗时，且具有较高的变异性。一个吸引人的替代方案是从语言模型中提取偏好作为合成注释的来源，因为它们更为一致、成本更低且更易于扩展；然而，这些合成注释也容易受到偏见和错误的影响。在本研究中，我们引入了一种路由框架，该框架结合了人类和语言模型的输入，以实现更高的注释质量，同时降低人类注释的总成本。我们的方法的核心在于识别那些将从人类注释中受益的偏好实例。我们将此问题形式化为一个优化问题：给定一个偏好数据集和一个评估指标，我们训练一个性能预测模型，用于预测奖励模型在任意组合的人类和语言模型注释上的表现，并采用一种路由策略，选择能够最大化预测性能的组合。我们在 MultiPref 上训练了性能预测模型，这是一个包含 10K 实例的新偏好数据集，每个实例都配有人类和语言模型的标签。我们展示了，通过我们的路由框架选择的语言模型和直接人类偏好的混合组合，相比单独使用其中一种，能够实现更好的奖励模型性能。我们在另外三个数据集上模拟了选择性的人类偏好收集，并展示了我们的方法在这三个数据集上具有良好的泛化能力。我们分析了路由模型的特征，以识别那些可以从人类反馈中受益的实例的特征，例如，具有中等安全关注度或中等意图复杂度的提示。我们公开了本研究中使用的数据集、注释平台和源代码，以促进未来更高效和准确的偏好收集。

[NLP-45] Retrieving Implicit and Explicit Emotional Events Using Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在常识情境下进行隐式和显式情感检索的能力评估问题。解决方案的关键在于提出了一种监督对比探针方法（supervised contrastive probing method），通过该方法系统地评估了LLMs在情感检索中的表现，包括隐式和显式情感检索的能力以及检索情感事件的多样性。这一方法为理解LLMs在情感检索方面的优势和局限性提供了宝贵的见解。

链接: https://arxiv.org/abs/2410.19128
作者: Guimin Hu
关键词-EN: Large language models, garnered significant attention, recent years due, Large language, emotion retrieval
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have garnered significant attention in recent years due to their impressive performance. While considerable research has evaluated these models from various perspectives, the extent to which LLMs can perform implicit and explicit emotion retrieval remains largely unexplored. To address this gap, this study investigates LLMs’ emotion retrieval capabilities in commonsense. Through extensive experiments involving multiple models, we systematically evaluate the ability of LLMs on emotion retrieval. Specifically, we propose a supervised contrastive probing method to verify LLMs’ performance for implicit and explicit emotion retrieval, as well as the diversity of the emotional events they retrieve. The results offer valuable insights into the strengths and limitations of LLMs in handling emotion retrieval.
摘要：近年来，大语言模型（Large Language Models, LLMs）因其卓越的表现而备受关注。尽管已有大量研究从不同角度评估了这些模型，但LLMs在常识推理中进行隐式和显式情感检索的能力仍未得到充分探索。为填补这一研究空白，本研究探讨了LLMs在常识推理中的情感检索能力。通过涉及多种模型的广泛实验，我们系统地评估了LLMs在情感检索方面的能力。具体而言，我们提出了一种监督对比探针方法，以验证LLMs在隐式和显式情感检索中的表现，以及它们检索情感事件的多样性。研究结果为LLMs在处理情感检索任务中的优势与局限性提供了宝贵的见解。

[NLP-46] Read-ME: Refactorizing LLM s as Router-Decoupled Mixture of Experts with System Co-Design NEURIPS2024

【速读】：该论文试图解决大规模语言模型（LLMs）在推理过程中面临的内存管理效率低下和批处理优化不足的问题，特别是由于模型架构与系统策略设计不匹配所导致的挑战。解决方案的关键在于提出了一种名为Read-ME的新框架，该框架通过将预训练的密集型LLMs转换为更小的混合专家（Mixture-of-Experts, MoE）模型，从而避免了从头开始训练的高成本。具体来说，Read-ME利用激活稀疏性来提取专家，并通过引入与MoE主干解耦的预门控路由器（pre-gating router），改进了系统友好的预计算和前瞻调度，从而增强了专家感知的批处理和缓存机制。这一代码设计在算法和系统层面填补了关键空白，为资源受限环境下的LLM推理提供了一种可扩展且高效的替代方案。

链接: https://arxiv.org/abs/2410.19123
作者: Ruisi Cai,Yeonju Ro,Geon-Woo Kim,Peihao Wang,Babak Ehteshami Bejnordi,Aditya Akella,Zhangyang Wang
关键词-EN: dynamically leverage specialized, leverage specialized subnetworks, large language models, efficiency and performance, proliferation of large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to “upcycling” generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router design and show its redundancy, and thus we introduce the pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, enhancing expert-aware batching and caching. Our codesign therefore addresses critical gaps on both the algorithmic and system fronts, establishing a scalable and efficient alternative for LLM inference in resource-constrained settings. Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Codes are available at: this https URL.
摘要：大语言模型 (LLM) 的广泛应用促使了混合专家 (Mixture-of-Experts, MoE) 架构的采用，该架构通过动态利用专门的子网络来提高效率和性能。尽管 MoE 模型具有诸多优势，但在推理过程中仍面临重大挑战，包括内存管理效率低下和批处理效果不佳，这主要是由于模型架构与系统策略之间的设计选择不一致。此外，传统的从头开始训练 MoE 的方法在成本上变得越来越不可行。本文提出了一种名为 Read-ME 的新框架，该框架将预训练的密集型 LLM 转化为更小的 MoE 模型（与“升级”通用 MoE 相反），从而避免了从头开始训练的高昂成本。我们的方法利用激活稀疏性来提取专家。为了组合专家，我们研究了广泛采用的逐层路由器设计，并展示了其冗余性，因此我们引入了与 MoE 主干解耦的预门控路由器，该路由器便于系统友好的预计算和前瞻调度，增强了专家感知的批处理和缓存。我们的协同设计因此解决了算法和系统层面的关键差距，为资源受限环境下的 LLM 推理提供了一种可扩展且高效的替代方案。Read-ME 在 MMLU 上的表现优于其他流行的开源密集模型，提升幅度高达 10.1%，并将平均端到端延迟提高了 6.1%。代码可在以下链接获取：this https URL。

[NLP-47] LLM Tree Search

【速读】：该论文试图解决在大语言模型 (LLMs) 中生成高质量、多样化序列的问题。解决方案的关键在于借鉴 AlphaGo 范式，通过构建搜索树来探索不同可能的序列完成路径，并根据模型的置信度对这些路径进行评分。这种方法不仅提高了输出质量、减少了错误，还消除了复合错误问题，并能生成多样化和创造性的完成序列。此外，该方法支持迭代问题解决和自训练，为平衡序列生成中的探索与利用提供了新的视角。关键在于利用置信度作为响应质量的代理，类似于束搜索 (beam search)，从而在考虑多种可能变体的情况下找到理想的序列完成。

链接: https://arxiv.org/abs/2410.19117
作者: Dylan Wilson
关键词-EN: generation method inspired, large language models, sequence generation method, method inspired, model confidence
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This project aims to investigate a novel sequence generation method inspired by the AlphaGo paradigm, adapting it for use with large language models (LLMs). The proposed approach involves creating search trees of different possible completions and evaluating these completions based on model confidence. By considering various paths in the search tree and scoring them according to the model’s confidence in each completion, we can generate diverse and high-quality sequences. This research explores the implementation of this paradigm by using confidence as a proxy for response quality akin to beam search \citepvijayakumar2016diverse. The primary goal of this paper is to outline the paradigm and demonstrate its potential, rather than focusing on achieving perfect results. The paper will outline the reasons why we believe this paradigm has the potential to improve LLMs in the following manners: 1) increase output quality, 2) decrease errors, 3) eliminate or reduce the compound error problems, 4) generate diverse and creative completions, 5) allow for iterative problem-solving, and 6) self-training. We expect this approach to yield a set of diverse and coherent sequences, offering insights into balancing exploration and exploitation in sequence generation. Potential applications include creative text generation tasks, such as storytelling and content creation, as well as other natural language processing domains, like machine translation and automated summarization. The goal is that the model will be far more effective as it will be able to consider many possible variations allowing it to find the ideal completion. This research aims to contribute to the understanding of effective search strategies in sequence generation and their impact on generating high-quality, varied textual outputs.
摘要：本项目旨在研究一种受 AlphaGo 范式启发的新型序列生成方法，并将其适应于大语言模型 (LLM) 的应用。所提出的方法涉及创建不同可能完成的搜索树，并根据模型的置信度评估这些完成情况。通过考虑搜索树中的各种路径，并根据模型对每个完成的置信度进行评分，我们可以生成多样且高质量的序列。本研究通过使用置信度作为响应质量的代理，类似于束搜索 (beam search) 的方法，探索了这一范式的实现 \citepvijayakumar2016diverse。本文的主要目标是概述这一范式并展示其潜力，而非专注于实现完美结果。本文将阐述我们为何认为这一范式有可能以以下方式改进 LLM：1) 提高输出质量，2) 减少错误，3) 消除或减少复合错误问题，4) 生成多样且富有创意的完成，5) 允许迭代问题解决，以及 6) 自我训练。我们期望这种方法能产生一组多样且连贯的序列，为在序列生成中平衡探索与利用提供见解。潜在应用包括创意文本生成任务，如故事创作和内容生成，以及其他自然语言处理领域，如机器翻译和自动摘要。目标是使模型在考虑多种可能变体的情况下，能够找到理想的完成，从而变得更加有效。本研究旨在为理解序列生成中的有效搜索策略及其对生成高质量、多样化文本输出的影响做出贡献。

[NLP-48] RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework EMNLP2024

【速读】：该论文试图解决自然语言生成中控制语言模型以生成具有特定属性的文本的挑战。解决方案的关键在于引入了一种基于语用学的无训练控制文本生成框架，称为RSA-Control。该框架通过递归地在假想的发言者和听众之间进行推理，增强了在干扰因素存在的情况下听众正确解读目标属性的可能性。此外，RSA-Control还引入了一个自适应的合理性参数，该参数能够根据上下文自动调整控制强度，从而在保持语言流畅性和内容一致性的同时，实现对文本属性的强控制。

链接: https://arxiv.org/abs/2410.19109
作者: Yifan Wang,Vera Demberg
关键词-EN: formidable challenge, desired attributes remains, significant advancements, advancements in natural, remains a formidable
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to EMNLP 2024 (main conference)

点击查看摘要

Abstract:Despite significant advancements in natural language generation, controlling language models to produce texts with desired attributes remains a formidable challenge. In this work, we introduce RSA-Control, a training-free controllable text generation framework grounded in pragmatics. RSA-Control directs the generation process by recursively reasoning between imaginary speakers and listeners, enhancing the likelihood that target attributes are correctly interpreted by listeners amidst distractors. Additionally, we introduce a self-adjustable rationality parameter, which allows for automatic adjustment of control strength based on context. Our experiments, conducted with two task types and two types of language models, demonstrate that RSA-Control achieves strong attribute control while maintaining language fluency and content consistency. Our code is available at this https URL.
摘要：尽管自然语言生成领域取得了显著进展，但控制语言模型生成具有所需属性的文本仍然是一个巨大的挑战。在本研究中，我们提出了 RSA-Control，一种基于语用学的无训练可控文本生成框架。RSA-Control 通过在虚拟说话者和听者之间进行递归推理来指导生成过程，从而提高在干扰因素存在的情况下，目标属性被听者正确解读的可能性。此外，我们引入了一个自适应的合理性参数，该参数允许根据上下文自动调整控制强度。我们的实验涉及两种任务类型和两种类型的语言模型，结果表明 RSA-Control 在保持语言流畅性和内容一致性的同时，实现了强大的属性控制。我们的代码可在以下链接获取：https URL。

[NLP-49] Watermarking Large Language Models and the Generated Content: Opportunities and Challenges

【速读】：该论文试图解决生成式大型语言模型（LLMs）在知识产权侵权和机器生成错误信息传播方面的问题。解决方案的关键在于采用水印技术（watermarking），以确立所有权、防止未经授权的使用并追踪LLM生成内容的来源。论文详细介绍了在不同威胁模型和场景下对LLMs本身进行水印的技术，以及针对LLM生成内容的水印方法，评估了其有效性和对各种攻击的抵抗力。此外，论文强调了在特定领域模型和数据（如代码生成、芯片设计和医疗应用）中水印的重要性，并探讨了通过硬件加速等方法提高水印过程效率的途径。最后，论文讨论了当前方法的局限性，并指出了未来研究方向，以促进这些生成式AI工具的负责任使用和保护。

链接: https://arxiv.org/abs/2410.19096
作者: Ruisi Zhang,Farinaz Koushanfar
关键词-EN: large language models, powerful generative large, generative large language, machine-generated misinformation, widely adopted
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: invited paper to Asilomar Conference on Signals, Systems, and Computers

点击查看摘要

Abstract:The widely adopted and powerful generative large language models (LLMs) have raised concerns about intellectual property rights violations and the spread of machine-generated misinformation. Watermarking serves as a promising approch to establish ownership, prevent unauthorized use, and trace the origins of LLM-generated content. This paper summarizes and shares the challenges and opportunities we found when watermarking LLMs. We begin by introducing techniques for watermarking LLMs themselves under different threat models and scenarios. Next, we investigate watermarking methods designed for the content generated by LLMs, assessing their effectiveness and resilience against various attacks. We also highlight the importance of watermarking domain-specific models and data, such as those used in code generation, chip design, and medical applications. Furthermore, we explore methods like hardware acceleration to improve the efficiency of the watermarking process. Finally, we discuss the limitations of current approaches and outline future research directions for the responsible use and protection of these generative AI tools.
摘要：广泛采用且功能强大的生成式大语言模型 (Generative Large Language Models, LLMs) 引发了关于知识产权侵犯和机器生成错误信息传播的担忧。水印技术作为一种有前景的方法，能够确立所有权、防止未经授权的使用，并追踪由 LLM 生成内容的来源。本文总结并分享了我们在为 LLM 加水印过程中发现的各种挑战与机遇。首先，我们介绍了在不同威胁模型和场景下为 LLM 本身添加水印的技术。接着，我们探讨了针对 LLM 生成内容的水印方法，评估了其在面对各种攻击时的有效性和韧性。我们还强调了为特定领域模型和数据（如代码生成、芯片设计及医疗应用中使用的模型和数据）添加水印的重要性。此外，我们探索了如硬件加速等方法，以提高水印过程的效率。最后，我们讨论了当前方法的局限性，并概述了未来在负责任使用和保护这些生成式 AI 工具方面的研究方向。

[NLP-50] GCoder: Improving Large Language Model for Generalized Graph Problem Solving

【速读】：该论文试图解决传统图计算问题推理步骤范式中的不可验证性、长期推理能力有限以及对图变体的泛化能力差的问题。解决方案的关键在于引入基于代码的大型语言模型 (Large Language Models, LLMs)，即 GCoder，通过构建广泛的训练数据集 GraphWild 和采用多阶段训练过程（包括监督微调 (Supervised Fine-Tuning, SFT) 和编译器反馈强化学习 (Reinforcement Learning from Compiler Feedback, RLCF)）来增强模型在广义图计算问题中的解决能力。此外，针对未见任务，采用混合检索技术提升性能。实验结果表明，GCoder 在多种图计算问题上优于 GPT-4o，平均准确率提高了 16.42%，并能高效处理大规模图和多样化的输入格式。

链接: https://arxiv.org/abs/2410.19084
作者: Qifan Zhang,Xiaobin Hong,Jianheng Tang,Nuo Chen,Yuhan Li,Wenzhong Li,Jing Tang,Jia Li
关键词-EN: Large Language Models, Large Language, strong reasoning abilities, demonstrated strong reasoning, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong reasoning abilities, making them suitable for complex tasks such as graph computation. Traditional reasoning steps paradigm for graph problems is hindered by unverifiable steps, limited long-term reasoning, and poor generalization to graph variations. To overcome these limitations, we introduce GCoder, a code-based LLM designed to enhance problem-solving in generalized graph computation problems. Our method involves constructing an extensive training dataset, GraphWild, featuring diverse graph formats and algorithms. We employ a multi-stage training process, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Compiler Feedback (RLCF), to refine model capabilities. For unseen tasks, a hybrid retrieval technique is used to augment performance. Experiments demonstrate that GCoder outperforms GPT-4o, with an average accuracy improvement of 16.42% across various graph computational problems. Furthermore, GCoder efficiently manages large-scale graphs with millions of nodes and diverse input formats, overcoming the limitations of previous models focused on the reasoning steps paradigm. This advancement paves the way for more intuitive and effective graph problem-solving using LLMs. Code and data are available at here: this https URL.
摘要：大语言模型 (LLMs) 展示了强大的推理能力，使其适用于诸如图计算等复杂任务。传统的图问题推理步骤范式受到不可验证步骤、有限的长程推理能力以及对图变体泛化能力差的限制。为了克服这些局限性，我们引入了 GCoder，一种基于代码的 LLM，旨在增强广义图计算问题中的问题解决能力。我们的方法包括构建一个广泛的训练数据集 GraphWild，该数据集涵盖了多种图格式和算法。我们采用多阶段训练过程，包括监督微调 (Supervised Fine-Tuning, SFT) 和从编译器反馈中进行强化学习 (Reinforcement Learning from Compiler Feedback, RLCF)，以精炼模型能力。对于未见过的任务，我们使用混合检索技术来增强性能。实验表明，GCoder 在各种图计算问题上均优于 GPT-4o，平均准确率提高了 16.42%。此外，GCoder 能够高效处理包含数百万节点和多样化输入格式的大规模图，克服了以往专注于推理步骤范式的模型的局限性。这一进展为使用 LLMs 进行更直观和有效的图问题解决铺平了道路。代码和数据可在以下链接获取：this https URL。

[NLP-51] Infogent: An Agent -Based Framework for Web Information Aggregation

【速读】：该论文试图解决的是复杂查询下的网页信息聚合问题，即通过探索不同网站来收集信息以回答复杂查询。解决方案的关键在于提出了一个名为Infogent的模块化框架，该框架包含三个核心组件：Navigator（导航器）、Extractor（提取器）和Aggregator（聚合器）。Navigator负责在网页间导航，Extractor从网页中提取信息，而Aggregator则将提取的信息进行整合。通过在Direct API-Driven Access和Interactive Visual Access两种信息访问设置下的实验，Infogent分别在FRAMES和AssistantBench基准测试中显著优于现有的最先进方法，证明了其有效性。

链接: https://arxiv.org/abs/2410.19054
作者: Revanth Gangi Reddy,Sagnik Mukherjee,Jeonghwan Kim,Zhenhailong Wang,Dilek Hakkani-Tur,Heng Ji
关键词-EN: marks task completion, navigation task consists, Interactive Visual Access, seemingly performant web, web navigation task
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES, and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.
摘要：尽管在任务完成基准测试中表现出色的网页智能体，大多数现有方法评估这些智能体时基于一个预设：网页导航任务由一系列线性动作组成，且有一个结束状态标志着任务完成。相比之下，我们的工作聚焦于信息聚合的网页导航，其中智能体必须探索不同的网站以收集复杂查询所需的信息。我们从两个不同的角度考虑网页信息聚合：（i）直接API驱动的访问依赖于纯文本的网页视图，利用外部工具如Google搜索API进行网页导航，并使用网页抓取工具提取网站内容。（ii）交互式视觉访问使用网页截图，并需要与浏览器进行交互以导航和访问信息。受这些多样化的信息访问设置的启发，我们引入了Infogent，这是一个新颖的模块化框架，用于网页信息聚合，涉及三个不同的组件：导航器、提取器和聚合器。在不同信息访问设置下的实验表明，Infogent在FRAMES数据集上通过直接API驱动的访问方式，比现有的SOTA多智能体搜索框架高出7%；在AssistantBench数据集上通过交互式视觉访问方式，比现有的信息搜索网页智能体提高了4.3%。

[NLP-52] O1 Replication Journey: A Strategic Progress Report – Part 1

【速读】：该论文试图解决现代人工智能研究中的几个关键问题，包括团队项目封闭性、信息共享延迟以及对多样化贡献的认可不足。解决方案的关键在于引入了一种名为“O1 Replication Journey”的透明、实时研究方法，通过提供全面的实时文档记录，包括成功和失败的经验，来促进开放科学、加速集体进步，并为AI驱动的科学发现奠定基础。技术上，论文提出了“旅程学习范式”（journey learning paradigm），该范式鼓励模型不仅学习捷径，还学习完整的探索过程，包括试错、反思和回溯。在仅有327个训练样本且无额外技巧的情况下，旅程学习在MATH数据集上比传统监督学习高出8%，展示了其强大的潜力。

链接: https://arxiv.org/abs/2410.18982
作者: Yiwei Qin,Xuefeng Li,Haoyang Zou,Yixiu Liu,Shijie Xia,Zhen Huang,Yixin Ye,Weizhe Yuan,Hector Liu,Yuanzhi Li,Pengfei Liu
关键词-EN: artificial intelligence research, introduces a pioneering, pioneering approach, approach to artificial, artificial intelligence
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces a pioneering approach to artificial intelligence research, embodied in our O1 Replication Journey. In response to the announcement of OpenAI’s groundbreaking O1 model, we embark on a transparent, real-time exploration to replicate its capabilities while reimagining the process of conducting and communicating AI research. Our methodology addresses critical challenges in modern AI research, including the insularity of prolonged team-based projects, delayed information sharing, and the lack of recognition for diverse contributions. By providing comprehensive, real-time documentation of our replication efforts, including both successes and failures, we aim to foster open science, accelerate collective advancement, and lay the groundwork for AI-driven scientific discovery. Our research progress report diverges significantly from traditional research papers, offering continuous updates, full process transparency, and active community engagement throughout the research journey. Technologically, we proposed the journey learning paradigm, which encourages models to learn not just shortcuts, but the complete exploration process, including trial and error, reflection, and backtracking. With only 327 training samples and without any additional tricks, journey learning outperformed conventional supervised learning by over 8% on the MATH dataset, demonstrating its extremely powerful potential. We believe this to be the most crucial component of O1 technology that we have successfully decoded. We share valuable resources including technical hypotheses and insights, cognitive exploration maps, custom-developed tools, etc at this https URL.
摘要：本文介绍了一种开创性的人工智能研究方法，体现在我们的 O1 复制之旅中。针对 OpenAI 宣布的突破性 O1 模型，我们启动了一项透明、实时的探索，旨在复制其能力的同时，重新构想进行和交流人工智能研究的过程。我们的方法解决了现代人工智能研究中的关键挑战，包括长期团队项目中的孤立性、信息共享的延迟以及对多样化贡献的缺乏认可。通过提供全面的、实时的复制努力文档，包括成功与失败，我们旨在促进开放科学，加速集体进步，并为基于人工智能的科学发现奠定基础。我们的研究进展报告与传统研究论文显著不同，提供持续更新、全过程透明以及在整个研究旅程中的积极社区参与。在技术上，我们提出了旅程学习范式，鼓励模型不仅学习捷径，还学习完整的探索过程，包括试错、反思和回溯。仅使用 327 个训练样本且没有任何额外技巧的情况下，旅程学习在 MATH 数据集上超越了传统的监督学习超过 8%，展示了其极其强大的潜力。我们相信这是我们成功解码的 O1 技术中最关键的组成部分。我们在 https URL 上分享了宝贵的资源，包括技术假设和见解、认知探索地图、定制开发工具等。

[NLP-53] Stick-breaking Attention

【速读】：该论文试图解决传统自注意力机制（self-attention mechanism）在处理长序列时面临的泛化能力不足的问题，特别是依赖于softmax操作和位置嵌入（如RoPE）的方法。论文提出了一种基于折棍过程（stick-breaking process）的替代注意力机制，其关键在于为每个当前token之前的token确定一个折点 (\beta_{i,j})，表示剩余棍子分配给当前token的比例。通过重复这一过程直至棍子完全分配，生成一系列注意力权重。这种方法自然地引入了近期偏差（recency bias），有助于语法解析，并且在数值稳定性和适应Flash Attention方面进行了优化。实验结果表明，这种折棍注意力机制在长度泛化和下游任务中表现出色，尤其是在扩展上下文窗口时，模型性能显著提升。

链接: https://arxiv.org/abs/2410.17980
作者: Shawn Tan,Yikang Shen,Songlin Yang,Aaron Courville,Rameswar Panda
关键词-EN: necessitating positional embeddings, self-attention mechanism traditionally, mechanism traditionally relies, necessitating positional, traditionally relies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges. We propose an alternative attention mechanism based on the stick-breaking process: For each token before the current, we determine a break point \beta_i,j , which represents the proportion of the remaining stick to allocate to the current token. We repeat the process until the stick is fully allocated, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing (Shen et. al., 2017). We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention. We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism. When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks. Stick-breaking also performs well at length generalisation, allowing a model trained with 2^11 context window to perform well at 2^14 with perplexity improvements.
摘要：传统的自注意力机制依赖于 softmax 运算符，需要使用如 RoPE 这样的位置嵌入或位置偏置来考虑 Token 的顺序。然而，当前的方法仍面临长度泛化挑战。我们提出了一种基于分段过程的替代注意力机制：对于当前 Token 之前的每个 Token，我们确定一个分段点 (\beta_{i,j})，该点表示剩余分段中分配给当前 Token 的比例。我们重复此过程，直到分段完全分配，从而生成一系列注意力权重。这一过程自然地引入了近期偏置，这在语法解析中具有语言学动机（Shen 等人，2017）。我们研究了将传统基于 softmax 的注意力机制替换为分段注意力机制的影响。随后，我们讨论了数值稳定的分段注意力的实现，并调整了 Flash Attention 以适应这一机制。当作为当前 softmax+RoPE 注意力系统的直接替代时，我们发现分段注意力在长度泛化和下游任务中与当前方法表现相当。分段注意力在长度泛化方面也表现出色，允许在训练时使用 (2^{11}) 上下文窗口的模型在 (2^{14}) 窗口下表现良好，并提高了困惑度。

[NLP-54] Scaling Law with Learning Rate Annealing DATE

【速读】：该论文试图解决神经语言模型训练过程中损失曲线预测的问题，特别是如何在不进行大量计算的情况下，准确预测任意学习率调度器（LRS）下的损失曲线。解决方案的关键在于提出了一种新的损失曲线公式： $L(s) = L_0 + A \cdot S_1^{-\alpha} - C \cdot S_2$ ，其中 $L(s)$ 是训练步数 $s$ 时的验证损失， $S_1$ 是学习率曲线下面积， $S_2$ 是学习率退火面积， $L_0$ 、 $A$ 、 $C$ 、 $\alpha$ 是常数参数。该公式结合了数据规模上的幂律缩放和学习率退火期间的额外损失减少，能够描述整个训练过程中的损失曲线，而不仅仅是训练结束时的单一损失点。通过这种公式，研究者可以在仅拟合一两条训练曲线的情况下，准确预测任意学习率调度器下的损失，从而显著降低计算成本，同时提高训练动态的准确性和表达性。

链接: https://arxiv.org/abs/2408.11029
作者: Howe Tissue,Venus Wang,Lu Wang
关键词-EN: language models empirically, models empirically adhere, neural language models, cross-entropy loss curves, constant parameters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Add more experiments to consolidate our scaling laws. 29 pages, 29 figures

点击查看摘要

Abstract:We find that the cross-entropy loss curves of neural language models empirically adhere to a scaling law with learning rate (LR) annealing over training steps: L(s) = L_0 + A\cdot S_1^-\alpha - C\cdot S_2, where L(s) is the validation loss at step s , S_1 is the area under the LR curve, S_2 is the LR annealing area, and L_0 , A , C , \alpha are constant parameters. This formulation takes into account two factors: (1) power-law scaling over data size, and (2) the additional loss reduction during LR annealing. Therefore, this formulation can describe the full loss curve at each step, rather than the single loss point at the end of training. Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step across any learning rate scheduler (LRS). This approach significantly reduces computational cost in formulating scaling laws while providing more accuracy and expressiveness for training dynamics. Extensive experiments demonstrate that our findings hold across a range of hyper-parameters and model architectures, and our equation can extend to scaling effect of model sizes. Moreover, our formulation provides accurate theoretical verification and explanation for empirical results observed in numerous previous studies, particularly those focusing on LR schedule and annealing. We believe that this work is promising to enhance the understanding of LLM training dynamics while greatly democratizing scaling laws, and it can guide researchers in refining training strategies (e.g. critical LRS) for further LLMs.
摘要：我们发现，神经语言模型的交叉熵损失曲线在训练步骤中，随着学习率（LR）的退火，经验上遵循一种缩放定律：L(s) = L_0 + A·S_1^-α - C·S_2，其中 L(s) 是在步骤 s 时的验证损失，S_1 是学习率曲线下方的面积，S_2 是学习率退火的面积，而 L_0、A、C、α 是常数参数。这一公式考虑了两个因素：（1）数据规模上的幂律缩放，以及（2）学习率退火期间的额外损失减少。因此，该公式能够描述每个步骤的完整损失曲线，而不仅仅是训练结束时的单一损失点。通过应用包含学习率退火的缩放定律，并仅拟合一到两条训练曲线，我们能够准确预测任何给定步骤在任何学习率调度器（LRS）下的损失。这种方法在制定缩放定律时显著降低了计算成本，同时为训练动态提供了更高的准确性和表达力。广泛的实验表明，我们的发现在一系列超参数和模型架构中都成立，并且我们的公式可以扩展到模型规模的缩放效应。此外，我们的公式为先前众多研究中观察到的经验结果提供了准确的理论验证和解释，特别是那些关注学习率调度和退火的研究。我们相信，这项工作有望加深对大语言模型训练动态的理解，同时极大地普及缩放定律，并指导研究人员优化训练策略（例如关键的学习率调度器）以进一步发展大语言模型。

[NLP-55] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

【速读】：该论文试图解决音频理解（audio understanding）领域中模型在处理复杂任务和专家级知识需求方面的不足。解决方案的关键在于提出了一个名为MMAU（Multimodal Audio Understanding）的新基准，该基准包含10,000个精心挑选的音频片段，配有人类标注的自然语言问题和答案，涵盖语音、环境声音和音乐。MMAU强调高级感知和基于领域知识的推理，要求模型展示27种不同的技能，以应对类似于专家面临的独特和挑战性任务。通过评估18个开源和专有的音频-语言模型，论文展示了MMAU对现有模型的显著挑战，即使是先进的Gemini Pro v1.5和开源的Qwen2-Audio也仅达到52.97%和52.50%的准确率，表明音频理解模型在复杂任务处理上仍有很大的改进空间。

链接: https://arxiv.org/abs/2410.19168
作者: S Sakshi,Utkarsh Tyagi,Sonal Kumar,Ashish Seth,Ramaneswaran Selvakumar,Oriol Nieto,Ramani Duraiswami,Sreyan Ghosh,Dinesh Manocha
关键词-EN: ability to comprehend, agents to interact, interact effectively, audio understanding models, non-speech sounds
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Project Website: this https URL

点击查看摘要

Abstract:The ability to comprehend audio–which includes speech, non-speech sounds, and music–is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
摘要：理解音频（包括语音、非语音声音和音乐）的能力对于 AI 智能体有效与世界互动至关重要。我们提出了 MMAU，这是一个新颖的基准，旨在评估多模态音频理解模型在需要专家级知识和复杂推理的任务上的表现。MMAU 包含 10,000 个精心挑选的音频片段，配有人类标注的自然语言问题和答案，涵盖语音、环境声音和音乐。它包括信息提取和推理问题，要求模型展示 27 种独特的技能，以应对具有挑战性的任务。与现有基准不同，MMAU 强调基于领域特定知识的先进感知和推理能力，挑战模型解决类似专家面临的任务。我们评估了 18 个开源和专有的（大）音频-语言模型，展示了 MMAU 带来的显著挑战。值得注意的是，即使是最先进的 Gemini Pro v1.5 也仅达到 52.97% 的准确率，而最先进的开源模型 Qwen2-Audio 也仅达到 52.50%，这表明仍有很大的改进空间。我们相信 MMAU 将推动音频和多模态研究社区开发更先进的音频理解模型，以解决复杂的音频任务。

人工智能

[AI-0] he Potential and Value of AI Chatbot in Personalized Cognitive Training

链接: https://arxiv.org/abs/2410.19733
作者: Zilong Wang,Nan Chen,Luna K. Qiu,Ling Yue,Geli Guo,Yang Ou,Shiqi Jiang,Yuqing Yang,Lili Qiu
关键词-EN: presenting significant public, public health challenges, Alzheimer disease, significant public health, recent years
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, the rapid aging of the global population has led to an increase in cognitive disorders, such as Alzheimer’s disease, presenting significant public health challenges. Although no effective treatments currently exist to reverse Alzheimer’s, prevention and early intervention, including cognitive training, are critical. This report explores the potential of AI chatbots in enhancing personalized cognitive training. We introduce ReMe, a web-based framework designed to create AI chatbots that facilitate cognitive training research, specifically targeting episodic memory tasks derived from personal life logs. By leveraging large language models, ReMe provides enhanced user-friendly, interactive, and personalized training experiences. Case studies demonstrate ReMe’s effectiveness in engaging users through life recall and open-ended language puzzles, highlighting its potential to improve cognitive training design. Despite promising results, further research is needed to validate training effectiveness through large-scale studies that include cognitive ability evaluations. Overall, ReMe offers a promising approach to personalized cognitive training, utilizing AI capabilities to meet the growing demand for non-pharmacological interventions in cognitive health, with future research aiming to expand its applications and efficacy.

[AI-1] Sparse Decomposition of Graph Neural Networks

链接: https://arxiv.org/abs/2410.19723
作者: Yaochen Hu,Mai Zeng,Ge Zhang,Pavel Rumiantsev,Liheng Ma,Yingxue Zhang,Mark Coates
关键词-EN: exhibit superior performance, Graph Neural Networks, exhibit superior, inference cost, superior performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNN) exhibit superior performance in graph representation learning, but their inference cost can be high, due to an aggregation operation that can require a memory fetch for a very large number of nodes. This inference cost is the major obstacle to deploying GNN models with \emphonline prediction to reflect the potentially dynamic node features. To address this, we propose an approach to reduce the number of nodes that are included during aggregation. We achieve this through a sparse decomposition, learning to approximate node representations using a weighted sum of linearly transformed features of a carefully selected subset of nodes within the extended neighbourhood. The approach achieves linear complexity with respect to the average node degree and the number of layers in the graph neural network. We introduce an algorithm to compute the optimal parameters for the sparse decomposition, ensuring an accurate approximation of the original GNN model, and present effective strategies to reduce the training time and improve the learning process. We demonstrate via extensive experiments that our method outperforms other baselines designed for inference speedup, achieving significant accuracy gains with comparable inference times for both node classification and spatio-temporal forecasting tasks.

[AI-2] Arabic Music Classification and Generation using Deep Learning

链接: https://arxiv.org/abs/2410.19719
作者: Mohamed Elshaarawy,Ashrakat Saeed,Mariam Sheta,Abdelrahman Said,Asem Bakr,Omar Bahaa,Walid Gomaa
关键词-EN: classical Egyptian music, Egyptian music, machine learning approach, music, classical Egyptian
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper proposes a machine learning approach for classifying classical and new Egyptian music by composer and generating new similar music. The proposed system utilizes a convolutional neural network (CNN) for classification and a CNN autoencoder for generation. The dataset used in this project consists of new and classical Egyptian music pieces composed by different composers. To classify the music by composer, each sample is normalized and transformed into a mel spectrogram. The CNN model is trained on the dataset using the mel spectrograms as input features and the composer labels as output classes. The model achieves 81.4% accuracy in classifying the music by composer, demonstrating the effectiveness of the proposed approach. To generate new music similar to the original pieces, a CNN autoencoder is trained on a similar dataset. The model is trained to encode the mel spectrograms of the original pieces into a lower-dimensional latent space and then decode them back into the original mel spectrogram. The generated music is produced by sampling from the latent space and decoding the samples back into mel spectrograms, which are then transformed into audio. In conclusion, the proposed system provides a promising approach to classifying and generating classical Egyptian music, which can be applied in various musical applications, such as music recommendation systems, music production, and music education. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2410.19719 [cs.SD] (or arXiv:2410.19719v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2410.19719 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohamed Elshaarawy [view email] [v1] Fri, 25 Oct 2024 17:47:08 UTC (1,169 KB)

[AI-3] Adversarial Environment Design via Regret-Guided Diffusion Models

链接: https://arxiv.org/abs/2410.19715
作者: Hojun Chung,Junseo Lee,Minsoo Kim,Dohyeong Kim,Songhwai Oh
关键词-EN: deep reinforcement learning, reinforcement learning, environmental changes remains, remains a significant, significant challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 38th Conference on Neural Information Processing Systems

点击查看摘要

Abstract:Training agents that are robust to environmental changes remains a significant challenge in deep reinforcement learning (RL). Unsupervised environment design (UED) has recently emerged to address this issue by generating a set of training environments tailored to the agent’s capabilities. While prior works demonstrate that UED has the potential to learn a robust policy, their performance is constrained by the capabilities of the environment generation. To this end, we propose a novel UED algorithm, adversarial environment design via regret-guided diffusion models (ADD). The proposed method guides the diffusion-based environment generator with the regret of the agent to produce environments that the agent finds challenging but conducive to further improvement. By exploiting the representation power of diffusion models, ADD can directly generate adversarial environments while maintaining the diversity of training environments, enabling the agent to effectively learn a robust policy. Our experimental results demonstrate that the proposed method successfully generates an instructive curriculum of environments, outperforming UED baselines in zero-shot generalization across novel, out-of-distribution environments. Project page: this https URL

[AI-4] meSuite: Improving MLLM s for Long Video Understanding via Grounded Tuning

链接: https://arxiv.org/abs/2410.19702
作者: Xiangyu Zeng,Kunchang Li,Chenting Wang,Xinhao Li,Tianxiang Jiang,Ziang Yan,Songze Li,Yansong Shi,Zhengrong Yue,Yi Wang,Yali Wang,Yu Qiao,Limin Wang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated impressive performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in short video understanding. However, understanding long-form videos still remains challenging for MLLMs. This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format. Specifically, based on VideoChat, we propose our long-video MLLM, coined as VideoChat-T, by implementing a token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance the temporal awareness of visual representation. Meanwhile, we introduce the TimePro, a comprehensive grounding-centric instruction tuning dataset composed of 9 tasks and 349k high-quality grounded annotations. Notably, we design a new instruction tuning task type, called Temporal Grounded Caption, to peform detailed video descriptions with the corresponding time stamps prediction. This explicit temporal location prediction will guide MLLM to correctly attend on the visual content when generating description, and thus reduce the hallucination risk caused by the LLMs. Experimental results demonstrate that our TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM, achieving improvement of 5.6% and 6.8% on the benchmarks of Egoschema and VideoMME, respectively. In addition, VideoChat-T exhibits robust zero-shot temporal grounding capabilities, significantly outperforming the existing state-of-the-art MLLMs. After fine-tuning, it performs on par with the traditional supervised expert models.

[AI-5] Enhancing Resilience and Scalability in Travel Booking Systems: A Microservices Approach to Fault Tolerance Load Balancing and Service Discovery

链接: https://arxiv.org/abs/2410.19701
作者: Biman Barua,M. Shamim Kaiser
关键词-EN: reliable airline reservation, airline reservation systems, paper investigates, investigates the inclusion, scalable and reliable
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:This paper investigates the inclusion of microservices architecture in the development of scalable and reliable airline reservation systems. Most of the traditional reservation systems are very rigid and centralized which makes them prone to bottlenecks and a single point of failure. As such, systems do not meet the requirements of modern airlines which are dynamic. Microservices offer better resiliency and scalability because the services do not depend on one another and can be deployed independently. The approach is grounded on the Circuit Breaker Pattern to maintain fault tolerance while consuming foreign resources such as flight APIs and payment systems. This avoided the failure propagation to the systems by 60% enabling the systems to function under external failures. Traffic rerouting also bolstered this with a guarantee of above 99.95% uptime in systems where high availability was demanded. To address this, load balancing was used, particularly the Round-Robin method which managed to enhance performance by 35% through the equal distribution of user requests among the service instances. Health checks, as well as monitoring in real-time, helped as well with failure management as they helped to contain failures before the users of the system were affected. The results suggest that the use of microservices led to a 40% increase in system scalability, a 50% decrease in downtime and a support for 30% more concurrent users than the use of monolithic architectures. These findings affirm the capability of microservices in the development of robust and flexible airline ticket booking systems that are responsive to change and recover from external system unavailability.

[AI-6] MILES: Making Imitation Learning Easy with Self-Supervision

链接: https://arxiv.org/abs/2410.19693
作者: Georgios Papagiannis,Edward Johns
关键词-EN: laborious human supervision, frequent environment resets, Data collection, single demonstration, self-supervised data collection
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at the Conference on Robot Learning (CoRL) 2024

点击查看摘要

Abstract:Data collection in imitation learning often requires significant, laborious human supervision, such as numerous demonstrations, and/or frequent environment resets for methods that incorporate reinforcement learning. In this work, we propose an alternative approach, MILES: a fully autonomous, self-supervised data collection paradigm, and we show that this enables efficient policy learning from just a single demonstration and a single environment reset. MILES autonomously learns a policy for returning to and then following the single demonstration, whilst being self-guided during data collection, eliminating the need for additional human interventions. We evaluated MILES across several real-world tasks, including tasks that require precise contact-rich manipulation such as locking a lock with a key. We found that, under the constraints of a single demonstration and no repeated environment resetting, MILES significantly outperforms state-of-the-art alternatives like imitation learning methods that leverage reinforcement learning. Videos of our experiments and code can be found on our webpage: this http URL.

[AI-7] Deep learning-based identification of patients at increased risk of cancer using routine laboratory markers

链接: https://arxiv.org/abs/2410.19646
作者: Vivek Singh,Shikha Chaganti,Matthias Siebert,Soumya Rajesh,Andrei Puiu,Raj Gopalan,Jamie Gramz,Dorin Comaniciu,Ali Kamen
关键词-EN: costly treatments due, Early screening, late diagnosis, proven to improve, improve the survival
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early screening for cancer has proven to improve the survival rate and spare patients from intensive and costly treatments due to late diagnosis. Cancer screening in the healthy population involves an initial risk stratification step to determine the screening method and frequency, primarily to optimize resource allocation by targeting screening towards individuals who draw most benefit. For most screening programs, age and clinical risk factors such as family history are part of the initial risk stratification algorithm. In this paper, we focus on developing a blood marker-based risk stratification approach, which could be used to identify patients with elevated cancer risk to be encouraged for taking a diagnostic test or participate in a screening program. We demonstrate that the combination of simple, widely available blood tests, such as complete blood count and complete metabolic panel, could potentially be used to identify patients at risk for colorectal, liver, and lung cancers with areas under the ROC curve of 0.76, 0.85, 0.78, respectively. Furthermore, we hypothesize that such an approach could not only be used as pre-screening risk assessment for individuals but also as population health management tool, for example to better interrogate the cancer risk in certain sub-populations.

[AI-8] Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

链接: https://arxiv.org/abs/2410.19643
作者: Nicolás Nieto,Simon B. Eickhoff,Christian Jung,Martin Reuter,Kersten Diers,Malte Kelm,Artur Lichtenberg,Federico Raimondo,Kaustubh R. Patil
关键词-EN: Machine learning, models benefit, benefit from large, Machine, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.

[AI-9] VARS: Vision-based Assessment of Risk in Security Systems

链接: https://arxiv.org/abs/2410.19642
作者: Pranav Gupta,Pratham Gohil,Sridhar S
关键词-EN: security systems, content is critical, critical for enhancing, enhancing safety, safety and security
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The accurate prediction of danger levels in video content is critical for enhancing safety and security systems, particularly in environments where quick and reliable assessments are essential. In this study, we perform a comparative analysis of various machine learning and deep learning models to predict danger ratings in a custom dataset of 100 videos, each containing 50 frames, annotated with human-rated danger scores ranging from 0 to 10. The danger ratings are further classified into three categories: no alert (less than 7)and high alert (greater than equal to 7). Our evaluation covers classical machine learning models, such as Support Vector Machines, as well as Neural Networks, and transformer-based models. Model performance is assessed using standard metrics such as accuracy, F1-score, and mean absolute error (MAE), and the results are compared to identify the most robust approach. This research contributes to developing a more accurate and generalizable danger assessment framework for video-based risk detection.

[AI-10] Planning-Aware Diffusion Networks for Enhanced Motion Forecasting in Autonomous Driving

链接: https://arxiv.org/abs/2410.19639
作者: Liu Yunhao,Ding Hong,Zhang Ziming,Wang Huixin,Liu Jinzhao,Xi Suyang
关键词-EN: significant advancements, fail to fully, fully capture, capture the complexity, interactions between dynamic
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by CoRL Workshop Leap 2024

点击查看摘要

Abstract:Autonomous driving technology has seen significant advancements, but existing models often fail to fully capture the complexity of multi-agent environments, where interactions between dynamic agents are critical. To address this, we propose the Planning-Integrated Forecasting Model (PIFM), a novel framework inspired by neural mechanisms governing decision-making and multi-agent coordination in the brain. PIFM leverages rich contextual information, integrating road structures, traffic rules, and the behavior of surrounding vehicles to improve both the accuracy and interpretability of predictions. By adopting a diffusion-based architecture, akin to neural diffusion processes involved in predicting and planning, PIFM is able to forecast future trajectories of all agents within a scenario. This architecture enhances model transparency, as it parallels the brain’s method of dynamically adjusting predictions based on external stimuli and other agents’behaviors. Extensive experiments validate PIFM’s capacity to provide interpretable, neuroscience-driven solutions for safer and more efficient autonomous driving systems, with an extremely low number of parameters.

[AI-11] Knowledge Graph Enhanced Language Agents for Recommendation

链接: https://arxiv.org/abs/2410.19627
作者: Taicheng Guo,Chaochun Liu,Hai Wang,Varun Mannam,Fang Wang,Xin Chen,Xiangliang Zhang,Chandan K. Reddy
关键词-EN: simulate human behavior, Language agents, Knowledge Graph Enhanced, Graph Enhanced Language, Enhanced Language Agents
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Language agents have recently been used to simulate human behavior and user-item interactions for recommendation systems. However, current language agent simulations do not understand the relationships between users and items, leading to inaccurate user profiles and ineffective recommendations. In this work, we explore the utility of Knowledge Graphs (KGs), which contain extensive and reliable relationships between users and items, for recommendation. Our key insight is that the paths in a KG can capture complex relationships between users and items, eliciting the underlying reasons for user preferences and enriching user profiles. Leveraging this insight, we propose Knowledge Graph Enhanced Language Agents(KGLA), a framework that unifies language agents and KG for recommendation systems. In the simulated recommendation scenario, we position the user and item within the KG and integrate KG paths as natural language descriptions into the simulation. This allows language agents to interact with each other and discover sufficient rationale behind their interactions, making the simulation more accurate and aligned with real-world cases, thus improving recommendation performance. Our experimental results show that KGLA significantly improves recommendation performance (with a 33%-95% boost in NDCG@1 among three widely used benchmarks) compared to the previous best baseline method.

[AI-12] Shared Control with Black Box Agents using Oracle Queries

链接: https://arxiv.org/abs/2410.19612
作者: Inbal Avraham,Reuth Mirsky
关键词-EN: control problems involve, Shared control, Shared control problems, shared control policy, involve a robot
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Shared control problems involve a robot learning to collaborate with a human. When learning a shared control policy, short communication between the agents can often significantly reduce running times and improve the system’s accuracy. We extend the shared control problem to include the ability to directly query a cooperating agent. We consider two types of potential responses to a query, namely oracles: one that can provide the learner with the best action they should take, even when that action might be myopically wrong, and one with a bounded knowledge limited to its part of the system. Given this additional information channel, this work further presents three heuristics for choosing when to query: reinforcement learning-based, utility-based, and entropy-based. These heuristics aim to reduce a system’s overall learning cost. Empirical results on two environments show the benefits of querying to learn a better control policy and the tradeoffs between the proposed heuristics.

[AI-13] CoqPilot a plugin for LLM -based generation of proofs

链接: https://arxiv.org/abs/2410.19605
作者: Andrei Kozyrev,Gleb Solovev,Nikita Khramov,Anton Podkopaev
关键词-EN: Code extension designed, extension designed, automate writing, Coq, Coq proof generation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Published in the proceedings of the ASE’24 Tool Demonstrations Track

点击查看摘要

Abstract:We present CoqPilot, a VS Code extension designed to help automate writing of Coq proofs. The plugin collects the parts of proofs marked with the admit tactic in a Coq file, i.e., proof holes, and combines LLMs along with non-machine-learning methods to generate proof candidates for the holes. Then, CoqPilot checks if each proof candidate solves the given subgoal and, if successful, replaces the hole with it. The focus of CoqPilot is twofold. Firstly, we want to allow users to seamlessly combine multiple Coq generation approaches and provide a zero-setup experience for our tool. Secondly, we want to deliver a platform for LLM-based experiments on Coq proof generation. We developed a benchmarking system for Coq generation methods, available in the plugin, and conducted an experiment using it, showcasing the framework’s possibilities. Demo of CoqPilot is available at: this https URL. Code at: this https URL

[AI-14] Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

链接: https://arxiv.org/abs/2410.19560
作者: Shentong Mo,Shengbang Tong
关键词-EN: Joint-Embedding Predictive Architecture, innovative masking strategy, Exponential Moving Average, extracting visual features, visual representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.

[AI-15] On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes NEURIPS2023

链接: https://arxiv.org/abs/2410.19553
作者: Rajat Modi,Vibhav Vineet,Yogesh Singh Rawat
关键词-EN: video action detection, action detection, paper explores, explores the impact, video action
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: This paper was accepted to NeurIPS 2023 Dataset And Benchmark Track. It also showcases: Hinton’s Islands of Agreement on realistic datasets which were previously hypothesized in his GLOM paper

点击查看摘要

Abstract:This paper explores the impact of occlusions in video action detection. We facilitate this study by introducing five new benchmark datasets namely O-UCF and O-JHMDB consisting of synthetically controlled static/dynamic occlusions, OVIS-UCF and OVIS-JHMDB consisting of occlusions with realistic motions and Real-OUCF for occlusions in realistic-world scenarios. We formally confirm an intuitive expectation: existing models suffer a lot as occlusion severity is increased and exhibit different behaviours when occluders are static vs when they are moving. We discover several intriguing phenomenon emerging in neural nets: 1) transformers can naturally outperform CNN models which might have even used occlusion as a form of data augmentation during training 2) incorporating symbolic-components like capsules to such backbones allows them to bind to occluders never even seen during training and 3) Islands of agreement can emerge in realistic images/videos without instance-level supervision, distillation or contrastive-based objectives2(eg. video-textual training). Such emergent properties allow us to derive simple yet effective training recipes which lead to robust occlusion models inductively satisfying the first two stages of the binding mechanism (grouping/segregation). Models leveraging these recipes outperform existing video action-detectors under occlusion by 32.3% on O-UCF, 32.7% on O-JHMDB 2.6% on Real-OUCF in terms of the vMAP metric. The code for this work has been released at this https URL.

[AI-16] DeMuVGN: Effective Software Defect Prediction Model by Learning Multi-view Software Dependency via Graph Neural Networks

链接: https://arxiv.org/abs/2410.19550
作者: Yu Qiao,Lina Gong,Yu Zhao,Yongwei Wang,Mingqiang Wei
关键词-EN: optimizing resource allocation, identify high-risk defect, aims to identify, optimizing resource, resource allocation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Software defect prediction (SDP) aims to identify high-risk defect modules in software development, optimizing resource allocation. While previous studies show that dependency network metrics improve defect prediction, most methods focus on code-based dependency graphs, overlooking developer factors. Current metrics, based on handcrafted features like ego and global network metrics, fail to fully capture defect-related information. To address this, we propose DeMuVGN, a defect prediction model that learns multi-view software dependency via graph neural networks. We introduce a Multi-view Software Dependency Graph (MSDG) that integrates data, call, and developer dependencies. DeMuVGN also leverages the Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance and enhance defect module identification. In a case study of eight open-source projects across 20 versions, DeMuVGN demonstrates significant improvements: i) models based on multi-view graphs improve F1 scores by 11.1% to 12.1% over single-view models; ii) DeMuVGN improves F1 scores by 17.4% to 45.8% in within-project contexts and by 17.9% to 41.0% in cross-project contexts. Additionally, DeMuVGN excels in software evolution, showing more improvement in later-stage software versions. Its strong performance across different projects highlights its generalizability. We recommend future research focus on multi-view dependency graphs for defect prediction in both mature and newly developed projects.

[AI-17] Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

链接: https://arxiv.org/abs/2410.19546
作者: Antonia Wüst,Tim Tobiasch,Lukas Helff,Devendra S. Dhami,Constantin A. Rothkopf,Kristian Kersting
关键词-EN: newly developed Vision-Language, seemingly demonstrating advanced, developed Vision-Language Models, demonstrating advanced reasoning, advanced reasoning capabilities
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, newly developed Vision-Language Models (VLMs), such as OpenAI’s GPT-4o, have emerged, seemingly demonstrating advanced reasoning capabilities across text and image modalities. Yet, the depth of these advances in language-guided perception and abstract reasoning remains underexplored, and it is unclear whether these models can truly live up to their ambitious promises. To assess the progress and identify shortcomings, we enter the wonderland of Bongard problems, a set of classical visual reasoning puzzles that require human-like abilities of pattern recognition and abstract reasoning. While VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter, failing to understand and reason about visual concepts. Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges. Moreover, even when asked to explicitly focus on and analyze these concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts. These observations underscore the current limitations of VLMs, emphasize that a significant gap remains between human-like visual reasoning and machine cognition, and highlight the ongoing need for innovation in this area.

[AI-18] PMM-Net: Single-stage Multi-agent Trajectory Prediction with Patching-based Embedding and Explicit Modal Modulation

链接: https://arxiv.org/abs/2410.19544
作者: Huajian Liu,Wei Dong,Kunpeng Fan,Chao Wang,Yongzhuo Gao
关键词-EN: embodied intelligent applications, intelligent applications, pedestrians plays, plays a pivotal, embodied intelligent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Analyzing and forecasting trajectories of agents like pedestrians plays a pivotal role for embodied intelligent applications. The inherent indeterminacy of human behavior and complex social interaction among a rich variety of agents make this task more challenging than common time-series forecasting. In this letter, we aim to explore a distinct formulation for multi-agent trajectory prediction framework. Specifically, we proposed a patching-based temporal feature extraction module and a graph-based social feature extraction module, enabling effective feature extraction and cross-scenario generalization. Moreover, we reassess the role of social interaction and present a novel method based on explicit modality modulation to integrate temporal and social features, thereby constructing an efficient single-stage inference pipeline. Results on public benchmark datasets demonstrate the superior performance of our model compared with the state-of-the-art methods. The code is available at: this http URL.

[AI-19] CloserMusicDB: A Modern Multipurpose Dataset of High Quality Music

链接: https://arxiv.org/abs/2410.19540
作者: Aleksandra Piekarzewicz,Tomasz Sroka,Aleksander Tym,Mateusz Modrzejewski
关键词-EN: full length studio, length studio quality, studio quality tracks, quality tracks annotated, introduce CloserMusicDB
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this paper, we introduce CloserMusicDB, a collection of full length studio quality tracks annotated by a team of human experts. We describe the selected qualities of our dataset, along with three example tasks possible to perform using this dataset: hook detection, contextual tagging and artist identification. We conduct baseline experiments and provide initial benchmarks for these tasks.

[AI-20] DMT-HI: MOE-based Hyperbolic Interpretable Deep Manifold Transformation for Unspervised Dimensionality Reduction

链接: https://arxiv.org/abs/2410.19504
作者: Zelin Zang,Yuhao Wang,Jinlin Wu,Hong Liu,Yue Shen,Stan.Z Li,Zhen Lei
关键词-EN: retaining essential information, Dimensionality reduction, including data engineering, simplifying complex datasets, engineering and visualization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Dimensionality reduction (DR) plays a crucial role in various fields, including data engineering and visualization, by simplifying complex datasets while retaining essential information. However, the challenge of balancing DR accuracy and interpretability remains crucial, particularly for users dealing with high-dimensional data. Traditional DR methods often face a trade-off between precision and transparency, where optimizing for performance can lead to reduced interpretability, and vice versa. This limitation is especially prominent in real-world applications such as image, tabular, and text data analysis, where both accuracy and interpretability are critical. To address these challenges, this work introduces the MOE-based Hyperbolic Interpretable Deep Manifold Transformation (DMT-HI). The proposed approach combines hyperbolic embeddings, which effectively capture complex hierarchical structures, with Mixture of Experts (MOE) models, which dynamically allocate tasks based on input features. DMT-HI enhances DR accuracy by leveraging hyperbolic embeddings to represent the hierarchical nature of data, while also improving interpretability by explicitly linking input data, embedding outcomes, and key features through the MOE structure. Extensive experiments demonstrate that DMT-HI consistently achieves superior performance in both DR accuracy and model interpretability, making it a robust solution for complex data analysis. The code is available at \urlthis https URL.

[AI-21] Peter Parker or Spiderman? Disambiguating Multiple Class Labels NEURIPS2024

链接: https://arxiv.org/abs/2410.19479
作者: Nuthan Mummani,Simran Ketha,Venkatakrishnan Ramaswamy
关键词-EN: supervised classification setting, classification setting, make multiple predictions, supervised classification, deep networks typically
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to Neural Information Processing Systems (NeurIPS 2024). ATTRIB Workshop

点击查看摘要

Abstract:In the supervised classification setting, during inference, deep networks typically make multiple predictions. For a pair of such predictions (that are in the top-k predictions), two distinct possibilities might occur. On the one hand, each of the two predictions might be primarily driven by two distinct sets of entities in the input. On the other hand, it is possible that there is a single entity or set of entities that is driving the prediction for both the classes in question. This latter case, in effect, corresponds to the network making two separate guesses about the identity of a single entity type. Clearly, both the guesses cannot be true, i.e. both the labels cannot be present in the input. Current techniques in interpretability research do not readily disambiguate these two cases, since they typically consider input attributions for one class label at a time. Here, we present a framework and method to do so, leveraging modern segmentation and input attribution techniques. Notably, our framework also provides a simple counterfactual “proof” of each case, which can be verified for the input on the model (i.e. without running the method again). We demonstrate that the method performs well for a number of samples from the ImageNet validation set and on multiple models.

[AI-22] Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

链接: https://arxiv.org/abs/2410.19471
作者: Ryan Park,Darren J. Hsu,C. Brian Roland,Maria Korshunova,Chen Tessler,Shie Mannor,Olivia Viessmann,Bruno Trentini
关键词-EN: Inverse folding models, predicting amino acid, Inverse folding, desired reference structures, amino acid sequences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint. 10 pages plus appendices

点击查看摘要

Abstract:Inverse folding models play an important role in structure-based design by predicting amino acid sequences that fold into desired reference structures. Models like ProteinMPNN, a message-passing encoder-decoder model, are trained to reliably produce new sequences from a reference structure. However, when applied to peptides, these models are prone to generating repetitive sequences that do not fold into the reference structure. To address this, we fine-tune ProteinMPNN to produce diverse and structurally consistent peptide sequences via Direct Preference Optimization (DPO). We derive two enhancements to DPO: online diversity regularization and domain-specific priors. Additionally, we develop a new understanding on improving diversity in decoder models. When conditioned on OpenFold generated structures, our fine-tuned models achieve state-of-the-art structural similarity scores, improving base ProteinMPNN by at least 8%. Compared to standard DPO, our regularized method achieves up to 20% higher sequence diversity with no loss in structural similarity score.

[AI-23] LOCAL: Learning with Orientation Matrix to Infer Causal Structure from Time Series Data

链接: https://arxiv.org/abs/2410.19464
作者: Yue Cheng,Jiajun Zhang,Weiwei Xing,Xiaoyu Guo,Xiaohui Gao
关键词-EN: underlying Directed Acyclic, Directed Acyclic Graph, Directed Acyclic, time series observational, complex nonlinear interactions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Discovering the underlying Directed Acyclic Graph (DAG) from time series observational data is highly challenging due to the dynamic nature and complex nonlinear interactions between variables. Existing methods often struggle with inefficiency and the handling of high-dimensional data. To address these research gap, we propose LOCAL, a highly efficient, easy-to-implement, and constraint-free method for recovering dynamic causal structures. LOCAL is the first attempt to formulate a quasi-maximum likelihood-based score function for learning the dynamic DAG equivalent to the ground truth. On this basis, we propose two adaptive modules for enhancing the algebraic characterization of acyclicity with new capabilities: Asymptotic Causal Mask Learning (ACML) and Dynamic Graph Parameter Learning (DGPL). ACML generates causal masks using learnable priority vectors and the Gumbel-Sigmoid function, ensuring the creation of DAGs while optimizing computational efficiency. DGPL transforms causal learning into decomposed matrix products, capturing the dynamic causal structure of high-dimensional data and enhancing interpretability. Extensive experiments on synthetic and real-world datasets demonstrate that LOCAL significantly outperforms existing methods, and highlight LOCAL’s potential as a robust and efficient method for dynamic causal discovery. Our code will be available soon.

[AI-24] EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

链接: https://arxiv.org/abs/2410.19461
作者: Xuetian Chen,Hangcheng Li,Jiaqing Liang,Sihang Jiang,Deqing Yang
关键词-EN: graphical user interfaces, Autonomous agents operating, applications hold immense, hold immense practical, user interfaces
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the approaches using large vision-language models (LVLMs) are more intuitive and adaptable as they can visually perceive and directly interact with screens, making them indispensable in general scenarios without text metadata and tailored backends. Given the lack of high-quality training data for GUI-related tasks in existing work, this paper aims to enhance the GUI understanding and interacting capabilities of LVLMs through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Evaluation results on various GUI and agent benchmarks demonstrate that the model trained with the dataset generated through EDGE exhibits superior webpage understanding capabilities, which can then be easily transferred to previously unseen desktop and mobile environments. Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work. Our source code, the dataset and the model are available at this https URL.

[AI-25] Accelerating AI Performance using Anderson Extrapolation on GPUs NEURIPS2024

链接: https://arxiv.org/abs/2410.19460
作者: Saleem Abdul Fattah Ahmed Al Dajani,David E. Keyes
关键词-EN: leveraging Anderson extrapolation, mapping technique based, Anderson extrapolation, leveraging Anderson, mapping technique
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Numerical Analysis (math.NA)
*备注: 6 pages, 6 figures, 1 table, Accepted by NeurIPS 2024 Workshop MLNCP this https URL

点击查看摘要

Abstract:We present a novel approach for accelerating AI performance by leveraging Anderson extrapolation, a vector-to-vector mapping technique based on a window of historical iterations. By identifying the crossover point where a mixing penalty is incurred, the method focuses on reducing iterations to convergence, with fewer more compute-intensive but generally cacheable iterations, balancing speed and memory usage with accuracy and algorithmic stability, respectively. We demonstrate significant improvements, in both training and inference, motivated by scalability and efficiency extensions to the realm of high-performance computing (HPC).

[AI-26] Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration

链接: https://arxiv.org/abs/2410.19450
作者: Hai Zhong,Xun Wang,Zhuoran Li,Longbo Huang
关键词-EN: Reinforcement Learning, Multi-Agent Reinforcement Learning, leveraging offline data, powerful paradigm, data for initialization
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine-tuning to enhance both sample efficiency and performance. However, most existing research has focused on single-agent settings, with limited exploration of the multi-agent extension, i.e., Offline-to-Online Multi-Agent Reinforcement Learning (O2O MARL). In O2O MARL, two critical challenges become more prominent as the number of agents increases: (i) the risk of unlearning pre-trained Q-values due to distributional shifts during the transition from offline-to-online phases, and (ii) the difficulty of efficient exploration in the large joint state-action space. To tackle these challenges, we propose a novel O2O MARL framework called Offline Value Function Memory with Sequential Exploration (OVMSE). First, we introduce the Offline Value Function Memory (OVM) mechanism to compute target Q-values, preserving knowledge gained during offline training, ensuring smoother transitions, and enabling efficient fine-tuning. Second, we propose a decentralized Sequential Exploration (SE) strategy tailored for O2O MARL, which effectively utilizes the pre-trained offline policy for exploration, thereby significantly reducing the joint state-action space to be explored. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate that OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.

[AI-27] Gradient Descent Efficiency Index

链接: https://arxiv.org/abs/2410.19448
作者: Aviral Dhingra
关键词-EN: finding local minima, widely used iterative, finding local, local minima, multivariate functions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 12 Pages, 3 Figures

点击查看摘要

Abstract:Gradient descent is a widely used iterative algorithm for finding local minima in multivariate functions. However, the final iterations often either overshoot the minima or make minimal progress, making it challenging to determine an optimal stopping point. This study introduces a new efficiency metric, Ek, designed to quantify the effectiveness of each iteration. The proposed metric accounts for both the relative change in error and the stability of the loss function across iterations. This measure is particularly valuable in resource-constrained environments, where costs are closely tied to training time. Experimental validation across multiple datasets and models demonstrates that Ek provides valuable insights into the convergence behavior of gradient descent, complementing traditional performance metrics. The index has the potential to guide more informed decisions in the selection and tuning of optimization algorithms in machine learning applications and be used to compare the “effectiveness” of models relative to each other.

[AI-28] Expose Before You Defend: Unifying and Enhancing Backdoor Defenses via Exposed Models

链接: https://arxiv.org/abs/2410.19427
作者: Yige Li,Hanxun Huang,Jiaming Zhang,Xingjun Ma,Yu-Gang Jiang
关键词-EN: deep neural networks, Backdoor, covertly implant triggers, attacks covertly implant, neural networks
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages

点击查看摘要

Abstract:Backdoor attacks covertly implant triggers into deep neural networks (DNNs) by poisoning a small portion of the training data with pre-designed backdoor triggers. This vulnerability is exacerbated in the era of large models, where extensive (pre-)training on web-crawled datasets is susceptible to compromise. In this paper, we introduce a novel two-step defense framework named Expose Before You Defend (EBYD). EBYD unifies existing backdoor defense methods into a comprehensive defense system with enhanced performance. Specifically, EBYD first exposes the backdoor functionality in the backdoored model through a model preprocessing step called backdoor exposure, and then applies detection and removal methods to the exposed model to identify and eliminate the backdoor features. In the first step of backdoor exposure, we propose a novel technique called Clean Unlearning (CUL), which proactively unlearns clean features from the backdoored model to reveal the hidden backdoor features. We also explore various model editing/modification techniques for backdoor exposure, including fine-tuning, model sparsification, and weight perturbation. Using EBYD, we conduct extensive experiments on 10 image attacks and 6 text attacks across 2 vision datasets (CIFAR-10 and an ImageNet subset) and 4 language datasets (SST-2, IMDB, Twitter, and AG’s News). The results demonstrate the importance of backdoor exposure for backdoor defense, showing that the exposed models can significantly benefit a range of downstream defense tasks, including backdoor label detection, backdoor trigger recovery, backdoor model detection, and backdoor removal. We hope our work could inspire more research in developing advanced defense frameworks with exposed models. Our code is available at: this https URL.

[AI-29] Robust Time Series Causal Discovery for Agent -Based Model Validation

链接: https://arxiv.org/abs/2410.19412
作者: Gene Yu,Ce Guo,Wayne Luk
关键词-EN: ABM validation, causal discovery, ABM, causal discovery methods, causal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Econometrics (econ.EM); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Agent-Based Model (ABM) validation is crucial as it helps ensuring the reliability of simulations, and causal discovery has become a powerful tool in this context. However, current causal discovery methods often face accuracy and robustness challenges when applied to complex and noisy time series data, which is typical in ABM scenarios. This study addresses these issues by proposing a Robust Cross-Validation (RCV) approach to enhance causal structure learning for ABM validation. We develop RCV-VarLiNGAM and RCV-PCMCI, novel extensions of two prominent causal discovery algorithms. These aim to reduce the impact of noise better and give more reliable causal relation results, even with high-dimensional, time-dependent data. The proposed approach is then integrated into an enhanced ABM validation framework, which is designed to handle diverse data and model structures. The approach is evaluated using synthetic datasets and a complex simulated fMRI dataset. The results demonstrate greater reliability in causal structure identification. The study examines how various characteristics of datasets affect the performance of established causal discovery methods. These characteristics include linearity, noise distribution, stationarity, and causal structure density. This analysis is then extended to the RCV method to see how it compares in these different situations. This examination helps confirm whether the results are consistent with existing literature and also reveals the strengths and weaknesses of the novel approaches. By tackling key methodological challenges, the study aims to enhance ABM validation with a more resilient valuation framework presented. These improvements increase the reliability of model-driven decision making processes in complex systems analysis. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Econometrics (econ.EM); Computation (stat.CO) Cite as: arXiv:2410.19412 [cs.LG] (or arXiv:2410.19412v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.19412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] Analysis of Financial Risk Behavior Prediction Using Deep Learning and Big Data Algorithms

链接: https://arxiv.org/abs/2410.19394
作者: Haowei Yang,Zhan Cheng,Zhaoyang Zhang,Yuanshuai Luo,Shuaishuai Huang,Ao Xiang
关键词-EN: intricate behavior patterns, financial markets continue, handle large datasets, methods increasingly struggle, risk behavior prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the complexity and dynamism of financial markets continue to grow, traditional financial risk prediction methods increasingly struggle to handle large datasets and intricate behavior patterns. This paper explores the feasibility and effectiveness of using deep learning and big data algorithms for financial risk behavior prediction. First, the application and advantages of deep learning and big data algorithms in the financial field are analyzed. Then, a deep learning-based big data risk prediction framework is designed and experimentally validated on actual financial datasets. The experimental results show that this method significantly improves the accuracy of financial risk behavior prediction and provides valuable support for risk management in financial institutions. Challenges in the application of deep learning are also discussed, along with potential directions for future research.

[AI-31] Learning Neural Strategy-Proof Matching Mechanism from Examples

链接: https://arxiv.org/abs/2410.19384
作者: Ryota Maruo,Koh Takeuchi,Hisashi Kashima
关键词-EN: Designing effective two-sided, Designing effective, Toggle, effective two-sided matching, two-sided matching mechanisms
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing effective two-sided matching mechanisms is a major problem in mechanism design, and the goodness of matching cannot always be formulated. The existing work addresses this issue by searching over a parameterized family of mechanisms with certain properties by learning to fit a human-crafted dataset containing examples of preference profiles and matching results. However, this approach does not consider a strategy-proof mechanism, implicitly assumes the number of agents to be a constant, and does not consider the public contextual information of the agents. In this paper, we propose a new parametric family of strategy-proof matching mechanisms by extending the serial dictatorship (SD). We develop a novel attention-based neural network called NeuralSD, which can learn a strategy-proof mechanism from a human-crafted dataset containing public contextual information. NeuralSD is constructed by tensor operations that make SD differentiable and learns a parameterized mechanism by estimating an order of SD from the contextual information. We conducted experiments to learn a strategy-proof matching from matching examples with different numbers of agents. We demonstrated that our method shows the superiority of learning with context-awareness over a baseline in terms of regression performance and other metrics. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2410.19384 [cs.AI] (or arXiv:2410.19384v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.19384 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ryota Maruo [view email] [v1] Fri, 25 Oct 2024 08:34:25 UTC (1,100 KB) Full-text links: Access Paper: View a PDF of the paper titled Learning Neural Strategy-Proof Matching Mechanism from Examples, by Ryota Maruo and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2024-10 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-32] Multi-Agent Reinforcement Learning with Selective State-Space Models

链接: https://arxiv.org/abs/2410.19382
作者: Jemma Daniel,Ruan de Kock,Louay Ben Nessir,Sasha Abramowitz,Omayma Mahjoub,Wiem Khlifi,Claude Formanek,Arnu Pretorius
关键词-EN: Multi-Agent Reinforcement Learning, Reinforcement Learning, Multi-Agent Reinforcement, Transformer model, Multi-Agent Transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:The Transformer model has demonstrated success across a wide range of domains, including in Multi-Agent Reinforcement Learning (MARL) where the Multi-Agent Transformer (MAT) has emerged as a leading algorithm in the field. The Transformer model has demonstrated success across a wide range of domains, including in Multi-Agent Reinforcement Learning (MARL) where the Multi-Agent Transformer (MAT) has emerged as a leading algorithm in the field. However, a significant drawback of Transformer models is their quadratic computational complexity relative to input size, making them computationally expensive when scaling to larger inputs. This limitation restricts MAT’s scalability in environments with many agents. Recently, State-Space Models (SSMs) have gained attention due to their computational efficiency, but their application in MARL remains unexplored. In this work, we investigate the use of Mamba, a recent SSM, in MARL and assess whether it can match the performance of MAT while providing significant improvements in efficiency. We introduce a modified version of MAT that incorporates standard and bi-directional Mamba blocks, as well as a novel “cross-attention” Mamba block. Extensive testing shows that our Multi-Agent Mamba (MAM) matches the performance of MAT across multiple standard multi-agent environments, while offering superior scalability to larger agent scenarios. This is significant for the MARL community, because it indicates that SSMs could replace Transformers without compromising performance, whilst also supporting more effective scaling to higher numbers of agents. Our project page is available at this https URL .

[AI-33] BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training

链接: https://arxiv.org/abs/2410.19367
作者: Houming Wu,Ling Chen,Wenjie Yu
关键词-EN: efficient distributed training, increasingly urgent, efficient distributed, increasing scale, distributed training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages, 13 figures

点击查看摘要

Abstract:With the increasing scale of models, the need for efficient distributed training has become increasingly urgent. Recently, many synchronous pipeline parallelism approaches have been proposed to improve training throughput. However, these approaches still suffer from two major issues, i.e., pipeline bubbles caused by periodic flushing and extra communication due to the increasing number of pipeline stages. To this end, we propose BitPipe, a bidirectional interleaved pipeline parallelism for accelerating large models training. Specifically, a hybrid scheme of fusing interleaved pipelines with bidirectional pipelines is proposed to reduce the computational time of each single micro-batch and multiply the number of devices executing simultaneously. A V-shaped schedule with eager gradient synchronization is introduced to reduce and overlap the communication between devices. Experiments conducted on up to 32 GPUs show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches. The code of our implementation is available at this https URL.

[AI-34] Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization

链接: https://arxiv.org/abs/2410.19361
作者: Diana Pfau,Alexander Jung
关键词-EN: increasingly shape critical, shape critical decisions, systems increasingly shape, societal domains, increasingly shape
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI systems increasingly shape critical decisions across personal and societal domains. While empirical risk minimization (ERM) drives much of the AI success, it typically prioritizes accuracy over trustworthiness, often resulting in biases, opacity, and other adverse effects. This paper discusses how key requirements for trustworthy AI can be translated into design choices for the components of ERM. We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.

[AI-35] LArctan-SKAN: Simple and Efficient Single-Parameterized Kolmogorov-Arnold Networks using Learnable Trigonometric Function

链接: https://arxiv.org/abs/2410.19360
作者: Zhijie Chen,Xinglin Zhang
关键词-EN: Single-Parameterized Kolmogorov-Arnold Networks, designing Single-Parameterized Kolmogorov-Arnold, Kolmogorov-Arnold Networks, designing Single-Parameterized, Single-Parameterized Kolmogorov-Arnold
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, 3 figures, experiment code is available at this https URL

点击查看摘要

Abstract:This paper proposes a novel approach for designing Single-Parameterized Kolmogorov-Arnold Networks (SKAN) by utilizing a Single-Parameterized Function (SFunc) constructed from trigonometric functions. Three new SKAN variants are developed: LSin-SKAN, LCos-SKAN, and LArctan-SKAN. Experimental validation on the MNIST dataset demonstrates that LArctan-SKAN excels in both accuracy and computational efficiency. Specifically, LArctan-SKAN significantly improves test set accuracy over existing models, outperforming all pure KAN variants compared, including FourierKAN, LSS-SKAN, and Spl-KAN. It also surpasses mixed MLP-based models such as MLP+rKAN and MLP+fKAN in accuracy. Furthermore, LArctan-SKAN exhibits remarkable computational efficiency, with a training speed increase of 535.01% and 49.55% compared to MLP+rKAN and MLP+fKAN, respectively. These results confirm the effectiveness and potential of SKANs constructed with trigonometric functions. The experiment code is available at this https URL .

[AI-36] Interpreting Neural Networks through Mahalanobis Distance

链接: https://arxiv.org/abs/2410.19352
作者: Alan Oursland
关键词-EN: network linear layers, neural network interpretability, Mahalanobis distance, neural network, connects neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 11 pages, October 2024

点击查看摘要

Abstract:This paper introduces a theoretical framework that connects neural network linear layers with the Mahalanobis distance, offering a new perspective on neural network interpretability. While previous studies have explored activation functions primarily for performance optimization, our work interprets these functions through statistical distance measures, a less explored area in neural network research. By establishing this connection, we provide a foundation for developing more interpretable neural network models, which is crucial for applications requiring transparency. Although this work is theoretical and does not include empirical data, the proposed distance-based interpretation has the potential to enhance model robustness, improve generalization, and provide more intuitive explanations of neural network decisions.

[AI-37] pEBR: A Probabilistic Approach to Embedding Based Retrieval

链接: https://arxiv.org/abs/2410.19349
作者: Han Zhang,Yunjing Jiang,Mingming Li,Haowei Yuan,Wen-Yun Yang
关键词-EN: Embedding retrieval aims, approximate nearest neighbor, shared semantic representation, semantic representation space, Embedding retrieval
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Embedding retrieval aims to learn a shared semantic representation space for both queries and items, thus enabling efficient and effective item retrieval using approximate nearest neighbor (ANN) algorithms. In current industrial practice, retrieval systems typically retrieve a fixed number of items for different queries, which actually leads to insufficient retrieval (low recall) for head queries and irrelevant retrieval (low precision) for tail queries. Mostly due to the trend of frequentist approach to loss function designs, till now there is no satisfactory solution to holistically address this challenge in the industry. In this paper, we move away from the frequentist approach, and take a novel \textbfprobabilistic approach to \textbfembedding \textbfbased \textbfretrieval (namely \textbfpEBR) by learning the item distribution for different queries, which enables a dynamic cosine similarity threshold calculated by the probabilistic cumulative distribution function (CDF) value. The experimental results show that our approach improves both the retrieval precision and recall significantly. Ablation studies also illustrate how the probabilistic approach is able to capture the differences between head and tail queries.

[AI-38] A prescriptive theory for brain-like inference

链接: https://arxiv.org/abs/2410.19315
作者: Hadi Vafaii,Dekel Galor,Jacob L. Yates
关键词-EN: Evidence Lower Bound, Lower Bound, Evidence Lower, Variational Autoencoders, training deep generative
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The Evidence Lower Bound (ELBO) is a widely used objective for training deep generative models, such as Variational Autoencoders (VAEs). In the neuroscience literature, an identical objective is known as the variational free energy, hinting at a potential unified framework for brain function and machine learning. Despite its utility in interpreting generative models, including diffusion models, ELBO maximization is often seen as too broad to offer prescriptive guidance for specific architectures in neuroscience or machine learning. In this work, we show that maximizing ELBO under Poisson assumptions for general sequence data leads to a spiking neural network that performs Bayesian posterior inference through its membrane potential dynamics. The resulting model, the iterative Poisson VAE (iP-VAE), has a closer connection to biological neurons than previous brain-inspired predictive coding models based on Gaussian assumptions. Compared to amortized and iterative VAEs, iP-VAElearns sparser representations and exhibits superior generalization to out-of-distribution samples. These findings suggest that optimizing ELBO, combined with Poisson assumptions, provides a solid foundation for developing prescriptive theories in NeuroAI.

[AI-39] COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

链接: https://arxiv.org/abs/2410.19313
作者: Haocheng Xi,Han Cai,Ligeng Zhu,Yao Lu,Kurt Keutzer,Jianfei Chen,Song Han
关键词-EN: improving training efficiency, Compressing Optimizer States, training, optimizer states, promising method
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages. 9 Figures. 8 Tables

点击查看摘要

Abstract:FP8 training has emerged as a promising method for improving training efficiency. Existing frameworks accelerate training by applying FP8 computation to linear layers while leaving optimizer states and activations in higher precision, which fails to fully optimize memory usage. This paper introduces COAT (Compressing Optimizer States and Activations for FP8 Training), a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT addresses current limitations through two key innovations: (1) Dynamic Range Expansion, which aligns optimizer state distributions more closely with the FP8 representation range, thereby reducing quantization error, and (2) Mixed-Granularity Activation Quantization, which optimizes activation memory using a combination of per-tensor and per-group quantization strategies. Experiments demonstrate that COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16 while achieving nearly lossless performance across various tasks, such as Large Language Model pretraining and fine-tuning and Vision Language Model training. COAT also achieves a 1.43x end-to-end training speedup compared to BF16, performing on par with or surpassing TransformerEngine’s speedup. COAT enables efficient full-parameter training of large models on fewer GPUs, and facilitates doubling the batch size in distributed training settings, providing a practical solution for scaling large-scale model training. The code is available at this https URL.

[AI-40] Flow Generator Matching

链接: https://arxiv.org/abs/2410.19310
作者: Zemin Huang,Zhengyang Geng,Weijian Luo,Guo-jun Qi
关键词-EN: Intelligence Generated Content, Artificial Intelligence Generated, Generated Content, Artificial Intelligence, Intelligence Generated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In the realm of Artificial Intelligence Generated Content (AIGC), flow-matching models have emerged as a powerhouse, achieving success due to their robust theoretical underpinnings and solid ability for large-scale generative modeling. These models have demonstrated state-of-the-art performance, but their brilliance comes at a cost. The process of sampling from these models is notoriously demanding on computational resources, as it necessitates the use of multi-step numerical ordinary differential equations (ODEs). Against this backdrop, this paper presents a novel solution with theoretical guarantees in the form of Flow Generator Matching (FGM), an innovative approach designed to accelerate the sampling of flow-matching models into a one-step generation, while maintaining the original performance. On the CIFAR10 unconditional generation benchmark, our one-step FGM model achieves a new record Fréchet Inception Distance (FID) score of 3.08 among few-step flow-matching-based models, outperforming original 50-step flow-matching models. Furthermore, we use the FGM to distill the Stable Diffusion 3, a leading text-to-image flow-matching model based on the MM-DiT architecture. The resulting MM-DiT-FGM one-step text-to-image model demonstrates outstanding industry-level performance. When evaluated on the GenEval benchmark, MM-DiT-FGM has delivered remarkable generating qualities, rivaling other multi-step models in light of the efficiency of a single generation step.

[AI-41] Semantics in Robotics: Environmental Data Cant Yield Conventions of Human Behaviour

链接: https://arxiv.org/abs/2410.19308
作者: Jamie Milton Freestone
关键词-EN: canonical definition, aid HRI, data, Abstract, word semantics
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The word semantics, in robotics and AI, has no canonical definition. It usually serves to denote additional data provided to autonomous agents to aid HRI. Most researchers seem, implicitly, to understand that such data cannot simply be extracted from environmental data. I try to make explicit why this is so and argue that so-called semantics are best understood as data comprised of conventions of human behaviour. This includes labels, most obviously, but also places, ontologies, and affordances. Object affordances are especially problematic because they require not only semantics that are not in the environmental data (conventions of object use) but also an understanding of physics and object combinations that would, if achieved, constitute artificial superintelligence.

[AI-42] EARS: Textual Representations for Scrutable Recommendations

链接: https://arxiv.org/abs/2410.19302
作者: Emiliano Penaloza,Olivier Gouvert,Haolun Wu,Laurent Charlin
关键词-EN: Traditional recommender systems, modeling user-item interactions, Traditional recommender, recommender systems rely, rely on high-dimensional
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional recommender systems rely on high-dimensional (latent) embeddings for modeling user-item interactions, often resulting in opaque representations that lack interpretability. Moreover, these systems offer limited control to users over their recommendations. Inspired by recent work, we introduce TExtuAl Representations for Scrutable recommendations (TEARS) to address these challenges. Instead of representing a user’s interests through a latent embedding, TEARS encodes them in natural text, providing transparency and allowing users to edit them. To do so, TEARS uses a modern LLM to generate user summaries based on user preferences. We find the summaries capture user preferences uniquely. Using these summaries, we take a hybrid approach where we use an optimal transport procedure to align the summaries’ representation with the learned representation of a standard VAE for collaborative filtering. We find this approach can surpass the performance of three popular VAE models while providing user-controllable recommendations. We also analyze the controllability of TEARS through three simulated user tasks to evaluate the effectiveness of a user editing its summary.

[AI-43] A Stock Price Prediction Approach Based on Time Series Decomposition and Multi-Scale CNN using OHLCT Images

链接: https://arxiv.org/abs/2410.19291
作者: Zhiyuan Pei,Jianqi Yan,Jin Yan,Bailing Yang,Ziyuan Li,Lin Zhang,Xin Liu,Yang Zhang
关键词-EN: including macroeconomic conditions, make price movements, price movements complex, Convolutional Neural Network, Stock price fluctuations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
*备注: 32 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Stock price fluctuations are influenced by a variety of factors, including macroeconomic conditions, government policies, and market sentiment, which together make price movements complex and difficult to predict. Despite many studies aimed at enhancing stock price prediction models, challenges such as data noise, model overfitting, and lack of interpretability are still encountered. To address these issues and improve prediction accuracy, this paper proposes a novel method, named Sequence-based Multiscale Fusion Regression Convolutional Neural Network (SMSFR-CNN), for predicting stock price movements in the China A-share market. By utilizing CNN to learn sequential features and combining them with image features, we improve the accuracy of stock trend prediction on the A-share market stock dataset. This approach reduces the search space for image features, stabilizes, and accelerates the training process. Extensive comparative experiments on 4,454 A-share stocks show that the proposed model achieves 61.15% for positive predictive value and 63.37% for negative predictive value of the stock price trend over the next 5 days, resulting in a total profit of 165.09%.

[AI-44] Applying sparse autoencoders to unlearn knowledge in language models

链接: https://arxiv.org/abs/2410.19278
作者: Eoin Farrell,Yeu-Tong Lau,Arthur Conmy
关键词-EN: Mass Destruction Proxy, language models, Destruction Proxy dataset, sparse autoencoders, investigate whether sparse
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We investigate whether sparse autoencoders (SAEs) can be used to remove knowledge from language models. We use the biology subset of the Weapons of Mass Destruction Proxy dataset and test on the gemma-2b-it and gemma-2-2b-it language models. We demonstrate that individual interpretable biology-related SAE features can be used to unlearn biology-related knowledge with minimal side-effects. Our results suggest that negative scaling of feature activations is necessary and that zero ablating features is ineffective. We find that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects than the existing Representation Misdirection for Unlearning technique. Current SAE quality or intervention techniques would need to improve to make SAE-based unlearning comparable to the existing fine-tuning based techniques.

[AI-45] Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management

链接: https://arxiv.org/abs/2410.19274
作者: Tuowei Wang,Ruwen Fan,Minxing Huang,Zixu Hao,Kun Li,Ting Cao,Youyou Lu,Yaoxue Zhang,Ju Ren
关键词-EN: Large Language Models, Large Language, achieved remarkable success, arduous challenge due, mobile devices remains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Operating Systems (cs.OS); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from degraded model accuracy. In contrast, sparsity-based techniques minimize DRAM usage by selectively transferring only relevant neurons to DRAM while retaining the full model in external storage, such as flash. However, such approaches are critically limited by numerous I/O operations, particularly on smartphones with severe IOPS constraints. In this paper, we propose Ripple, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory. Ripple leverages the concept of Neuron Co-Activation, where neurons frequently activated together are linked to facilitate continuous read access and optimize data transfer efficiency. Our approach incorporates a two-stage solution: an offline stage that reorganizes neuron placement based on co-activation patterns, and an online stage that employs tailored data access and caching strategies to align well with hardware characteristics. Evaluations conducted on a variety of smartphones and LLMs demonstrate that Ripple achieves up to 5.93x improvements in I/O latency compared to the state-of-the-art. As the first solution to optimize storage placement under sparsity, Ripple explores a new optimization space at the intersection of sparsity-driven algorithm and storage-level system co-design in LLM inference. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Operating Systems (cs.OS); Performance (cs.PF) Cite as: arXiv:2410.19274 [cs.LG] (or arXiv:2410.19274v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.19274 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-46] Autonomous Building Cyber-Physical Systems Using Decentralized Autonomous Organizations Digital Twins and Large Language Model

链接: https://arxiv.org/abs/2410.19262
作者: Reachsak Ly,Alireza Shojaei
关键词-EN: Current autonomous building, research primarily focuses, autonomous building, autonomous building research, building
类目: Artificial Intelligence (cs.AI)
*备注: 40 pages, 22 figures

点击查看摘要

Abstract:Current autonomous building research primarily focuses on energy efficiency and automation. While traditional artificial intelligence has advanced autonomous building research, it often relies on predefined rules and struggles to adapt to complex, evolving building operations. Moreover, the centralized organizational structures of facilities management hinder transparency in decision-making, limiting true building autonomy. Research on decentralized governance and adaptive building infrastructure, which could overcome these challenges, remains relatively unexplored. This paper addresses these limitations by introducing a novel Decentralized Autonomous Building Cyber-Physical System framework that integrates Decentralized Autonomous Organizations, Large Language Models, and digital twins to create a smart, self-managed, operational, and financially autonomous building infrastructure. This study develops a full-stack decentralized application to facilitate decentralized governance of building infrastructure. An LLM-based artificial intelligence assistant is developed to provide intuitive human-building interaction for blockchain and building operation management-related tasks and enable autonomous building operation. Six real-world scenarios were tested to evaluate the autonomous building system’s workability, including building revenue and expense management, AI-assisted facility control, and autonomous adjustment of building systems. Results indicate that the prototype successfully executes these operations, confirming the framework’s suitability for developing building infrastructure with decentralized governance and autonomous operation.

[AI-47] Non-rigid Relative Placement through 3D Dense Diffusion

链接: https://arxiv.org/abs/2410.19247
作者: Eric Cai,Octavian Donca,Ben Eisner,David Held
关键词-EN: placing a mug, mug rack, relative placement, mug, unseen task variations
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Conference on Robot Learning (CoRL), 2024

点击查看摘要

Abstract:The task of “relative placement” is to predict the placement of one object in relation to another, e.g. placing a mug onto a mug rack. Through explicit object-centric geometric reasoning, recent methods for relative placement have made tremendous progress towards data-efficient learning for robot manipulation while generalizing to unseen task variations. However, they have yet to represent deformable transformations, despite the ubiquity of non-rigid bodies in real world settings. As a first step towards bridging this gap, we propose ``cross-displacement" - an extension of the principles of relative placement to geometric relationships between deformable objects - and present a novel vision-based method to learn cross-displacement through dense diffusion. To this end, we demonstrate our method’s ability to generalize to unseen object instances, out-of-distribution scene configurations, and multimodal goals on multiple highly deformable tasks (both in simulation and in the real world) beyond the scope of prior works. Supplementary information and videos can be found at our \hrefthis https URL\textwebsite .

[AI-48] Designing LLM -Agents with Personalities: A Psychometric Approach

链接: https://arxiv.org/abs/2410.19238
作者: Muhua Huang,Xijuan Zhang,Christopher Soto,James Evans
关键词-EN: Large Language Models-Based, Language Models-Based Agents, Large Language, personalities to Large, Language Models-Based
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This research introduces a novel methodology for assigning quantifiable, controllable and psychometrically validated personalities to Large Language Models-Based Agents (Agents) using the Big Five personality framework. It seeks to overcome the constraints of human subject studies, proposing Agents as an accessible tool for social science inquiry. Through a series of four studies, this research demonstrates the feasibility of assigning psychometrically valid personality traits to Agents, enabling them to replicate complex human-like behaviors. The first study establishes an understanding of personality constructs and personality tests within the semantic space of an LLM. Two subsequent studies – using empirical and simulated data – illustrate the process of creating Agents and validate the results by showing strong correspondence between human and Agent answers to personality tests. The final study further corroborates this correspondence by using Agents to replicate known human correlations between personality traits and decision-making behaviors in scenarios involving risk-taking and ethical dilemmas, thereby validating the effectiveness of the psychometric approach to design Agents and its applicability to social and behavioral research.

[AI-49] Learning Diffusion Policies from Demonstrations For Compliant Contact-rich Manipulation

链接: https://arxiv.org/abs/2410.19235
作者: Malek Aburub,Cristian C. Beltran-Hernandez,Tatsuya Kamijo,Masashi Hamaya
关键词-EN: achieving human-like dexterity, hold great promise, Robots hold great, remains challenging, human-like dexterity
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Robots hold great promise for performing repetitive or hazardous tasks, but achieving human-like dexterity, especially in contact-rich and dynamic environments, remains challenging. Rigid robots, which rely on position or velocity control, often struggle with maintaining stable contact and applying consistent force in force-intensive tasks. Learning from Demonstration has emerged as a solution, but tasks requiring intricate maneuvers, such as powder grinding, present unique difficulties. This paper introduces Diffusion Policies For Compliant Manipulation (DIPCOM), a novel diffusion-based framework designed for compliant control tasks. By leveraging generative diffusion models, we develop a policy that predicts Cartesian end-effector poses and adjusts arm stiffness to maintain the necessary force. Our approach enhances force control through multimodal distribution modeling, improves the integration of diffusion policies in compliance control, and extends our previous work by demonstrating its effectiveness in real-world tasks. We present a detailed comparison between our framework and existing methods, highlighting the advantages and best practices for deploying diffusion-based compliance control.

[AI-50] Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis

链接: https://arxiv.org/abs/2410.19225
作者: Weikai Li,Ding Wang,Zijian Ding,Atefeh Sohrabizadeh,Zongyue Qin,Jason Cong,Yizhou Sun
关键词-EN: Programmable Gate Array, Field Programmable Gate, designing Field Programmable, Gate Array, Field Programmable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:High-level synthesis (HLS) is a widely used tool in designing Field Programmable Gate Array (FPGA). HLS enables FPGA design with software programming languages by compiling the source code into an FPGA circuit. The source code includes a program (called ``kernel’') and several pragmas that instruct hardware synthesis, such as parallelization, pipeline, etc. While it is relatively easy for software developers to design the program, it heavily relies on hardware knowledge to design the pragmas, posing a big challenge for software developers. Recently, different machine learning algorithms, such as GNNs, have been proposed to automate the pragma design via performance prediction. However, when applying the trained model on new kernels, the significant domain shift often leads to unsatisfactory performance. We propose a more domain-generalizable model structure: a two-level hierarchical Mixture of Experts (MoE), that can be flexibly adapted to any GNN model. Different expert networks can learn to deal with different regions in the representation space, and they can utilize similar patterns between the old kernels and new kernels. In the low-level MoE, we apply MoE on three natural granularities of a program: node, basic block, and graph. The high-level MoE learns to aggregate the three granularities for the final decision. To stably train the hierarchical MoE, we further propose a two-stage training method. Extensive experiments verify the effectiveness of the hierarchical MoE.

[AI-51] Integrating Large Language Models with Internet of Things Applications

链接: https://arxiv.org/abs/2410.19223
作者: Mingyu Zong,Arvin Hekmati,Michael Guastalla,Yiyi Li,Bhaskar Krishnamachari
关键词-EN: Internet of Things, DDoS attack detection, Large Language Models, make Internet, GPT model
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper identifies and analyzes applications in which Large Language Models (LLMs) can make Internet of Things (IoT) networks more intelligent and responsive through three case studies from critical topics: DDoS attack detection, macroprogramming over IoT systems, and sensor data processing. Our results reveal that the GPT model under few-shot learning achieves 87.6% detection accuracy, whereas the fine-tuned GPT increases the value to 94.9%. Given a macroprogramming framework, the GPT model is capable of writing scripts using high-level functions from the framework to handle possible incidents. Moreover, the GPT model shows efficacy in processing a vast amount of sensor data by offering fast and high-quality responses, which comprise expected results and summarized insights. Overall, the model demonstrates its potential to power a natural language interface. We hope that researchers will find these case studies inspiring to develop further.

[AI-52] Robot Behavior Personalization from Sparse User Feedback

链接: https://arxiv.org/abs/2410.19219
作者: Maithili Patel,Sonia Chernova
关键词-EN: user, service robots, user feedback, task adaptation, Abstract Concepts
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As service robots become more general-purpose, they will need to adapt to their users’ preferences over a large set of all possible tasks that they can perform. This includes preferences regarding which actions the users prefer to delegate to robots as opposed to doing themselves. Existing personalization approaches require task-specific data for each user. To handle diversity across all household tasks and users, and nuances in user preferences across tasks, we propose to learn a task adaptation function independently, which can be used in tandem with any universal robot policy to customize robot behavior. We create Task Adaptation using Abstract Concepts (TAACo) framework. TAACo can learn to predict the user’s preferred manner of assistance with any given task, by mediating reasoning through a representation composed of abstract concepts built based on user feedback. TAACo can generalize to an open set of household tasks from small amount of user feedback and explain its inferences through intuitive concepts. We evaluate our model on a dataset we collected of 5 people’s preferences, and show that TAACo outperforms GPT-4 by 16% and a rule-based system by 54%, on prediction accuracy, with 40 samples of user feedback.

[AI-53] axonomy-guided Semantic Indexing for Academic Paper Search EMNLP’24

链接: https://arxiv.org/abs/2410.19218
作者: SeongKu Kang,Yunyi Zhang,Pengcheng Jiang,Dongha Lee,Jiawei Han,Hwanjo Yu
关键词-EN: efficient literature discovery, scientific advancement, paper search, Taxonomy-guided Semantic Indexing, essential task
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: EMNLP’24

点击查看摘要

Abstract:Academic paper search is an essential task for efficient literature discovery and scientific advancement. While dense retrieval has advanced various ad-hoc searches, it often struggles to match the underlying academic concepts between queries and documents, which is critical for paper search. To enable effective academic concept matching for paper search, we propose Taxonomy-guided Semantic Indexing (TaxoIndex) framework. TaxoIndex extracts key concepts from papers and organizes them as a semantic index guided by an academic taxonomy, and then leverages this index as foundational knowledge to identify academic concepts and link queries and documents. As a plug-and-play framework, TaxoIndex can be flexibly employed to enhance existing dense retrievers. Extensive experiments show that TaxoIndex brings significant improvements, even with highly limited training data, and greatly enhances interpretability.

[AI-54] No Free Lunch: Fundamental Limits of Learning Non-Hallucinating Generative Models

链接: https://arxiv.org/abs/2410.19217
作者: Changlong Wu,Ananth Grama,Wojciech Szpankowski
关键词-EN: shown impressive capabilities, synthesizing high-quality outputs, shown impressive, impressive capabilities, capabilities in synthesizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative models have shown impressive capabilities in synthesizing high-quality outputs across various domains. However, a persistent challenge is the occurrence of “hallucinations”, where the model produces outputs that are plausible but invalid. While empirical strategies have been explored to mitigate this issue, a rigorous theoretical understanding remains elusive. In this paper, we develop a theoretical framework to analyze the learnability of non-hallucinating generative models from a learning-theoretic perspective. Our results reveal that non-hallucinating learning is statistically impossible when relying solely on the training dataset, even for a hypothesis class of size two and when the entire training set is truthful. To overcome these limitations, we show that incorporating inductive biases aligned with the actual facts into the learning process is essential. We provide a systematic approach to achieve this by restricting the facts set to a concept class of finite VC-dimension and demonstrate its effectiveness under various learning paradigms. Although our findings are primarily conceptual, they represent a first step towards a principled approach to addressing hallucinations in learning generative models.

[AI-55] Equitable Federated Learning with Activation Clustering

链接: https://arxiv.org/abs/2410.19207
作者: Antesh Upadhyay,Abolfazl Hashemi
关键词-EN: prominent distributed learning, distributed learning paradigm, promotes data locality, Federated learning, distributed learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 28 pages

点击查看摘要

Abstract:Federated learning is a prominent distributed learning paradigm that incorporates collaboration among diverse clients, promotes data locality, and thus ensures privacy. These clients have their own technological, cultural, and other biases in the process of data generation. However, the present standard often ignores this bias/heterogeneity, perpetuating bias against certain groups rather than mitigating it. In response to this concern, we propose an equitable clustering-based framework where the clients are categorized/clustered based on how similar they are to each other. We propose a unique way to construct the similarity matrix that uses activation vectors. Furthermore, we propose a client weighing mechanism to ensure that each cluster receives equal importance and establish O(1/\sqrtK) rate of convergence to reach an \epsilon- stationary solution. We assess the effectiveness of our proposed strategy against common baselines, demonstrating its efficacy in terms of reducing the bias existing amongst various client clusters and consequently ameliorating algorithmic bias against specific groups.

[AI-56] An Inverse Modeling Constrained Multi-Objective Evolutionary Algorithm Based on Decomposition

链接: https://arxiv.org/abs/2410.19203
作者: Lucas R. C. Farias,Aluizio F. R. Araújo
关键词-EN: constrained multi-objective evolutionary, based on decomposition, paper introduces, inverse modeling, evolutionary algorithm based
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 1 figure, 1 algorithm, and 2 tables

点击查看摘要

Abstract:This paper introduces the inverse modeling constrained multi-objective evolutionary algorithm based on decomposition (IM-C-MOEA/D) for addressing constrained real-world optimization problems. Our research builds upon the advancements made in evolutionary computing-based inverse modeling, and it strategically bridges the gaps in applying inverse models based on decomposition to problem domains with constraints. The proposed approach is experimentally evaluated on diverse real-world problems (RWMOP1-35), showing superior performance to state-of-the-art constrained multi-objective evolutionary algorithms (CMOEAs). The experimental results highlight the robustness of the algorithm and its applicability in real-world constrained optimization scenarios.

[AI-57] MAP: Multi-Human-Value Alignment Palette

链接: https://arxiv.org/abs/2410.19198
作者: Xinran Wang,Qi Le,Ammar Ahmed,Enmao Diao,Yi Zhou,Nathalie Baracaldo,Jie Ding,Ali Anwar
关键词-EN: Ensuring that generative, essential but challenging, human, alignment, Ensuring
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring that generative AI systems align with human values is essential but challenging, especially when considering multiple human values and their potential trade-offs. Since human values can be personalized and dynamically change over time, the desirable levels of value alignment vary across different ethnic groups, industry sectors, and user cohorts. Within existing frameworks, it is hard to define human values and align AI systems accordingly across different directions simultaneously, such as harmlessness, helpfulness, and positiveness. To address this, we develop a novel, first-principle approach called Multi-Human-Value Alignment Palette (MAP), which navigates the alignment across multiple human values in a structured and reliable way. MAP formulates the alignment problem as an optimization task with user-defined constraints, which define human value targets. It can be efficiently solved via a primal-dual approach, which determines whether a user-defined alignment target is achievable and how to achieve it. We conduct a detailed theoretical analysis of MAP by quantifying the trade-offs between values, the sensitivity to constraints, the fundamental connection between multi-value alignment and sequential alignment, and proving that linear weighted rewards are sufficient for multi-value alignment. Extensive experiments demonstrate MAP’s ability to align multiple values in a principled manner while delivering strong empirical performance across various tasks.

[AI-58] ailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts

链接: https://arxiv.org/abs/2410.19185
作者: Danyal Aftab,Steven Davy
关键词-EN: Large language models, Large language, demonstrate impressive proficiency, impressive proficiency, model
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models demonstrate impressive proficiency in language understanding and generation. Nonetheless, training these models from scratch, even the least complex billion-parameter variant demands significant computational resources rendering it economically impractical for many organizations. With large language models functioning as general-purpose task solvers, this paper investigates their task-specific fine-tuning. We employ task-specific datasets and prompts to fine-tune two pruned LLaMA models having 5 billion and 4 billion parameters. This process utilizes the pre-trained weights and focuses on a subset of weights using the LoRA method. One challenge in fine-tuning the LLaMA model is crafting a precise prompt tailored to the specific task. To address this, we propose a novel approach to fine-tune the LLaMA model under two primary constraints: task specificity and prompt effectiveness. Our approach, Tailored LLaMA initially employs structural pruning to reduce the model sizes from 7B to 5B and 4B parameters. Subsequently, it applies a carefully designed prompt specific to the task and utilizes the LoRA method to accelerate the fine-tuning process. Moreover, fine-tuning a model pruned by 50% for less than one hour restores the mean accuracy of classification tasks to 95.68% at a 20% compression ratio and to 86.54% at a 50% compression ratio through few-shot learning with 50 shots. Our validation of Tailored LLaMA on these two pruned variants demonstrates that even when compressed to 50%, the models maintain over 65% of the baseline model accuracy in few-shot classification and generation tasks. These findings highlight the efficacy of our tailored approach in maintaining high performance with significantly reduced model sizes.

[AI-59] Can Self Supervision Rejuvenate Similarity-Based Link Prediction?

链接: https://arxiv.org/abs/2410.19183
作者: Chenhan Zhang,Weiqi Wang,Zhiyi Tian,James Jianqiao Yu,Mohamed Ali Kaafar,An Liu,Shui Yu
关键词-EN: shown remarkable capabilities, learning-based link prediction, remarkable capabilities, similarity-based, recent advancements
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although recent advancements in end-to-end learning-based link prediction (LP) methods have shown remarkable capabilities, the significance of traditional similarity-based LP methods persists in unsupervised scenarios where there are no known link labels. However, the selection of node features for similarity computation in similarity-based LP can be challenging. Less informative node features can result in suboptimal LP performance. To address these challenges, we integrate self-supervised graph learning techniques into similarity-based LP and propose a novel method: Self-Supervised Similarity-based LP (3SLP). 3SLP is suitable for the unsupervised condition of similarity-based LP without the assistance of known link labels. Specifically, 3SLP introduces a dual-view contrastive node representation learning (DCNRL) with crafted data augmentation and node representation learning. DCNRL is dedicated to developing more informative node representations, replacing the node attributes as inputs in the similarity-based LP backbone. Extensive experiments over benchmark datasets demonstrate the salient improvement of 3SLP, outperforming the baseline of traditional similarity-based LP by up to 21.2% (AUC).

[AI-60] PDL: A Declarative Prompt Programming Language

链接: https://arxiv.org/abs/2410.19135
作者: Mandana Vaziri,Louis Mandel,Claudio Spiess,Martin Hirzel
关键词-EN: Large language models, Large language, storm by making, making many previously, previously difficult
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have taken the world by storm by making many previously difficult uses of AI feasible. LLMs are controlled via highly expressive textual prompts and return textual answers. Unfortunately, this unstructured text as input and output makes LLM-based applications brittle. This motivates the rise of prompting frameworks, which mediate between LLMs and the external world. However, existing prompting frameworks either have a high learning curve or take away control over the exact prompts from the developer. To overcome this dilemma, this paper introduces the Prompt Declaration Language (PDL). PDL is a simple declarative data-oriented language that puts prompts at the forefront, based on YAML. PDL works well with many LLM platforms and LLMs. It supports writing interactive applications that call LLMs and tools, and makes it easy to implement common use-cases such as chatbots, RAG, or agents. We hope PDL will make prompt programming simpler, less brittle, and more enjoyable.

[AI-61] Research on Key Technologies for Cross-Cloud Federated Training of Large Language Models

链接: https://arxiv.org/abs/2410.19130
作者: Haowei Yang,Mingxiu Sui,Shaobo Liu,Xinyue Qian,Zhaoyang Zhang,Bingying Liu
关键词-EN: demonstrated exceptional performance, language processing technology, natural language processing, large language models, Cross-cloud federated training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:With the rapid development of natural language processing technology, large language models have demonstrated exceptional performance in various application scenarios. However, training these models requires significant computational resources and data processing capabilities. Cross-cloud federated training offers a new approach to addressing the resource bottlenecks of a single cloud platform, allowing the computational resources of multiple clouds to collaboratively complete the training tasks of large models. This study analyzes the key technologies of cross-cloud federated training, including data partitioning and distribution, communication optimization, model aggregation algorithms, and the compatibility of heterogeneous cloud platforms. Additionally, the study examines data security and privacy protection strategies in cross-cloud training, particularly the application of data encryption and differential privacy techniques. Through experimental validation, the proposed technical framework demonstrates enhanced training efficiency, ensured data security, and reduced training costs, highlighting the broad application prospects of cross-cloud federated training.

[AI-62] Bio2Token: All-atom tokenization of any biomolecular structure with Mamba

链接: https://arxiv.org/abs/2410.19110
作者: Andrew Liu,Axel Elaldi,Nathan Russell,Olivia Viessmann
关键词-EN: biomolecular design applications, design applications, high fidelity, fidelity is critical, critical for biomolecular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient encoding and representation of large 3D molecular structures with high fidelity is critical for biomolecular design applications. Despite this, many representation learning approaches restrict themselves to modeling smaller systems or use coarse-grained approximations of the systems, for example modeling proteins at the resolution of amino acid residues rather than at the level of individual atoms. To address this, we develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies below and around 1 Angstrom. We demonstrate that the Mamba state space model architecture employed is comparatively efficient, requiring a fraction of the training data, parameters and compute needed to reach competitive accuracies and can scale to systems with almost 100,000 atoms. The learned structure tokens of bio2token may serve as the input for all-atom language models in the future.

[AI-63] VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

链接: https://arxiv.org/abs/2410.19100
作者: Lawrence Jang,Yinheng Li,Charles Ding,Justin Lin,Paul Pu Liang,Dan Zhao,Rogerio Bonatti,Kazuhito Koishida
关键词-EN: retention, factual retention, tasks, learn or extract, long-context
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves 13.3% success on factual retention tasks and 45.8% on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to improve the agentic abilities of long-context multimodal models and provides a testbed for future development with long-context video agents.

[AI-64] A Counterexample in Cross-Correlation Template Matching

链接: https://arxiv.org/abs/2410.19085
作者: Serap A. Savari
关键词-EN: impact is incomplete, quantization are standard, standard practices, practices in signal, theoretical understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Sampling and quantization are standard practices in signal and image processing, but a theoretical understanding of their impact is incomplete. We consider discrete image registration when the underlying function is a one-dimensional spatially-limited piecewise constant function. For ideal noiseless sampling the number of samples from each region of the support of the function generally depends on the placement of the sampling grid. Therefore, if the samples of the function are noisy, then image registration requires alignment and segmentation of the data sequences. One popular strategy for aligning images is selecting the maximum from cross-correlation template matching. To motivate more robust and accurate approaches which also address segmentation, we provide an example of a one-dimensional spatially-limited piecewise constant function for which the cross-correlation technique can perform poorly on noisy samples. While earlier approaches to improve the method involve normalization, our example suggests a novel strategy in our setting. Difference sequences, thresholding, and dynamic programming are well-known techniques in image processing. We prove that they are tools to correctly align and segment noisy data sequences under some conditions on the noise. We also address some of the potential difficulties that could arise in a more general case.

[AI-65] From a Tiny Slip to a Giant Leap: An LLM -Based Simulation for Fake News Evolution

链接: https://arxiv.org/abs/2410.19064
作者: Yuhan Liu,Zirui Song,Xiaoqing Zhang,Xiuying Chen,Rui Yan
关键词-EN: fake, misinformation online, research has increasingly, increasingly focused, focused on detecting
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the growing spread of misinformation online, research has increasingly focused on detecting and tracking fake news. However, an overlooked issue is that fake news does not naturally exist in social networks – it often originates from distorted facts or deliberate fabrication by malicious actors. Understanding how true news gradually evolves into fake news is critical for early detection and prevention, reducing its spread and impact. Hence, in this paper, we take the first step toward simulating and revealing this evolution, proposing a Fake News evolUtion Simulation framEwork (FUSE) based on large language models (LLMs). Specifically, we employ LLM as agents to represent individuals in a simulated social network. We define four types of agents commonly observed in daily interactions: spreaders, who propagate information; commentators, who provide opinions and interpretations; verifiers, who check the accuracy of information; and bystanders, who passively observe without engaging. For simulated environments, we model various social network structures, such as high-clustering networks and scale-free networks, to mirror real-world network dynamics. Each day, the agents engage in belief exchanges, reflect on their thought processes, and reintroduce the news accordingly. Given the lack of prior work in this area, we developed a FUSE-EVAL evaluation framework to measure the deviation from true news during the fake news evolution process. The results show that FUSE successfully captures the underlying patterns of how true news transforms into fake news and accurately reproduces previously discovered instances of fake news, aligning closely with human evaluations. Moreover, our work provides insights into the fact that combating fake news should not be delayed until it has fully evolved; instead, prevention in advance is key to achieving better outcomes.

[AI-66] ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

链接: https://arxiv.org/abs/2410.19056
作者: Xiaodong Yu,Ben Zhou,Hao Cheng,Dan Roth
关键词-EN: Existing math datasets, large language models, intermediate reasoning steps, reasoning steps derived, Existing math
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model’s uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.

[AI-67] Large Language Models for Financial Aid in Financial Time-series Forecasting

链接: https://arxiv.org/abs/2410.19025
作者: Md Khairul Islam,Ayush Karmacharya,Timothy Sue,Judy Fox
关键词-EN: current research focuses, leveraging big data, big data analytics, financial aid, current research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: GitHub link this https URL

点击查看摘要

Abstract:Considering the difficulty of financial time series forecasting in financial aid, much of the current research focuses on leveraging big data analytics in financial services. One modern approach is to utilize “predictive analysis”, analogous to forecasting financial trends. However, many of these time series data in Financial Aid (FA) pose unique challenges due to limited historical datasets and high dimensional financial information, which hinder the development of effective predictive models that balance accuracy with efficient runtime and memory usage. Pre-trained foundation models are employed to address these challenging tasks. We use state-of-the-art time series models including pre-trained LLMs (GPT-2 as the backbone), transformers, and linear models to demonstrate their ability to outperform traditional approaches, even with minimal (“few-shot”) or no fine-tuning (“zero-shot”). Our benchmark study, which includes financial aid with seven other time series tasks, shows the potential of using LLMs for scarce financial datasets.

[AI-68] Dual Space Training for GANs: A Pathway to Efficient and Creative Generative Models

链接: https://arxiv.org/abs/2410.19009
作者: Beka Modrekiladze
关键词-EN: Generative Adversarial Networks, requiring extensive computational, demonstrated remarkable advancements, extensive computational time, Adversarial Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have demonstrated remarkable advancements in generative modeling; however, their training is often resource-intensive, requiring extensive computational time and hundreds of thousands of epochs. This paper proposes a novel optimization approach that transforms the training process by operating within a dual space of the initial data using invertible mappings, specifically autoencoders. By training GANs on the encoded representations in the dual space, which encapsulate the most salient features of the data, the generative process becomes significantly more efficient and potentially reveals underlying patterns beyond human recognition. This approach not only enhances training speed and resource usage but also explores the philosophical question of whether models can generate insights that transcend the human intelligence while being limited by the human-generated data.

[AI-69] Whither Bias Goes I Will Go: An Integrative Systematic Review of Algorithmic Bias Mitigation

链接: https://arxiv.org/abs/2410.19003
作者: Louis Hickman,Christopher Huynh,Jessica Gass,Brandon Booth,Jason Kuruzovich,Louis Tay
关键词-EN: automatically scored interviews, Machine learning, algorithmic bias mitigation, bias mitigation methods, algorithmic bias
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: forthcoming in Journal of Applied Psychology

点击查看摘要

Abstract:Machine learning (ML) models are increasingly used for personnel assessment and selection (e.g., resume screeners, automatically scored interviews). However, concerns have been raised throughout society that ML assessments may be biased and perpetuate or exacerbate inequality. Although organizational researchers have begun investigating ML assessments from traditional psychometric and legal perspectives, there is a need to understand, clarify, and integrate fairness operationalizations and algorithmic bias mitigation methods from the computer science, data science, and organizational research literatures. We present a four-stage model of developing ML assessments and applying bias mitigation methods, including 1) generating the training data, 2) training the model, 3) testing the model, and 4) deploying the model. When introducing the four-stage model, we describe potential sources of bias and unfairness at each stage. Then, we systematically review definitions and operationalizations of algorithmic bias, legal requirements governing personnel selection from the United States and Europe, and research on algorithmic bias mitigation across multiple domains and integrate these findings into our framework. Our review provides insights for both research and practice by elucidating possible mechanisms of algorithmic bias while identifying which bias mitigation methods are legal and effective. This integrative framework also reveals gaps in the knowledge of algorithmic bias mitigation that should be addressed by future collaborative research between organizational researchers, computer scientists, and data scientists. We provide recommendations for developing and deploying ML assessments, as well as recommendations for future research into algorithmic bias and fairness.

[AI-70] RIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations

链接: https://arxiv.org/abs/2410.18991
作者: Nathalie Maria Kirch,Konstantin Hebenstreit,Matthias Samwald
关键词-EN: mass casualty incidents, tests LLMs’ ability, make ethical decisions, machine ethics, casualty incidents
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present the TRIAGE Benchmark, a novel machine ethics (ME) benchmark that tests LLMs’ ability to make ethical decisions during mass casualty incidents. It uses real-world ethical dilemmas with clear solutions designed by medical professionals, offering a more realistic alternative to annotation-based benchmarks. TRIAGE incorporates various prompting styles to evaluate model performance across different contexts. Most models consistently outperformed random guessing, suggesting LLMs may support decision-making in triage scenarios. Neutral or factual scenario formulations led to the best performance, unlike other ME benchmarks where ethical reminders improved outcomes. Adversarial prompts reduced performance but not to random guessing levels. Open-source models made more morally serious errors, and general capability overall predicted better performance.

[AI-71] Evolving Neural Networks Reveal Emergent Collective Behavior from Minimal Agent Interactions

链接: https://arxiv.org/abs/2410.19718
作者: Guilherme S. Y. Giardini,John F. Hardy II,Carlo R. da Cunha
关键词-EN: artificial intelligence, critical for advancing, robotics and artificial, behaviors, emergent behaviors
类目: Adaptation and Self-Organizing Systems (nlin.AO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 25 pages, 9 figures

点击查看摘要

Abstract:Understanding the mechanisms behind emergent behaviors in multi-agent systems is critical for advancing fields such as swarm robotics and artificial intelligence. In this study, we investigate how neural networks evolve to control agents’ behavior in a dynamic environment, focusing on the relationship between the network’s complexity and collective behavior patterns. By performing quantitative and qualitative analyses, we demonstrate that the degree of network non-linearity correlates with the complexity of emergent behaviors. Simpler behaviors, such as lane formation and laminar flow, are characterized by more linear network operations, while complex behaviors like swarming and flocking show highly non-linear neural processing. Moreover, specific environmental parameters, such as moderate noise, broader field of view, and lower agent density, promote the evolution of non-linear networks that drive richer, more intricate collective behaviors. These results highlight the importance of tuning evolutionary conditions to induce desired behaviors in multi-agent systems, offering new pathways for optimizing coordination in autonomous swarms. Our findings contribute to a deeper understanding of how neural mechanisms influence collective dynamics, with implications for the design of intelligent, self-organizing systems.

[AI-72] ake Caution in Using LLM s as Human Surrogates: Scylla Ex Machina

链接: https://arxiv.org/abs/2410.19599
作者: Yuan Gao,Dokyun Lee,Gordon Burtch,Sina Fazelpour
关键词-EN: Recent studies suggest, studies suggest large, Recent studies, exhibit human-like reasoning, suggest large language
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Recent studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Almost all advanced approaches fail to replicate human behavior distributions across many models, except in one case involving fine-tuning using a substantial amount of human behavior data. Causes of failure are diverse, relating to input language, roles, and safeguarding. These results caution against using LLMs to study human behaviors or as human surrogates.

[AI-73] Brain-like Functional Organization within Large Language Models

链接: https://arxiv.org/abs/2410.19542
作者: H.Sun,L.Zhao,Z.Wu,X.Gao,Y.Hu,M.Zuo,W.Zhang,J.Han,T.Liu,X.Hu
关键词-EN: brain-like functional organization, human brain, brain-like functional, artificial neurons, functional organization
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The human brain has long inspired the pursuit of artificial intelligence (AI). Recently, neuroimaging studies provide compelling evidence of alignment between the computational representation of artificial neural networks (ANNs) and the neural responses of the human brain to stimuli, suggesting that ANNs may employ brain-like information processing strategies. While such alignment has been observed across sensory modalities–visual, auditory, and linguistic–much of the focus has been on the behaviors of artificial neurons (ANs) at the population level, leaving the functional organization of individual ANs that facilitates such brain-like processes largely unexplored. In this study, we bridge this gap by directly coupling sub-groups of artificial neurons with functional brain networks (FBNs), the foundational organizational structure of the human brain. Specifically, we extract representative patterns from temporal responses of ANs in large language models (LLMs), and use them as fixed regressors to construct voxel-wise encoding models to predict brain activity recorded by functional magnetic resonance imaging (fMRI). This framework links the AN sub-groups to FBNs, enabling the delineation of brain-like functional organization within LLMs. Our findings reveal that LLMs (BERT and Llama 1-3) exhibit brain-like functional architecture, with sub-groups of artificial neurons mirroring the organizational patterns of well-established FBNs. Notably, the brain-like functional organization of LLMs evolves with the increased sophistication and capability, achieving an improved balance between the diversity of computational behaviors and the consistency of functional specializations. This research represents the first exploration of brain-like functional organization within LLMs, offering novel insights to inform the development of artificial general intelligence (AGI) with human brain principles.

[AI-74] Unified Causality Analysis Based on the Degrees of Freedom

链接: https://arxiv.org/abs/2410.19469
作者: András Telcs,Marcell T. Kurbucz,Antal Jakovác
关键词-EN: Temporally evolving systems, Temporally evolving, typically modeled, dynamic equations, Temporally
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Mathematical Physics (math-ph)
*备注: 32 pages, 7 figures

点击查看摘要

Abstract:Temporally evolving systems are typically modeled by dynamic equations. A key challenge in accurate modeling is understanding the causal relationships between subsystems, as well as identifying the presence and influence of unobserved hidden drivers on the observed dynamics. This paper presents a unified method capable of identifying fundamental causal relationships between pairs of systems, whether deterministic or stochastic. Notably, the method also uncovers hidden common causes beyond the observed variables. By analyzing the degrees of freedom in the system, our approach provides a more comprehensive understanding of both causal influence and hidden confounders. This unified framework is validated through theoretical models and simulations, demonstrating its robustness and potential for broader application.

[AI-75] NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction NEURIPS2024

链接: https://arxiv.org/abs/2410.19452
作者: Zixuan Gong,Guangyin Bao,Qi Zhang,Zhongwei Wan,Duoqian Miao,Shoujin Wang,Lei Zhu,Changwei Wang,Rongtao Xu,Liang Hu,Ke Liu,Yu Zhang
关键词-EN: CLIP and Stable, advanced deep learning, non-invasion brain activity, achieves great success, Stable Diffusion
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Oral

点击查看摘要

Abstract:Reconstruction of static visual stimuli from non-invasion brain activity fMRI achieves great success, owning to advanced deep learning models such as CLIP and Stable Diffusion. However, the research on fMRI-to-video reconstruction remains limited since decoding the spatiotemporal perception of continuous visual experiences is formidably challenging. We contend that the key to addressing these challenges lies in accurately decoding both high-level semantics and low-level perception flows, as perceived by the brain in response to video stimuli. To the end, we propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI. NeuroClips utilizes a semantics reconstructor to reconstruct video keyframes, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details, ensuring video smoothness. During inference, it adopts a pre-trained T2V diffusion model injected with both keyframes and low-level perception flows for video reconstruction. Evaluated on a publicly available fMRI-video dataset, NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics, e.g., a 128% improvement in SSIM and an 81% improvement in spatiotemporal metrics. Our project is available at this https URLthis https URL.

[AI-76] CLAP. I. Resolving miscalibration for deep learning-based galaxy photometric redshift estimation

链接: https://arxiv.org/abs/2410.19390
作者: Qiufan Lin,Hengxin Ruan,Dominique Fouchez,Shupei Chen,Rui Li,Paulo Montero-Camacho,Nicola R. Napolitano,Yuan-Sen Ting,Wei Zhang
关键词-EN: spectroscopic measurement remains, redshift probability densities, well-calibrated photometric redshift, photometric redshift probability, photometric redshift
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI)
*备注: 22 + 6 pages, 9 + 5 figures

点击查看摘要

Abstract:Obtaining well-calibrated photometric redshift probability densities for galaxies without a spectroscopic measurement remains a challenge. Deep learning discriminative models, typically fed with multi-band galaxy images, can produce outputs that mimic probability densities and achieve state-of-the-art accuracy. However, such models may be affected by miscalibration that would result in discrepancies between the model outputs and the actual distributions of true redshifts. Our work develops a novel method called the Contrastive Learning and Adaptive KNN for Photometric Redshift (CLAP) that resolves this issue. It leverages supervised contrastive learning (SCL) and k-nearest neighbours (KNN) to construct and calibrate raw probability density estimates, and implements a refitting procedure to resume end-to-end discriminative models ready to produce final estimates for large-scale imaging data. The harmonic mean is adopted to combine an ensemble of estimates from multiple realisations for improving accuracy. Our experiments demonstrate that CLAP takes advantage of both deep learning and KNN, outperforming benchmark methods on the calibration of probability density estimates and retaining high accuracy and computational efficiency. With reference to CLAP, we point out that miscalibration is particularly sensitive to the method-induced excessive correlations among data instances in addition to the unaccounted-for epistemic uncertainties. Reducing the uncertainties may not guarantee the removal of miscalibration due to the presence of such excessive correlations, yet this is a problem for conventional deep learning methods rather than CLAP. These discussions underscore the robustness of CLAP for obtaining photometric redshift probability densities required by astrophysical and cosmological applications. This is the first paper in our series on CLAP.

[AI-77] High Resolution Seismic Waveform Generation using Denoising Diffusion

链接: https://arxiv.org/abs/2410.19343
作者: Andreas Bergmeister,Kadek Hendrawan Palgunadi,Andrea Bosisio,Laura Ermert,Maria Koroni,Nathanaël Perraudin,Simon Dirmeier,Men-Andrin Meier
关键词-EN: earthquake-resistant infrastructure design, Accurate prediction, Ground Motion, seismic, infrastructure design
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction and synthesis of seismic waveforms are crucial for seismic hazard assessment and earthquake-resistant infrastructure design. Existing prediction methods, such as Ground Motion Models and physics-based simulations, often fail to capture the full complexity of seismic wavefields, particularly at higher frequencies. This study introduces a novel, efficient, and scalable generative model for high-frequency seismic waveform generation. Our approach leverages a spectrogram representation of seismic waveform data, which is reduced to a lower-dimensional submanifold via an autoencoder. A state-of-the-art diffusion model is trained to generate this latent representation, conditioned on key input parameters: earthquake magnitude, recording distance, site conditions, and faulting type. The model generates waveforms with frequency content up to 50 Hz. Any scalar ground motion statistic, such as peak ground motion amplitudes and spectral accelerations, can be readily derived from the synthesized waveforms. We validate our model using commonly used seismological metrics, and performance metrics from image generation studies. Our results demonstrate that our openly available model can generate distributions of realistic high-frequency seismic waveforms across a wide range of input parameters, even in data-sparse regions. For the scalar ground motion statistics commonly used in seismic hazard and earthquake engineering studies, we show that the model accurately reproduces both the median trends of the real data and its variability. To evaluate and compare the growing number of this and similar ‘Generative Waveform Models’ (GWM), we argue that they should generally be openly available and that they should be included in community efforts for ground motion model evaluations.

[AI-78] ST-NeRP: Spatial-Temporal Neural Representation Learning with Prior Embedding for Patient-specific Imaging Study

链接: https://arxiv.org/abs/2410.19283
作者: Liang Qiu,Liyue Shen,Lianli Liu,Junyan Liu,Yizheng Chen,Lei Xing
关键词-EN: treatment responses, Implicit Neural Representation, monitor the disease, disease progression, progression and assess
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages with 10 figures and 6 tables

点击查看摘要

Abstract:During and after a course of therapy, imaging is routinely used to monitor the disease progression and assess the treatment responses. Despite of its significance, reliably capturing and predicting the spatial-temporal anatomic changes from a sequence of patient-specific image series presents a considerable challenge. Thus, the development of a computational framework becomes highly desirable for a multitude of practical applications. In this context, we propose a strategy of Spatial-Temporal Neural Representation learning with Prior embedding (ST-NeRP) for patient-specific imaging study. Our strategy involves leveraging an Implicit Neural Representation (INR) network to encode the image at the reference time point into a prior embedding. Subsequently, a spatial-temporally continuous deformation function is learned through another INR network. This network is trained using the whole patient-specific image sequence, enabling the prediction of deformation fields at various target time points. The efficacy of the ST-NeRP model is demonstrated through its application to diverse sequential image series, including 4D CT and longitudinal CT datasets within thoracic and abdominal imaging. The proposed ST-NeRP model exhibits substantial potential in enabling the monitoring of anatomical changes within a patient throughout the therapeutic journey.

[AI-79] UbiHR: Resource-efficient Long-range Heart Rate Sensing on Ubiquitous Devices

链接: https://arxiv.org/abs/2410.19279
作者: Haoyu Bian,Bin Guo,Sicong Liu,Yasan Ding,Shanshan Gao,Zhiwen Yu
关键词-EN: Ubiquitous on-device heart, heart rate sensing, chronic patients, on-device heart rate, vital for high-stress
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ubiquitous on-device heart rate sensing is vital for high-stress individuals and chronic patients. Non-contact sensing, compared to contact-based tools, allows for natural user monitoring, potentially enabling more accurate and holistic data collection. However, in open and uncontrolled mobile environments, user movement and lighting introduce. Existing methods, such as curve-based or short-range deep learning recognition based on adjacent frames, strike the optimal balance between real-time performance and accuracy, especially under limited device resources. In this paper, we present UbiHR, a ubiquitous device-based heart rate sensing system. Key to UbiHR is a real-time long-range spatio-temporal model enabling noise-independent heart rate recognition and display on commodity mobile devices, along with a set of mechanisms for prompt and energy-efficient sampling and preprocessing. Diverse experiments and user studies involving four devices, four tasks, and 80 participants demonstrate UbiHR’s superior performance, enhancing accuracy by up to 74.2% and reducing latency by 51.2%.

[AI-80] Conditional diffusions for neural posterior estimation

链接: https://arxiv.org/abs/2410.19105
作者: Tianyu Chen,Vansh Bansal,James G. Scott
关键词-EN: Neural posterior estimation, shown great success, Bayesian inference, approach for Bayesian, Neural posterior
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Neural posterior estimation (NPE), a simulation-based computational approach for Bayesian inference, has shown great success in situations where posteriors are intractable or likelihood functions are treated as “black boxes.” Existing NPE methods typically rely on normalizing flows, which transform a base distributions into a complex posterior by composing many simple, invertible transformations. But flow-based models, while state of the art for NPE, are known to suffer from several limitations, including training instability and sharp trade-offs between representational power and computational cost. In this work, we demonstrate the effectiveness of conditional diffusions as an alternative to normalizing flows for NPE. Conditional diffusions address many of the challenges faced by flow-based methods. Our results show that, across a highly varied suite of benchmarking problems for NPE architectures, diffusions offer improved stability, superior accuracy, and faster training times, even with simpler, shallower models. These gains persist across a variety of different encoder or “summary network” architectures, as well as in situations where no summary network is required. The code will be publicly available at \urlthis https URL.

[AI-81] ach Multimodal LLM s to Comprehend Electrocardiographic Images

链接: https://arxiv.org/abs/2410.19008
作者: Ruoqi Liu,Yuelin Bai,Xiang Yue,Ping Zhang
关键词-EN: essential non-invasive diagnostic, non-invasive diagnostic tool, assessing cardiac conditions, ECG image, ECG
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The electrocardiogram (ECG) is an essential non-invasive diagnostic tool for assessing cardiac conditions. Existing automatic interpretation methods suffer from limited generalizability, focusing on a narrow range of cardiac conditions, and typically depend on raw physiological signals, which may not be readily available in resource-limited settings where only printed or digital ECG images are accessible. Recent advancements in multimodal large language models (MLLMs) present promising opportunities for addressing these challenges. However, the application of MLLMs to ECG image interpretation remains challenging due to the lack of instruction tuning datasets and well-established ECG image benchmarks for quantitative evaluation. To address these challenges, we introduce ECGInstruct, a comprehensive ECG image instruction tuning dataset of over one million samples, covering a wide range of ECG-related tasks from diverse data sources. Using ECGInstruct, we develop PULSE, an MLLM tailored for ECG image comprehension. In addition, we curate ECGBench, a new evaluation benchmark covering four key ECG image interpretation tasks across nine different datasets. Our experiments show that PULSE sets a new state-of-the-art, outperforming general MLLMs with an average accuracy improvement of 15% to 30%. This work highlights the potential of PULSE to enhance ECG interpretation in clinical practice.

[AI-82] rECGnition_v1.0: Arrhythmia detection using cardiologist-inspired multi-modal architecture incorporating demographic attributes in ECG

链接: https://arxiv.org/abs/2410.18985
作者: Shreya Srivastava,Durgesh Kumar,Jatin Bedi,Sandeep Seth,Deepak Sharma
关键词-EN: ECG manifested due, patient characteristics hinders, patient characteristics, ECG, Patient characteristic Encoding
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A substantial amount of variability in ECG manifested due to patient characteristics hinders the adoption of automated analysis algorithms in clinical practice. None of the ECG annotators developed till date consider the characteristics of the patients in a multi-modal architecture. We employed the XGBoost model to analyze the UCI Arrhythmia dataset, linking patient characteristics to ECG morphological changes. The model accurately classified patient gender using discriminative ECG features with 87.75% confidence. We propose a novel multi-modal methodology for ECG analysis and arrhythmia classification that can help defy the variability in ECG related to patient-specific conditions. This deep learning algorithm, named rECGnition_v1.0 (robust ECG abnormality detection Version 1), fuses Beat Morphology with Patient Characteristics to create a discriminative feature map that understands the internal correlation between both modalities. A Squeeze and Excitation based Patient characteristic Encoding Network (SEPcEnet) has been introduced, considering the patient’s demographics. The trained model outperformed the various existing algorithms by achieving the overall F1-score of 0.986 for the ten arrhythmia class classification in the MITDB and achieved near perfect prediction scores of ~0.99 for LBBB, RBBB, Premature ventricular contraction beat, Atrial premature beat and Paced beat. Subsequently, the methodology was validated across INCARTDB, EDB and different class groups of MITDB using transfer learning. The generalizability test provided F1-scores of 0.980, 0.946, 0.977, and 0.980 for INCARTDB, EDB, MITDB AAMI, and MITDB Normal vs. Abnormal Classification, respectively. Therefore, with a more enhanced and comprehensive understanding of the patient being examined and their ECG for diverse CVD manifestations, the proposed rECGnition_v1.0 algorithm paves the way for its deployment in clinics.

计算机视觉

[CV-0] Model merging with SVD to tie the Knots

链接: https://arxiv.org/abs/2410.19735
作者: George Stoica,Pratik Ramesh,Boglarka Ecsedi,Leshem Choshen,Judy Hoffman
关键词-EN: Recent model merging, Recent model, distinct tasks, specializing in distinct, capable of solving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent model merging methods demonstrate that the parameters of fully-finetuned models specializing in distinct tasks can be combined into one model capable of solving all tasks without retraining. Yet, this success does not transfer well when merging LoRA finetuned models. We study this phenomenon and observe that the weights of LoRA finetuned models showcase a lower degree of alignment compared to their fully-finetuned counterparts. We hypothesize that improving this alignment is key to obtaining better LoRA model merges, and propose KnOTS to address this problem. KnOTS uses the SVD to jointly transform the weights of different LoRA models into an aligned space, where existing merging methods can be applied. In addition, we introduce a new benchmark that explicitly evaluates whether merged models are general models. Notably, KnOTS consistently improves LoRA merging by up to 4.3% across several vision and language benchmarks, including our new setting. We release our code at: this https URL.

[CV-1] Deep Learning for Classification of Inflammatory Bowel Disease Activity in Whole Slide Images of Colonic Histopathology

链接: https://arxiv.org/abs/2410.19690
作者: Amit Das,Tanmay Shukla,Naofumi Tomita,Ryland Richards,Laura Vidis,Bing Ren,Saeed Hassanpour
关键词-EN: Grading inflammatory bowel, inflammatory bowel disease, standardized histopathological scoring, histopathological scoring systems, scoring systems remains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Grading inflammatory bowel disease (IBD) activity using standardized histopathological scoring systems remains challenging due to resource constraints and inter-observer variability. In this study, we developed a deep learning model to classify activity grades in hematoxylin and eosin-stained whole slide images (WSIs) from patients with IBD, offering a robust approach for general pathologists. We utilized 2,077 WSIs from 636 patients treated at Dartmouth-Hitchcock Medical Center in 2018 and 2019, scanned at 40x magnification (0.25 micron/pixel). Board-certified gastrointestinal pathologists categorized the WSIs into four activity classes: inactive, mildly active, moderately active, and severely active. A transformer-based model was developed and validated using five-fold cross-validation to classify IBD activity. Using HoVerNet, we examined neutrophil distribution across activity grades. Attention maps from our model highlighted areas contributing to its prediction. The model classified IBD activity with weighted averages of 0.871 [95% Confidence Interval (CI): 0.860-0.883] for the area under the curve, 0.695 [95% CI: 0.674-0.715] for precision, 0.697 [95% CI: 0.678-0.716] for recall, and 0.695 [95% CI: 0.674-0.714] for F1-score. Neutrophil distribution was significantly different across activity classes. Qualitative evaluation of attention maps by a gastrointestinal pathologist suggested their potential for improved interpretability. Our model demonstrates robust diagnostic performance and could enhance consistency and efficiency in IBD activity assessment.

[CV-2] Inferring Neural Signed Distance Functions by Overfitting on Single Noisy Point Clouds through Finetuning Data-Driven based Priors

链接: https://arxiv.org/abs/2410.19680
作者: Chao Chen,Yu-Shen Liu,Zhizhong Han
关键词-EN: computer vision applications, vision applications, important to estimate, estimate an accurate, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurlPS 2024. Project page: this https URL

点击查看摘要

Abstract:It is important to estimate an accurate signed distance function (SDF) from a point cloud in many computer vision applications. The latest methods learn neural SDFs using either a data-driven based or an overfitting-based strategy. However, these two kinds of methods are with either poor generalization or slow convergence, which limits their capability under challenging scenarios like highly noisy point clouds. To resolve this issue, we propose a method to promote pros of both data-driven based and overfitting-based methods for better generalization, faster inference, and higher accuracy in learning neural SDFs. We introduce a novel statistical reasoning algorithm in local regions which is able to finetune data-driven based priors without signed distance supervision, clean point cloud, or point normals. This helps our method start with a good initialization, and converge to a minimum in a much faster way. Our numerical and visual comparisons with the state-of-the-art methods show our superiority over these methods in surface reconstruction and point cloud denoising on widely used shape and scene benchmarks. The code is available at this https URL.

[CV-3] DiffGS: Functional Gaussian Splatting Diffusion NEURIPS2024

链接: https://arxiv.org/abs/2410.19657
作者: Junsheng Zhou,Weiqi Zhang,Yu-Shen Liu
关键词-EN: Gaussian Splatting, Gaussian Splatting remains, shown convincing performance, Gaussian Splatting functions, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024. Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown convincing performance in rendering speed and fidelity, yet the generation of Gaussian Splatting remains a challenge due to its discreteness and unstructured nature. In this work, we propose DiffGS, a general Gaussian generator based on latent diffusion models. DiffGS is a powerful and efficient 3D generative model which is capable of generating Gaussian primitives at arbitrary numbers for high-fidelity rendering with rasterization. The key insight is to represent Gaussian Splatting in a disentangled manner via three novel functions to model Gaussian probabilities, colors and transforms. Through the novel disentanglement of 3DGS, we represent the discrete and unstructured 3DGS with continuous Gaussian Splatting functions, where we then train a latent diffusion model with the target of generating these Gaussian Splatting functions both unconditionally and conditionally. Meanwhile, we introduce a discretization algorithm to extract Gaussians at arbitrary numbers from the generated functions via octree-guided sampling and optimization. We explore DiffGS for various tasks, including unconditional generation, conditional generation from text, image, and partial 3DGS, as well as Point-to-Gaussian generation. We believe that DiffGS provides a new direction for flexibly modeling and generating Gaussian Splatting.

[CV-4] Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models NEURIPS2024

链接: https://arxiv.org/abs/2410.19635
作者: Shenghao Fu,Junkai Yan,Qize Yang,Xihan Wei,Xiaohua Xie,Wei-Shi Zheng
关键词-EN: Recent vision foundation, extract universal representations, foundation models, Recent vision, show impressive abilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Recent vision foundation models can extract universal representations and show impressive abilities in various tasks. However, their application on object detection is largely overlooked, especially without fine-tuning them. In this work, we show that frozen foundation models can be a versatile feature enhancer, even though they are not pre-trained for object detection. Specifically, we explore directly transferring the high-level image understanding of foundation models to detectors in the following two ways. First, the class token in foundation models provides an in-depth understanding of the complex scene, which facilitates decoding object queries in the detector’s decoder by providing a compact context. Additionally, the patch tokens in foundation models can enrich the features in the detector’s encoder by providing semantic details. Utilizing frozen foundation models as plug-and-play modules rather than the commonly used backbone can significantly enhance the detector’s performance while preventing the problems caused by the architecture discrepancy between the detector’s backbone and the foundation model. With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector’s backbone.

[CV-5] Multi-modal Motion Prediction using Temporal Ensembling with Learning-based Aggregation IROS2024

链接: https://arxiv.org/abs/2410.19606
作者: Kai-Yin Hong,Chieh-Chih Wang,Wen-Chieh Lin
关键词-EN: capturing multi-modal distributions, Recent years, Temporal Ensembling, multi-modal distributions, challenges remaining
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024), accepted by IROS2024

点击查看摘要

Abstract:Recent years have seen a shift towards learning-based methods for trajectory prediction, with challenges remaining in addressing uncertainty and capturing multi-modal distributions. This paper introduces Temporal Ensembling with Learning-based Aggregation, a meta-algorithm designed to mitigate the issue of missing behaviors in trajectory prediction, which leads to inconsistent predictions across consecutive frames. Unlike conventional model ensembling, temporal ensembling leverages predictions from nearby frames to enhance spatial coverage and prediction diversity. By confirming predictions from multiple frames, temporal ensembling compensates for occasional errors in individual frame predictions. Furthermore, trajectory-level aggregation, often utilized in model ensembling, is insufficient for temporal ensembling due to a lack of consideration of traffic context and its tendency to assign candidate trajectories with incorrect driving behaviors to final predictions. We further emphasize the necessity of learning-based aggregation by utilizing mode queries within a DETR-like architecture for our temporal ensembling, leveraging the characteristics of predictions from nearby frames. Our method, validated on the Argoverse 2 dataset, shows notable improvements: a 4% reduction in minADE, a 5% decrease in minFDE, and a 1.16% reduction in the miss rate compared to the strongest baseline, QCNet, highlighting its efficacy and potential in autonomous driving.

[CV-6] Microplastic Identification Using AI-Driven Image Segmentation and GAN-Generated Ecological Context

链接: https://arxiv.org/abs/2410.19604
作者: Alex Dils,David Raymond,Jack Spottiswood,Samay Kodige,Dylan Karmin,Rikhil Kokal,Win Cowger,Chris Sadée
关键词-EN: Current methods, Generative Adversarial Network, Plastic Pollution Research, identification in water, water samples
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 6 pages one figure

点击查看摘要

Abstract:Current methods for microplastic identification in water samples are costly and require expert analysis. Here, we propose a deep learning segmentation model to automatically identify microplastics in microscopic images. We labeled images of microplastic from the Moore Institute for Plastic Pollution Research and employ a Generative Adversarial Network (GAN) to supplement and generate diverse training data. To verify the validity of the generated data, we conducted a reader study where an expert was able to discern the generated microplastic from real microplastic at a rate of 68 percent. Our segmentation model trained on the combined data achieved an F1-Score of 0.91 on a diverse dataset, compared to the model without generated data’s 0.82. With our findings we aim to enhance the ability of both experts and citizens to detect microplastic across diverse ecological contexts, thereby improving the cost and accessibility of microplastic analysis.

[CV-7] MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors

链接: https://arxiv.org/abs/2410.19590
作者: Fanqi Pu,Yifan Wang,Jiru Deng,Wenming Yang
关键词-EN: Perspective projection, extensively utilized, Perspective, object, geometric depth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Perspective projection has been extensively utilized in monocular 3D object detection methods. It introduces geometric priors from 2D bounding boxes and 3D object dimensions to reduce the uncertainty of depth estimation. However, due to depth errors originating from the object’s visual surface, the height of the bounding box often fails to represent the actual projected central height, which undermines the effectiveness of geometric depth. Direct prediction for the projected height unavoidably results in a loss of 2D priors, while multi-depth prediction with complex branches does not fully leverage geometric depth. This paper presents a Transformer-based monocular 3D object detection method called MonoDGP, which adopts perspective-invariant geometry errors to modify the projection formula. We also try to systematically discuss and explain the mechanisms and efficacy behind geometry errors, which serve as a simple but effective alternative to multi-depth prediction. Additionally, MonoDGP decouples the depth-guided decoder and constructs a 2D decoder only dependent on visual features, providing 2D priors and initializing object queries without the disturbance of 3D detection. To further optimize and fine-tune input tokens of the transformer decoder, we also introduce a Region Segment Head (RSH) that generates enhanced features and segment embeddings. Our monocular method demonstrates state-of-the-art performance on the KITTI benchmark without extra data. Code is available at this https URL.

[CV-8] Diverse Sign Language Translation

链接: https://arxiv.org/abs/2410.19586
作者: Xin Shen,Lei Shen,Shaozu Yuan,Heming Du,Haiyang Sun,Xin Yu
关键词-EN: valid textual interpretations, sign language expression, single sign language, sign language, sign language translation
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Like spoken languages, a single sign language expression could correspond to multiple valid textual interpretations. Hence, learning a rigid one-to-one mapping for sign language translation (SLT) models might be inadequate, particularly in the case of limited data. In this work, we introduce a Diverse Sign Language Translation (DivSLT) task, aiming to generate diverse yet accurate translations for sign language videos. Firstly, we employ large language models (LLM) to generate multiple references for the widely-used CSL-Daily and PHOENIX14T SLT datasets. Here, native speakers are only invited to touch up inaccurate references, thus significantly improving the annotation efficiency. Secondly, we provide a benchmark model to spur research in this task. Specifically, we investigate multi-reference training strategies to enable our DivSLT model to achieve diverse translations. Then, to enhance translation accuracy, we employ the max-reward-driven reinforcement learning objective that maximizes the reward of the translated result. Additionally, we utilize multiple metrics to assess the accuracy, diversity, and semantic precision of the DivSLT task. Experimental results on the enriched datasets demonstrate that our DivSLT method achieves not only better translation performance but also diverse translation results.

[CV-9] FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation ECCV2024

链接: https://arxiv.org/abs/2410.19573
作者: Tianyu Zhang,Guocheng Qian,Jin Xie,Jian Yang
关键词-EN: Point cloud frame, cloud frame interpolation, accurate scene flow, Point cloud, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in ECCV 2024

点击查看摘要

Abstract:Point cloud frame interpolation is a challenging task that involves accurate scene flow estimation across frames and maintaining the geometry structure. Prevailing techniques often rely on pre-trained motion estimators or intensive testing-time optimization, resulting in compromised interpolation accuracy or prolonged inference. This work presents FastPCI that introduces Pyramid Convolution-Transformer architecture for point cloud frame interpolation. Our hybrid Convolution-Transformer improves the local and long-range feature learning, while the pyramid network offers multilevel features and reduces the computation. In addition, FastPCI proposes a unique Dual-Direction Motion-Structure block for more accurate scene flow estimation. Our design is motivated by two facts: (1) accurate scene flow preserves 3D structure, and (2) point cloud at the previous timestep should be reconstructable using reverse motion from future timestep. Extensive experiments show that FastPCI significantly outperforms the state-of-the-art PointINet and NeuralPCI with notable gains (e.g. 26.6% and 18.3% reduction in Chamfer Distance in KITTI), while being more than 10x and 600x faster, respectively. Code is available at this https URL

[CV-10] GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing

链接: https://arxiv.org/abs/2410.19552
作者: Hosam Elgendy,Ahmed Sharshar,Ahmed Aboeitta,Yasser Ashraf,Mohsen Guizani
关键词-EN: Detecting temporal, urban planning, landscapes is critical, critical for applications, applications like environmental
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Detecting temporal changes in geographical landscapes is critical for applications like environmental monitoring and urban planning. While remote sensing data is abundant, existing vision-language models (VLMs) often fail to capture temporal dynamics effectively. This paper addresses these limitations by introducing an annotated dataset of video frame pairs to track evolving geographical patterns over time. Using fine-tuning techniques like Low-Rank Adaptation (LoRA), quantized LoRA (QLoRA), and model pruning on models such as Video-LLaVA and LLaVA-NeXT-Video, we significantly enhance VLM performance in processing remote sensing temporal changes. Results show significant improvements, with the best performance achieving a BERT score of 0.864 and ROUGE-1 score of 0.576, demonstrating superior accuracy in describing land-use transformations.

[CV-11] Utilizing Image Transforms and Diffusion Models for Generative Modeling of Short and Long Time Series NEURIPS2024

链接: https://arxiv.org/abs/2410.19538
作者: Ilan Naiman,Nimrod Berman,Itai Pemper,Idan Arbiv,Gal Fadlon,Omri Azencot
关键词-EN: interest surrounding generative, surrounding generative modeling, time series data, surge in interest, interest surrounding
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024; The first two authors contributed equally

点击查看摘要

Abstract:Lately, there has been a surge in interest surrounding generative modeling of time series data. Most existing approaches are designed either to process short sequences or to handle long-range sequences. This dichotomy can be attributed to gradient issues with recurrent networks, computational costs associated with transformers, and limited expressiveness of state space models. Towards a unified generative model for varying-length time series, we propose in this work to transform sequences into images. By employing invertible transforms such as the delay embedding and the short-time Fourier transform, we unlock three main advantages: i) We can exploit advanced diffusion vision models; ii) We can remarkably process short- and long-range inputs within the same framework; and iii) We can harness recent and established tools proposed in the time series to image literature. We validate the effectiveness of our method through a comprehensive evaluation across multiple tasks, including unconditional generation, interpolation, and extrapolation. We show that our approach achieves consistently state-of-the-art results against strong baselines. In the unconditional generation tasks, we show remarkable mean improvements of 58.17% over previous diffusion models in the short discriminative score and 132.61% in the (ultra-)long classification scores. Code is at this https URL.

[CV-12] MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset

链接: https://arxiv.org/abs/2410.19488
作者: Xin Shen,Heming Du,Hongwei Sheng,Shuyun Wang,Hui Chen,Huiqiang Chen,Zhuojie Wu,Xiaobiao Du,Jiaying Ying,Ruihan Lu,Qingzheng Xu,Xin Yu
关键词-EN: Isolated Sign Language, Sign Language Recognition, identifying individual sign, Sign Language, individual sign language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Isolated Sign Language Recognition (ISLR) focuses on identifying individual sign language glosses. Considering the diversity of sign languages across geographical regions, developing region-specific ISLR datasets is crucial for supporting communication and research. Auslan, as a sign language specific to Australia, still lacks a dedicated large-scale word-level dataset for the ISLR task. To fill this gap, we curate \underline\textbfthe first large-scale Multi-view Multi-modal Word-Level Australian Sign Language recognition dataset, dubbed MM-WLAuslan. Compared to other publicly available datasets, MM-WLAuslan exhibits three significant advantages: (1) the largest amount of data, (2) the most extensive vocabulary, and (3) the most diverse of multi-modal camera views. Specifically, we record 282K+ sign videos covering 3,215 commonly used Auslan glosses presented by 73 signers in a studio environment. Moreover, our filming system includes two different types of cameras, i.e., three Kinect-V2 cameras and a RealSense camera. We position cameras hemispherically around the front half of the model and simultaneously record videos using all four cameras. Furthermore, we benchmark results with state-of-the-art methods for various multi-modal ISLR settings on MM-WLAuslan, including multi-view, cross-camera, and cross-view. Experiment results indicate that MM-WLAuslan is a challenging ISLR dataset, and we hope this dataset will contribute to the development of Auslan and the advancement of sign languages worldwide. All datasets and benchmarks are available at MM-WLAuslan.

[CV-13] x-RAGE: eXtended Reality – Action Gesture Events Dataset

链接: https://arxiv.org/abs/2410.19486
作者: Vivek Parmar,Dwijay Bane,Syed Shakib Sarwar,Kleber Stangherlin,Barbara De Salvo,Manan Suri
关键词-EN: Metaverse and focus, based human-computer interaction, gained significance, recent years gesture, recent years
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:With the emergence of the Metaverse and focus on wearable devices in the recent years gesture based human-computer interaction has gained significance. To enable gesture recognition for VR/AR headsets and glasses several datasets focusing on egocentric i.e. first-person view have emerged in recent years. However, standard frame-based vision suffers from limitations in data bandwidth requirements as well as ability to capture fast motions. To overcome these limitation bio-inspired approaches such as event-based cameras present an attractive alternative. In this work, we present the first event-camera based egocentric gesture dataset for enabling neuromorphic, low-power solutions for XR-centric gesture recognition. The dataset has been made available publicly at the following URL: this https URL.

[CV-14] Content-Aware Radiance Fields: Aligning Model Complexity with Scene Intricacy Through Learned Bitwidth Quantization ECCV2024

链接: https://arxiv.org/abs/2410.19483
作者: Weihang Liu,Xue Xian Zheng,Jingyi Yu,Xin Lou
关键词-EN: Gaussian Splat, Neural Radiance Fields, popular radiance field, recent popular radiance, radiance field models
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: accepted by ECCV2024

点击查看摘要

Abstract:The recent popular radiance field models, exemplified by Neural Radiance Fields (NeRF), Instant-NGP and 3D Gaussian Splat?ting, are designed to represent 3D content by that training models for each individual scene. This unique characteristic of scene representation and per-scene training distinguishes radiance field models from other neural models, because complex scenes necessitate models with higher representational capacity and vice versa. In this paper, we propose content?aware radiance fields, aligning the model complexity with the scene intricacies through Adversarial Content-Aware Quantization (A-CAQ). Specifically, we make the bitwidth of parameters differentiable and train?able, tailored to the unique characteristics of specific scenes and requirements. The proposed framework has been assessed on Instant-NGP, a well-known NeRF variant and evaluated using various datasets. Experimental results demonstrate a notable reduction in computational complexity, while preserving the requisite reconstruction and rendering quality, making it beneficial for practical deployment of radiance fields models. Codes are available at this https URL.

[CV-15] Evaluation of strategies for efficient rate-distortion NeRF streaming

链接: https://arxiv.org/abs/2410.19459
作者: Pedro Martin,António Rodrigues,João Ascenso,Maria Paula Queluz
关键词-EN: Neural Radiance Fields, enabling highly realistic, detailed scene reconstructions, Radiance Fields, Neural Radiance
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have revolutionized the field of 3D visual representation by enabling highly realistic and detailed scene reconstructions from a sparse set of images. NeRF uses a volumetric functional representation that maps 3D points to their corresponding colors and opacities, allowing for photorealistic view synthesis from arbitrary viewpoints. Despite its advancements, the efficient streaming of NeRF content remains a significant challenge due to the large amount of data involved. This paper investigates the rate-distortion performance of two NeRF streaming strategies: pixel-based and neural network (NN) parameter-based streaming. While in the former, images are coded and then transmitted throughout the network, in the latter, the respective NeRF model parameters are coded and transmitted instead. This work also highlights the trade-offs in complexity and performance, demonstrating that the NN parameter-based strategy generally offers superior efficiency, making it suitable for one-to-many streaming scenarios.

[CV-16] Fusion-then-Distillation: Toward Cross-modal Positive Distillation for Domain Adaptive 3D Semantic Segmentation

链接: https://arxiv.org/abs/2410.19446
作者: Yao Wu,Mingwei Xing,Yachao Zhang,Yuan Xie,Yanyun Qu
关键词-EN: target-domain data, adapted to target-domain, source-domain data, cross-modal, data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In cross-modal unsupervised domain adaptation, a model trained on source-domain data (e.g., synthetic) is adapted to target-domain data (e.g., real-world) without access to target annotation. Previous methods seek to mutually mimic cross-modal outputs in each domain, which enforces a class probability distribution that is agreeable in different domains. However, they overlook the complementarity brought by the heterogeneous fusion in cross-modal learning. In light of this, we propose a novel fusion-then-distillation (FtD++) method to explore cross-modal positive distillation of the source and target domains for 3D semantic segmentation. FtD++ realizes distribution consistency between outputs not only for 2D images and 3D point clouds but also for source-domain and augment-domain. Specially, our method contains three key ingredients. First, we present a model-agnostic feature fusion module to generate the cross-modal fusion representation for establishing a latent space. In this space, two modalities are enforced maximum correlation and complementarity. Second, the proposed cross-modal positive distillation preserves the complete information of multi-modal input and combines the semantic content of the source domain with the style of the target domain, thereby achieving domain-modality alignment. Finally, cross-modal debiased pseudo-labeling is devised to model the uncertainty of pseudo-labels via a self-training manner. Extensive experiments report state-of-the-art results on several domain adaptive scenarios under unsupervised and semi-supervised settings. Code is available at this https URL.

[CV-17] Balancing the Scales: Enhancing Fairness in Facial Expression Recognition with Latent Alignment

链接: https://arxiv.org/abs/2410.19444
作者: Syed Sameen Ahmad Rizvi,Aryan Seth,Pratik Narang
关键词-EN: Automatically recognizing emotional, Automatically recognizing, Facial Expression Recognition, facial expression, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatically recognizing emotional intent using facial expression has been a thoroughly investigated topic in the realm of computer vision. Facial Expression Recognition (FER), being a supervised learning task, relies heavily on substantially large data exemplifying various socio-cultural demographic attributes. Over the past decade, several real-world in-the-wild FER datasets that have been proposed were collected through crowd-sourcing or web-scraping. However, most of these practically used datasets employ a manual annotation methodology for labeling emotional intent, which inherently propagates individual demographic biases. Moreover, these datasets also lack an equitable representation of various socio-cultural demographic groups, thereby inducing a class imbalance. Bias analysis and its mitigation have been investigated across multiple domains and problem settings, however, in the FER domain, this is a relatively lesser explored area. This work leverages representation learning based on latent spaces to mitigate bias in facial expression recognition systems, thereby enhancing a deep learning model’s fairness and overall accuracy.

[CV-18] ransductive Learning for Near-Duplicate Image Detection in Scanned Photo Collections ICDAR2023

链接: https://arxiv.org/abs/2410.19437
作者: Francesc Net,Marc Folia,Pep Casals,Lluis Gomez
关键词-EN: document management company, near-duplicate image detection, paper presents, presents a comparative, comparative study
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in ICDAR 2023

点击查看摘要

Abstract:This paper presents a comparative study of near-duplicate image detection techniques in a real-world use case scenario, where a document management company is commissioned to manually annotate a collection of scanned photographs. Detecting duplicate and near-duplicate photographs can reduce the time spent on manual annotation by archivists. This real use case differs from laboratory settings as the deployment dataset is available in advance, allowing the use of transductive learning. We propose a transductive learning approach that leverages state-of-the-art deep learning architectures such as convolutional neural networks (CNNs) and Vision Transformers (ViTs). Our approach involves pre-training a deep neural network on a large dataset and then fine-tuning the network on the unlabeled target collection with self-supervised learning. The results show that the proposed approach outperforms the baseline methods in the task of near-duplicate image detection in the UKBench and an in-house private dataset.

[CV-19] Paint Bucket Colorization Using Anime Character Color Design Sheets

链接: https://arxiv.org/abs/2410.19424
作者: Yuekun Dai,Qinyue Li,Shangchen Zhou,Yihang Luo,Chongyi Li,Chen Change Loy
关键词-EN: digital artists manually, artists manually colorize, guided by RGB, paint bucket tool, color design sheets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Extension of arXiv:2403.18342 ; Project page at this https URL

点击查看摘要

Abstract:Line art colorization plays a crucial role in hand-drawn animation production, where digital artists manually colorize segments using a paint bucket tool, guided by RGB values from character color design sheets. This process, often called paint bucket colorization, involves two main tasks: keyframe colorization, where colors are applied according to the character’s color design sheet, and consecutive frame colorization, where these colors are replicated across adjacent frames. Current automated colorization methods primarily focus on reference-based and segment-matching approaches. However, reference-based methods often fail to accurately assign specific colors to each region, while matching-based methods are limited to consecutive frame colorization and struggle with issues like significant deformation and occlusion. In this work, we introduce inclusion matching, which allows the network to understand the inclusion relationships between segments, rather than relying solely on direct visual correspondences. By integrating this approach with segment parsing and color warping modules, our inclusion matching pipeline significantly improves performance in both keyframe colorization and consecutive frame colorization. To support our network’s training, we have developed a unique dataset named PaintBucket-Character, which includes rendered line arts alongside their colorized versions and shading annotations for various 3D characters. To replicate industry animation data formats, we also created color design sheets for each character, with semantic information for each color and standard pose reference images. Experiments highlight the superiority of our method, demonstrating accurate and consistent colorization across both our proposed benchmarks and hand-drawn animations.

[CV-20] Unified Cross-Modal Image Synthesis with Hierarchical Mixture of Product-of-Experts

链接: https://arxiv.org/abs/2410.19378
作者: Reuben Dorent,Nazim Haouchine,Alexandra Golby,Sarah Frisken,Tina Kapur,William Wells
关键词-EN: auto-encoders called MMHVAE, hierarchical variational auto-encoders, variational auto-encoders called, multimodal hierarchical variational, called MMHVAE
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Manuscript under review

点击查看摘要

Abstract:We propose a deep mixture of multimodal hierarchical variational auto-encoders called MMHVAE that synthesizes missing images from observed images in different modalities. MMHVAE’s design focuses on tackling four challenges: (i) creating a complex latent representation of multimodal data to generate high-resolution images; (ii) encouraging the variational distributions to estimate the missing information needed for cross-modal image synthesis; (iii) learning to fuse multimodal information in the context of missing data; (iv) leveraging dataset-level information to handle incomplete data sets at training time. Extensive experiments are performed on the challenging problem of pre-operative brain multi-parametric magnetic resonance and intra-operative ultrasound imaging.

[CV-21] Capsule Endoscopy Multi-classification via Gated Attention and Wavelet Transformations

链接: https://arxiv.org/abs/2410.19363
作者: Lakshmi Srinivas Panchananam,Praveen Kumar Chandaliya,Kishor Upla,Kiran Raja
关键词-EN: tract significantly influence, gastrointestinal tract significantly, tract significantly, significantly influence, influence the patient
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Capsule Vision 2024 Challenge

点击查看摘要

Abstract:Abnormalities in the gastrointestinal tract significantly influence the patient’s health and require a timely diagnosis for effective treatment. With such consideration, an effective automatic classification of these abnormalities from a video capsule endoscopy (VCE) frame is crucial for improvement in diagnostic workflows. The work presents the process of developing and evaluating a novel model designed to classify gastrointestinal anomalies from a VCE video frame. Integration of Omni Dimensional Gated Attention (OGA) mechanism and Wavelet transformation techniques into the model’s architecture allowed the model to focus on the most critical areas in the endoscopy images, reducing noise and irrelevant features. This is particularly advantageous in capsule endoscopy, where images often contain a high degree of variability in texture and color. Wavelet transformations contributed by efficiently capturing spatial and frequency-domain information, improving feature extraction, especially for detecting subtle features from the VCE frames. Furthermore, the features extracted from the Stationary Wavelet Transform and Discrete Wavelet Transform are concatenated channel-wise to capture multiscale features, which are essential for detecting polyps, ulcerations, and bleeding. This approach improves classification accuracy on imbalanced capsule endoscopy datasets. The proposed model achieved 92.76% and 91.19% as training and validation accuracies respectively. At the same time, Training and Validation losses are 0.2057 and 0.2700. The proposed model achieved a Balanced Accuracy of 94.81%, AUC of 87.49%, F1-score of 91.11%, precision of 91.17%, recall of 91.19% and specificity of 98.44%. Additionally, the model’s performance is benchmarked against two base models, VGG16 and ResNet50, demonstrating its enhanced ability to identify and classify a range of gastrointestinal abnormalities accurately. Comments: Capsule Vision 2024 Challenge Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2410.19363 [cs.CV] (or arXiv:2410.19363v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.19363 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-22] FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

链接: https://arxiv.org/abs/2410.19355
作者: Zhengyao Lv,Chenyang Si,Junhao Song,Zhenyu Yang,Yu Qiao,Ziwei Liu,Kwan-Yee K. Wong
关键词-EN: training-free strategy designed, video, video quality, high-quality generation, video diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we present \textbf\textitFasterCache, a novel training-free strategy designed to accelerate the inference of video diffusion models with high-quality generation. By analyzing existing cache-based methods, we observe that \textitdirectly reusing adjacent-step features degrades video quality due to the loss of subtle variations. We further perform a pioneering investigation of the acceleration potential of classifier-free guidance (CFG) and reveal significant redundancy between conditional and unconditional features within the same timestep. Capitalizing on these observations, we introduce FasterCache to substantially accelerate diffusion-based video generation. Our key contributions include a dynamic feature reuse strategy that preserves both feature distinction and temporal continuity, and CFG-Cache which optimizes the reuse of conditional and unconditional outputs to further enhance inference speed without compromising video quality. We empirically evaluate FasterCache on recent video diffusion models. Experimental results show that FasterCache can significantly accelerate video generation (\eg 1.67 \times speedup on Vchitect-2.0) while keeping video quality comparable to the baseline, and consistently outperform existing methods in both inference speed and video quality.

[CV-23] Context-Based Visual-Language Place Recognition

链接: https://arxiv.org/abs/2410.19341
作者: Soojin Woo,Seong-Woo Kim
关键词-EN: localization and SLAM, vision-based robot localization, Place Recognition, robot localization, Visual Place Recognition
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In vision-based robot localization and SLAM, Visual Place Recognition (VPR) is essential. This paper addresses the problem of VPR, which involves accurately recognizing the location corresponding to a given query image. A popular approach to vision-based place recognition relies on low-level visual features. Despite significant progress in recent years, place recognition based on low-level visual features is challenging when there are changes in scene appearance. To address this, end-to-end training approaches have been proposed to overcome the limitations of hand-crafted features. However, these approaches still fail under drastic changes and require large amounts of labeled data to train models, presenting a significant limitation. Methods that leverage high-level semantic information, such as objects or categories, have been proposed to handle variations in appearance. In this paper, we introduce a novel VPR approach that remains robust to scene changes and does not require additional training. Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model. We validate our approach in challenging place recognition scenarios using real-world public dataset. The experiments demonstrate that our method outperforms non-learned image representation techniques and off-the-shelf convolutional neural network (CNN) descriptors. Our code is available at https: //github.com/woo-soojin/context-based-vlpr.

[CV-24] DECADE: Towards Designing Efficient-yet-Accurate Distance Estimation Modules for Collision Avoidance in Mobile Advanced Driver Assistance Systems

链接: https://arxiv.org/abs/2410.19336
作者: Muhammad Zaeem Shahzad,Muhammad Abdullah Hanif,Muhammad Shafique
关键词-EN: Driver Assistance Systems, Advanced Driver Assistance, make Advanced Driver, Assistance Systems, Advanced Driver
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 17 figures, 4 tables

点击查看摘要

Abstract:The proliferation of smartphones and other mobile devices provides a unique opportunity to make Advanced Driver Assistance Systems (ADAS) accessible to everyone in the form of an application empowered by low-cost Machine/Deep Learning (ML/DL) models to enhance road safety. For the critical feature of Collision Avoidance in Mobile ADAS, lightweight Deep Neural Networks (DNN) for object detection exist, but conventional pixel-wise depth/distance estimation DNNs are vastly more computationally expensive making them unsuitable for a real-time application on resource-constrained devices. In this paper, we present a distance estimation model, DECADE, that processes each detector output instead of constructing pixel-wise depth/disparity maps. In it, we propose a pose estimation DNN to estimate allocentric orientation of detections to supplement the distance estimation DNN in its prediction of distance using bounding box features. We demonstrate that these modules can be attached to any detector to extend object detection with fast distance estimation. Evaluation of the proposed modules with attachment to and fine-tuning on the outputs of the YOLO object detector on the KITTI 3D Object Detection dataset achieves state-of-the-art performance with 1.38 meters in Mean Absolute Error and 7.3% in Mean Relative Error in the distance range of 0-150 meters. Our extensive evaluation scheme not only evaluates class-wise performance, but also evaluates range-wise accuracy especially in the critical range of 0-70m.

[CV-25] Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

链接: https://arxiv.org/abs/2410.19324
作者: Emiel Hoogeboom,Thomas Mensink,Jonathan Heek,Kay Lamerigts,Ruiqi Gao,Tim Salimans
关键词-EN: resolution image synthesis, diffusion models, popular choice, pixel-space diffusion models, high resolution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can in fact be very competitive to latent approaches both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128 and ImageNet256. We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss (Kingma Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at high resolution with fewer parameters, rather than using more parameters but at a lower resolution. When combining these three steps with recently proposed tricks like guidance intervals, we obtain a family of pixel-space diffusion models we call Simple Diffusion v2 (SiD2). Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2410.19324 [cs.CV] (or arXiv:2410.19324v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.19324 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-26] Semi-supervised Chinese Poem-to-Painting Generation via Cycle-consistent Adversarial Networks

链接: https://arxiv.org/abs/2410.19307
作者: Zhengyang Lu,Tianhao Guo,Feng Wang
关键词-EN: Classical Chinese poetry, Classical Chinese, computational translation, represent the epitome, relationship poses
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Classical Chinese poetry and painting represent the epitome of artistic expression, but the abstract and symbolic nature of their relationship poses a significant challenge for computational translation. Most existing methods rely on large-scale paired datasets, which are scarce in this domain. In this work, we propose a semi-supervised approach using cycle-consistent adversarial networks to leverage the limited paired data and large unpaired corpus of poems and paintings. The key insight is to learn bidirectional mappings that enforce semantic alignment between the visual and textual modalities. We introduce novel evaluation metrics to assess the quality, diversity, and consistency of the generated poems and paintings. Extensive experiments are conducted on a new Chinese Painting Description Dataset (CPDD). The proposed model outperforms previous methods, showing promise in capturing the symbolic essence of artistic expression. Codes are available online \urlthis https URL.

[CV-27] Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting NEURIPS2024

链接: https://arxiv.org/abs/2410.19294
作者: Xingyu Zhu,Beier Zhu,Yi Tan,Shuo Wang,Yanbin Hao,Hanwang Zhang
关键词-EN: shown impressive generalization, impressive generalization capacities, text descriptions, Vision-language models, shown impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions. While optimizing prompts on downstream labeled data has proven effective in improving performance, these methods entail labor costs for annotations and are limited by their quality. Additionally, since CLIP is pre-trained on highly imbalanced Web-scale data, it suffers from inherent label bias that leads to suboptimal performance. To tackle the above challenges, we propose a label-Free prompt distribution learning and bias correction framework, dubbed as Frolic, which boosts zero-shot performance without the need for labeled data. Specifically, our Frolic learns distributions over prompt prototypes to capture diverse visual representations and adaptively fuses these with the original CLIP through confidence matching. This fused model is further enhanced by correcting label bias via a label-free logit adjustment. Notably, our method is not only training-free but also circumvents the necessity for hyper-parameter tuning. Extensive experimental results across 16 datasets demonstrate the efficacy of our approach, particularly outperforming the state-of-the-art by an average of 2.6% on 10 datasets with CLIP ViT-B/16 and achieving an average margin of 1.5% on ImageNet and its five distribution shifts with CLIP ViT-B/16. Codes are available in this https URL.

[CV-28] VisionCoder: Empowering Multi-Agent Auto-Programming for Image Processing with Hybrid LLM s

链接: https://arxiv.org/abs/2410.19245
作者: Zixiao Zhao,Jing Sun,Zhiyuan Wei,Cheng-Hao Cai,Zhe Hou,Jin Song Dong
关键词-EN: demonstrated foundational generative, foundational generative capabilities, large language models, detailed task descriptions, automated programming
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In the field of automated programming, large language models (LLMs) have demonstrated foundational generative capabilities when given detailed task descriptions. However, their current functionalities are primarily limited to function-level development, restricting their effectiveness in complex project environments and specific application scenarios, such as complicated image-processing tasks. This paper presents a multi-agent framework that utilises a hybrid set of LLMs, including GPT-4o and locally deployed open-source models, which collaboratively complete auto-programming tasks. Each agent plays a distinct role in the software development cycle, collectively forming a virtual organisation that works together to produce software products. By establishing a tree-structured thought distribution and development mechanism across project, module, and function levels, this framework offers a cost-effective and efficient solution for code generation. We evaluated our approach using benchmark datasets, and the experimental results demonstrate that VisionCoder significantly outperforms existing methods in image processing auto-programming tasks.

[CV-29] Prompting Continual Person Search ACM-MM2024

链接: https://arxiv.org/abs/2410.19239
作者: Pengcheng Zhang,Xiaohan Yu,Xiao Bai,Jin Zheng,Xin Ning
关键词-EN: person search, continual person search, person search techniques, person, search
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM MM 2024

点击查看摘要

Abstract:The development of person search techniques has been greatly promoted in recent years for its superior practicality and challenging goals. Despite their significant progress, existing person search models still lack the ability to continually learn from increaseing real-world data and adaptively process input from different domains. To this end, this work introduces the continual person search task that sequentially learns on multiple domains and then performs person search on all seen domains. This requires balancing the stability and plasticity of the model to continually learn new knowledge without catastrophic forgetting. For this, we propose a Prompt-based Continual Person Search (PoPS) model in this paper. First, we design a compositional person search transformer to construct an effective pre-trained transformer without exhaustive pre-training from scratch on large-scale person search data. This serves as the fundamental for prompt-based continual learning. On top of that, we design a domain incremental prompt pool with a diverse attribute matching module. For each domain, we independently learn a set of prompts to encode the domain-oriented knowledge. Meanwhile, we jointly learn a group of diverse attribute projections and prototype embeddings to capture discriminative domain attributes. By matching an input image with the learned attributes across domains, the learned prompts can be properly selected for model inference. Extensive experiments are conducted to validate the proposed method for continual person search. The source code is available at this https URL.

[CV-30] Prototypical Hash Encoding for On-the-Fly Fine-Grained Category Discovery NEURIPS2024

链接: https://arxiv.org/abs/2410.19213
作者: Haiyang Zheng,Nan Pu,Wenjing Li,Nicu Sebe,Zhun Zhong
关键词-EN: newly-coming stream data, Category Discovery, category knowledge contained, stream data, labeled data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:In this paper, we study a practical yet challenging task, On-the-fly Category Discovery (OCD), aiming to online discover the newly-coming stream data that belong to both known and unknown classes, by leveraging only known category knowledge contained in labeled data. Previous OCD methods employ the hash-based technique to represent old/new categories by hash codes for instance-wise inference. However, directly mapping features into low-dimensional hash space not only inevitably damages the ability to distinguish classes and but also causes “high sensitivity” issue, especially for fine-grained classes, leading to inferior performance. To address these issues, we propose a novel Prototypical Hash Encoding (PHE) framework consisting of Category-aware Prototype Generation (CPG) and Discriminative Category Encoding (DCE) to mitigate the sensitivity of hash code while preserving rich discriminative information contained in high-dimension feature space, in a two-stage projection fashion. CPG enables the model to fully capture the intra-category diversity by representing each category with multiple prototypes. DCE boosts the discrimination ability of hash code with the guidance of the generated category prototypes and the constraint of minimum separation distance. By jointly optimizing CPG and DCE, we demonstrate that these two components are mutually beneficial towards an effective OCD. Extensive experiments show the significant superiority of our PHE over previous methods, e.g., obtaining an improvement of +5.3% in ALL ACC averaged on all datasets. Moreover, due to the nature of the interpretable prototypes, we visually analyze the underlying mechanism of how PHE helps group certain samples into either known or unknown categories. Code is available at this https URL.

[CV-31] Classifying Bicycle Infrastructure Using On-Bike Street-Level Images ITSC2024

链接: https://arxiv.org/abs/2410.19194
作者: Kal Backman,Ben Beck,Dana Kulić
关键词-EN: cycling infrastructure, sustainable transportation, offers an attractive, attractive option, option for sustainable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures, presented at ITSC 2024

点击查看摘要

Abstract:While cycling offers an attractive option for sustainable transportation, many potential cyclists are discouraged from taking up cycling due to the lack of suitable and safe infrastructure. Efficiently mapping cycling infrastructure across entire cities is necessary to advance our understanding of how to provide connected networks of high-quality infrastructure. Therefore we propose a system capable of classifying available cycling infrastructure from on-bike smartphone camera data. The system receives an image sequence as input, temporally analyzing the sequence to account for sparsity of signage. The model outputs cycling infrastructure class labels defined by a hierarchical classification system. Data is collected via participant cyclists covering 7,006Km across the Greater Melbourne region that is automatically labeled via a GPS and OpenStreetMap database matching algorithm. The proposed model achieved an accuracy of 95.38%, an increase in performance of 7.55% compared to the non-temporal model. The model demonstrated robustness to extreme absence of image features where the model lost only 6.6% in accuracy after 90% of images being replaced with blank images. This work is the first to classify cycling infrastructure using only street-level imagery collected from bike-mounted mobile phone cameras, while demonstrating robustness to feature sparsity via long temporal sequence analysis.

[CV-32] Review of wavelet-based unsupervised texture segmentation advantage of adaptive wavelets

链接: https://arxiv.org/abs/2410.19191
作者: Yuan Huang,Valentin De Bortoli,Fugen Zhou,Jerome Gilles
关键词-EN: Wavelet-based segmentation approaches, Wavelet-based segmentation, approaches are widely, ability to characterize, texture segmentation purposes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Wavelet-based segmentation approaches are widely used for texture segmentation purposes because of their ability to characterize different textures. In this paper, we assess the influence of the chosen wavelet and propose to use the recently introduced empirical wavelets. We show that the adaptability of the empirical wavelet permits to reach better results than classic wavelets. In order to focus only on the textural information, we also propose to perform a cartoon + texture decomposition step before applying the segmentation algorithm. The proposed method is tested on six classic benchmarks, based on several popular texture images.

[CV-33] Noise Adaption Network for Morse Code Image Classification

链接: https://arxiv.org/abs/2410.19180
作者: Xiaxia Wang,XueSong Leng,Guoping Xu
关键词-EN: safeguarding communication con-tent, Morse code images, Morse code, escalating significance, security has underscored
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:The escalating significance of information security has underscored the per-vasive role of encryption technology in safeguarding communication con-tent. Morse code, a well-established and effective encryption method, has found widespread application in telegraph communication and various do-mains. However, the transmission of Morse code images faces challenges due to diverse noises and distortions, thereby hindering comprehensive clas-sification outcomes. Existing methodologies predominantly concentrate on categorizing Morse code images affected by a single type of noise, neglecting the multitude of scenarios that noise pollution can generate. To overcome this limitation, we propose a novel two-stage approach, termed the Noise Adaptation Network (NANet), for Morse code image classification. Our method involves exclusive training on pristine images while adapting to noisy ones through the extraction of critical information unaffected by noise. In the initial stage, we introduce a U-shaped network structure designed to learn representative features and denoise images. Subsequently, the second stage employs a deep convolutional neural network for classification. By leveraging the denoising module from the first stage, our approach achieves enhanced accuracy and robustness in the subsequent classification phase. We conducted an evaluation of our approach on a diverse dataset, encom-passing Gaussian, salt-and-pepper, and uniform noise variations. The results convincingly demonstrate the superiority of our methodology over existing approaches. The datasets are available on this https URL

[CV-34] DCT-HistoTransformer: Efficient Lightweight Vision Transformer with DCT Integration for histopathological image analysis

链接: https://arxiv.org/abs/2410.19166
作者: Mahtab Ranjbar(1),Mehdi Mohebbi(1),Mahdi Cherakhloo(2),Bijan Vosoughi. Vahdat(2) ((1) Department of Mathematical and Computer Sciences, Kharazmi University, (2) Department of Medical Engineering, Electrical Engineering Department, Sharif University of Technology)
关键词-EN: advanced computer-aided diagnosis, advanced imaging techniques, deep learning methods, significantly advanced computer-aided, advanced imaging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 5 figures, Accepted for 2024 9th International Iranian Conference on Biomedical Engineering (ICBME)

点击查看摘要

Abstract:In recent years, the integration of advanced imaging techniques and deep learning methods has significantly advanced computer-aided diagnosis (CAD) systems for breast cancer detection and classification. Transformers, which have shown great promise in computer vision, are now being applied to medical image analysis. However, their application to histopathological images presents challenges due to the need for extensive manual annotations of whole-slide images (WSIs), as these models require large amounts of data to work effectively, which is costly and time-consuming. Furthermore, the quadratic computational cost of Vision Transformers (ViTs) is particularly prohibitive for large, high-resolution histopathological images, especially on edge devices with limited computational resources. In this study, we introduce a novel lightweight breast cancer classification approach using transformers that operates effectively without large datasets. By incorporating parallel processing pathways for Discrete Cosine Transform (DCT) Attention and MobileConv, we convert image data from the spatial domain to the frequency domain to utilize the benefits such as filtering out high frequencies in the image, which reduces computational cost. This demonstrates the potential of our approach to improve breast cancer classification in histopathological images, offering a more efficient solution with reduced reliance on extensive annotated datasets. Our proposed model achieves an accuracy of 96.00% \pm 0.48% for binary classification and 87.85% \pm 0.93% for multiclass classification, which is comparable to state-of-the-art models while significantly reducing computational costs. This demonstrates the potential of our approach to improve breast cancer classification in histopathological images, offering a more efficient solution with reduced reliance on extensive annotated datasets.

[CV-35] HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision ECCV

链接: https://arxiv.org/abs/2410.19164
作者: Burak Ercan,Onur Eker,Aykut Erdem,Erkut Erdem
关键词-EN: environments pose significant, pose significant challenges, Low-light environments pose, environments pose, pose significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 4 figures. Has been accepted for publication at the European Conference on Computer Vision Workshops (ECCVW), Milano, 2024. The project page can be found at this https URL

点击查看摘要

Abstract:Low-light environments pose significant challenges for image enhancement methods. To address these challenges, in this work, we introduce the HUE dataset, a comprehensive collection of high-resolution event and frame sequences captured in diverse and challenging low-light conditions. Our dataset includes 106 sequences, encompassing indoor, cityscape, twilight, night, driving, and controlled scenarios, each carefully recorded to address various illumination levels and dynamic ranges. Utilizing a hybrid RGB and event camera setup. we collect a dataset that combines high-resolution event data with complementary frame data. We employ both qualitative and quantitative evaluations using no-reference metrics to assess state-of-the-art low-light enhancement and event-based image reconstruction methods. Additionally, we evaluate these methods on a downstream object detection task. Our findings reveal that while event-based methods perform well in specific metrics, they may produce false positives in practical applications. This dataset and our comprehensive analysis provide valuable insights for future research in low-light vision and hybrid camera systems.

[CV-36] MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

链接: https://arxiv.org/abs/2410.19115
作者: Ruicheng Wang,Sicheng Xu,Cassie Dai,Jianfeng Xiang,Yu Deng,Xin Tong,Jiaolong Yang
关键词-EN: monocular open-domain images, present MoGe, open-domain images, geometry, local geometry
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitate effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervisions that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view. Code and models will be released on our project page.

[CV-37] BIFR"OST: 3D-Aware Image compositing with Language Instructions NEURIPS2024

链接: https://arxiv.org/abs/2410.19079
作者: Lingxiao Li,Kaixiong Gong,Weihong Li,Xili Dai,Tao Chen,Xiaojun Yuan,Xiangyu Yue
关键词-EN: paper introduces Bifröst, paper introduces, built upon diffusion, introduces Bifröst, perform instruction-based image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024, Code Available: this https URL . arXiv admin note: text overlap with arXiv:2307.09481 by other authors

点击查看摘要

Abstract:This paper introduces Bifröst, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships ( \textite.g. , occlusion). Bifröst addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that Bifröst significantly outperforms existing methods, providing a robust solution for generating realistically composed images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.

[CV-38] Generative Topology for Shape Synthesis

链接: https://arxiv.org/abs/2410.18987
作者: Ernst Röell,Bastian Rieck
关键词-EN: Euler Characteristic Transform, embedded simplicial complexes, Characteristic Transform, Euler Characteristic, topological characteristics
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Euler Characteristic Transform (ECT) is a powerful invariant for assessing geometrical and topological characteristics of a large variety of objects, including graphs and embedded simplicial complexes. Although the ECT is invertible in theory, no explicit algorithm for general data sets exists. In this paper, we address this lack and demonstrate that it is possible to learn the inversion, permitting us to develop a novel framework for shape generation tasks on point clouds. Our model exhibits high quality in reconstruction and generation tasks, affords efficient latent-space interpolation, and is orders of magnitude faster than existing methods.

[CV-39] VehicleSDF: A 3D generative model for constrained engineering design via surrogate modeling NEURIPS2024

链接: https://arxiv.org/abs/2410.18986
作者: Hayata Morita,Kohei Shintani,Chenyang Yuan,Frank Permenter
关键词-EN: satisfying engineering constraints, engineering constraints, main challenge, challenge in mechanical, enforcing engineering constraints
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 14 figures, NeurIPS 2024 workshop

点击查看摘要

Abstract:A main challenge in mechanical design is to efficiently explore the design space while satisfying engineering constraints. This work explores the use of 3D generative models to explore the design space in the context of vehicle development, while estimating and enforcing engineering constraints. Specifically, we generate diverse 3D models of cars that meet a given set of geometric specifications, while also obtaining quick estimates of performance parameters such as aerodynamic drag. For this, we employ a data-driven approach (using the ShapeNet dataset) to train VehicleSDF, a DeepSDF based model that represents potential designs in a latent space witch can be decoded into a 3D model. We then train surrogate models to estimate engineering parameters from this latent space representation, enabling us to efficiently optimize latent vectors to match specifications. Our experiments show that we can generate diverse 3D models while matching the specified geometric parameters. Finally, we demonstrate that other performance parameters such as aerodynamic drag can be estimated in a differentiable pipeline.

[CV-40] Very High-Resolution Bridge Deformation Monitoring Using UAV-based Photogrammetry

链接: https://arxiv.org/abs/2410.18984
作者: Mehdi Maboudi,Jan Backhaus,Yahya Ghassoun,Yogesh Khedar,Dirk Lowke,Inka Mai,Bjoern Riedel,Ulf Bestmann,Markus Gerke
关键词-EN: planned service life, efficient structural health, structural health monitoring, Accurate and efficient, vital task
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Accurate and efficient structural health monitoring of infrastructure objects such as bridges is a vital task, as many existing constructions have already reached or are approaching their planned service life. In this contribution, we address the question of the suitability of UAV-based monitoring for SHM, in particular focusing on the geometric deformation under load. Such an advanced technology is becoming increasingly popular due to its ability to decrease the cost and risk of tedious traditional inspection methods. To this end, we performed extensive tests employing a research reinforced concrete bridge that can be exposed to a predefined load via ground anchors. Very high-resolution image blocks have been captured before, during, and after the application of controlled loads. From those images, the motion of distinct points on the bridge has been monitored, and in addition, dense image point clouds were computed to evaluate the performance of surface-based data acquisition. Moreover, a geodetic control network in stable regions is used as control information for bundle adjustment. We applied different sensing technologies in order to be able to judge the image-based deformation results: displacement transducers, tachymetry, and laser profiling. As a platform for the photogrammetric measurements, a multi-rotor UAV DJI Matrice 600 Pro was employed, equipped with two RTK-GNSS receivers. The mounted camera was a PhaseOne iXM-100 (100MP) with an 80 mm lens. With a flying height of 30 m above the terrain, this resulted in a GSD of 1.3 mm while a forward and sideward overlap of 80% was maintained. The comparison with reference data (displacement transducers) reveals a difference of less than 1 mm. We show that by employing the introduced UAV-based monitoring approach, a full area-wide quantification of deformation is possible in contrast to classical point or profile measurements.

[CV-41] oward Generalizable Multiple Sclerosis Lesion Segmentation Models

链接: https://arxiv.org/abs/2410.19623
作者: Liviu Badea,Maria Popa
关键词-EN: Automating Multiple Sclerosis, Automating Multiple, Multiple Sclerosis, monitoring disease progression, lesion segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automating Multiple Sclerosis (MS) lesion segmentation would be of great benefit in initial diagnosis as well as monitoring disease progression. Deep learning based segmentation models perform well in many domains, but the state-of-the-art in MS lesion segmentation is still suboptimal. Complementary to previous MS lesion segmentation challenges which focused on optimizing the performance on a single evaluation dataset, this study aims to develop models that generalize across diverse evaluation datasets, mirroring real-world clinical scenarios that involve varied scanners, settings, and patient cohorts. To this end, we used all high-quality publicly-available MS lesion segmentation datasets on which we systematically trained a state-of-the-art UNet++ architecture. The resulting models demonstrate consistent performance across the remaining test datasets (are generalizable), with larger and more heterogeneous datasets leading to better models. To the best of our knowledge, this represents the most comprehensive cross-dataset evaluation of MS lesion segmentation models to date using publicly available datasets. Additionally, explicitly enhancing dataset size by merging datasets improved model performance. Specifically, a model trained on the combined MSSEG2016-train, ISBI2015, and 3D-MR-MS datasets surpasses the winner of the MICCAI-2016 competition. Moreover, we demonstrate that the generalizability of our models also relies on our original use of quantile normalization on MRI intensities.

[CV-42] Prediction of microstructural representativity from a single image

链接: https://arxiv.org/abs/2410.19568
作者: Amir Dahari,Ronan Docherty,Steve Kench,Samuel J. Cooper
关键词-EN: phase fraction observed, single image, Integral Range, phase fraction, fraction observed
类目: Computation (stat.CO); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In this study, we present a method for predicting the representativity of the phase fraction observed in a single image (2D or 3D) of a material. Traditional approaches often require large datasets and extensive statistical analysis to estimate the Integral Range, a key factor in determining the variance of microstructural properties. Our method leverages the Two-Point Correlation function to directly estimate the variance from a single image (2D or 3D), thereby enabling phase fraction prediction with associated confidence levels. We validate our approach using open-source datasets, demonstrating its efficacy across diverse microstructures. This technique significantly reduces the data requirements for representativity analysis, providing a practical tool for material scientists and engineers working with limited microstructural data. To make the method easily accessible, we have created a web-application, \urlthis http URL, for quick, simple and informative use of the method.

[CV-43] Detection of Emerging Infectious Diseases in Lung CT based on Spatial Anomaly Patterns

链接: https://arxiv.org/abs/2410.19535
作者: Branko Mitic,Philipp Seeböck,Jennifer Straub,Helmut Prosch,Georg Langs
关键词-EN: treating patients effectively, Fast detection, spread and treating, Fast, patients effectively
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fast detection of emerging diseases is important for containing their spread and treating patients effectively. Local anomalies are relevant, but often novel diseases involve familiar disease patterns in new spatial distributions. Therefore, established local anomaly detection approaches may fail to identify them as new. Here, we present a novel approach to detect the emergence of new disease phenotypes exhibiting distinct patterns of the spatial distribution of lesions. We first identify anomalies in lung CT data, and then compare their distribution in a continually acquired new patient cohorts with historic patient population observed over a long prior period. We evaluate how accumulated evidence collected in the stream of patients is able to detect the onset of an emerging disease. In a gram-matrix based representation derived from the intermediate layers of a three-dimensional convolutional neural network, newly emerging clusters indicate emerging diseases.

[CV-44] Conditional Hallucinations for Image Compression

链接: https://arxiv.org/abs/2410.19493
作者: Till Aczel,Roger Wattenhofer
关键词-EN: details or generating, information bottleneck, face the challenge, lossy image compression, samples due
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In lossy image compression, models face the challenge of either hallucinating details or generating out-of-distribution samples due to the information bottleneck. This implies that at times, introducing hallucinations is necessary to generate in-distribution samples. The optimal level of hallucination varies depending on image content, as humans are sensitive to small changes that alter the semantic meaning. We propose a novel compression method that dynamically balances the degree of hallucination based on content. We collect data and train a model to predict user preferences on hallucinations. By using this prediction to adjust the perceptual weight in the reconstruction loss, we develop a Conditionally Hallucinating compression model (ConHa) that outperforms state-of-the-art image compression methods. Code and images are available at this https URL.

[CV-45] Integration of Communication and Computational Imaging

链接: https://arxiv.org/abs/2410.19415
作者: Zhenming Yu,Liming Cheng,Hongyu Huang,Wei Zhang,Liang Lin,Kun Xu
关键词-EN: computational imaging overcomes, computational imaging, time and distance, depth and breadth, human visual perception
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Communication enables the expansion of human visual perception beyond the limitations of time and distance, while computational imaging overcomes the constraints of depth and breadth. Although impressive achievements have been witnessed with the two types of technologies, the occlusive information flow between the two domains is a bottleneck hindering their ulterior progression. Herein, we propose a novel framework that integrates communication and computational imaging (ICCI) to break through the inherent isolation between communication and computational imaging for remote perception. By jointly considering the sensing and transmitting of remote visual information, the ICCI framework performs a full-link information transfer optimization, aiming to minimize information loss from the generation of the information source to the execution of the final vision tasks. We conduct numerical analysis and experiments to demonstrate the ICCI framework by integrating communication systems and snapshot compressive imaging systems. Compared with straightforward combination schemes, which sequentially execute sensing and transmitting, the ICCI scheme shows greater robustness against channel noise and impairments while achieving higher data compression. Moreover, an 80 km 27-band hyperspectral video perception with a rate of 30 fps is experimentally achieved. This new ICCI remote perception paradigm offers a highefficiency solution for various real-time computer vision tasks.

[CV-46] Beyond Point Annotation: A Weakly Supervised Network Guided by Multi-Level Labels Generated from Four-Point Annotation for Thyroid Nodule Segmentation in Ultrasound Image

链接: https://arxiv.org/abs/2410.19332
作者: Jianning Chi,Zelan Li,Huixuan Wu,Wenjun Zhang,Ying Huang
关键词-EN: represent delicate feature, delicate feature differences, confused incorrect information, methods typically guided, diverse segmentation-related information
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Weakly-supervised methods typically guided the pixel-wise training by comparing the predictions to single-level labels containing diverse segmentation-related information at once, but struggled to represent delicate feature differences between nodule and background regions and confused incorrect information, resulting in underfitting or overfitting in the segmentation predictions. In this work, we propose a weakly-supervised network that generates multi-level labels from four-point annotation to refine diverse constraints for delicate nodule segmentation. The Distance-Similarity Fusion Prior referring to the points annotations filters out information irrelevant to nodules. The bounding box and pure foreground/background labels, generated from the point annotation, guarantee the rationality of the prediction in the arrangement of target localization and the spatial distribution of target/background regions, respectively. Our proposed network outperforms existing weakly-supervised methods on two public datasets with respect to the accuracy and robustness, improving the applicability of deep-learning based segmentation in the clinical practice of thyroid nodule diagnosis.

[CV-47] A Flow-based Truncated Denoising Diffusion Model for Super-resolution Magnetic Resonance Spectroscopic Imaging

链接: https://arxiv.org/abs/2410.19288
作者: Siyuan Dong,Zhuotong Cai,Gilbert Hangel,Wolfgang Bogner,Georg Widhalm,Yaqing Huang,Qinghao Liang,Chenyu You,Chathura Kumaragamage,Robert K. Fulbright,Amit Mahajan,Amin Karbasi,John A. Onofrey,Robin A. de Graaf,James S. Duncan
关键词-EN: Magnetic Resonance Spectroscopic, Resonance Spectroscopic Imaging, non-invasive imaging technique, Magnetic Resonance, Resonance Spectroscopic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by Medical Image Analysis (MedIA)

点击查看摘要

Abstract:Magnetic Resonance Spectroscopic Imaging (MRSI) is a non-invasive imaging technique for studying metabolism and has become a crucial tool for understanding neurological diseases, cancers and diabetes. High spatial resolution MRSI is needed to characterize lesions, but in practice MRSI is acquired at low resolution due to time and sensitivity restrictions caused by the low metabolite concentrations. Therefore, there is an imperative need for a post-processing approach to generate high-resolution MRSI from low-resolution data that can be acquired fast and with high sensitivity. Deep learning-based super-resolution methods provided promising results for improving the spatial resolution of MRSI, but they still have limited capability to generate accurate and high-quality images. Recently, diffusion models have demonstrated superior learning capability than other generative models in various tasks, but sampling from diffusion models requires iterating through a large number of diffusion steps, which is time-consuming. This work introduces a Flow-based Truncated Denoising Diffusion Model (FTDDM) for super-resolution MRSI, which shortens the diffusion process by truncating the diffusion chain, and the truncated steps are estimated using a normalizing flow-based network. The network is conditioned on upscaling factors to enable multi-scale super-resolution. To train and evaluate the deep learning models, we developed a 1H-MRSI dataset acquired from 25 high-grade glioma patients. We demonstrate that FTDDM outperforms existing generative models while speeding up the sampling process by over 9-fold compared to the baseline diffusion model. Neuroradiologists’ evaluations confirmed the clinical advantages of our method, which also supports uncertainty estimation and sharpness adjustment, extending its potential clinical applications.

[CV-48] he Empirical Watershed Wavelet

链接: https://arxiv.org/abs/2410.19187
作者: Basile Hurat,Zariluz Alvarado,Jerome Gilles
关键词-EN: multiresolution analysis tool, analysis tool based, adaptive multiresolution analysis, Fourier domain, multiresolution analysis
类目: pectral Theory (math.SP); Computer Vision and Pattern Recognition (cs.CV); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:The empirical wavelet transform is an adaptive multiresolution analysis tool based on the idea of building filters on a data-driven partition of the Fourier domain. However, existing 2D extensions are constrained by the shape of the detected partitioning. In this paper, we provide theoretical results that permits us to build 2D empirical wavelet filters based on an arbitrary partitioning of the frequency domain. We also propose an algorithm to detect such partitioning from an image spectrum by combining a scale-space representation to estimate the position of dominant harmonic modes and a watershed transform to find the boundaries of the different supports making the expected partition. This whole process allows us to define the empirical watershed wavelet transform. We illustrate the effectiveness and the advantages of such adaptive transform, first visually on toy images, and next on both unsupervised texture segmentation and image deconvolution applications.

[CV-49] CapsuleNet: A Deep Learning Model To Classify GI Diseases Using EfficientNet-b7

链接: https://arxiv.org/abs/2410.19151
作者: Aniket Das,Ayushman Singh,Nishant,Sharad Prakash
关键词-EN: global health concern, significant global health, Capsule Endoscopy, health concern, offering a non-invasive
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Capsule Vision 2024 Challenge

点击查看摘要

Abstract:Gastrointestinal (GI) diseases represent a significant global health concern, with Capsule Endoscopy (CE) offering a non-invasive method for diagnosis by capturing a large number of GI tract images. However, the sheer volume of video frames necessitates automated analysis to reduce the workload on doctors and increase the diagnostic accuracy. In this paper, we present CapsuleNet, a deep learning model developed for the Capsule Vision 2024 Challenge, aimed at classifying 10 distinct GI abnormalities. Using a highly imbalanced dataset, we implemented various data augmentation strategies, reducing the data imbalance to a manageable level. Our model leverages a pretrained EfficientNet-b7 backbone, tuned with additional layers for classification and optimized with PReLU activation functions. The model demonstrated superior performance on validation data, achieving a micro accuracy of 84.5% and outperforming the VGG16 baseline across most classes. Despite these advances, challenges remain in classifying certain abnormalities, such as Erythema. Our findings suggest that CNN-based models like CapsuleNet can provide an efficient solution for GI tract disease classification, particularly when inference time is a critical factor.

机器学习

[LG-0] mporal Convolution-based Hybrid Model Approach with Representation Learning for Real-Time Acoustic Anomaly Detection ICML

链接: https://arxiv.org/abs/2410.19722
作者: Sahan Dissanayaka,Manjusri Wickramasinghe,Pasindu Marasinghe
关键词-EN: Machine Condition Monitoring, preserving Machine Condition, Condition Monitoring, industrial machinery components, Machine Condition
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 10 pages, 10 figures, ICMLC2024

点击查看摘要

Abstract:The early detection of potential failures in industrial machinery components is paramount for ensuring the reliability and safety of operations, thereby preserving Machine Condition Monitoring (MCM). This research addresses this imperative by introducing an innovative approach to Real-Time Acoustic Anomaly Detection. Our method combines semi-supervised temporal convolution with representation learning and a hybrid model strategy with Temporal Convolutional Networks (TCN) to handle various intricate anomaly patterns found in acoustic data effectively. The proposed model demonstrates superior performance compared to established research in the field, underscoring the effectiveness of this approach. Not only do we present quantitative evidence of its superiority, but we also employ visual representations, such as t-SNE plots, to further substantiate the model’s efficacy.

[LG-1] Water and Electricity Consumption Forecasting at an Educational Institution using Machine Learning models with Metaheuristic Optimization

链接: https://arxiv.org/abs/2410.19709
作者: Eduardo Luiz Alba,Matheus Henrique Dal Molin Ribeiro,Gilson Adamczuk,Flavio Trojan,Erick Oliveira Rodrigues
关键词-EN: Educational institutions, social development, institutions are essential, essential for economic, economic and social
类目: Machine Learning (cs.LG)
*备注: Conference: International Joint Conference on Industrial Engineering and Operations Management (IJCIEOM ). At: Salvador-BA, Brazil

点击查看摘要

Abstract:Educational institutions are essential for economic and social development. Budget cuts in Brazil in recent years have made it difficult to carry out their activities and projects. In the case of expenses with water and electricity, unexpected situations can occur, such as leaks and equipment failures, which make their management challenging. This study proposes a comparison between two machine learning models, Random Forest (RF) and Support Vector Regression (SVR), for water and electricity consumption forecasting at the Federal Institute of Paraná-Campus Palmas, with a 12-month forecasting horizon, as well as evaluating the influence of the application of climatic variables as exogenous features. The data were collected over the past five years, combining details pertaining to invoices with exogenous and endogenous variables. The two models had their hyperpa-rameters optimized using the Genetic Algorithm (GA) to select the individuals with the best fitness to perform the forecasting with and without climatic variables. The absolute percentage errors and root mean squared error were used as performance measures to evaluate the forecasting accuracy. The results suggest that in forecasting water and electricity consumption over a 12-step horizon, the Random Forest model exhibited the most superior performance. The integration of climatic variables often led to diminished forecasting accuracy, resulting in higher errors. Both models still had certain difficulties in predicting water consumption, indicating that new studies with different models or variables are welcome.

[LG-2] Super Gradient Descent: Global Optimization requires Global Gradient

链接: https://arxiv.org/abs/2410.19706
作者: Seifeddine Achour
关键词-EN: directly impacts model, impacts model performance, function directly impacts, Super Gradient Descent, fundamental challenge
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Global minimization is a fundamental challenge in optimization, especially in machine learning, where finding the global minimum of a function directly impacts model performance and convergence. This report introduces a novel optimization method that we called Super Gradient Descent, designed specifically for one-dimensional functions, guaranteeing convergence to the global minimum for any k-Lipschitz function defined on a closed interval [a, b]. Our approach addresses the limitations of traditional optimization algorithms, which often get trapped in local minima. In particular, we introduce the concept of global gradient which offers a robust solution for precise and well-guided global optimization. By focusing on the global minimization problem, this work bridges a critical gap in optimization theory, offering new insights and practical advancements in different optimization problems in particular Machine Learning problems like line search.

[LG-3] Robust Thompson Sampling Algorithms Against Reward Poisoning Attacks

链接: https://arxiv.org/abs/2410.19705
作者: Yinglun Xu,Zhiwei Wang,Gagandeep Singh
关键词-EN: online sequential decision-making, Thompson sampling, rich real-world applications, sequential decision-making problems, current Thompson sampling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Thompson sampling is one of the most popular learning algorithms for online sequential decision-making problems and has rich real-world applications. However, current Thompson sampling algorithms are limited by the assumption that the rewards received are uncorrupted, which may not be true in real-world applications where adversarial reward poisoning exists. To make Thompson sampling more reliable, we want to make it robust against adversarial reward poisoning. The main challenge is that one can no longer compute the actual posteriors for the true reward, as the agent can only observe the rewards after corruption. In this work, we solve this problem by computing pseudo-posteriors that are less likely to be manipulated by the attack. We propose robust algorithms based on Thompson sampling for the popular stochastic and contextual linear bandit settings in both cases where the agent is aware or unaware of the budget of the attacker. We theoretically show that our algorithms guarantee near-optimal regret under any attack strategy.

[LG-4] Learning the Regularization Strength for Deep Fine-Tuning via a Data-Emphasized Variational Objective

链接: https://arxiv.org/abs/2410.19675
作者: Ethan Harvey,Mikhail Petrov,Michael C. Hughes
关键词-EN: popular transfer learning, select regularization hyperparameters, grid search, control over-fitting, transfer learning methods
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A number of popular transfer learning methods rely on grid search to select regularization hyperparameters that control over-fitting. This grid search requirement has several key disadvantages: the search is computationally expensive, requires carving out a validation set that reduces the size of available data for model training, and requires practitioners to specify candidate values. In this paper, we propose an alternative to grid search: directly learning regularization hyperparameters on the full training set via model selection techniques based on the evidence lower bound (“ELBo”) objective from variational methods. For deep neural networks with millions of parameters, we specifically recommend a modified ELBo that upweights the influence of the data likelihood relative to the prior while remaining a valid bound on the evidence for Bayesian model selection. Our proposed technique overcomes all three disadvantages of grid search. We demonstrate effectiveness on image classification tasks on several datasets, yielding heldout accuracy comparable to existing approaches with far less compute time.

[LG-5] Spatial Shortcuts in Graph Neural Controlled Differential Equations NEURIPS2024

链接: https://arxiv.org/abs/2410.19673
作者: Michael Detzel,Gabriel Nobis,Jackie Ma,Wojciech Samek
关键词-EN: Controlled Differential Equation, Neural Controlled Differential, Differential Equation, Neural Controlled, Controlled Differential
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted as a workshop paper at the NeurIPS 2024 workshop on Data-driven and Differentiable Simulations, Surrogates, and Solvers (D3S3)

点击查看摘要

Abstract:We incorporate prior graph topology information into a Neural Controlled Differential Equation (NCDE) to predict the future states of a dynamical system defined on a graph. The informed NCDE infers the future dynamics at the vertices of simulated advection data on graph edges with a known causal graph, observed only at vertices during training. We investigate different positions in the model architecture to inform the NCDE with graph information and identify an outer position between hidden state and control as theoretically and empirically favorable. Our such informed NCDE requires fewer parameters to reach a lower Mean Absolute Error (MAE) compared to previous methods that do not incorporate additional graph topology information.

[LG-6] MetaTrading: An Immersion-Aware Model Trading Framework for Vehicular Metaverse Services

链接: https://arxiv.org/abs/2410.19665
作者: Hongjia Wu,Hui Zeng,Zehui Xiong,Jiawen Kang,Zhiping Cai,Tse-Tin Chan,Dusit Niyato,Zhu Han
关键词-EN: Internet of Things, Updates of extensive, extensive Internet, vehicular metaverse, Updates
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Updates of extensive Internet of Things (IoT) data are critical to the immersion of vehicular metaverse services. However, providing high-quality and sustainable data in unstable and resource-constrained vehicular networks remains a significant challenge. To address this problem, we put forth a novel immersion-aware model trading framework that incentivizes metaverse users (MUs) to contribute learning models trained by their latest local data for augmented reality (AR) services in the vehicular metaverse, while preserving their privacy through federated learning. To comprehensively evaluate the contribution of locally trained learning models provided by MUs to AR services, we design a new immersion metric that captures service immersion by considering the freshness and accuracy of learning models, as well as the amount and potential value of raw data used for training. We model the trading interactions between metaverse service providers (MSPs) and MUs as an equilibrium problem with equilibrium constraints (EPEC) to analyze and balance their costs and gains. Moreover, considering dynamic network conditions and privacy concerns, we formulate the reward decisions of MSPs as a multi-agent Markov decision process. Then, a fully distributed dynamic reward method based on deep reinforcement learning is presented, which operates without any private information about MUs and other MSPs. Experimental results demonstrate that the proposed framework can effectively provide higher-value models for object detection and classification in AR services on real AR-related vehicle datasets compared to benchmark schemes.

[LG-7] Conformal Prediction for Multimodal Regression

链接: https://arxiv.org/abs/2410.19653
作者: Alexis Bose,Jonathan Ethier,Paul Guinand
关键词-EN: paper introduces multimodal, multimodal conformal regression, paper introduces, introduces multimodal conformal, conformal regression
类目: Machine Learning (cs.LG)
*备注: 20 pages, 34 figures

点击查看摘要

Abstract:This paper introduces multimodal conformal regression. Traditionally confined to scenarios with solely numerical input features, conformal prediction is now extended to multimodal contexts through our methodology, which harnesses internal features from complex neural network architectures processing images and unstructured text. Our findings highlight the potential for internal neural network features, extracted from convergence points where multimodal information is combined, to be used by conformal prediction to construct prediction intervals (PIs). This capability paves new paths for deploying conformal prediction in domains abundant with multimodal data, enabling a broader range of problems to benefit from guaranteed distribution-free uncertainty quantification.

[LG-8] Efficient Biological Data Acquisition through Inference Set Design

链接: https://arxiv.org/abs/2410.19631
作者: Ihor Neporozhnii,Julien Roy,Emmanuel Bengio,Jason Hartford
关键词-EN: highly automated high-throughput, automated high-throughput laboratories, drug discovery, effective drugs, highly automated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In drug discovery, highly automated high-throughput laboratories are used to screen a large number of compounds in search of effective drugs. These experiments are expensive, so we might hope to reduce their cost by experimenting on a subset of the compounds, and predicting the outcomes of the remaining experiments. In this work, we model this scenario as a sequential subset selection problem: we aim to select the smallest set of candidates in order to achieve some desired level of accuracy for the system as a whole. Our key observation is that, if there is heterogeneity in the difficulty of the prediction problem across the input space, selectively obtaining the labels for the hardest examples in the acquisition pool will leave only the relatively easy examples to remain in the inference set, leading to better overall system performance. We call this mechanism inference set design, and propose the use of an uncertainty-based active learning solution to prune out these challenging examples. Our algorithm includes an explicit stopping criterion that stops running the experiments when it is sufficiently confident that the system has reached the target performance. Our empirical studies on image and molecular datasets, as well as a real-world large-scale biological assay, show that deploying active learning for inference set design leads to significant reduction in experimental cost while obtaining high system performance.

[LG-9] Analyzing Neural Network Robustness Using Graph Curvature

链接: https://arxiv.org/abs/2410.19607
作者: Shuhang Tan,Jayson Sia,Paul Bogdan,Radoslav Ivanov
关键词-EN: specifically graph curvature, graph theory analysis, robustness problem, neural Ricci curvature, paper presents
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a new look at the neural network (NN) robustness problem, from the point of view of graph theory analysis, specifically graph curvature. Graph curvature (e.g., Ricci curvature) has been used to analyze system dynamics and identify bottlenecks in many domains, including road traffic analysis and internet routing. We define the notion of neural Ricci curvature and use it to identify bottleneck NN edges that are heavily used to ``transport data" to the NN outputs. We provide an evaluation on MNIST that illustrates that such edges indeed occur more frequently for inputs where NNs are less robust. These results will serve as the basis for an alternative method of robust training, by minimizing the number of bottleneck edges.

[LG-10] Neuromorphic IoT Architecture for Efficient Water Management: A Smart Village Case Study

链接: https://arxiv.org/abs/2410.19562
作者: Mugdim Bublin,Heimo Hirner,Antoine-Martin Lanners,Radu Grosu
关键词-EN: offer high flexibility, IoT networks necessitates, minimal communication overhead, exponential growth, networks necessitates
类目: Machine Learning (cs.LG)
*备注: Submitted to Edge AI conference, Cagliari, Italy

点击查看摘要

Abstract:The exponential growth of IoT networks necessitates a paradigm shift towards architectures that offer high flexibility and learning capabilities while maintaining low energy consumption, minimal communication overhead, and low latency. Traditional IoT systems, particularly when integrated with machine learning approaches, often suffer from high communication overhead and significant energy consumption. This work addresses these challenges by proposing a neuromorphic architecture inspired by biological systems. To illustrate the practical application of our proposed architecture, we present a case study focusing on water management in the Carinthian community of Neuhaus. Preliminary results regarding water consumption prediction and anomaly detection in this community are presented. We also introduce a novel neuromorphic IoT architecture that integrates biological principles into the design of IoT systems. This architecture is specifically tailored for edge computing scenarios, where low power and high efficiency are crucial. Our approach leverages the inherent advantages of neuromorphic computing, such as asynchronous processing and event-driven communication, to create an IoT framework that is both energy-efficient and responsive. This case study demonstrates how the neuromorphic IoT architecture can be deployed in a real-world scenario, highlighting its benefits in terms of energy savings, reduced communication overhead, and improved system responsiveness.

[LG-11] FLiP: Privacy-Preserving Federated Learning based on the Principle of Least Privileg

链接: https://arxiv.org/abs/2410.19548
作者: ShiMao Xu,Xiaopeng Ke,Xing Su,Shucheng Li,Hao wu,Fengyuan Xu,Sheng Zhong
关键词-EN: Federated Learning, high accuracy, Federated, Learning, raw data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) allows users to share knowledge instead of raw data to train a model with high accuracy. Unfortunately, during the training, users lose control over the knowledge shared, which causes serious data privacy issues. We hold that users are only willing and need to share the essential knowledge to the training task to obtain the FL model with high accuracy. However, existing efforts cannot help users minimize the shared knowledge according to the user intention in the FL training procedure. This work proposes FLiP, which aims to bring the principle of least privilege (PoLP) to FL training. The key design of FLiP is applying elaborate information reduction on the training data through a local-global dataset distillation design. We measure the privacy performance through attribute inference and membership inference attacks. Extensive experiments show that FLiP strikes a good balance between model accuracy and privacy protection.

[LG-12] Agent Forge: A Flexible Low-Code Platform for Reinforcement Learning Agent Design

链接: https://arxiv.org/abs/2410.19528
作者: Francisco Erivaldo Fernandes Junior,Antti Oulasvirta
关键词-EN: memory modules work, involves identifying effective, Developing a reinforcement, agent internal architecture, covering the policy
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: This preprint was submitted to the 17th International Conference on Agents and Artificial Intelligence (ICAART’2025) and is currently under review

点击查看摘要

Abstract:Developing a reinforcement learning (RL) agent often involves identifying effective values for a large number of parameters, covering the policy, reward function, environment, and the agent’s internal architecture, such as parameters controlling how the peripheral vision and memory modules work. Critically, since these parameters are interrelated in complex ways, optimizing them can be viewed as a black box optimization problem, which is especially challenging for non-experts. Although existing optimization-as-a-service platforms (e.g., Vizier, Optuna) can handle such problems, they are impractical for RL systems, as users must manually map each parameter to different components, making the process cumbersome and error-prone. They also require deep understanding of the optimization process, limiting their application outside ML experts and restricting access for fields like cognitive science, which models human decision-making. To tackle these challenges, we present AgentForge, a flexible low-code framework to optimize any parameter set across an RL system. AgentForge allows the user to perform individual or joint optimization of parameter sets. An optimization problem can be defined in a few lines of code and handed to any of the interfaced optimizers. We evaluated its performance in a challenging vision-based RL problem. AgentForge enables practitioners to develop RL agents without requiring extensive coding or deep expertise in optimization.

[LG-13] Parametric Nonlinear Volterra Series via Machine Learning: Transonic Aerodynamics

链接: https://arxiv.org/abs/2410.19514
作者: Gabriele Immordino,Andrea Da Ronch,Marcello Righi
关键词-EN: capture aerodynamic responses, modeling unsteady transonic, unsteady transonic aerodynamics, second-order Volterra kernels, Volterra series
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces an approach for modeling unsteady transonic aerodynamics within a parametric space, using Volterra series to capture aerodynamic responses and machine learning to enable interpolation. The first- and second-order Volterra kernels are derived from indicial aerodynamic responses obtained through computational fluid dynamics, with the second-order kernel calculated as a correction to the dominant linear response. Machine learning algorithms, specifically artificial neural network and Gaussian process regression, are used to interpolate kernel coefficients within a parameter space defined by Mach number and angle of attack. The methodology is applied to two and three dimensional test cases in the transonic regime. Results underscore the benefit of including the second-order kernel to address strong nonlinearity and demonstrate the effectiveness of neural networks. The approach achieves a level of accuracy that appears sufficient for use in conceptual design.

[LG-14] Marked Temporal Bayesian Flow Point Processes

链接: https://arxiv.org/abs/2410.19512
作者: Hui Chen,Xuhui Fan,Hengyu Liu,Longbing Cao
关键词-EN: Temporal Point Process, MTPP models, generative MTPP, generative MTPP models, continuous-valued occurrence timestamps
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Marked event data captures events by recording their continuous-valued occurrence timestamps along with their corresponding discrete-valued types. They have appeared in various real-world scenarios such as social media, financial transactions, and healthcare records, and have been effectively modeled through Marked Temporal Point Process (MTPP) models. Recently, developing generative models for these MTPP models have seen rapid development due to their powerful generative capability and less restrictive functional forms. However, existing generative MTPP models are usually challenged in jointly modeling events’ timestamps and types since: (1) mainstream methods design the generative mechanisms for timestamps only and do not include event types; (2) the complex interdependence between the timestamps and event types are overlooked. In this paper, we propose a novel generative MTPP model called BMTPP. Unlike existing generative MTPP models, BMTPP flexibly models marked temporal joint distributions using a parameter-based approach. Additionally, by adding joint noise to the marked temporal data space, BMTPP effectively captures and explicitly reveals the interdependence between timestamps and event types. Extensive experiments validate the superiority of our approach over other state-of-the-art models and its ability to effectively capture marked-temporal interdependence.

[LG-15] A neural network approach for solving the Monge-Amp`ere equation with transport boundary condition

链接: https://arxiv.org/abs/2410.19496
作者: Roel Hacking,Lisa Kusch,Koondanibha Mitra,Martijn Anthonissen,Wilbert IJzerman
关键词-EN: optical design applications, transport boundary condition, specifically targeted, design applications, paper introduces
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel neural network-based approach to solving the Monge-Ampère equation with the transport boundary condition, specifically targeted towards optical design applications. We leverage multilayer perceptron networks to learn approximate solutions by minimizing a loss function that encompasses the equation’s residual, boundary conditions, and convexity constraints. Our main results demonstrate the efficacy of this method, optimized using L-BFGS, through a series of test cases encompassing symmetric and asymmetric circle-to-circle, square-to-circle, and circle-to-flower reflector mapping problems. Comparative analysis with a conventional least-squares finite-difference solver reveals the competitive, and often superior, performance of our neural network approach on the test cases examined here. A comprehensive hyperparameter study further illuminates the impact of factors such as sampling density, network architecture, and optimization algorithm. While promising, further investigation is needed to verify the method’s robustness for more complicated problems and to ensure consistent convergence. Nonetheless, the simplicity and adaptability of this neural network-based approach position it as a compelling alternative to specialized partial differential equation solvers.

[LG-16] RADE: Transfer of Distributions between External Conditions with Normalizing Flows

链接: https://arxiv.org/abs/2410.19492
作者: Stefan Wahl,Armand Rousselot,Felix Draxler,Ullrich Köthe
关键词-EN: affect molecular configurations, temperature affect molecular, Modeling distributions, common scenario, scenario in diverse
类目: Machine Learning (cs.LG)
*备注: Preprint, under review

点击查看摘要

Abstract:Modeling distributions that depend on external control parameters is a common scenario in diverse applications like molecular simulations, where system properties like temperature affect molecular configurations. Despite the relevance of these applications, existing solutions are unsatisfactory as they require severely restricted model architectures or rely on backward training, which is prone to unstable training. We introduce TRADE, which overcomes these limitations by formulating the learning process as a boundary value problem. By initially training the model for a specific condition using either i.i.d. samples or backward KL training, we establish a boundary distribution. We then propagate this information across other conditions using the gradient of the unnormalized density with respect to the external parameter. This formulation, akin to the principles of physics-informed neural networks, allows us to efficiently learn parameter-dependent distributions without restrictive assumptions. Experimentally, we demonstrate that TRADE achieves excellent results in a wide range of applications, ranging from Bayesian inference and molecular simulations to physical lattice models.

[LG-17] Measuring memorization through probabilistic discoverable extraction

链接: https://arxiv.org/abs/2410.19482
作者: Jamie Hayes,Marika Swanberg,Harsh Chaudhari,Itay Yona,Ilia Shumailov
关键词-EN: Large language models, raising concerns due, Large language, discoverable extraction, raising concerns
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are susceptible to memorizing training data, raising concerns due to the potential extraction of sensitive information. Current methods to measure memorization rates of LLMs, primarily discoverable extraction (Carlini et al., 2022), rely on single-sequence greedy sampling, potentially underestimating the true extent of memorization. This paper introduces a probabilistic relaxation of discoverable extraction that quantifies the probability of extracting a target sequence within a set of generated samples, considering various sampling schemes and multiple attempts. This approach addresses the limitations of reporting memorization rates through discoverable extraction by accounting for the probabilistic nature of LLMs and user interaction patterns. Our experiments demonstrate that this probabilistic measure can reveal cases of higher memorization rates compared to rates found through discoverable extraction. We further investigate the impact of different sampling schemes on extractability, providing a more comprehensive and realistic assessment of LLM memorization and its associated risks. Our contributions include a new probabilistic memorization definition, empirical evidence of its effectiveness, and a thorough evaluation across different models, sizes, sampling schemes, and training data repetitions.

[LG-18] Computational Bottlenecks of Training Small-scale Large Language Models

链接: https://arxiv.org/abs/2410.19456
作者: Saleh Ashkboos,Iman Mirzadeh,Keivan Alizadeh,Mohammad Hossein Sekhavat,Moin Nabi,Mehrdad Farajtabar,Fartash Faghri
关键词-EN: Small-scale large Language, large language models, Small-scale large, gaining attention due, large language
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:While large language models (LLMs) dominate the AI landscape, Small-scale large Language Models (SLMs) are gaining attention due to cost and efficiency demands from consumers. However, there is limited research on the training behavior and computational requirements of SLMs. In this study, we explore the computational bottlenecks of training SLMs (up to 2B parameters) by examining the effects of various hyperparameters and configurations, including GPU type, batch size, model size, communication protocol, attention type, and the number of GPUs. We assess these factors on popular cloud services using metrics such as loss per dollar and tokens per second. Our findings aim to support the broader adoption and optimization of language model training for low-resource AI research institutes.

[LG-19] Generative Diffusion Models for Sequential Recommendations RECSYS RECSYS’24

链接: https://arxiv.org/abs/2410.19429
作者: Sharare Zolghadr,Ole Winther,Paul Jeha
关键词-EN: Generative Adversarial Networks, Generative Adversarial, Variational Autoencoders, Adversarial Networks, Generative models
类目: Machine Learning (cs.LG)
*备注: ROEGEN@RecSys’24: The 1st Workshop on Risks, Opportunities, and Evaluation of Generative Models in Recommender Systems, Co-located with ACM RecSys in Bari, Italy, October 2024

点击查看摘要

Abstract:Generative models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) have shown promise in sequential recommendation tasks. However, they face challenges, including posterior collapse and limited representation capacity. The work by Li et al. (2023) introduces a novel approach that leverages diffusion models to address these challenges by representing item embeddings as distributions rather than fixed vectors. This approach allows for a more adaptive reflection of users’ diverse interests and various item aspects. During the diffusion phase, the model converts the target item embedding into a Gaussian distribution by adding noise, facilitating the representation of sequential item distributions and the injection of uncertainty. An Approximator then processes this noisy item representation to reconstruct the target item. In the reverse phase, the model utilizes users’ past interactions to reverse the noise and finalize the item prediction through a rounding operation. This research introduces enhancements to the DiffuRec architecture, particularly by adding offset noise in the diffusion process to improve robustness and incorporating a cross-attention mechanism in the Approximator to better capture relevant user-item interactions. These contributions led to the development of a new model, DiffuRecSys, which improves performance. Extensive experiments conducted on three public benchmark datasets demonstrate that these modifications enhance item representation, effectively capture diverse user preferences, and outperform existing baselines in sequential recommendation research.

[LG-20] Analyzing Generative Models by Manifold Entropic Metrics

链接: https://arxiv.org/abs/2410.19426
作者: Daniel Galperin,Ullrich Köthe
关键词-EN: high quality data, synthesize high quality, aid human understanding, Good generative models, utilize interpretable representations
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Good generative models should not only synthesize high quality data, but also utilize interpretable representations that aid human understanding of their behavior. However, it is difficult to measure objectively if and to what degree desirable properties of disentangled representations have been achieved. Inspired by the principle of independent mechanisms, we address this difficulty by introducing a novel set of tractable information-theoretic evaluation metrics. We demonstrate the usefulness of our metrics on illustrative toy examples and conduct an in-depth comparison of various normalizing flow architectures and \beta -VAEs on the EMNIST dataset. Our method allows to sort latent features by importance and assess the amount of residual correlations of the resulting concepts. The most interesting finding of our experiments is a ranking of model architectures and training procedures in terms of their inductive bias to converge to aligned and disentangled representations during training.

[LG-21] An Auditing Test To Detect Behavioral Shift in Language Models

链接: https://arxiv.org/abs/2410.19406
作者: Leo Richter,Xuanli He,Pasquale Minervini,Matt J. Kusner
关键词-EN: comprehensive understanding, approach human-level performance, human-level performance, performance, model
类目: Machine Learning (cs.LG)
*备注: 25 pages, 12 figures

点击查看摘要

Abstract:As language models (LMs) approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model’s behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present a method for continual Behavioral Shift Auditing (BSA) in LMs. Building on recent work in hypothesis testing, our auditing test detects behavioral shifts solely through model generations. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.

[LG-22] Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression NEURIPS2024

链接: https://arxiv.org/abs/2410.19400
作者: Yixiu Mao,Cheems Wang,Chen Chen,Yun Qu,Xiangyang Ji
关键词-EN: OOD state correction, offline reinforcement learning, OOD state, OOD, OOD state issue
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus, but we argue that there exists an OOD state issue that also impairs performance yet has been underexplored. Such an issue describes the scenario when the agent encounters states out of the offline dataset during the test phase, leading to uncontrolled behavior and performance degradation. To this end, we propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL. Technically, SCAS achieves value-aware OOD state correction, capable of correcting the agent from OOD states to high-value in-distribution states. Theoretical and empirical results show that SCAS also exhibits the effect of suppressing OOD actions. On standard offline RL benchmarks, SCAS achieves excellent performance without additional hyperparameter tuning. Moreover, benefiting from its OOD state correction feature, SCAS demonstrates enhanced robustness against environmental perturbations.

[LG-23] COMSPLIT: A Communication-Aware Split Learning Design for Heterogeneous IoT Platforms

链接: https://arxiv.org/abs/2410.19375
作者: Vukan Ninkovic,Dejan Vukobratovic,Dragisa Miskovic,Marco Zennaro
关键词-EN: Internet of Things, flexibly distribute computation, distribute computation load, enhance data privacy, algorithms in Internet
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted for publication in IEEE Internet of Things Journal

点击查看摘要

Abstract:The significance of distributed learning and inference algorithms in Internet of Things (IoT) network is growing since they flexibly distribute computation load between IoT devices and the infrastructure, enhance data privacy, and minimize latency. However, a notable challenge stems from the influence of communication channel conditions on their performance. In this work, we introduce COMSPLIT: a novel communication-aware design for split learning (SL) and inference paradigm tailored to processing time series data in IoT networks. COMSPLIT provides a versatile framework for deploying adaptable SL in IoT networks affected by diverse channel conditions. In conjunction with the integration of an early-exit strategy, and addressing IoT scenarios containing devices with heterogeneous computational capabilities, COMSPLIT represents a comprehensive design solution for communication-aware SL in IoT networks. Numerical results show superior performance of COMSPLIT compared to vanilla SL approaches (that assume ideal communication channel), demonstrating its ability to offer both design simplicity and adaptability to different channel conditions.

[LG-24] oward Finding Strong Pareto Optimal Policies in Multi-Agent Reinforcement Learning ACML2024

链接: https://arxiv.org/abs/2410.19372
作者: Bang Giang Le,Viet Cuong Ta
关键词-EN: multi-agent reinforcement learning, finding Pareto optimal, cooperative reward structures, multi-agent reinforcement, MGDA
类目: Machine Learning (cs.LG)
*备注: Submitted to ACML 2024 Special Issue Journal track

点击查看摘要

Abstract:In this work, we study the problem of finding Pareto optimal policies in multi-agent reinforcement learning problems with cooperative reward structures. We show that any algorithm where each agent only optimizes their reward is subject to suboptimal convergence. Therefore, to achieve Pareto optimality, agents have to act altruistically by considering the rewards of others. This observation bridges the multi-objective optimization framework and multi-agent reinforcement learning together. We first propose a framework for applying the Multiple Gradient Descent algorithm (MGDA) for learning in multi-agent settings. We further show that standard MGDA is subjected to weak Pareto convergence, a problem that is often overlooked in other learning settings but is prevalent in multi-agent reinforcement learning. To mitigate this issue, we propose MGDA++, an improvement of the existing algorithm to handle the weakly optimal convergence of MGDA properly. Theoretically, we prove that MGDA++ converges to strong Pareto optimal solutions in convex, smooth bi-objective problems. We further demonstrate the superiority of our MGDA++ in cooperative settings in the Gridworld benchmark. The results highlight that our proposed method can converge efficiently and outperform the other methods in terms of the optimality of the convergent policies. The source code is available at \urlthis https URL.

[LG-25] Notes on the Mathematical Structure of GPT LLM Architectures

链接: https://arxiv.org/abs/2410.19370
作者: Spencer Becker-Kahn
关键词-EN: neural network architecture, mathematics underpinning, underpinning the neural, neural network, network architecture
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:An exposition of the mathematics underpinning the neural network architecture of a GPT-3-style LLM.

[LG-26] FeBiM: Efficient and Compact Bayesian Inference Engine Empowered with Ferroelectric In-Memory Computing

链接: https://arxiv.org/abs/2410.19356
作者: Chao Li,Zhicheng Xu,Bo Wen,Ruibin Mao,Can Li,Thomas Kämpfe,Kai Ni,Xunzhao Yin
关键词-EN: Bayesian inference, network-based machine learning, conventional neural network-based, Bayesian, limited training data
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 6 pages, 8 figures, to be published in the 61st DAC (Design Automation Conference) proceedings

点击查看摘要

Abstract:In scenarios with limited training data or where explainability is crucial, conventional neural network-based machine learning models often face challenges. In contrast, Bayesian inference-based algorithms excel in providing interpretable predictions and reliable uncertainty estimation in these scenarios. While many state-of-the-art in-memory computing (IMC) architectures leverage emerging non-volatile memory (NVM) technologies to offer unparalleled computing capacity and energy efficiency for neural network workloads, their application in Bayesian inference is limited. This is because the core operations in Bayesian inference differ significantly from the multiplication-accumulation (MAC) operations common in neural networks, rendering them generally unsuitable for direct implementation in most existing IMC designs. In this paper, we propose FeBiM, an efficient and compact Bayesian inference engine powered by multi-bit ferroelectric field-effect transistor (FeFET)-based IMC. FeBiM effectively encodes the trained probabilities of a Bayesian inference model within a compact FeFET-based crossbar. It maps quantized logarithmic probabilities to discrete FeFET states. As a result, the accumulated outputs of the crossbar naturally represent the posterior probabilities, i.e., the Bayesian inference model’s output given a set of observations. This approach enables efficient in-memory Bayesian inference without the need for additional calculation circuitry. As the first FeFET-based in-memory Bayesian inference engine, FeBiM achieves an impressive storage density of 26.32 Mb/mm ^2 and a computing efficiency of 581.40 TOPS/W in a representative Bayesian classification task. These results demonstrate 10.7 \times /43.4 \times improvement in compactness/efficiency compared to the state-of-the-art hardware implementation of Bayesian inference.

[LG-27] Free-Rider and Conflict Aware Collaboration Formation for Cross-Silo Federated Learning

链接: https://arxiv.org/abs/2410.19321
作者: Mengmeng Chen,Xiaohu Wu,Xiaoli Tang,Tiantian He,Yew-Soon Ong,Qiqi Liu,Qicheng Lao,Han Yu
关键词-EN: machine learning paradigm, Federated learning, sharing private data, machine learning, learning paradigm
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a machine learning paradigm that allows multiple FL participants (FL-PTs) to collaborate on training models without sharing private data. Due to data heterogeneity, negative transfer may occur in the FL training process. This necessitates FL-PT selection based on their data complementarity. In cross-silo FL, organizations that engage in business activities are key sources of FL-PTs. The resulting FL ecosystem has two features: (i) self-interest, and (ii) competition among FL-PTs. This requires the desirable FL-PT selection strategy to simultaneously mitigate the problems of free riders and conflicts of interest among competitors. To this end, we propose an optimal FL collaboration formation strategy – FedEgoists – which ensures that: (1) a FL-PT can benefit from FL if and only if it benefits the FL ecosystem, and (2) a FL-PT will not contribute to its competitors or their supporters. It provides an efficient clustering solution to group FL-PTs into coalitions, ensuring that within each coalition, FL-PTs share the same interest. We theoretically prove that the FL-PT coalitions formed are optimal since no coalitions can collaborate together to improve the utility of any of their members. Extensive experiments on widely adopted benchmark datasets demonstrate the effectiveness of FedEgoists compared to nine state-of-the-art baseline methods, and its ability to establish efficient collaborative networks in cross-silos FL with FL-PTs that engage in business activities.

[LG-28] Golden Ratio-Based Sufficient Dimension Reduction

链接: https://arxiv.org/abs/2410.19300
作者: Wenjing Yang,Yuhong Yang
关键词-EN: high dimensional data, machine learning applications, learning applications deal, dimensional data, applications deal
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Many machine learning applications deal with high dimensional data. To make computations feasible and learning more efficient, it is often desirable to reduce the dimensionality of the input variables by finding linear combinations of the predictors that can retain as much original information as possible in the relationship between the response and the original predictors. We propose a neural network based sufficient dimension reduction method that not only identifies the structural dimension effectively, but also estimates the central space well. It takes advantages of approximation capabilities of neural networks for functions in Barron classes and leads to reduced computation cost compared to other dimension reduction methods in the literature. Additionally, the framework can be extended to fit practical dimension reduction, making the methodology more applicable in practical settings.

[LG-29] Coordinated Reply Attacks in Influence Operations: Characterization and Detection

链接: https://arxiv.org/abs/2410.19272
作者: Manita Pote,Tuğrulcan Elmas,Alessandro Flammini,Filippo Menczer
关键词-EN: harass targeted individuals, Coordinated reply attacks, online influence operations, observed in online, campaigns to support
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Coordinated reply attacks are a tactic observed in online influence operations and other coordinated campaigns to support or harass targeted individuals, or influence them or their followers. Despite its potential to influence the public, past studies have yet to analyze or provide a methodology to detect this tactic. In this study, we characterize coordinated reply attacks in the context of influence operations on Twitter. Our analysis reveals that the primary targets of these attacks are influential people such as journalists, news media, state officials, and politicians. We propose two supervised machine-learning models, one to classify tweets to determine whether they are targeted by a reply attack, and one to classify accounts that reply to a targeted tweet to determine whether they are part of a coordinated attack. The classifiers achieve AUC scores of 0.88 and 0.97, respectively. These results indicate that accounts involved in reply attacks can be detected, and the targeted accounts themselves can serve as sensors for influence operation detection. Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2410.19272 [cs.LG] (or arXiv:2410.19272v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.19272 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] A Survey of Deep Graph Learning under Distribution Shifts: from Graph Out-of-Distribution Generalization to Adaptation

链接: https://arxiv.org/abs/2410.19265
作者: Kexin Zhang,Shuhan Liu,Song Wang,Weili Shi,Chen Chen,Pan Li,Sheng Li,Jundong Li,Kaize Ding
关键词-EN: graph machine learning, machine learning, graph OOD, graph OOD adaptation, graph machine
类目: Machine Learning (cs.LG)
*备注: 18 pages, 2 figures. arXiv admin note: text overlap with arXiv:2402.11153

点击查看摘要

Abstract:Distribution shifts on graphs – the discrepancies in data distribution between training and employing a graph machine learning model – are ubiquitous and often unavoidable in real-world scenarios. These shifts may severely deteriorate model performance, posing significant challenges for reliable graph machine learning. Consequently, there has been a surge in research on graph machine learning under distribution shifts, aiming to train models to achieve satisfactory performance on out-of-distribution (OOD) test data. In our survey, we provide an up-to-date and forward-looking review of deep graph learning under distribution shifts. Specifically, we cover three primary scenarios: graph OOD generalization, training-time graph OOD adaptation, and test-time graph OOD adaptation. We begin by formally formulating the problems and discussing various types of distribution shifts that can affect graph learning, such as covariate shifts and concept shifts. To provide a better understanding of the literature, we systematically categorize the existing models based on our proposed taxonomy and investigate the adopted techniques behind. We also summarize commonly used datasets in this research area to facilitate further investigation. Finally, we point out promising research directions and the corresponding challenges to encourage further study in this vital domain. Additionally, we provide a continuously updated reading list at this https URL.

[LG-31] Spatioformer: A Geo-encoded Transformer for Large-Scale Plant Species Richness Prediction

链接: https://arxiv.org/abs/2410.19256
作者: Yiqing Guo,Karel Mokany,Shaun R. Levick,Jinyan Yang,Peyman Moghadam
关键词-EN: Earth observation data, geographically distant regions, Earth observation, predicting species richness, spectral measurements
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

Abstract:Earth observation data have shown promise in predicting species richness of vascular plants ( \alpha -diversity), but extending this approach to large spatial scales is challenging because geographically distant regions may exhibit different compositions of plant species ( \beta -diversity), resulting in a location-dependent relationship between richness and spectral measurements. In order to handle such geolocation dependency, we propose Spatioformer, where a novel geolocation encoder is coupled with the transformer model to encode geolocation context into remote sensing imagery. The Spatioformer model compares favourably to state-of-the-art models in richness predictions on a large-scale ground-truth richness dataset (HAVPlot) that consists of 68,170 in-situ richness samples covering diverse landscapes across Australia. The results demonstrate that geolocational information is advantageous in predicting species richness from satellite observations over large spatial scales. With Spatioformer, plant species richness maps over Australia are compiled from Landsat archive for the years from 2015 to 2023. The richness maps produced in this study reveal the spatiotemporal dynamics of plant species richness in Australia, providing supporting evidence to inform effective planning and policy development for plant diversity conservation. Regions of high richness prediction uncertainties are identified, highlighting the need for future in-situ surveys to be conducted in these areas to enhance the prediction accuracy.

[LG-32] CHESTNUT: A QoS Dataset for Mobile Edge Environments

链接: https://arxiv.org/abs/2410.19248
作者: Guobing Zou,Fei Zhao,Shengxiang Hu
关键词-EN: network services, Quality, Quality of Service, geographic location, QoS
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quality of Service (QoS) is an important metric to measure the performance of network services. Nowadays, it is widely used in mobile edge environments to evaluate the quality of service when mobile devices request services from edge servers. QoS usually involves multiple dimensions, such as bandwidth, latency, jitter, and data packet loss rate. However, most existing QoS datasets, such as the common WS-Dream dataset, focus mainly on static QoS metrics of network services and ignore dynamic attributes such as time and geographic location. This means they should have detailed the mobile device’s location at the time of the service request or the chronological order in which the request was made. However, these dynamic attributes are crucial for understanding and predicting the actual performance of network services, as QoS performance typically fluctuates with time and geographic location. To this end, we propose a novel dataset that accurately records temporal and geographic location information on quality of service during the collection process, aiming to provide more accurate and reliable data to support future QoS prediction in mobile edge environments.

[LG-33] Enhancing Exchange Rate Forecasting with Explainable Deep Learning Models

链接: https://arxiv.org/abs/2410.19241
作者: Shuchen Meng,Andi Chen,Chihang Wang,Mengyao Zheng,Fangyu Wu,Xupeng Chen,Haowei Ni,Panfeng Li
关键词-EN: exchange rate, USD exchange rate, Accurate exchange rate, stability and international, critical focus
类目: Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Machine Learning and Computer Application

点击查看摘要

Abstract:Accurate exchange rate prediction is fundamental to financial stability and international trade, positioning it as a critical focus in economic and financial research. Traditional forecasting models often falter when addressing the inherent complexities and non-linearities of exchange rate data. This study explores the application of advanced deep learning models, including LSTM, CNN, and transformer-based architectures, to enhance the predictive accuracy of the RMB/USD exchange rate. Utilizing 40 features across 6 categories, the analysis identifies TSMixer as the most effective model for this task. A rigorous feature selection process emphasizes the inclusion of key economic indicators, such as China-U.S. trade volumes and exchange rates of other major currencies like the euro-RMB and yen-dollar pairs. The integration of grad-CAM visualization techniques further enhances model interpretability, allowing for clearer identification of the most influential features and bolstering the credibility of the predictions. These findings underscore the pivotal role of fundamental economic data in exchange rate forecasting and highlight the substantial potential of machine learning models to deliver more accurate and reliable predictions, thereby serving as a valuable tool for financial analysis and decision-making.

[LG-34] SHAP zero Explains All-order Feature Interactions in Black-box Genomic Models with Near-zero Query Cost

链接: https://arxiv.org/abs/2410.19236
作者: Darin Tsui,Aryan Musharaf,Amirali Aghazadeh
关键词-EN: machine learning, theoretical guarantees, rapid growth, popular method, Shapley
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Genomics (q-bio.GN); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:With the rapid growth of black-box models in machine learning, Shapley values have emerged as a popular method for model explanations due to their theoretical guarantees. Shapley values locally explain a model to an input query using additive features. Yet, in genomics, extracting biological knowledge from black-box models hinges on explaining nonlinear feature interactions globally to hundreds to thousands of input query sequences. Herein, we develop SHAP zero, an algorithm that estimates all-order Shapley feature interactions with a near-zero cost per queried sequence after paying a one-time fee for model sketching. SHAP zero achieves this by establishing a surprisingly underexplored connection between the Shapley interactions and the Fourier transform of the model. Explaining two genomic models, one trained to predict guide RNA binding and the other to predict DNA repair outcomes, we demonstrate that SHAP zero achieves orders of magnitude reduction in amortized computational cost compared to state-of-the-art algorithms. SHAP zero reveals all microhomologous motifs that are predictive of DNA repair outcome, a finding previously inaccessible due to the combinatorial space of possible high-order feature interactions.

[LG-35] Peptide-GPT: Generative Design of Peptides using Generative Pre-trained Transformers and Bio-informatic Supervision

链接: https://arxiv.org/abs/2410.19222
作者: Aayush Shah,Chakradhar Guntuboina,Amir Barati Farimani
关键词-EN: demonstrated remarkable capabilities, natural language processing, traditional text generation, recent years, demonstrated remarkable
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In recent years, natural language processing (NLP) models have demonstrated remarkable capabilities in various domains beyond traditional text generation. In this work, we introduce PeptideGPT, a protein language model tailored to generate protein sequences with distinct properties: hemolytic activity, solubility, and non-fouling characteristics. To facilitate a rigorous evaluation of these generated sequences, we established a comprehensive evaluation pipeline consisting of ideas from bioinformatics to retain valid proteins with ordered structures. First, we rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins. Finally, we predict the structure using ESMFold and select the proteins with pLDDT values greater than 70 to ensure ordered structure. The properties of generated sequences are evaluated using task-specific classifiers - PeptideBERT and HAPPENN. We achieved an accuracy of 76.26% in hemolytic, 72.46% in non-hemolytic, 78.84% in non-fouling, and 68.06% in solubility protein generation. Our experimental results demonstrate the effectiveness of PeptideGPT in de novo protein design and underscore the potential of leveraging NLP-based approaches for paving the way for future innovations and breakthroughs in synthetic biology and bioinformatics. Codes, models, and data used in this study are freely available at: this https URL.

[LG-36] Predicting Liquidity Coverage Ratio with Gated Recurrent Units: A Deep Learning Model for Risk Management

链接: https://arxiv.org/abs/2410.19211
作者: Zhen Xu,Jingming Pan,Siyuan Han,Hongju Ouyang,Yuan Chen,Mohan Jiang
关键词-EN: facing unprecedented challenges, global economic integration, liquidity risk, financial institutions, unprecedented challenges
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the global economic integration and the high interconnection of financial markets, financial institutions are facing unprecedented challenges, especially liquidity risk. This paper proposes a liquidity coverage ratio (LCR) prediction model based on the gated recurrent unit (GRU) network to help financial institutions manage their liquidity risk more effectively. By utilizing the GRU network in deep learning technology, the model can automatically learn complex patterns from historical data and accurately predict LCR for a period of time in the future. The experimental results show that compared with traditional methods, the GRU model proposed in this study shows significant advantages in mean absolute error (MAE), proving its higher accuracy and robustness. This not only provides financial institutions with a more reliable liquidity risk management tool but also provides support for regulators to formulate more scientific and reasonable policies, which helps to improve the stability of the entire financial system.

[LG-37] Binary Classification: Is Boosting stronger than Bagging?

链接: https://arxiv.org/abs/2410.19200
作者: Dimitris Bertsimas,Vasiliki Stoumpou
关键词-EN: Random Forests, Enhanced Random Forests, handling tabular datasets, Random Forests adopt, vanilla Random Forests
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Random Forests have been one of the most popular bagging methods in the past few decades, especially due to their success at handling tabular datasets. They have been extensively studied and compared to boosting models, like XGBoost, which are generally considered more performant. Random Forests adopt several simplistic assumptions, such that all samples and all trees that form the forest are equally important for building the final model. We introduce Enhanced Random Forests, an extension of vanilla Random Forests with extra functionalities and adaptive sample and model weighting. We develop an iterative algorithm for adapting the training sample weights, by favoring the hardest examples, and an approach for finding personalized tree weighting schemes for each new sample. Our method significantly improves upon regular Random Forests across 15 different binary classification datasets and considerably outperforms other tree methods, including XGBoost, when run with default hyperparameters, which indicates the robustness of our approach across datasets, without the need for extensive hyperparameter tuning. Our tree-weighting methodology results in enhanced or comparable performance to the uniformly weighted ensemble, and is, more importantly, leveraged to define importance scores for trees based on their contributions to classifying each new sample. This enables us to only focus on a small number of trees as the main models that define the outcome of a new sample and, thus, to partially recover interpretability, which is critically missing from both bagging and boosting methods. In binary classification problems, the proposed extensions and the corresponding results suggest the equivalence of bagging and boosting methods in performance, and the edge of bagging in interpretability by leveraging a few learners of the ensemble, which is not an option in the less explainable boosting methods.

[LG-38] EAM: Topological Evolution-aware Framework for Traffic Forecasting–Extended Version VLDB2025

链接: https://arxiv.org/abs/2410.19192
作者: Duc Kieu,Tung Kieu,Peng Han,Bin Yang,Christian S. Jensen,Bac Le
关键词-EN: people increasingly move, people increasingly, continue to grow, global trend, increasingly move
类目: Machine Learning (cs.LG)
*备注: 16 pages. An extended version of “TEAM: Topological Evolution-aware Framework for Traffic Forecasting” accepted at PVLDB 2025

点击查看摘要

Abstract:Due to the global trend towards urbanization, people increasingly move to and live in cities that then continue to grow. Traffic forecasting plays an important role in the intelligent transportation systems of cities as well as in spatio-temporal data mining. State-of-the-art forecasting is achieved by deep-learning approaches due to their ability to contend with complex spatio-temporal dynamics. However, existing methods assume the input is fixed-topology road networks and static traffic time series. These assumptions fail to align with urbanization, where time series are collected continuously and road networks evolve over time. In such settings, deep-learning models require frequent re-initialization and re-training, imposing high computational costs. To enable much more efficient training without jeopardizing model accuracy, we propose the Topological Evolution-aware Framework (TEAM) for traffic forecasting that incorporates convolution and attention. This combination of mechanisms enables better adaptation to newly collected time series, while being able to maintain learned knowledge from old time series. TEAM features a continual learning module based on the Wasserstein metric that acts as a buffer that can identify the most stable and the most changing network nodes. Then, only data related to stable nodes is employed for re-training when consolidating a model. Further, only data of new nodes and their adjacent nodes as well as data pertaining to changing nodes are used to re-train the model. Empirical studies with two real-world traffic datasets offer evidence that TEAM is capable of much lower re-training costs than existing methods are, without jeopardizing forecasting accuracy.

[LG-39] Cascading Failure Prediction via Causal Inference

链接: https://arxiv.org/abs/2410.19179
作者: Shiuli Subhra Ghosh,Anmol Dwivedi,Ali Tajer,Kyongmin Yeo,Wesley M. Gifford
关键词-EN: interacting agents, framework, causal inference framework, power transmission networks, Causal inference
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal inference provides an analytical framework to identify and quantify cause-and-effect relationships among a network of interacting agents. This paper offers a novel framework for analyzing cascading failures in power transmission networks. This framework generates a directed latent graph in which the nodes represent the transmission lines and the directed edges encode the cause-effect relationships. This graph has a structure distinct from the system’s topology, signifying the intricate fact that both local and non-local interdependencies exist among transmission lines, which are more general than only the local interdependencies that topological graphs can present. This paper formalizes a causal inference framework for predicting how an emerging anomaly propagates throughout the system. Using this framework, two algorithms are designed, providing an analytical framework to identify the most likely and most costly cascading scenarios. The framework’s effectiveness is evaluated compared to the pertinent literature on the IEEE 14-bus, 39-bus, and 118-bus systems.

[LG-40] Perturbation-based Graph Active Learning for Weakly-Supervised Belief Representation Learning

链接: https://arxiv.org/abs/2410.19176
作者: Dachun Sun,Ruijie Wang,Jinning Li,Ruipeng Han,Xinyi Liu,You Lyu,Tarek Abdelzaher
关键词-EN: belief representation learning, optimizing the allocation, representation learning, semi-supervised belief representation, social networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of optimizing the allocation of labeling resources for semi-supervised belief representation learning in social networks. The objective is to strategically identify valuable messages on social media graphs that are worth labeling within a constrained budget, ultimately maximizing the task’s performance. Despite the progress in unsupervised or semi-supervised methods in advancing belief and ideology representation learning on social networks and the remarkable efficacy of graph learning techniques, the availability of high-quality curated labeled social data can greatly benefit and further improve performances. Consequently, allocating labeling efforts is a critical research problem in scenarios where labeling resources are limited. This paper proposes a graph data augmentation-inspired perturbation-based active learning strategy (PerbALGraph) that progressively selects messages for labeling according to an automatic estimator, obviating human guidance. This estimator is based on the principle that messages in the network that exhibit heightened sensitivity to structural features of the observational data indicate landmark quality that significantly influences semi-supervision processes. We design the estimator to be the prediction variance under a set of designed graph perturbations, which is model-agnostic and application-independent. Extensive experiment results demonstrate the effectiveness of the proposed strategy for belief representation learning tasks.

[LG-41] Learning Coupled Subspaces for Multi-Condition Spike Data

链接: https://arxiv.org/abs/2410.19153
作者: Yididiya Y. Nadew,Xuhui Fan,Christopher J. Quinn
关键词-EN: researchers typically conduct, high-dimensional spike train, acquire neural responses, high-dimensional spike, typically conduct experiments
类目: Machine Learning (cs.LG)
*备注: 27 pages, 3 figures

点击查看摘要

Abstract:In neuroscience, researchers typically conduct experiments under multiple conditions to acquire neural responses in the form of high-dimensional spike train datasets. Analysing high-dimensional spike data is a challenging statistical problem. To this end, Gaussian process factor analysis (GPFA), a popular class of latent variable models has been proposed. GPFA extracts smooth, low-dimensional latent trajectories underlying high-dimensional spike train datasets. However, such analyses are often done separately for each experimental condition, contrary to the nature of neural datasets, which contain recordings under multiple experimental conditions. Exploiting the parametric nature of these conditions, we propose a multi-condition GPFA model and inference procedure to learn the underlying latent structure in the corresponding datasets in sample-efficient manner. In particular, we propose a non-parametric Bayesian approach to learn a smooth tuning function over the experiment condition space. Our approach not only boosts model accuracy and is faster, but also improves model interpretability compared to approaches that separately fit models for each experimental condition.

[LG-42] Structured Diffusion Models with Mixture of Gaussians as Prior Distribution

链接: https://arxiv.org/abs/2410.19149
作者: Nanshan Jia,Tingyu Zhu,Haoyu Liu,Zeyu Zheng
关键词-EN: standard Gaussian distribution, mixed Gaussian distribution, Gaussian distribution, standard Gaussian, mixed Gaussian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a class of structured diffusion models, in which the prior distribution is chosen as a mixture of Gaussians, rather than a standard Gaussian distribution. The specific mixed Gaussian distribution, as prior, can be chosen to incorporate certain structured information of the data. We develop a simple-to-implement training procedure that smoothly accommodates the use of mixed Gaussian as prior. Theory is provided to quantify the benefits of our proposed models, compared to the classical diffusion models. Numerical experiments with synthetic, image and operational data are conducted to show comparative advantages of our model. Our method is shown to be robust to mis-specifications and in particular suits situations where training resources are limited or faster training in real time is desired.

[LG-43] Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers

链接: https://arxiv.org/abs/2410.19139
作者: Shuning Shang,Xuran Meng,Yuan Cao,Difan Zou
关键词-EN: fit training data, training data perfectly, over-parameterized neural networks, Benign overfitting refers, output layer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 80 pages, 3 figures, 1 table

点击查看摘要

Abstract:Benign overfitting refers to how over-parameterized neural networks can fit training data perfectly and generalize well to unseen data. While this has been widely investigated theoretically, existing works are limited to two-layer networks with fixed output layers, where only the hidden weights are trained. We extend the analysis to two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers, which is closer to the practice. Our results show that the initialization scaling of the output layer is crucial to the training dynamics: large scales make the model training behave similarly to that with the fixed output, the hidden layer grows rapidly while the output layer remains largely unchanged; in contrast, small scales result in more complex layer interactions, the hidden layer initially grows to a specific ratio relative to the output layer, after which both layers jointly grow and maintain that ratio throughout training. Furthermore, in both settings, we provide nearly matching upper and lower bounds on the test errors, identifying the sharp conditions on the initialization scaling and signal-to-noise ratio (SNR) in which the benign overfitting can be achieved or not. Numerical experiments back up the theoretical results.

[LG-44] Context-Aware Trajectory Anomaly Detection

链接: https://arxiv.org/abs/2410.19136
作者: Haoji Hu,Jina Kim,Jinwei Zhou,Sofia Kirsanova,JangHyeon Lee,Yao-Yi Chiang
关键词-EN: human mobility management, Trajectory anomaly detection, anomaly detection, mobility management, Trajectory
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trajectory anomaly detection is crucial for effective decision-making in urban and human mobility management. Existing methods of trajectory anomaly detection generally focus on training a trajectory generative model and evaluating the likelihood of reconstructing a given trajectory. However, previous work often lacks important contextual information on the trajectory, such as the agent’s information (e.g., agent ID) or geographic information (e.g., Points of Interest (POI)), which could provide additional information on accurately capturing anomalous behaviors. To fill this gap, we propose a context-aware anomaly detection approach that models contextual information related to trajectories. The proposed method is based on a trajectory reconstruction framework guided by contextual factors such as agent ID and contextual POI embedding. The injection of contextual information aims to improve the performance of anomaly detection. We conducted experiments in two cities and demonstrated that the proposed approach significantly outperformed existing methods by effectively modeling contextual information. Overall, this paper paves a new direction for advancing trajectory anomaly detection.

[LG-45] LanFL: Differentially Private Federated Learning with Large Language Models using Synthetic Samples

链接: https://arxiv.org/abs/2410.19114
作者: Huiyu Wu,Diego Klabjan
关键词-EN: single global model, privacy-preserving machine learning, Large Language Models, machine learning framework, powerful Large Language
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a collaborative, privacy-preserving machine learning framework that enables multiple participants to train a single global model. However, the recent advent of powerful Large Language Models (LLMs) with tens to hundreds of billions of parameters makes the naive application of traditional FL methods to LLMs impractical due to high computational and communication costs. Furthermore, end users of LLMs often lack access to full architectures and weights of the models, making it impossible for participants to fine-tune these models directly. This paper introduces a novel FL scheme for LLMs, named LanFL, which is purely prompt-based and treats the underlying LLMs as black boxes. We have developed a differentially private synthetic sample generation mechanism to facilitate knowledge sharing among participants, along with a prompt optimization scheme that enables learning from synthetic samples. Our extensive experiments demonstrate that LanFL successfully facilitates learning among participants while preserving the privacy of local datasets across various tasks.

[LG-46] sseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction

链接: https://arxiv.org/abs/2410.19103
作者: Yuhang Li,Priyadarshini Panda
关键词-EN: Large language models, natural language processing, revolutionized natural language, Large language, language processing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing, albeit at the cost of immense memory and computation requirements. Post-training quantization (PTQ) is becoming the de facto method to reduce the memory footprint and improve the inference throughput of LLMs. In this work, we aim to push the upper limit of LLM PTQ by optimizing the weight rounding parameters with the block reconstruction technique, a predominant method in previous vision models. We propose TesseraQ, a new state-of-the-art PTQ technique, to quantize the weights of LLMs to ultra-low bits. To effectively optimize the rounding in LLMs and stabilize the reconstruction process, we introduce progressive adaptive rounding. This approach iteratively transits the soft rounding variables to hard variables during the reconstruction process. Additionally, we optimize the dequantization scale parameters to fully leverage the block reconstruction technique. We demonstrate that TesseraQ can be seamlessly integrated with existing scaling or clipping-based PTQ algorithms such as AWQ and OmniQuant, significantly enhancing their performance and establishing a new state-of-the-art. For instance, when compared to AWQ, TesseraQ improves the wikitext2 perplexity from 14.65 to 6.82 and average downstream accuracy from 50.52 to 59.27 with 2-bit weight-only quantization of LLaMA-2-7B. Across a range of quantization schemes, including W2A16, W3A16, W3A3, and W4A4, TesseraQ consistently exhibits superior performance.

[LG-47] Provable Tempered Overfitting of Minimal Nets and Typical Nets

链接: https://arxiv.org/abs/2410.19092
作者: Itamar Harel,William M. Hoza,Gal Vardi,Itay Evron,Nathan Srebro,Daniel Soudry
关键词-EN: deep Neural Networks, noisy training set, Neural Networks, connected deep Neural, binary weights fitted
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 60 pages, 4 figures

点击查看摘要

Abstract:We study the overfitting behavior of fully connected deep Neural Networks (NNs) with binary weights fitted to perfectly classify a noisy training set. We consider interpolation using both the smallest NN (having the minimal number of weights) and a random interpolating NN. For both learning rules, we prove overfitting is tempered. Our analysis rests on a new bound on the size of a threshold circuit consistent with a partial function. To the best of our knowledge, ours are the first theoretical results on benign or tempered overfitting that: (1) apply to deep NNs, and (2) do not require a very high or very low input dimension.

[LG-48] FastSurvival: Hidden Computational Blessings in Training Cox Proportional Hazards Models NEURIPS2024

链接: https://arxiv.org/abs/2410.19081
作者: Jiachang Liu,Rui Zhang,Cynthia Rudin
关键词-EN: important research topic, Survival analysis, CPH model, important research, research topic
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted into NeurIPS 2024

点击查看摘要

Abstract:Survival analysis is an important research topic with applications in healthcare, business, and manufacturing. One essential tool in this area is the Cox proportional hazards (CPH) model, which is widely used for its interpretability, flexibility, and predictive performance. However, for modern data science challenges such as high dimensionality (both n and p ) and high feature correlations, current algorithms to train the CPH model have drawbacks, preventing us from using the CPH model at its full potential. The root cause is that the current algorithms, based on the Newton method, have trouble converging due to vanishing second order derivatives when outside the local region of the minimizer. To circumvent this problem, we propose new optimization methods by constructing and minimizing surrogate functions that exploit hidden mathematical structures of the CPH model. Our new methods are easy to implement and ensure monotonic loss decrease and global convergence. Empirically, we verify the computational efficiency of our methods. As a direct application, we show how our optimization methods can be used to solve the cardinality-constrained CPH problem, producing very sparse high-quality models that were not previously practical to construct. We list several extensions that our breakthrough enables, including optimization opportunities, theoretical questions on CPH’s mathematical structure, as well as other CPH-related applications.

[LG-49] arget Strangeness: A Novel Conformal Prediction Difficulty Estimator

链接: https://arxiv.org/abs/2410.19077
作者: Alexis Bose,Jonathan Ethier,Paul Guinand
关键词-EN: introduces Target Strangeness, paper introduces Target, normalizing prediction intervals, Target Strangeness, paper introduces
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:This paper introduces Target Strangeness, a novel difficulty estimator for conformal prediction (CP) that offers an alternative approach for normalizing prediction intervals (PIs). By assessing how atypical a prediction is within the context of its nearest neighbours’ target distribution, Target Strangeness can surpass the current state-of-the-art performance. This novel difficulty estimator is evaluated against others in the context of several conformal regression experiments.

[LG-50] An Investigation on Machine Learning Predictive Accuracy Improvement and Uncertainty Reduction using VAE-based Data Augmentation

链接: https://arxiv.org/abs/2410.19063
作者: Farah Alsafadi,Mahmoud Yaseen,Xu Wu
关键词-EN: place multiple engineering, multiple engineering fields, large datasets place, Machine Learning, datasets place multiple
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The confluence of ultrafast computers with large memory, rapid progress in Machine Learning (ML) algorithms, and the availability of large datasets place multiple engineering fields at the threshold of dramatic progress. However, a unique challenge in nuclear engineering is data scarcity because experimentation on nuclear systems is usually more expensive and time-consuming than most other disciplines. One potential way to resolve the data scarcity issue is deep generative learning, which uses certain ML models to learn the underlying distribution of existing data and generate synthetic samples that resemble the real data. In this way, one can significantly expand the dataset to train more accurate predictive ML models. In this study, our objective is to evaluate the effectiveness of data augmentation using variational autoencoder (VAE)-based deep generative models. We investigated whether the data augmentation leads to improved accuracy in the predictions of a deep neural network (DNN) model trained using the augmented data. Additionally, the DNN prediction uncertainties are quantified using Bayesian Neural Networks (BNN) and conformal prediction (CP) to assess the impact on predictive uncertainty reduction. To test the proposed methodology, we used TRACE simulations of steady-state void fraction data based on the NUPEC Boiling Water Reactor Full-size Fine-mesh Bundle Test (BFBT) benchmark. We found that augmenting the training dataset using VAEs has improved the DNN model’s predictive accuracy, improved the prediction confidence intervals, and reduced the prediction uncertainties.

[LG-51] Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms NEURIPS2024

链接: https://arxiv.org/abs/2410.19055
作者: Felix Petersen,Christian Borgelt,Tobias Sutter,Hilde Kuehne,Oliver Deussen,Stefano Ermon
关键词-EN: common problem, custom objectives, losses, Newton Losses, neural network
类目: Machine Learning (cs.LG)
*备注: Published at NeurIPS 2024

点击查看摘要

Abstract:When training neural networks with custom objectives, such as ranking losses and shortest-path losses, a common problem is that they are, per se, non-differentiable. A popular approach is to continuously relax the objectives to provide gradients, enabling learning. However, such differentiable relaxations are often non-convex and can exhibit vanishing and exploding gradients, making them (already in isolation) hard to optimize. Here, the loss function poses the bottleneck when training a deep neural network. We present Newton Losses, a method for improving the performance of existing hard to optimize losses by exploiting their second-order information via their empirical Fisher and Hessian matrices. Instead of training the neural network with second-order techniques, we only utilize the loss function’s second-order information to replace it by a Newton Loss, while training the network with gradient descent. This makes our method computationally efficient. We apply Newton Losses to eight differentiable algorithms for sorting and shortest-paths, achieving significant improvements for less-optimized differentiable algorithms, and consistent improvements, even for well-optimized differentiable algorithms.

[LG-52] Mixture of Parrots: Experts improve memorization more than reasoning

链接: https://arxiv.org/abs/2410.19034
作者: Samy Jelassi,Clara Mohri,David Brandfonbrener,Alex Gu,Nikhil Vyas,Nikhil Anand,David Alvarez-Melis,Yuanzhi Li,Sham M. Kakade,Eran Malach
关键词-EN: minimal computational overhead, number of experts, architecture enables, computational overhead, enables a significant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be easily solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.

[LG-53] Heterogeneous Random Forest

链接: https://arxiv.org/abs/2410.19022
作者: Ye-eun Kim,Seoung Yun Kim,Hyunjoong Kim
关键词-EN: highly favored machine, favored machine learning, Random forest, machine learning approach, classification problems
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Random forest (RF) stands out as a highly favored machine learning approach for classification problems. The effectiveness of RF hinges on two key factors: the accuracy of individual trees and the diversity among them. In this study, we introduce a novel approach called heterogeneous RF (HRF), designed to enhance tree diversity in a meaningful way. This diversification is achieved by deliberately introducing heterogeneity during the tree construction. Specifically, features used for splitting near the root node of previous trees are assigned lower weights when constructing the feature sub-space of the subsequent trees. As a result, dominant features in the prior trees are less likely to be employed in the next iteration, leading to a more diverse set of splitting features at the nodes. Through simulation studies, it was confirmed that the HRF method effectively mitigates the selection bias of trees within the ensemble, increases the diversity of the ensemble, and demonstrates superior performance on datasets with fewer noise features. To assess the comparative performance of HRF against other widely adopted ensemble methods, we conducted tests on 52 datasets, comprising both real-world and synthetic data. HRF consistently outperformed other ensemble methods in terms of accuracy across the majority of datasets.

[LG-54] Privacy-Computation trade-offs in Private Repetition and Metaselection

链接: https://arxiv.org/abs/2410.19012
作者: Kunal Talwar
关键词-EN: Private Repetition algorithm, Private Repetition, differentially private algorithm, Repetition algorithm, Private
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A Private Repetition algorithm takes as input a differentially private algorithm with constant success probability and boosts it to one that succeeds with high probability. These algorithms are closely related to private metaselection algorithms that compete with the best of many private algorithms, and private hyperparameter tuning algorithms that compete with the best hyperparameter settings for a private learning algorithm. Existing algorithms for these tasks pay either a large overhead in privacy cost, or a large overhead in computational cost. In this work, we show strong lower bounds for problems of this kind, showing in particular that for any algorithm that preserves the privacy cost up to a constant factor, the failure probability can only fall polynomially in the computational overhead. This is in stark contrast with the non-private setting, where the failure probability falls exponentially in the computational overhead. By carefully combining existing algorithms for metaselection, we prove computation-privacy tradeoffs that nearly match our lower bounds.

[LG-55] Make LLM s better zero-shot reasoners: Structure-orientated autonomous reasoning

链接: https://arxiv.org/abs/2410.19000
作者: Pengfei He,Zitao Li,Yue Xing,Yaling Li,Jiliang Tang,Bolin Ding
关键词-EN: Large Language Models, Large Language, offer significant advantages, significant advantages including, advantages including great
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot reasoning methods with Large Language Models (LLMs) offer significant advantages including great generalization to novel tasks and reduced dependency on human-crafted examples. However, the current zero-shot methods still have limitations in complex tasks, e.g., answering questions that require multi-step reasoning. In this paper, we address this limitation by introducing a novel structure-oriented analysis method to help LLMs better understand the question and guide the problem-solving process of LLMs. We first demonstrate how the existing reasoning strategies, Chain-of-Thought and ReAct, can benefit from our structure-oriented analysis. In addition to empirical investigations, we leverage the probabilistic graphical model to theoretically explain why our structure-oriented analysis can improve the LLM reasoning process. To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA), that can better enforce the reasoning process following our structure-oriented analysis by refinement techniques and is equipped with external knowledge retrieval capability to reduce factual errors. Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods. Finally, the system not only improves reasoning accuracy in complex tasks but also demonstrates robustness against potential attacks that corrupt the reasoning process.

[LG-56] cymyc – Calabi-Yau Metrics Yukawas and Curvature

链接: https://arxiv.org/abs/2410.19728
作者: Per Berglund,Giorgi Butbaia,Tristan Hübsch,Vishnu Jejjala,Challenger Mishra,Damián Mayorga Peña,Justin Tan
关键词-EN: high-performance Python library, string compactification manifolds, high-performance Python, Python library, model tensor fields
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 35 pages, 12 figures

点击查看摘要

Abstract:We introduce \textttcymyc, a high-performance Python library for numerical investigation of the geometry of a large class of string compactification manifolds and their associated moduli spaces. We develop a well-defined geometric ansatz to numerically model tensor fields of arbitrary degree on a large class of Calabi-Yau manifolds. \textttcymyc includes a machine learning component which incorporates this ansatz to model tensor fields of interest on these spaces by finding an approximate solution to the system of partial differential equations they should satisfy.

[LG-57] On the Benefits of Active Data Collection in Operator Learning

链接: https://arxiv.org/abs/2410.19725
作者: Unique Subedi,Ambuj Tewari
关键词-EN: mean-zero stochastic process, continuous covariance kernels, data collection strategies, active data collection, data collection
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate active data collection strategies for operator learning when the target operator is linear and the input functions are drawn from a mean-zero stochastic process with continuous covariance kernels. With an active data collection strategy, we establish an error convergence rate in terms of the decay rate of the eigenvalues of the covariance kernel. Thus, with sufficiently rapid eigenvalue decay of the covariance kernels, arbitrarily fast error convergence rates can be achieved. This contrasts with the passive (i.i.d.) data collection strategies, where the convergence rate is never faster than \sim n^-1 . In fact, for our setting, we establish a \emphnon-vanishing lower bound for any passive data collection strategy, regardless of the eigenvalues decay rate of the covariance kernel. Overall, our results show the benefit of active over passive data collection strategies in operator learning.

[LG-58] Multi-view biomedical foundation models for molecule-target and property prediction

链接: https://arxiv.org/abs/2410.19704
作者: Parthasarathy Suryanarayanan,Yunguang Qiu,Shreyans Sethi,Diwakar Mahajan,Hongyang Li,Yuxin Yang,Elif Eyigoz,Aldo Guzman Saenz,Daniel E. Platt,Timothy H. Rumbell,Kenney Ng,Sanjoy Dey,Myson Burch,Bum Chul Kwon,Pablo Meyer,Feixiong Cheng,Jianying Hu,Joseph A. Morrone
关键词-EN: accelerate drug discovery, bio-molecular space hold, space hold promise, drug discovery, applied to bio-molecular
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 34 pages including supplement. 9 figures, 4 tables

点击查看摘要

Abstract:Foundation models applied to bio-molecular space hold promise to accelerate drug discovery. Molecular representation is key to building such models. Previous works have typically focused on a single representation or view of the molecules. Here, we develop a multi-view foundation model approach, that integrates molecular views of graph, image and text. Single-view foundation models are each pre-trained on a dataset of up to 200M molecules and then aggregated into combined representations. Our multi-view model is validated on a diverse set of 18 tasks, encompassing ligand-protein binding, molecular solubility, metabolism and toxicity. We show that the multi-view models perform robustly and are able to balance the strengths and weaknesses of specific views. We then apply this model to screen compounds against a large (100 targets) set of G Protein-Coupled receptors (GPCRs). From this library of targets, we identify 33 that are related to Alzheimer’s disease. On this subset, we employ our model to identify strong binders, which are validated through structure-based modeling and identification of key binding motifs.

[LG-59] Electromechanical Dynamics of the Heart: A Study of Cardiac Hysteresis During Physical Stress Test

链接: https://arxiv.org/abs/2410.19667
作者: Sajjad Karimi,Shirin Karimi,Amit J. Shah,Gari D. Clifford,Reza Sameni
关键词-EN: Cardiovascular diseases, diagnosed using multiple, multiple modalities, modalities that assess, Cardiovascular
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Cardiovascular diseases are best diagnosed using multiple modalities that assess both the heart’s electrical and mechanical functions. While effective, imaging techniques like echocardiography and nuclear imaging are costly and not widely accessible. More affordable technologies, such as simultaneous electrocardiography (ECG) and phonocardiography (PCG), may provide valuable insights into electromechanical coupling and could be useful for prescreening in low-resource settings. Using physical stress test data from the EPHNOGRAM ECG-PCG dataset, collected from 23 healthy male subjects (age: 25.4+/-1.9 yrs), we investigated electromechanical intervals (RR, QT, systolic, and diastolic) and their interactions during exercise, along with hysteresis between cardiac electrical activity and mechanical responses. Time delay analysis revealed distinct temporal relationships between QT, systolic, and diastolic intervals, with RR as the primary driver. The diastolic interval showed near-synchrony with RR, while QT responded to RR interval changes with an average delay of 10.5s, and the systolic interval responded more slowly, with an average delay of 28.3s. We examined QT-RR, systolic-RR, and diastolic-RR hysteresis, finding narrower loops for diastolic RR and wider loops for systolic RR. Significant correlations (average:0.75) were found between heart rate changes and hysteresis loop areas, suggesting the equivalent circular area diameter as a promising biomarker for cardiac function under exercise stress. Deep learning models, including Long Short-Term Memory and Convolutional Neural Networks, estimated the QT, systolic, and diastolic intervals from RR data, confirming the nonlinear relationship between RR and other intervals. Findings highlight a significant cardiac memory effect, linking ECG and PCG morphology and timing to heart rate history. Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM) Cite as: arXiv:2410.19667 [physics.med-ph] (or arXiv:2410.19667v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2410.19667 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Reza Sameni [view email] [v1] Fri, 25 Oct 2024 16:23:19 UTC (3,626 KB)

[LG-60] Improving Stochastic Cubic Newton with Momentum

链接: https://arxiv.org/abs/2410.19644
作者: El Mahdi Chayti,Nikita Doikov,Martin Jaggi
关键词-EN: stochastic, solving general non-convex, general non-convex optimization, study stochastic second-order, stochastic second-order methods
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study stochastic second-order methods for solving general non-convex optimization problems. We propose using a special version of momentum to stabilize the stochastic gradient and Hessian estimates in Newton’s method. We show that momentum provably improves the variance of stochastic estimates and allows the method to converge for any noise level. Using the cubic regularization technique, we prove a global convergence rate for our method on general non-convex problems to a second-order stationary point, even when using only a single stochastic data sample per iteration. This starkly contrasts with all existing stochastic second-order methods for non-convex problems, which typically require large batches. Therefore, we are the first to demonstrate global convergence for batches of arbitrary size in the non-convex case for the Stochastic Cubic Newton. Additionally, we show improved speed on convex stochastic problems for our regularized Newton methods with momentum.

[LG-61] Mask-Weighted Spatial Likelihood Coding for Speaker-Independent Joint Localization and Mask Estimation

链接: https://arxiv.org/abs/2410.19595
作者: Jakob Kienegger,Alina Mannanova,Timo Gerkmann
关键词-EN: simultaneous speakers alongside, speakers alongside noise, robustness and flexibility, noise and reverberation, popular choice
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: ©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Due to their robustness and flexibility, neural-driven beamformers are a popular choice for speech separation in challenging environments with a varying amount of simultaneous speakers alongside noise and reverberation. Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer’s parameters. To some degree, speaker-independence is achieved by ensuring a greater amount of spatial partitions than speech sources. In this work, we analyze how to encode both mask and positioning into such a grid to enable joint estimation of both quantities. We propose mask-weighted spatial likelihood coding and show that it achieves considerable performance in both tasks compared to baseline encodings optimized for either localization or mask estimation. In the same setup, we demonstrate superiority for joint estimation of both quantities. Conclusively, we propose a universal approach which can replace an upstream sound source localization system solely by adapting the training framework, making it highly relevant in performance-critical scenarios.

[LG-62] Considerations for Distribution Shift Robustness of Diagnostic Models in Healthcare

链接: https://arxiv.org/abs/2410.19575
作者: Arno Blaas,Adam Goliński,Andrew Miller,Luca Zappella,Jörn-Henrik Jacobsen,Christina Heinze-Deml
关键词-EN: distribution shifts, context of diagnostic, causally upstream, diagnostic models, prediction target
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider robustness to distribution shifts in the context of diagnostic models in healthcare, where the prediction target Y , e.g., the presence of a disease, is causally upstream of the observations X , e.g., a biomarker. Distribution shifts may occur, for instance, when the training data is collected in a domain with patients having particular demographic characteristics while the model is deployed on patients from a different demographic group. In the domain of applied ML for health, it is common to predict Y from X without considering further information about the patient. However, beyond the direct influence of the disease Y on biomarker X , a predictive model may learn to exploit confounding dependencies (or shortcuts) between X and Y that are unstable under certain distribution shifts. In this work, we highlight a data generating mechanism common to healthcare settings and discuss how recent theoretical results from the causality literature can be applied to build robust predictive models. We theoretically show why ignoring covariates as well as common invariant learning approaches will in general not yield robust predictors in the studied setting, while including certain covariates into the prediction model will. In an extensive simulation study, we showcase the robustness (or lack thereof) of different predictors under various data generating processes. Lastly, we analyze the performance of the different approaches using the PTB-XL dataset, a public dataset of annotated ECG recordings.

[LG-63] How Critical is Site-Specific RAN Optimization? 5G Open-RAN Uplink Air Interface Performance Test and Optimization from Macro-Cell CIR Data

链接: https://arxiv.org/abs/2410.19565
作者: Johnathan Corgan,Nitin Nair,Rajib Bhattacharjea,Wan Liu,Serhat Tadik,Tom Tsou,Timothy J. O’Shea
关键词-EN: air interface, air interface optimization, air interface testing, air interface performance, delay line
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Appears in the proceedings of the First Workshop on Research and Innovation in Testing and Integration for Open Radio Access Networks (RitiRAN)

点击查看摘要

Abstract:In this paper, we consider the importance of channel measurement data from specific sites and its impact on air interface optimization and test. Currently, a range of statistical channel models including 3GPP 38.901 tapped delay line (TDL), clustered delay line (CDL), urban microcells (UMi) and urban macrocells (UMa) type channels are widely used for air interface performance testing and simulation. However, there remains a gap in the realism of these models for air interface testing and optimization when compared with real world measurement based channels. To address this gap, we compare the performance impacts of training neural receivers with 1) statistical 3GPP TDL models, and 2) measured macro-cell channel impulse response (CIR) data. We leverage our OmniPHY-5G neural receiver for NR PUSCH uplink simulation, with a training procedure that uses statistical TDL channel models for pre-training, and fine-tuning based on measured site specific MIMO CIR data. The proposed fine-tuning method achieves a 10% block error rate (BLER) at a 1.85 dB lower signal-to-noise ratio (SNR) compared to pre-training only on simulated TDL channels, illustrating a rough magnitude of the gap that can be closed by site-specific training, and gives the first answer to the question “how much can fine-tuning the RAN for site-specific channels help?”

[LG-64] Learned Reference-based Diffusion Sampling for multi-modal distributions

链接: https://arxiv.org/abs/2410.19449
作者: Maxence Noble,Louis Grenioux,Marylou Gabrié,Alain Oliviero Durmus
关键词-EN: utilizing score-based diffusion, approaches utilizing score-based, past few years, unnormalized densities, utilizing score-based
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: Under review

点击查看摘要

Abstract:Over the past few years, several approaches utilizing score-based diffusion have been proposed to sample from probability distributions, that is without having access to exact samples and relying solely on evaluations of unnormalized densities. The resulting samplers approximate the time-reversal of a noising diffusion process, bridging the target distribution to an easy-to-sample base distribution. In practice, the performance of these methods heavily depends on key hyperparameters that require ground truth samples to be accurately tuned. Our work aims to highlight and address this fundamental issue, focusing in particular on multi-modal distributions, which pose significant challenges for existing sampling methods. Building on existing approaches, we introduce Learned Reference-based Diffusion Sampler (LRDS), a methodology specifically designed to leverage prior knowledge on the location of the target modes in order to bypass the obstacle of hyperparameter tuning. LRDS proceeds in two steps by (i) learning a reference diffusion model on samples located in high-density space regions and tailored for multimodality, and (ii) using this reference model to foster the training of a diffusion-based sampler. We experimentally demonstrate that LRDS best exploits prior knowledge on the target distribution compared to competing algorithms on a variety of challenging distributions.

[LG-65] On the Application of Deep Learning for Precise Indoor Positioning in 6G

链接: https://arxiv.org/abs/2410.19436
作者: Sai Prasanth Kotturi,Anil Kumar Yerrapragada,Sai Prasad,Radha Krishna Ganti
关键词-EN: Line of Sight, Accurate localization, Channel Impulse Response, Signal Received Power, Transmit Receive Points
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 Pages, 6 Figures

点击查看摘要

Abstract:Accurate localization in indoor environments is a challenge due to the Non Line of Sight (NLoS) nature of the signaling. In this paper, we explore the use of AI/ML techniques for positioning accuracy enhancement in Indoor Factory (InF) scenarios. The proposed neural network, which we term LocNet, is trained on measurements such as Channel Impulse Response (CIR) and Reference Signal Received Power (RSRP) from multiple Transmit Receive Points (TRPs). Simulation results show that when using measurements from 18 TRPs, LocNet achieves a 9 cm positioning accuracy at the 90th percentile. Additionally, we demonstrate that the same model generalizes effectively even when measurements from some TRPs randomly become unavailable. Lastly, we provide insights on the robustness of the trained model to the errors in ground truth labels used for training.

[LG-66] Noise-Aware Differentially Private Variational Inference

链接: https://arxiv.org/abs/2410.19371
作者: Talal Alrawajfeh,Joonas Jälkö,Antti Honkela
关键词-EN: robust privacy guarantees, Differential privacy, downstream applications, robust privacy, privacy guarantees
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) provides robust privacy guarantees for statistical inference, but this can lead to unreliable results and biases in downstream applications. While several noise-aware approaches have been proposed which integrate DP perturbation into the inference, they are limited to specific types of simple probabilistic models. In this work, we propose a novel method for noise-aware approximate Bayesian inference based on stochastic gradient variational inference which can also be applied to high-dimensional and non-conjugate models. We also propose a more accurate evaluation method for noise-aware posteriors. Empirically, our inference method has similar performance to existing methods in the domain where they are applicable. Outside this domain, we obtain accurate coverages on high-dimensional Bayesian linear regression and well-calibrated predictive probabilities on Bayesian logistic regression with the UCI Adult dataset.

[LG-67] Double Difference Earthquake Location with Graph Neural Networks

链接: https://arxiv.org/abs/2410.19323
作者: Ian W. McBrearty,Gregory C. Beroza
关键词-EN: catalog development workflows, Graph Double Difference, Double difference earthquake, Double difference, earthquake catalog development
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Double difference earthquake relocation is an essential component of many earthquake catalog development workflows. This technique produces high-resolution relative relocations between events by minimizing differential measurements of the arrival times of waves from nearby sources, which highlights the resolution of faults and improves interpretation of seismic activity. The inverse problem is typically solved iteratively using conjugate-gradient minimization, however the cost scales significantly with the total number of sources and stations considered. Here we propose a Graph Neural Network (GNN) based earthquake double-difference relocation framework, Graph Double Difference (GraphDD), that is trained to minimize the double-difference residuals of a catalog to locate earthquakes. Through batching and sampling the method can scale to arbitrarily large catalogs. Our architecture uses one graph to represent the stations, a second graph to represent the sources, and creates the Cartesian product graph between the two graphs to capture the relationships between the stations and sources (e.g., the residuals and travel time partial derivatives). This key feature allows a natural architecture that can be used to minimize the double-difference residuals. We implement our model on several distinct test cases including seismicity from northern California, Turkiye, and northern Chile, which have highly variable data quality, and station and source distributions. We obtain high resolution relocations in these tests, and our model shows adaptability to variable types of loss functions and location objectives, including learning station corrections and mapping into the reference frame of a different catalog. Our results suggest that a GNN approach to double-difference relocation is a promising direction for scaling to very large catalogs and gaining new insights into the relocation problem.

[LG-68] Fully First-Order Methods for Decentralized Bilevel Optimization

链接: https://arxiv.org/abs/2410.19319
作者: Xiaoyu Wang,Xuxing Chen,Shiqian Ma,Tong Zhang
关键词-EN: stochastic bilevel optimization, decentralized stochastic bilevel, Decentralized Stochastic Gradient, Stochastic Gradient Descent, propose Decentralized Stochastic
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 46 pages

点击查看摘要

Abstract:This paper focuses on decentralized stochastic bilevel optimization (DSBO) where agents only communicate with their neighbors. We propose Decentralized Stochastic Gradient Descent and Ascent with Gradient Tracking (DSGDA-GT), a novel algorithm that only requires first-order oracles that are much cheaper than second-order oracles widely adopted in existing works. We further provide a finite-time convergence analysis showing that for n agents collaboratively solving the DSBO problem, the sample complexity of finding an \epsilon -stationary point in our algorithm is \mathcalO(n^-1\epsilon^-7) , which matches the currently best-known results of the single-agent counterpart with linear speedup. The numerical experiments demonstrate both the communication and training efficiency of our algorithm.

[LG-69] Reinforcement Learning the Chromatic Symmetric Function

链接: https://arxiv.org/abs/2410.19189
作者: Gergely Bérczi,Jonas Klüver
关键词-EN: chromatic symmetric function, conjectural counting formula, propose a conjectural, chromatic symmetric, symmetric function
类目: Combinatorics (math.CO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a conjectural counting formula for the coefficients of the chromatic symmetric function of unit interval graphs using reinforcement learning. The formula counts specific disjoint cycle-tuples in the graphs, referred to as Eschers, which satisfy certain concatenation conditions. These conditions are identified by a reinforcement learning model and are independent of the particular unit interval graph, resulting a universal counting expression.

[LG-70] Cross Spline Net and a Unified World

链接: https://arxiv.org/abs/2410.19154
作者: Linwei Hu,Ye Jin Choi,Vijayan N. Nair
关键词-EN: today machine learning, machine learning world, good model performance, fully connected neural, popular methods due
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In today’s machine learning world for tabular data, XGBoost and fully connected neural network (FCNN) are two most popular methods due to their good model performance and convenience to use. However, they are highly complicated, hard to interpret, and can be overfitted. In this paper, we propose a new modeling framework called cross spline net (CSN) that is based on a combination of spline transformation and cross-network (Wang et al. 2017, 2021). We will show CSN is as performant and convenient to use, and is less complicated, more interpretable and robust. Moreover, the CSN framework is flexible, as the spline layer can be configured differently to yield different models. With different choices of the spline layer, we can reproduce or approximate a set of non-neural network models, including linear and spline-based statistical models, tree, rule-fit, tree-ensembles (gradient boosting trees, random forest), oblique tree/forests, multi-variate adaptive regression spline (MARS), SVM with polynomial kernel, etc. Therefore, CSN provides a unified modeling framework that puts the above set of non-neural network models under the same neural network framework. By using scalable and powerful gradient descent algorithms available in neural network libraries, CSN avoids some pitfalls (such as being ad-hoc, greedy or non-scalable) in the case-specific optimization methods used in the above non-neural network models. We will use a special type of CSN, TreeNet, to illustrate our point. We will compare TreeNet with XGBoost and FCNN to show the benefits of TreeNet. We believe CSN will provide a flexible and convenient framework for practitioners to build performant, robust and more interpretable models.

[LG-71] Functional Brain Network Identification in Opioid Use Disorder Using Machine Learning Analysis of Resting-State fMRI BOLD Signals

链接: https://arxiv.org/abs/2410.19147
作者: Ahmed Temtam,Megan A. Witherow,Liangsuo Ma,M. Shibly Sadique,F. Gerard Moeller,Khan M. Iftekharuddin
关键词-EN: magnetic resonance imaging, improve patient outcomes, inform treatment strategies, Understanding the neurobiology, rs-fMRI BOLD features
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Understanding the neurobiology of opioid use disorder (OUD) using resting-state functional magnetic resonance imaging (rs-fMRI) may help inform treatment strategies to improve patient outcomes. Recent literature suggests temporal characteristics of rs-fMRI blood oxygenation level-dependent (BOLD) signals may offer complementary information to functional connectivity analysis. However, existing studies of OUD analyze BOLD signals using measures computed across all time points. This study, for the first time in the literature, employs data-driven machine learning (ML) modeling of rs-fMRI BOLD features representing multiple time points to identify region(s) of interest that differentiate OUD subjects from healthy controls (HC). Following the triple network model, we obtain rs-fMRI BOLD features from the default mode network (DMN), salience network (SN), and executive control network (ECN) for 31 OUD and 45 HC subjects. Then, we use the Boruta ML algorithm to identify statistically significant BOLD features that differentiate OUD from HC, identifying the DMN as the most salient functional network for OUD. Furthermore, we conduct brain activity mapping, showing heightened neural activity within the DMN for OUD. We perform 5-fold cross-validation classification (OUD vs. HC) experiments to study the discriminative power of functional network features with and without fusing demographic features. The DMN shows the most discriminative power, achieving mean AUC and F1 scores of 80.91% and 73.97%, respectively, when fusing BOLD and demographic features. Follow-up Boruta analysis using BOLD features extracted from the medial prefrontal cortex, posterior cingulate cortex, and left and right temporoparietal junctions reveals significant features for all four functional hubs within the DMN.

[LG-72] Maximum a Posteriori Inference for Factor Graphs via Benders Decomposition

链接: https://arxiv.org/abs/2410.19131
作者: Harsh Vardhan Dubey,Ji Ah Lee,Patrick Flaherty
关键词-EN: MAP, linear programming problem, MAP assignment, Bayesian statistical inference, MAP inference
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many Bayesian statistical inference problems come down to computing a maximum a-posteriori (MAP) assignment of latent variables. Yet, standard methods for estimating the MAP assignment do not have a finite time guarantee that the algorithm has converged to a fixed point. Previous research has found that MAP inference can be represented in dual form as a linear programming problem with a non-polynomial number of constraints. A Lagrangian relaxation of the dual yields a statistical inference algorithm as a linear programming problem. However, the decision as to which constraints to remove in the relaxation is often heuristic. We present a method for maximum a-posteriori inference in general Bayesian factor models that sequentially adds constraints to the fully relaxed dual problem using Benders’ decomposition. Our method enables the incorporation of expressive integer and logical constraints in clustering problems such as must-link, cannot-link, and a minimum number of whole samples allocated to each cluster. Using this approach, we derive MAP estimation algorithms for the Bayesian Gaussian mixture model and latent Dirichlet allocation. Empirical results show that our method produces a higher optimal posterior value compared to Gibbs sampling and variational Bayes methods for standard data sets and provides certificate of convergence.

[LG-73] A spectral method for multi-view subspace learning using the product of projections

链接: https://arxiv.org/abs/2410.19125
作者: Renat Sergazinov,Armeen Taeb,Irina Gaynanova
关键词-EN: multimodal sensor data, Multi-view data, set of observations, complementary information, multimodal sensor
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
*备注: 23 pages, 7 figures

点击查看摘要

Abstract:Multi-view data provides complementary information on the same set of observations, with multi-omics and multimodal sensor data being common examples. Analyzing such data typically requires distinguishing between shared (joint) and unique (individual) signal subspaces from noisy, high-dimensional measurements. Despite many proposed methods, the conditions for reliably identifying joint and individual subspaces remain unclear. We rigorously quantify these conditions, which depend on the ratio of the signal rank to the ambient dimension, principal angles between true subspaces, and noise levels. Our approach characterizes how spectrum perturbations of the product of projection matrices, derived from each view’s estimated subspaces, affect subspace separation. Using these insights, we provide an easy-to-use and scalable estimation algorithm. In particular, we employ rotational bootstrap and random matrix theory to partition the observed spectrum into joint, individual, and noise subspaces. Diagnostic plots visualize this partitioning, providing practical and interpretable insights into the estimation performance. In simulations, our method estimates joint and individual subspaces more accurately than existing approaches. Applications to multi-omics data from colorectal cancer patients and nutrigenomic study of mice demonstrate improved performance in downstream predictive tasks.

[LG-74] Distributed Blind Source Separation based on FastICA

链接: https://arxiv.org/abs/2410.19112
作者: Cem Ates Musluoglu,Alexander Bertrand
关键词-EN: traditional signal processing, signal processing tasks, wireless sensor networks, centralized processing unit, processing unit
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 5 pages

点击查看摘要

Abstract:With the emergence of wireless sensor networks (WSNs), many traditional signal processing tasks are required to be computed in a distributed fashion, without transmissions of the raw data to a centralized processing unit, due to the limited energy and bandwidth resources available to the sensors. In this paper, we propose a distributed independent component analysis (ICA) algorithm, which aims at identifying the original signal sources based on observations of their mixtures measured at various sensor nodes. One of the most commonly used ICA algorithms is known as FastICA, which requires a spatial pre-whitening operation in the first step of the algorithm. Such a pre-whitening across all nodes of a WSN is impossible in a bandwidth-constrained distributed setting as it requires to correlate each channel with each other channel in the WSN. We show that an explicit network-wide pre-whitening step can be circumvented by leveraging the properties of the so-called Distributed Adaptive Signal Fusion (DASF) framework. Despite the lack of such a network-wide pre-whitening, we can still obtain the Q least Gaussian independent components of the centralized ICA solution, where Q scales linearly with the required communication load.

[LG-75] Inherently Interpretable Tree Ensemble Learning

链接: https://arxiv.org/abs/2410.19098
作者: Zebin Yang,Agus Sudjianto,Xiaoming Li,Aijun Zhang
关键词-EN: gradient boosting machines, machine learning due, excellent predictive performance, boosting machines, random forests
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tree ensemble models like random forests and gradient boosting machines are widely used in machine learning due to their excellent predictive performance. However, a high-performance ensemble consisting of a large number of decision trees lacks sufficient transparency and explainability. In this paper, we demonstrate that when shallow decision trees are used as base learners, the ensemble learning algorithms can not only become inherently interpretable subject to an equivalent representation as the generalized additive models but also sometimes lead to better generalization performance. First, an interpretation algorithm is developed that converts the tree ensemble into the functional ANOVA representation with inherent interpretability. Second, two strategies are proposed to further enhance the model interpretability, i.e., by adding constraints in the model training stage and post-hoc effect pruning. Experiments on simulations and real-world datasets show that our proposed methods offer a better trade-off between model interpretation and predictive performance, compared with its counterpart benchmarks.

[LG-76] A Generalized Framework for Multiscale State-Space Modeling with Nested Nonlinear Dynamics: An Application to Bayesian Learning under Switching Regimes

链接: https://arxiv.org/abs/2410.19074
作者: Nayely Vélez-Cruz,Manfred D. Laubichler
关键词-EN: incorporates nested nonlinear, multiscale state-space modeling, introduce a generalized, state-space modeling, modeling that incorporates
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this work, we introduce a generalized framework for multiscale state-space modeling that incorporates nested nonlinear dynamics, with a specific focus on Bayesian learning under switching regimes. Our framework captures the complex interactions between fast and slow processes within systems, allowing for the analysis of how these dynamics influence each other across various temporal scales. We model these interactions through a hierarchical structure in which finer time-scale dynamics are nested within coarser ones, while facilitating feedback between the scales. To promote the practical application of our framework, we address the problem of identifying switching regimes and transient dynamics. In particular, we develop a Bayesian learning approach to estimate latent states and indicators corresponding to switching dynamics, enabling the model to adapt effectively to regime changes. We employ Sequential Monte Carlo, or particle filtering, for inference. We illustrate the utility of our framework through simulations. The results demonstrate that our Bayesian learning approach effectively tracks state transitions and achieves accurate identification of switching dynamics in multiscale systems.

[LG-77] Less Discriminatory Alternative and Interpretable XGBoost Framework for Binary Classification

链接: https://arxiv.org/abs/2410.19067
作者: Andrew Pangia,Agus Sudjianto,Aijun Zhang,Taufiquar Khan
关键词-EN: Financial Protection Bureau, complex machine learning, Consumer Financial Protection, machine learning, crucial concerns
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fair lending practices and model interpretability are crucial concerns in the financial industry, especially given the increasing use of complex machine learning models. In response to the Consumer Financial Protection Bureau’s (CFPB) requirement to protect consumers against unlawful discrimination, we introduce LDA-XGB1, a novel less discriminatory alternative (LDA) machine learning model for fair and interpretable binary classification. LDA-XGB1 is developed through biobjective optimization that balances accuracy and fairness, with both objectives formulated using binning and information value. It leverages the predictive power and computational efficiency of XGBoost while ensuring inherent model interpretability, including the enforcement of monotonic constraints. We evaluate LDA-XGB1 on two datasets: SimuCredit, a simulated credit approval dataset, and COMPAS, a real-world recidivism prediction dataset. Our results demonstrate that LDA-XGB1 achieves an effective balance between predictive accuracy, fairness, and interpretability, often outperforming traditional fair lending models. This approach equips financial institutions with a powerful tool to meet regulatory requirements for fair lending while maintaining the advantages of advanced machine learning techniques.

[LG-78] DamFormer: Generalizing Morphologies in Dam Break Simulations Using Transformer Model

链接: https://arxiv.org/abs/2410.18998
作者: Zhaoyang Mul,Aoming Liang,Mingming Ge,Dashuai Chen,Dixia Fan,Minyi Xu
关键词-EN: dams breaking plays, tsunami disasters, dams breaking, breaking plays, plays a critical
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The interaction of waves with structural barriers such as dams breaking plays a critical role in flood defense and tsunami disasters. In this work, we explore the dynamic changes in wave surfaces impacting various structural shapes, e.g., circle, triangle, and square, by using deep learning techniques. We introduce the DamFormer, a novel transformer-based model designed to learn and simulate these complex interactions. The model was trained and tested on simulated data representing the three structural forms.

[LG-79] Deterministic Fokker-Planck Transport – With Applications to Sampling Variational Inference Kernel Mean Embeddings Sequential Monte Carlo

链接: https://arxiv.org/abs/2410.18993
作者: Ilja Klebanov
关键词-EN: particle flow methods, probability flow ODE, Fokker-Planck equation, continuity equation, flow ODE offers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 28 pages, 6 figures, 1 table

点击查看摘要

Abstract:The Fokker-Planck equation can be reformulated as a continuity equation, which naturally suggests using the associated velocity field in particle flow methods. While the resulting probability flow ODE offers appealing properties - such as defining a gradient flow of the Kullback-Leibler divergence between the current and target densities with respect to the 2-Wasserstein distance - it relies on evaluating the current probability density, which is intractable in most practical applications. By closely examining the drawbacks of approximating this density via kernel density estimation, we uncover opportunities to turn these limitations into advantages in contexts such as variational inference, kernel mean embeddings, and sequential Monte Carlo.

信息检索

[IR-0] Learning ID-free Item Representation with Token Crossing for Multimodal Recommendation

链接: https://arxiv.org/abs/2410.19276
作者: Kangning Zhang,Jiarui Jin,Yingjie Qin,Ruilong Su,Jianghao Lin,Yong Yu,Weinan Zhang
关键词-EN: Current multimodal recommendation, Current multimodal, multimodal, embeddings remains, ID-based Multimodal Recommender
类目: Information Retrieval (cs.IR)
*备注: 11 pages,6 figures

点击查看摘要

Abstract:Current multimodal recommendation models have extensively explored the effective utilization of multimodal information; however, their reliance on ID embeddings remains a performance bottleneck. Even with the assistance of multimodal information, optimizing ID embeddings remains challenging for ID-based Multimodal Recommender when interaction data is sparse. Furthermore, the unique nature of item-specific ID embeddings hinders the information exchange among related items and the spatial requirement of ID embeddings increases with the scale of item. Based on these limitations, we propose an ID-free MultimOdal TOken Representation scheme named MOTOR that represents each item using learnable multimodal tokens and connects them through shared tokens. Specifically, we first employ product quantization to discretize each item’s multimodal features (e.g., images, text) into discrete token IDs. We then interpret the token embeddings corresponding to these token IDs as implicit item features, introducing a new Token Cross Network to capture the implicit interaction patterns among these tokens. The resulting representations can replace the original ID embeddings and transform the original ID-based multimodal recommender into ID-free system, without introducing any additional loss design. MOTOR reduces the overall space requirements of these models, facilitating information interaction among related items, while also significantly enhancing the model’s recommendation capability. Extensive experiments on nine mainstream models demonstrate the significant performance improvement achieved by MOTOR, highlighting its effectiveness in enhancing multimodal recommendation systems.

[IR-1] Sentiment-Driven Community Detection in a Network of Perfume Preferences

链接: https://arxiv.org/abs/2410.19177
作者: Kamand Kalashi,Sajjad Saed,Babak Teimourpour
关键词-EN: increasingly important, Community detection, user, perfumes, Network
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Network analysis is increasingly important across various fields, including the fragrance industry, where perfumes are represented as nodes and shared user preferences as edges in perfume networks. Community detection can uncover clusters of similar perfumes, providing insights into consumer preferences, enhancing recommendation systems, and informing targeted marketing strategies. This study aims to apply community detection techniques to group perfumes favored by users into relevant clusters for better recommendations. We constructed a bipartite network from user reviews on the Persian retail platform “Atrafshan,” with nodes representing users and perfumes, and edges formed by positive comments. This network was transformed into a Perfume Co-Preference Network, connecting perfumes liked by the same users. By applying community detection algorithms, we identified clusters based on shared preferences, enhancing our understanding of user sentiment in the fragrance market. To improve sentiment analysis, we integrated emojis and a user voting system for greater accuracy. Emojis, aligned with their Persian counterparts, captured the emotional tone of reviews, while user ratings for scent, longevity, and sillage refined sentiment classification. Edge weights were adjusted by combining adjacency values with user ratings in a 60:40 ratio, reflecting both connection strength and user preferences. These enhancements led to improved modularity of detected communities, resulting in more accurate perfume groupings. This research pioneers the use of community detection in perfume networks, offering new insights into consumer preferences. Our advancements in sentiment analysis and edge weight refinement provide actionable insights for optimizing product recommendations and marketing strategies in the fragrance industry. Subjects: Social and Information Networks (cs.SI); Information Retrieval (cs.IR) ACMclasses: G.2.2; I.2.6; J.4 Cite as: arXiv:2410.19177 [cs.SI] (or arXiv:2410.19177v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2410.19177 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kamand Kalashi [view email] [v1] Thu, 24 Oct 2024 22:13:48 UTC (29,856 KB)

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-10-28

目录

概览 (2024-10-28)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载