本篇博文主要展示 2024-11-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2024-11-05)
今日共更新865篇论文,其中:
- 自然语言处理共110篇(Computation and Language (cs.CL))
- 人工智能共266篇(Artificial Intelligence (cs.AI))
- 计算机视觉共206篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共352篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Prompting with Phonemes: Enhancing LLM Multilinguality for non-Latin Script Languages
【速读】: 该论文试图解决多语言大型语言模型(LLMs)在非拉丁文字语言上表现不佳的问题。解决方案的关键在于利用音位转录(phonemic transcriptions)作为补充信号,以诱导出与文字无关的表征(script-invariant representations)。通过将音位信号与传统的正字法(orthographic scripts)结合,研究显示这种方法显著提升了非拉丁文字和拉丁文字语言的性能,特别是在缩小两者性能差距方面效果显著。此外,论文提出了混合上下文学习(Mixed-ICL)检索策略,通过进一步聚合音位和正字法检索的示例,相比随机ICL检索,显著提高了拉丁文字语言(最高提升12.6%)和非拉丁文字语言(最高提升15.1%)的性能。
链接: https://arxiv.org/abs/2411.02398
作者: Hoang Nguyen,Khyati Mahajan,Vikas Yadav,Philip S. Yu,Masoud Hashemi,Rishabh Maheshwary
关键词-EN: contemporary LLM families, achieved remarkable benchmark, Multilingual LLMs, remarkable benchmark performance, LLM families
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Multilingual LLMs have achieved remarkable benchmark performance, but we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.
摘要:多语言大语言模型在基准测试中取得了显著的性能表现,但我们发现它们在当代大语言模型家族中对非拉丁文字语言的表现仍然不尽如人意。这种差异源于大语言模型在预训练时主要使用正字法脚本,这些脚本以拉丁字符为主,掩盖了它们与非拉丁文字共享的音韵特征。我们提出利用音位转写作为补充信号,以诱导出与文字无关的表示。我们的研究表明,整合音位信号不仅提升了非拉丁语言的表现,也改善了拉丁语言的表现,特别是在缩小两者性能差距方面效果显著。通过详细的实验,我们展示了音位和正字法脚本在上下文学习(ICL)中检索到不同示例的情况。这促使我们提出了混合ICL检索策略,进一步的聚合使得我们在拉丁文字语言(最高提升12.6%)和非拉丁文字语言(最高提升15.1%)上的性能相较于随机ICL检索有了显著提升。
[NLP-1] Attacking Vision-Language Computer Agents via Pop-ups
【速读】: 该论文旨在揭示基于大型视觉和语言模型(VLM)的自主代理在处理日常计算机任务时可能面临的攻击风险。论文通过设计一系列精心制作的对抗性弹出窗口(adversarial pop-ups),展示了这些弹出窗口能够轻易地干扰代理的正常操作,导致代理点击这些弹出窗口而非执行预定任务。关键解决方案在于识别并量化这种攻击的有效性,论文通过将这些弹出窗口集成到现有的代理测试环境中(如OSWorld和VisualWebArena),发现攻击成功率平均达到86%,任务成功率下降47%。论文还指出,基本的防御技术(如要求代理忽略弹出窗口或包含广告通知)对这种攻击无效。
链接: https://arxiv.org/abs/2411.02391
作者: Yanzhe Zhang,Tao Yu,Diyi Yang
关键词-EN: Autonomous agents powered, operating desktop software, demonstrated significant potential, completing daily computer, daily computer tasks
类目: Computation and Language (cs.CL)
备注: 10 pages, preprint
点击查看摘要
Abstract:Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing the tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
摘要:由大型视觉和语言模型(VLM)驱动的自主智能体在完成日常计算机任务方面展示了显著的潜力,例如通过浏览网页预订旅行和操作桌面软件,这要求智能体理解这些界面。尽管此类视觉输入在智能体应用中变得更加集成,但围绕它们的潜在风险和攻击类型仍不明确。在本研究中,我们展示了 VLM 智能体可以被一组精心设计的对抗性弹出窗口轻易攻击,这些弹出窗口通常会被人类用户识别并忽略。这种干扰导致智能体点击这些弹出窗口,而不是按常规执行任务。将这些弹出窗口整合到现有的智能体测试环境中,如 OSWorld 和 VisualWebArena,平均攻击成功率(智能体点击弹出窗口的频率)达到 86%,并使任务成功率下降了 47%。基本的防御技术,如要求智能体忽略弹出窗口或包含广告通知,对这种攻击无效。
[NLP-2] Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在科学研究中生成假设时容易产生“幻觉”(hallucinations)的问题,即生成看似合理但事实错误的输出。解决方案的关键是提出了KG-CoI(Knowledge Grounded Chain of Ideas)系统,该系统通过整合知识图谱(KGs)中的外部结构化知识,引导LLMs进行结构化推理,并将输出组织为一系列思想链(Chain of Ideas, CoI)。此外,KG-CoI还包括一个基于知识图谱的模块,用于检测和减少幻觉现象。实验结果表明,KG-CoI不仅提高了LLMs生成假设的准确性,还显著减少了推理链中的幻觉,从而有效推进了实际科学研究。
链接: https://arxiv.org/abs/2411.02382
作者: Guangzhi Xiong,Eric Xie,Amir Hassan Shariatmadari,Sikun Guo,Stefan Bekiranov,Aidong Zhang
关键词-EN: Large language models, natural language processing, demonstrated remarkable capabilities, Large language, complex problem-solving tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in various scientific domains, from natural language processing to complex problem-solving tasks. Their ability to understand and generate human-like text has opened up new possibilities for advancing scientific research, enabling tasks such as data analysis, literature review, and even experimental design. One of the most promising applications of LLMs in this context is hypothesis generation, where they can identify novel research directions by analyzing existing knowledge. However, despite their potential, LLMs are prone to generating ``hallucinations’', outputs that are plausible-sounding but factually incorrect. Such a problem presents significant challenges in scientific fields that demand rigorous accuracy and verifiability, potentially leading to erroneous or misleading conclusions. To overcome these challenges, we propose KG-CoI (Knowledge Grounded Chain of Ideas), a novel system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs (KGs). KG-CoI guides LLMs through a structured reasoning process, organizing their output as a chain of ideas (CoI), and includes a KG-supported module for the detection of hallucinations. With experiments on our newly constructed hypothesis generation dataset, we demonstrate that KG-CoI not only improves the accuracy of LLM-generated hypotheses but also reduces the hallucination in their reasoning chains, highlighting its effectiveness in advancing real-world scientific research.
摘要:大语言模型 (LLMs) 在多个科学领域,从自然语言处理到复杂问题解决任务中,展示了卓越的能力。它们理解和生成类人文本的能力为推进科学研究开辟了新的可能性,使得数据分析、文献综述甚至实验设计等任务成为可能。在此背景下,LLMs 最引人注目的应用之一是假设生成,它们通过分析现有知识可以识别出新颖的研究方向。然而,尽管具有潜力,LLMs 容易产生“幻觉”,即听起来合理但实际上事实错误的输出。这种问题在需要严格准确性和可验证性的科学领域中提出了重大挑战,可能导致错误或误导性的结论。为了克服这些挑战,我们提出了 KG-CoI(基于知识的思路链),这是一种通过整合来自知识图谱 (KGs) 的外部结构化知识来增强 LLM 假设生成的新系统。KG-CoI 通过结构化推理过程引导 LLMs,将其输出组织为思路链 (CoI),并包含一个支持 KG 的模块用于幻觉检测。通过在我们新构建的假设生成数据集上的实验,我们证明了 KG-CoI 不仅提高了 LLM 生成假设的准确性,还减少了其推理链中的幻觉,突显了其在推进实际科学研究中的有效性。
[NLP-3] Can Large Language Models generalize analogy solving like people can?
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)是否能够像人类一样将类比解决能力泛化到新的领域。解决方案的关键在于比较儿童、成人和LLMs在不同领域(如拉丁字母、希腊字母和符号列表)中解决字母串类比问题的表现。研究发现,儿童和成人能够轻松地将知识泛化到不熟悉的领域,而LLMs则未能表现出类似的泛化能力,这表明当前的LLMs在类比迁移方面仍存在显著局限性。
链接: https://arxiv.org/abs/2411.02348
作者: Claire E. Stevenson,Alexandra Pafford,Han L. J. van der Maas,Melanie Mitchell
关键词-EN: relational similarity, abstract rules, rules and relational, Abstract, transfer
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:When we solve an analogy we transfer information from a known context to a new one through abstract rules and relational similarity. In people, the ability to solve analogies such as “body : feet :: table : ?” emerges in childhood, and appears to transfer easily to other domains, such as the visual domain “( : ) :: : ?”. Recent research shows that large language models (LLMs) can solve various forms of analogies. However, can LLMs generalize analogy solving to new domains like people can? To investigate this, we had children, adults, and LLMs solve a series of letter-string analogies (e.g., a b : a c :: j k : ?) in the Latin alphabet, in a near transfer domain (Greek alphabet), and a far transfer domain (list of symbols). As expected, children and adults easily generalized their knowledge to unfamiliar domains, whereas LLMs did not. This key difference between human and AI performance is evidence that these LLMs still struggle with robust human-like analogical transfer.
摘要:当我们解决类比问题时,通过抽象规则和关系相似性将信息从一个已知情境转移到新的情境。在人类中,解决诸如“身体 : 脚 :: 桌子 : ?”这样的类比问题的能力在儿童时期出现,并且似乎很容易转移到其他领域,例如视觉领域“( : ) :: : ?”。最近的研究表明,大语言模型 (LLMs) 能够解决各种形式的类比问题。然而,LLMs 能否像人类一样将类比解决能力泛化到新的领域呢?为了研究这一点,我们让儿童、成年人和 LLMs 解决一系列字母串类比问题(例如,a b : a c :: j k : ?),分别在拉丁字母、近转移领域(希腊字母)和远转移领域(符号列表)中进行。正如预期的那样,儿童和成年人能够轻松地将他们的知识泛化到不熟悉的领域,而 LLMs 则未能做到这一点。这种人类与 AI 表现之间的关键差异表明,这些 LLMs 在实现类似人类的稳健类比转移方面仍然存在困难。
[NLP-4] Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
【速读】: 该论文试图解决解码器专用Transformer在复杂推理任务,特别是需要多步序列操作的算术推理任务中的表现不佳问题。研究发现,模型中间层的表示崩溃是限制其推理能力的关键因素。为此,论文提出了序列方差-协方差正则化(Sequential Variance-Covariance Regularization, Seq-VCR)方法,通过增强中间表示的熵来防止表示崩溃。结合使用虚拟暂停标记(dummy pause tokens)替代链式思维(Chain-of-Thought, CoT)标记,该方法显著提升了算术推理问题的性能。在5×5整数乘法任务中,该方法达到了99.5%的精确匹配准确率,优于同规模模型(0%准确率)和使用五次CoT提示的GPT-4(44%准确率)。此外,该方法在算术表达式和最长递增子序列(Longest Increasing Subsequence, LIS)数据集上也表现出优越的结果。论文强调了防止中间层表示崩溃对于提升Transformer推理能力的重要性,并展示了Seq-VCR在不依赖显式CoT监督的情况下提供有效解决方案的潜力。
链接: https://arxiv.org/abs/2411.02344
作者: Md Rifat Arefin,Gopeshh Subbaraj,Nicolas Gontier,Yann LeCun,Irina Rish,Ravid Shwartz-Ziv,Christopher Pal
关键词-EN: multiple sequential operations, Decoder-only Transformers, struggle with complex, requiring multiple sequential, sequential operations
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model’s intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging 5 \times 5 integer multiplication task, our approach achieves 99.5% exact match accuracy, outperforming models of the same size (which yield 0% accuracy) and GPT-4 with five-shot CoT prompting ( 44% ). We also demonstrate superior results on arithmetic expression and longest increasing subsequence (LIS) datasets. Our findings highlight the importance of preventing intermediate layer representation collapse to enhance the reasoning capabilities of Transformers and show that Seq-VCR offers an effective solution without requiring explicit CoT supervision.
摘要:仅解码器的 Transformer 在处理复杂推理任务时常常遇到困难,特别是需要多个连续操作的算术推理任务。在本研究中,我们识别出模型中间层的表示崩溃是限制其推理能力的关键因素。为解决这一问题,我们提出了序列方差-协方差正则化(Sequential Variance-Covariance Regularization, Seq-VCR),该方法增强了中间表示的熵并防止了表示崩溃。结合使用虚拟暂停 Token 作为思维链(Chain-of-Thought, CoT)Token 的替代,我们的方法在算术推理问题上显著提升了性能。在具有挑战性的 5 × 5 整数乘法任务中,我们的方法达到了 99.5% 的精确匹配准确率,优于同规模的模型(准确率为 0%)以及使用五样本 CoT 提示的 GPT-4(准确率为 44%)。我们还在算术表达式和最长递增子序列(Longest Increasing Subsequence, LIS)数据集上展示了优越的结果。我们的研究强调了防止中间层表示崩溃对于提升 Transformer 推理能力的重要性,并表明 Seq-VCR 提供了一种有效的解决方案,无需显式的 CoT 监督。
[NLP-5] WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
【速读】: 该论文试图解决现有大型语言模型(LLMs)在作为自主代理进行网页任务时,依赖昂贵的专有LLM API,而开源LLMs缺乏必要决策能力的问题。解决方案的关键在于引入WebRL,这是一个自进化的在线课程强化学习框架,旨在使用开源LLMs训练高性能的网页代理。WebRL通过以下三个关键组件解决构建LLM网页代理的三大挑战:1) 自进化的课程生成新任务,从失败尝试中学习;2) 强大的结果监督奖励模型(Outcome-Supervised Reward Model, ORM);3) 适应性强化学习策略,确保持续改进。实验结果表明,WebRL显著提升了开源模型Llama-3.1和GLM-4在WebArena-Lite上的成功率,超越了GPT-4-Turbo和GPT-4o等专有模型,以及之前的开源LLM网页代理AutoWebGLM,从而有效弥合了开源与专有LLM网页代理之间的性能差距。
链接: https://arxiv.org/abs/2411.02337
作者: Zehan Qi,Xiao Liu,Iat Long Iong,Hanyu Lai,Xueqiao Sun,Xinyue Yang,Jiadai Sun,Yu Yang,Shuntian Yao,Tianjie Zhang,Wei Xu,Jie Tang,Yuxiao Dong
关键词-EN: Large language models, shown remarkable potential, Large language, web agents, LLM web agents
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL’s effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.
摘要:大语言模型(LLMs)在作为自主智能体,特别是在基于网络的任务中,展示了显著的潜力。然而,现有的 LLM 网络智能体严重依赖昂贵的专有 LLM API,而开源 LLM 则缺乏必要的决策能力。本文介绍了 WebRL,这是一个自进化的在线课程强化学习框架,旨在使用开源 LLM 训练高性能的网络智能体。WebRL 解决了构建 LLM 网络智能体的三个关键挑战,包括训练任务的稀缺性、稀疏的反馈信号以及在线学习中的策略分布漂移。具体而言,WebRL 集成了 1) 一个自进化的课程,从失败尝试中生成新任务,2) 一个强大的结果监督奖励模型(ORM),以及 3) 适应性强化学习策略,以确保持续改进。我们将 WebRL 应用于将开源的 Llama-3.1 和 GLM-4 模型转化为熟练的网络智能体。在 WebArena-Lite 上,WebRL 将 Llama-3.1-8B 的成功率从 4.8% 提高到 42.4%,GLM-4-9B 的成功率从 6.1% 提高到 43%。这些开源模型显著超越了 GPT-4-Turbo(17.6%)和 GPT-4o(13.9%)的性能,并优于之前基于开源 LLM 训练的最先进网络智能体(AutoWebGLM,18.2%)。我们的研究结果表明,WebRL 在弥合开源与专有 LLM 网络智能体之间的差距方面具有有效性,为更易获取和更强大的自主网络交互系统铺平了道路。
[NLP-6] Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
【速读】: 该论文试图解决大语言模型(LLMs)中激活稀疏性(activation sparsity)与其影响因素之间的定量关系问题。解决方案的关键在于提出了一个精确且性能导向的激活稀疏性度量方法——PPL-p% 稀疏性(PPL-p% sparsity),并通过对不同激活函数(如SiLU和ReLU)、训练数据量、模型架构(宽度-深度比)以及参数规模等因素的广泛实验,揭示了激活稀疏性与这些因素之间的定量关系。研究发现,ReLU激活函数在利用更多训练数据提升激活稀疏性方面比SiLU更有效,且在固定参数规模下,更深的架构可能具有优势。此外,激活模式对参数规模的敏感性较低,表明LLMs中的激活模式相对稳定。这些发现为提升LLMs的效率和可解释性提供了重要依据。
链接: https://arxiv.org/abs/2411.02335
作者: Yuqi Luo,Chenyang Song,Xu Han,Yingfa Chen,Chaojun Xiao,Zhiyuan Liu,Maosong Sun
关键词-EN: large language models, substantial weakly-contributed elements, Activation sparsity, Activation, sparsity
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 23 pages, 13 figures, 6 tables
点击查看摘要
Abstract:Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL- p% sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., 1-\mathrmsparsity\ ratio ) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
摘要:激活稀疏性(Activation sparsity)指的是在激活输出中存在大量贡献较小的元素,这些元素可以被消除,从而有利于许多涉及大语言模型(LLMs)的重要应用。尽管在LLMs中促进更高的激活稀疏性值得深入研究,但现有工作缺乏对激活稀疏性与潜在影响因素之间关系的全面和定量研究。本文对基于仅解码器Transformer的LLMs中的激活稀疏性的定量缩放特性和影响因素进行了全面研究。具体而言,我们提出了PPL-p%稀疏性,这是一种精确且性能导向的激活稀疏性度量,适用于任何激活函数。通过广泛的实验,我们发现了几个重要现象。首先,不同的激活函数在性能上表现相当,但训练时间内的稀疏性趋势相反。激活比率(即1-稀疏性比率)对于SiLU激活和ReLU激活的LLMs分别随着训练数据量的增加呈现出收敛的递增幂律和递减对数空间幂律。这表明ReLU作为激活函数比SiLU更高效,并且能够利用更多的训练数据来提高激活稀疏性。其次,激活比率在宽度-深度比低于某个瓶颈点时线性增加,这表明在固定参数规模下,更深的架构具有潜在优势。最后,在相似的宽度-深度比下,我们意外地发现激活稀疏性的极限值对参数规模的变化不敏感,即LLMs中的激活模式对参数规模不敏感。这些关于更高激活稀疏性的LLMs的经验规律对于提高LLMs的效率和可解释性具有重要意义。
[NLP-7] Evaluating Creative Short Story Generation in Humans and Large Language Models
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)在生成短篇故事时是否能展现出与普通人类作家相当的创造力。解决方案的关键在于采用了一个五句话的创意故事任务,该任务常用于心理学中评估人类创造力。通过自动评估模型和人类生成的故事在创造力维度(如新颖性、惊喜度和多样性)上的表现,研究发现尽管LLMs能够生成风格复杂的故事,但在创造力方面仍不及普通人类作家。
链接: https://arxiv.org/abs/2411.02316
作者: Mete Ismayilzada,Claire Stevenson,Lonneke van der Plas
关键词-EN: relying heavily, fundamental aspect, produce narratives, Storytelling, creativity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages
点击查看摘要
Abstract:Storytelling is a fundamental aspect of human communication, relying heavily on creativity to produce narratives that are novel, appropriate, and surprising. While large language models (LLMs) have recently demonstrated the ability to generate high-quality stories, their creative capabilities remain underexplored. Previous research has either focused on creativity tests requiring short responses or primarily compared model performance in story generation to that of professional writers. However, the question of whether LLMs exhibit creativity in writing short stories on par with the average human remains unanswered. In this work, we conduct a systematic analysis of creativity in short story generation across LLMs and everyday people. Using a five-sentence creative story task, commonly employed in psychology to assess human creativity, we automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, and diversity. Our findings reveal that while LLMs can generate stylistically complex stories, they tend to fall short in terms of creativity when compared to average human writers.
摘要:叙事是人类交流的基本方面,高度依赖创造力来生成新颖、恰当且令人惊讶的故事。尽管大语言模型 (LLM) 最近展示了生成高质量故事的能力,但其创造力仍未得到充分探索。先前的研究要么集中在需要短回应的创造力测试上,要么主要比较模型在故事生成方面的表现与专业作家的表现。然而,大语言模型是否能在撰写短篇故事的创造力上与普通人类相媲美的问题仍未得到解答。在本研究中,我们对大语言模型和普通人的短篇故事生成中的创造力进行了系统分析。使用心理学中常用的五句创意故事任务,我们自动评估了模型和人类生成的故事在多个创造力维度上的表现,包括新颖性、惊喜度和多样性。我们的研究结果表明,尽管大语言模型能够生成风格复杂的故事,但在创造力方面,它们往往不及普通人类作家。
[NLP-8] MdEval: Massively Multilingual Code Debugging
【速读】: 该论文试图解决现有编程基准在多语言代码调试能力评估中的局限性问题,特别是针对Python语言的偏重和语言多样性的不足。解决方案的关键在于提出了首个大规模多语言调试基准,涵盖18种编程语言的3.6K测试样本,并涉及自动化程序修复(APR)、代码审查(CR)和错误识别(BI)任务。此外,论文通过引入调试指令语料库MDEVAL-INSTRUCT和训练多语言调试器xDebugCoder,为处理多种编程语言中的错误(如Rust中的“Missing Mut”和C语言中的“Misused Macro Definition”)提供了强有力的基线模型。实验结果显示,开源模型与闭源大型语言模型(如GPT和Claude系列)在多语言代码调试场景中存在显著性能差距,表明该领域仍有巨大的改进空间。
链接: https://arxiv.org/abs/2411.02310
作者: Shukai Liu,Linzheng Chai,Jian Yang,Jiajun Shi,He Zhu,Liran Wang,Ke Jin,Wei Zhang,Hualei Zhu,Shuyue Guo,Tao Sun,Jiaheng Liu,Yunlong Duan,Yu Hao,Liqun Yang,Guanglin Niu,Ge Zhang,Zhoujun Li
关键词-EN: made significant progress, buggy code snippet, Code large language, correct code based, made significant
类目: Computation and Language (cs.CL)
备注: 15 pages
点击查看摘要
Abstract:Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. “Missing Mut” in language Rust and “Misused Macro Definition” in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.
摘要:代码大语言模型 (LLM) 在代码调试方面取得了显著进展,能够直接根据错误代码片段生成正确的代码。编程基准测试通常由错误代码片段及其相关测试用例组成,用于评估 LLM 的调试能力。然而,许多现有的基准测试主要集中在 Python 上,并且在语言多样性方面往往有限(例如,DebugBench 和 DebugEval)。为了推动多语言调试领域的发展,我们提出了首个大规模多语言调试基准测试,其中包括 18 种编程语言的 3.6K 测试样本,涵盖了自动化程序修复 (APR) 任务、代码审查 (CR) 任务和错误识别 (BI) 任务。此外,我们通过将错误注入到正确的多语言查询和解决方案中,引入了调试指令语料库 MDEVAL-INSTRUCT(xDebugGen)。进一步地,我们训练了一个多语言调试器 xDebugCoder,该调试器以 MDEVAL-INSTRUCT 为强基线,专门处理多种编程语言中的错误(例如,Rust 语言中的“缺失 Mut”和 C 语言中的“误用宏定义”)。我们在 MDEVAL 上的广泛实验揭示了开源模型与闭源 LLM(例如,GPT 和 Claude 系列)之间在多语言代码调试场景中存在显著的性能差距,这表明在多语言代码调试领域仍有巨大的改进空间。
[NLP-9] CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments
【速读】: 该论文试图解决在客户关系管理 (Customer Relationship Management, CRM) 系统中部署和评估人工智能 (AI) 代理的挑战,特别是缺乏能够反映真实世界复杂性的基准测试问题。解决方案的关键在于引入了CRMArena,这是一个新颖的基准测试,旨在评估AI代理在基于专业工作环境的现实任务中的表现。CRMArena设计了九个客户服务任务,分布在三个角色(服务代理、分析师和管理者)中,并包含了16个常用的工业对象和潜在变量,以模拟真实的数据分布。通过实验,论文揭示了当前最先进的语言模型 (LLM) 代理在ReAct提示和函数调用能力下,仅能成功完成不到55%的任务,强调了提升代理在函数调用和规则遵循方面的能力对于在实际工作环境中部署的重要性。CRMArena作为一个开放的挑战,旨在推动社区开发能够可靠完成任务的系统,从而在流行的工作环境中展示直接的商业价值。
链接: https://arxiv.org/abs/2411.02305
作者: Kung-Hsiang Huang,Akshara Prabhakar,Sidharth Dhawan,Yixin Mao,Huan Wang,Silvio Savarese,Caiming Xiong,Philippe Laban,Chien-Sheng Wu
关键词-EN: Customer Relationship Management, Relationship Management, managing customer interactions, Customer Relationship, modern enterprises
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.
摘要:客户关系管理 (Customer Relationship Management, CRM) 系统对于现代企业至关重要,为管理客户互动和数据提供了基础。将 AI 智能体集成到 CRM 系统中可以自动化常规流程并增强个性化服务。然而,由于缺乏反映现实世界 CRM 任务复杂性的真实基准,部署和评估这些智能体面临挑战。为解决这一问题,我们引入了 CRMArena,这是一个新颖的基准,旨在评估 AI 智能体在基于专业工作环境的现实任务中的表现。根据 CRM 专家和行业最佳实践的指导,我们设计了 CRMArena,涵盖了九个客户服务任务,分布在三个角色:服务代理、分析师和管理者。该基准包括 16 个常用的工业对象(如账户、订单、知识文章、案例),这些对象具有高度互联性,并伴随潜在变量(如投诉习惯、政策违规)以模拟现实数据分布。实验结果显示,最先进的大语言模型智能体在使用 ReAct 提示的情况下,成功率不足 40%,即使具备函数调用能力,成功率也不到 55%。我们的研究结果强调了在函数调用和规则遵循方面提升智能体能力的必要性,以便在现实工作环境中部署。CRMArena 是对社区的一个公开挑战:能够可靠完成任务的系统在流行的工作环境中直接展示了商业价值。
[NLP-10] he LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)中是否存在类似于人类大脑中语言系统的语言选择性单元,并且这些单元在语言处理中是否具有因果作用。解决方案的关键在于:通过采用神经科学中的定位方法,识别出LLMs中的语言选择性单元,并通过实验证明这些单元的缺失会导致语言任务的显著缺陷,从而确立这些单元的因果作用。此外,研究还探讨了这些定位方法是否适用于其他认知领域,发现某些LLMs中存在专门用于推理和社会能力的网络,但不同模型之间存在显著差异。这些发现为LLMs中的功能特化提供了功能性和因果性证据,并强调了与大脑功能组织的相似性。
链接: https://arxiv.org/abs/2411.02280
作者: Badr AlKhamissi,Greta Tuckute,Antoine Bosselut,Martin Schrimpf
关键词-EN: exhibit remarkable capabilities, exhibit remarkable, linguistic in nature, language, Large language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint
点击查看摘要
Abstract:Large language models (LLMs) exhibit remarkable capabilities on not just language tasks, but also various tasks that are not linguistic in nature, such as logical reasoning and social inference. In the human brain, neuroscience has identified a core language system that selectively and causally supports language processing. We here ask whether similar specialization for language emerges in LLMs. We identify language-selective units within 18 popular LLMs, using the same localization approach that is used in neuroscience. We then establish the causal role of these units by demonstrating that ablating LLM language-selective units – but not random units – leads to drastic deficits in language tasks. Correspondingly, language-selective LLM units are more aligned to brain recordings from the human language system than random units. Finally, we investigate whether our localization method extends to other cognitive domains: while we find specialized networks in some LLMs for reasoning and social capabilities, there are substantial differences among models. These findings provide functional and causal evidence for specialization in large language models, and highlight parallels with the functional organization in the brain.
摘要:大语言模型 (Large Language Models, LLMs) 不仅在语言任务上表现出卓越的能力,还在诸如逻辑推理和社会推理等非语言性质的任务中展现出显著的性能。在人类大脑中,神经科学已经识别出一个核心语言系统,该系统选择性地且因果性地支持语言处理。我们在此探讨,LLMs 中是否也出现了类似的语言特化现象。我们通过采用神经科学中相同的位置定位方法,在 18 个流行的大语言模型中识别出了语言选择性单元。随后,我们通过证明消除这些语言选择性单元(而非随机单元)会导致语言任务中的严重缺陷,从而确立了这些单元的因果作用。相应地,大语言模型中的语言选择性单元比随机单元更符合人类语言系统中的脑电记录。最后,我们探讨了我们的定位方法是否可以扩展到其他认知领域:尽管我们在某些大语言模型中发现了针对推理和社会能力的特化网络,但不同模型之间存在显著差异。这些发现为大语言模型中的特化现象提供了功能性和因果性的证据,并突显了其与大脑功能组织之间的相似性。
[NLP-11] Combining Induction and Transduction for Abstract Reasoning
【速读】: 该论文试图解决在极少样本情况下学习输入-输出映射时,是先推断潜在函数(induction)还是直接预测新测试输出(transduction)更有效的问题。解决方案的关键在于通过ARC数据集进行实验,训练神经网络模型分别进行推断潜在函数和直接预测输出。研究结果表明,尽管两种模型在相同的神经网络架构下训练并处理相同的问题,但它们解决的问题类型显著不同,这揭示了推断潜在函数和直接预测输出在处理抽象推理任务时的不同效能。
链接: https://arxiv.org/abs/2411.02272
作者: Wen-Ding Li,Keya Hu,Carter Larsen,Yuqing Wu,Simon Alford,Caleb Woo,Spencer M. Dunn,Hao Tang,Michelangelo Naim,Dat Nguyen,Wei-Long Zheng,Zenna Tavares,Yewen Pu,Kevin Ellis
关键词-EN: learning an input-output, input-output mapping, neural network, abstract reasoning tasks, directly predict
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:When learning an input-output mapping from very few examples, is it better to first infer a latent function that explains the examples, or is it better to directly predict new test outputs, e.g. using a neural network? We study this question on ARC, a highly diverse dataset of abstract reasoning tasks. We train neural models for induction (inferring latent functions) and transduction (directly predicting the test output for a given test input). Our models are trained on synthetic data generated by prompting LLMs to produce Python code specifying a function to be inferred, plus a stochastic subroutine for generating inputs to that function. We find inductive and transductive models solve very different problems, despite training on the same problems, and despite sharing the same neural architecture.
摘要:在从极少样本中学习输入-输出映射时,是先推断出一个解释这些样本的潜在函数更好,还是直接预测新的测试输出(例如使用神经网络)更好?我们在 ARC 数据集上研究了这个问题,该数据集包含高度多样化的抽象推理任务。我们训练了用于归纳(推断潜在函数)和转导(直接预测给定测试输入的测试输出)的神经模型。这些模型在由大语言模型 (LLM) 生成的合成数据上进行训练,这些数据包括指定待推断函数的 Python 代码以及生成该函数输入的随机子程序。我们发现,尽管在相同的问题上进行训练,并且共享相同的神经网络架构,归纳模型和转导模型解决的问题却截然不同。
[NLP-12] Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
【速读】: 该论文旨在解决大规模Transformer模型在处理复杂任务时的性能提升问题。解决方案的关键在于引入Hunyuan-Large,这是一个基于Transformer的混合专家模型(mixture of experts model),拥有3890亿参数和520亿激活参数,能够处理高达256K的tokens。Hunyuan-Large通过以下关键实践显著提升了模型性能:1) 使用比以往文献大几个数量级的合成数据;2) 采用混合专家路由策略;3) 实施键值缓存压缩技术;4) 采用专家特定的学习率策略。此外,论文还探讨了混合专家模型的扩展规律和学习率调度,为未来的模型开发和优化提供了宝贵的见解和指导。
链接: https://arxiv.org/abs/2411.02265
作者: Xingwu Sun,Yanfeng Chen,Yiqing Huang,Ruobing Xie,Jiaqi Zhu,Kai Zhang,Shuaipeng Li,Zhen Yang,Jonny Han,Xiaobo Shu,Jiahao Bu,Zhongzhi Chen,Xuemeng Huang,Fengzong Lian,Saiyong Yang,Jianfeng Yan,Yuyuan Zeng,Xiaoqin Ren,Chao Yu,Lulu Wu,Yue Mao,Tao Yang,Suncong Zheng,Kan Wu,Dian Jiao,Jinbao Xue,Xipeng Zhang,Decheng Wu,Kai Liu,Dengpeng Wu,Guanghui Xu,Shaohua Chen,Shuang Chen,Xiao Feng,Yigeng Hong,Junqiang Zheng,Chengcheng Xu,Zongwei Li,Xiong Kuang,Jianglu Hu,Yiqi Chen,Yuchi Deng,Guiyang Li,Ao Liu,Chenchen Zhang,Shihui Hu,Zilong Zhao,Zifan Wu,Yao Ding,Weichao Wang,Han Liu,Roberts Wang,Hao Fei,Peijie She,Ze Zhao,Xun Cao,Hai Wang,Fusheng Xiang,Mengyuan Huang,Zhiyuan Xiong,Bin Hu,Xuebin Hou,Lei Jiang,Jiajia Wu,Yaping Deng,Yi Shen,Qian Wang,Weijie Liu,Jie Liu,Meng Chen,Liang Dong,Weiwen Jia,Hu Chen,Feifei Liu,Rui Yuan,Huilin Xu,Zhenxiang Yan,Tengfei Cao,Zhichao Hu,Xinhua Feng,Dong Du,Tinghao She,Yangyu Tao,Feng Zhang,Jianchen Zhu,Chengzhong Xu,Xirui Li,Chong Zha,Wen Ouyang,Yinben Xia,Xiang Li,Zekun He,Rongpeng Chen,Jiawei Song,Ruibin Chen,Fan Jiang,Chongqing Zhao,Bo Wang,Hao Gong,Rong Gan
关键词-EN: largest open-source Transformer-based, billion activation parameters, open-source Transformer-based mixture, open-source Transformer-based, billion parameters
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 Figures
点击查看摘要
Abstract:In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large’s superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: this https URL Models: this https URL Comments: 17 pages, 4 Figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.02265 [cs.CL] (or arXiv:2411.02265v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.02265 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:本文介绍了一种名为 Hunyuan-Large 的模型,这是目前最大的开源基于 Transformer 的专家混合模型,总参数量达到 3890 亿,激活参数量为 520 亿,能够处理多达 256K 的 Token。我们对 Hunyuan-Large 在多种基准测试中的卓越表现进行了全面评估,包括语言理解和生成、逻辑推理、数学问题解决、编码、长上下文处理以及综合任务,其在这些测试中均优于 LLama3.1-70B,并且在与规模更大的 LLama3.1-405B 模型相比时,表现相当。Hunyuan-Large 的关键实践包括使用比以往文献中规模大几个数量级的合成数据、混合专家路由策略、键值缓存压缩技术以及专家特定的学习率策略。此外,我们还研究了专家混合模型的扩展规律和学习率调度,为未来的模型开发和优化提供了宝贵的见解和指导。Hunyuan-Large 的代码和检查点已发布,以促进未来的创新和应用。代码:此 https URL 模型:此 https URL 评论:17 页,4 图 主题:计算与语言(cs.CL);人工智能(cs.AI) 引用为:arXiv:2411.02265 [cs.CL] (或 arXiv:2411.02265v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2411.02265 了解更多 arXiv 发布的 DOI 通过 DataCite(待注册)
[NLP-13] Positive Experience Reflection for Agents in Interactive Text Environments NEURIPS2024
【速读】: 该论文试图解决在基于文本的游戏中,智能代理在使用大型语言模型(LLMs)进行自反思时,初始成功后效果下降的问题,特别是在使用较小LLMs时表现不佳的问题。解决方案的关键在于引入了一种名为SweetSour的新方法,该方法通过整合正面经验和受控记忆来丰富代理在决策时的上下文信息,从而提升代理在复杂推理和适应性方面的表现。
链接: https://arxiv.org/abs/2411.02223
作者: Philip Lippmann,Matthijs T.J. Spaan,Jie Yang
关键词-EN: Intelligent agents designed, interactive environments face, environments face significant, face significant challenges, demands complex reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear at NeurIPS 2024 Language Gamification workshop
点击查看摘要
Abstract:Intelligent agents designed for interactive environments face significant challenges in text-based games, a domain that demands complex reasoning and adaptability. While agents based on large language models (LLMs) using self-reflection have shown promise, they struggle when initially successful and exhibit reduced effectiveness when using smaller LLMs. We introduce SweetSour, a novel approach that addresses these limitations in existing reflection methods by incorporating positive experiences and managed memory to enrich the context available to the agent at decision time. Our comprehensive analysis spans both closed- and open-source LLMs and demonstrates the effectiveness of SweetSour in improving agent performance, particularly in scenarios where previous approaches fall short.
摘要:为交互环境设计的智能体在基于文本的游戏中面临重大挑战,这一领域要求复杂的推理能力和适应性。尽管基于大语言模型 (LLM) 并使用自我反思的智能体显示出潜力,但它们在初始成功后表现不佳,并且在使用较小 LLM 时效果显著下降。我们引入了 SweetSour,这是一种新颖的方法,通过结合积极经验和受控记忆来丰富智能体在决策时可用的上下文,从而解决现有反思方法的局限性。我们的全面分析涵盖了闭源和开源的 LLM,并展示了 SweetSour 在提升智能体性能方面的有效性,特别是在先前方法表现不足的场景中。
[NLP-14] he Role of DevOps in Enhancing Enterprise Software Delivery Success through RD Efficiency and Source Code Management
【速读】: 该论文试图解决企业软件交付成功中DevOps实践的影响问题,特别是如何通过增强研发效率(RD efficiency)和源代码管理(Source Code Management, SCM)来提升软件交付的成功率。解决方案的关键在于DevOps通过促进跨职能协作、缩短开发周期、提升软件质量以及通过有效的SCM实践(如版本控制和持续集成)来显著提高研发生产力。此外,DevOps中的SCM工具支持精确的变更追踪和可靠的代码维护,从而支持更快、更稳健的软件交付。然而,论文也指出了DevOps实施中可能遇到的挑战,如文化阻力与工具集成问题。
链接: https://arxiv.org/abs/2411.02209
作者: Jun Cui
关键词-EN: source code management, examines the impact, software delivery success, software delivery, SCM
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This study examines the impact of DevOps practices on enterprise software delivery success, focusing on enhancing RD efficiency and source code management (SCM). Using a qualitative methodology, data were collected from case studies of large-scale enterprises implementing DevOps to explore how these practices streamline software development processes. Findings reveal that DevOps significantly improves RD productivity by fostering cross-functional collaboration, reducing development cycle times, and enhancing software quality through effective SCM practices, such as version control and continuous integration. Additionally, SCM tools within DevOps enable precise change tracking and reliable code maintenance, further supporting faster, more robust software delivery. However, the study identifies challenges, including cultural resistance and tool integration issues, that can hinder DevOps implementation. Additionally, This research contributes to the growing body of DevOps literature by highlighting the role of RD efficiency and SCM as crucial factors for software delivery success. Future studies should investigate these factors across diverse industries to validate findings.
摘要:本研究探讨了 DevOps 实践对企业软件交付成功的影响,重点在于提升研发效率和源代码管理 (Source Code Management, SCM)。采用定性研究方法,通过大规模企业实施 DevOps 的案例研究收集数据,探讨这些实践如何简化软件开发流程。研究结果表明,DevOps 通过促进跨职能协作、缩短开发周期时间以及通过有效的 SCM 实践(如版本控制和持续集成)提升软件质量,显著提高了研发效率。此外,DevOps 中的 SCM 工具能够实现精确的变更追踪和可靠的代码维护,进一步支持更快、更稳健的软件交付。然而,研究也识别出一些挑战,包括文化阻力与工具集成问题,这些因素可能阻碍 DevOps 的实施。本研究通过强调研发效率和 SCM 作为软件交付成功的关键因素,为不断增长的 DevOps 文献做出了贡献。未来的研究应在不同行业中进一步探讨这些因素,以验证研究结果。
[NLP-15] Improving Steering Vectors by Targeting Sparse Autoencoder Features
【速读】: 该论文试图解决语言模型控制中使用转向向量(steering vectors)时难以预测其效果的问题。解决方案的关键在于利用自编码器(SAE, Sparse Autoencoder)来测量转向向量的因果效应,从而开发出一种改进的转向方法,称为SAE-Targeted Steering (SAE-TS)。SAE-TS方法能够在针对特定SAE特征进行转向的同时,最小化不必要的副作用,从而在多种任务中实现更优的转向效果与连贯性平衡。
链接: https://arxiv.org/abs/2411.02193
作者: Sviatoslav Chalnev,Matthew Siu,Arthur Conmy
关键词-EN: specific pre-defined properties, satisfy specific pre-defined, steering vectors, steering, model satisfy specific
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 maintext pages and 9 appendix pages
点击查看摘要
Abstract:To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier than finetuning, and may be more robust than prompting. However, it can be difficult to anticipate the effects of steering vectors produced by almost all existing methods, such as CAA (Panickssery et al., 2024) or the direct use of SAE latents (Templeton et al., 2024). In our work, we address this issue by using SAEs to measure the effects of steering vectors, giving us a method that can be used to understand the causal effect of any steering vector intervention. We use this method for measuring causal effects to develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that overall, SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.
摘要:为了控制语言模型的行为,引导方法试图确保模型的输出满足特定的预定义属性。向模型中添加引导向量是一种有前景的模型控制方法,相较于微调更为简便,且可能比提示更为稳健。然而,几乎所有现有方法(如 CAA (Panickssery et al., 2024) 或直接使用 SAE 潜在变量 (Templeton et al., 2024))生成的引导向量的效果难以预料。在我们的工作中,我们通过使用 SAE 来测量引导向量的效果,从而解决了这一问题,提供了一种可以理解任何引导向量干预因果效应的方法。我们利用这种测量因果效应的方法,开发了一种改进的引导方法,即 SAE 目标引导 (SAE-Targeted Steering, SAE-TS),该方法在最小化意外副作用的同时,能够针对特定的 SAE 特征找到引导向量。我们在一系列任务中评估时发现,总体而言,SAE-TS 在平衡引导效果与连贯性方面优于 CAA 和 SAE 特征引导。
[NLP-16] Grounding Emotional Descriptions to Electrovibration Haptic Signals
【速读】: 该论文试图解决将用户对触觉信号的自由语言描述与实际触觉信号特征进行关联的问题(即语言接地,language grounding)。解决方案的关键在于开发了一个计算流程,利用自然语言处理技术(如GPT-3.5 Turbo和词嵌入方法)从用户描述中提取感官和情感关键词,并将这些关键词分组为语义簇(即概念)。随后,通过相关性分析将这些关键词簇与触觉信号特征(如脉冲次数)进行关联。这一流程展示了利用计算方法分析触觉体验的可行性,并为未来构建触觉体验预测模型奠定了基础。
链接: https://arxiv.org/abs/2411.02118
作者: Guimin Hu,Zirui Zhao,Lukas Heilmann,Yasemin Vardar,Hasti Seifi
关键词-EN: Designing and displaying, sensory and emotional, attributes can improve, displaying haptic signals, Designing
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Designing and displaying haptic signals with sensory and emotional attributes can improve the user experience in various applications. Free-form user language provides rich sensory and emotional information for haptic design (e.g., ``This signal feels smooth and exciting’'), but little work exists on linking user descriptions to haptic signals (i.e., language grounding). To address this gap, we conducted a study where 12 users described the feel of 32 signals perceived on a surface haptics (i.e., electrovibration) display. We developed a computational pipeline using natural language processing (NLP) techniques, such as GPT-3.5 Turbo and word embedding methods, to extract sensory and emotional keywords and group them into semantic clusters (i.e., concepts). We linked the keyword clusters to haptic signal features (e.g., pulse count) using correlation analysis. The proposed pipeline demonstrates the viability of a computational approach to analyzing haptic experiences. We discuss our future plans for creating a predictive model of haptic experience.
摘要:设计和展示具有感官和情感属性的触觉信号可以提升各种应用中的用户体验。自由形式的用户语言为触觉设计提供了丰富的感官和情感信息(例如,“这个信号感觉流畅且令人兴奋”),但目前很少有研究将用户描述与触觉信号(即语言接地)联系起来。为了填补这一空白,我们进行了一项研究,其中12名用户描述了在表面触觉(即电振动)显示器上感知的32个信号的触感。我们开发了一个计算流程,利用自然语言处理(NLP)技术,如GPT-3.5 Turbo和词嵌入方法,提取感官和情感关键词,并将它们分组为语义簇(即概念)。我们通过相关性分析将关键词簇与触觉信号特征(例如脉冲计数)相连接。所提出的流程展示了计算方法在分析触觉体验中的可行性。我们讨论了未来创建触觉体验预测模型的计划。
[NLP-17] AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis
【速读】: 该论文试图解决在大语言模型 (LLMs) 中评估各层重要性的问题,特别是在激活分布视角下理解各层的功能重要性和对模型性能的贡献。解决方案的关键在于提出了激活方差-稀疏度评分 (Activation Variance-Sparsity Score, AVSS),这是一种结合了归一化激活方差和稀疏度的新指标,用于评估每一层对模型性能的贡献。通过识别并移除基于AVSS评分最低的约25%的层,研究者能够在多种任务(如问答、语言建模和情感分类)中保持超过90%的原始模型性能,表明这些层可能是不必要的。这种方法为识别大语言模型中非关键层提供了系统性的手段,有助于构建更高效的模型架构。
链接: https://arxiv.org/abs/2411.02117
作者: Zichen Song,Yuxin Wu,Sitan Huang,Zhongfeng Kang
关键词-EN: area of research, optimization and interpretability, deep learning, active area, significant implications
类目: Computation and Language (cs.CL)
备注: 4 pages, 1 figure
点击查看摘要
Abstract:The evaluation of layer importance in deep learning has been an active area of research, with significant implications for model optimization and interpretability. Recently, large language models (LLMs) have gained prominence across various domains, yet limited studies have explored the functional importance and performance contributions of individual layers within LLMs, especially from the perspective of activation distribution. In this work, we propose the Activation Variance-Sparsity Score (AVSS), a novel metric combining normalized activation variance and sparsity to assess each layer’s contribution to model performance. By identifying and removing approximately the lowest 25% of layers based on AVSS, we achieve over 90% of original model performance across tasks such as question answering, language modeling, and sentiment classification, indicating that these layers may be non-essential. Our approach provides a systematic method for identifying less critical layers, contributing to efficient large language model architectures.
摘要:深度学习中层重要性的评估一直是研究的热点领域,对模型优化和可解释性具有重要意义。近年来,大语言模型(Large Language Models, LLMs)在各个领域中崭露头角,然而关于这些模型中各层的功能重要性和对性能的贡献的研究却相对有限,尤其是从激活分布的角度来看。在本研究中,我们提出了激活方差-稀疏度评分(Activation Variance-Sparsity Score, AVSS),这是一种结合了归一化激活方差和稀疏度的新型指标,用于评估每一层对模型性能的贡献。通过识别并移除基于 AVSS 评分最低的约 25% 的层,我们在问答、语言建模和情感分类等任务中仍能保持超过 90% 的原始模型性能,这表明这些层可能并非关键。我们的方法为识别非关键层提供了一种系统化的手段,有助于构建更高效的大语言模型架构。
[NLP-18] Advancements and limitations of LLM s in replicating human color-word associations
【速读】: 该论文试图解决的问题是评估大型语言模型(LLMs)在复制人类颜色-词语关联方面的能力。解决方案的关键在于通过对比不同代际的LLMs(从GPT-3到GPT-4o)与人类颜色-词语关联的数据,特别是使用来自日本参与者的超过10,000份样本,涉及17种颜色和8个类别的词语。研究发现,尽管GPT-4o在预测每个颜色和类别的最佳投票词语方面表现最佳,尤其是在使用视觉输入而非基于文本的颜色代码时,但其最高中位数表现仅为约50%(随机水平为10%),且在不同词语类别和颜色上的表现差异显著,表明LLMs未能完全复制人类的颜色-词语关联。此外,研究还发现LLMs在颜色辨别能力上与人类表现出高度相关性,这与先前的研究一致。该研究强调了LLMs在颜色-词语关联方面的进步及其持续的局限性,暗示了人类与LLMs在语义记忆结构上的差异。
链接: https://arxiv.org/abs/2411.02116
作者: Makoto Fukushima,Shusuke Eshita,Hiroshige Fukuhara
关键词-EN: Large Language Models, Color-word associations, human color-word associations, Color-word associations play, design applications
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 20 pages, 7 figures, 3 tables
点击查看摘要
Abstract:Color-word associations play a fundamental role in human cognition and design applications. Large Language Models (LLMs) have become widely available and demonstrated intelligent behaviors in various benchmarks with natural conversation skills. However, their ability to replicate human color-word associations remains understudied. We compared multiple generations of LLMs (from GPT-3 to GPT- 4o) against human color-word associations using data collected from over 10,000 Japanese participants, involving 17 colors and words from eight categories in Japanese. Our findings reveal a clear progression in LLM performance across generations, with GPT-4o achieving the highest accuracy in predicting the best voted word for each color and category, particularly when using visual inputs rather than text-based color codes. However, the highest median performance was approximately 50% even for GPT4-o with visual inputs (chance level is 10%), and the performance levels varied significantly across word categories and colors, indicating a failure to fully replicate human color-word associations. On the other hand, color discrimination ability estimated from our color-word association data showed that LLMs demonstrated high correlation with human color discrimination patterns, similarly to previous studies. Our study highlights both the advancements in LLM capabilities and their persistent limitations, suggesting differences in semantic memory structures between humans and LLMs in representing color-word associations.
摘要:颜色与词汇的关联在人类认知和设计应用中起着基础性作用。大语言模型 (LLMs) 已经广泛应用,并在各种基准测试中展示了智能行为,具备自然对话能力。然而,它们在复制人类颜色与词汇关联方面的能力仍未得到充分研究。我们对比了多个世代的 LLM(从 GPT-3 到 GPT-4o)与人类颜色与词汇关联的表现,使用了来自超过 10,000 名日本参与者的数据,涉及 17 种颜色和来自八个类别的词汇。研究结果显示,LLM 的性能随世代明显提升,GPT-4o 在预测每个颜色和类别中最受欢迎的词汇时达到了最高准确率,尤其是在使用视觉输入而非基于文本的颜色代码时。然而,即使使用视觉输入,GPT-4o 的最高中位数表现也仅约为 50%(随机水平为 10%),并且在不同词汇类别和颜色上的表现差异显著,表明未能完全复制人类的颜色与词汇关联。另一方面,从我们的颜色与词汇关联数据中估计的颜色辨别能力显示,LLM 与人类颜色辨别模式表现出高度相关性,与先前研究结果相似。本研究突显了 LLM 能力的进步及其持续存在的局限性,暗示了人类与 LLM 在表示颜色与词汇关联时语义记忆结构的差异。
[NLP-19] Regress Dont Guess – A Regression-like Loss on Number Tokens for Language Models NEURIPS2024
【速读】: 该论文试图解决语言模型在处理涉及数量推理的任务(尤其是算术运算)时表现不佳的问题。其关键在于提出了两种新的损失函数(number token loss),以替代传统的交叉熵损失(CE loss)。第一种损失基于L_p损失,计算真实值与预测类概率加权和之间的差异;第二种损失则最小化预测输出概率分布与真实分布之间的Wasserstein-1距离。这些回归型损失函数能够轻松地集成到任何语言模型中,并在训练过程中扩展交叉熵目标,从而显著提高模型在处理数值数据时的准确性。
链接: https://arxiv.org/abs/2411.02083
作者: Jonas Zausinger,Lars Pennig,Kacper Chlodny,Vincent Limbach,Anna Ketteler,Thorben Prein,Vishwa Mohan Singh,Michael Morris Danziger,Jannis Born
关键词-EN: natural inductive bias, tasks involving reasoning, reasoning over quantities, exceptional capabilities, lack a natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Mathematical Software (cs.MS)
备注: 5-page version for NeurIPS 2024 (MathAI workshop)
点击查看摘要
Abstract:While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an L_p loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression-like losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.
摘要:尽管语言模型在文本生成方面表现出色,但它们缺乏对生成数字的自然归纳偏置,因此在涉及数量推理的任务中,尤其是算术运算方面表现不佳。这在科学数据集中尤为重要,因为这些数据集中文本和数值数据的组合非常丰富。一个根本的限制在于交叉熵损失(CE loss)的性质,它假设了一个名义(分类)尺度,因此无法传达生成数字 Token 之间的接近度。为此,我们提出了两种版本的数字 Token 损失。第一种基于真实 Token 值与预测类别概率加权和之间的 L_p 损失。第二种损失则最小化预测输出概率分布与真实分布之间的 Wasserstein-1 距离。这些类似于回归的损失可以轻松添加到任何语言模型中,并在训练过程中扩展交叉熵目标。我们在一个数学数据集上对比了所提出的方案与现有的 Token 化、编码和解码方案,以改进语言模型中的数字表示。结果显示,当标准 T5 模型配备所提出的损失方案时,数值准确性显著提高。
[NLP-20] Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention EMNLP2024
【速读】: 该论文试图解决大语言模型(LLMs)在提升效果和效率方面的双重挑战。解决方案的关键在于精确地针对注意力层应用低维投影(Low-dimensional Projected Attention, LPA),从而在不牺牲性能的前提下显著提高模型的效率。具体来说,LPA通过在注意力层引入低秩预训练,实现了参数的精确减少,同时提升了模型的效果和效率。实验结果表明,LPA模型在节省高达12.4%的时间的同时,测试困惑度(ppl)和下游任务性能分别提升了约5%。
链接: https://arxiv.org/abs/2411.02063
作者: Xingtai Lv,Ning Ding,Kaiyan Zhang,Ermo Hua,Ganqu Cui,Bowen Zhou
关键词-EN: challenging research goal, large language models, research goal, large language, critical yet challenging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 (Main Conference)
点击查看摘要
Abstract:Improving the effectiveness and efficiency of large language models (LLMs) simultaneously is a critical yet challenging research goal. In this paper, we find that low-rank pre-training, normally considered as efficient methods that will compromise performance, can be scalably effective when reduced parameters are precisely targeted. Specifically, applying the low-dimensional module only to the attention layer – resolves this issue and enhances both effectiveness and efficiency. We refer to this structure as Low-dimensional Projected Attention (LPA) and provide an explanatory analysis. Through extensive experimentation at parameter scales of 130M, 370M, and scaling up to 3B, we have validated the effectiveness and scalability of LPA. Our results show that LPA model can save up to 12.4% in time while achieving an approximate 5% improvement in test perplexity (ppl) and on downstream tasks compared with the vanilla Transformer.
摘要:提升大语言模型(Large Language Models, LLMs)的效果和效率是一个关键且具有挑战性的研究目标。本文发现,低秩预训练方法通常被认为是会牺牲性能的高效方法,但当减少的参数精确针对时,可以实现规模化的有效性。具体而言,仅将低维模块应用于注意力层——解决了这一问题,并同时提升了效果和效率。我们将这种结构称为低维投影注意力(Low-dimensional Projected Attention, LPA),并提供了详细的解释性分析。通过在130M、370M参数规模及扩展至3B参数的大量实验,我们验证了LPA的有效性和可扩展性。结果表明,与标准Transformer相比,LPA模型在节省高达12.4%的时间的同时,在测试困惑度(ppl)和下游任务上实现了约5%的提升。
[NLP-21] Explainable cognitive decline detection in free dialogues with a Machine Learning approach based on pre-trained Large Language Models
【速读】: 该论文试图解决认知和神经功能障碍的高成本筛查问题,提出了一种利用大语言模型(Large Language Models)从自由对话中提取特征以检测认知衰退的解决方案。解决方案的关键在于通过自然语言处理技术(Natural Language Processing)和提示工程(prompt engineering)进行特征工程,提取与内容无关的高级推理特征(如理解力、意识减退、注意力分散和记忆问题),并通过特征分析和选择优化模型性能。最终,结合ChatGPT和专用机器学习模型,实现对老年人在自由对话中认知衰退的检测,从而提供一种低成本、非侵入性和快速的认知衰退检测方法。
链接: https://arxiv.org/abs/2411.02036
作者: Francisco de Arriba-Pérez,Silvia García-Méndez,Javier Otero-Mosquera,Francisco J. González-Castaño
关键词-EN: diagnosed and treated, frequent screening, small proportion, proportion of affected, affected individuals
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Cognitive and neurological impairments are very common, but only a small proportion of affected individuals are diagnosed and treated, partly because of the high costs associated with frequent screening. Detecting pre-illness stages and analyzing the progression of neurological disorders through effective and efficient intelligent systems can be beneficial for timely diagnosis and early intervention. We propose using Large Language Models to extract features from free dialogues to detect cognitive decline. These features comprise high-level reasoning content-independent features (such as comprehension, decreased awareness, increased distraction, and memory problems). Our solution comprises (i) preprocessing, (ii) feature engineering via Natural Language Processing techniques and prompt engineering, (iii) feature analysis and selection to optimize performance, and (iv) classification, supported by automatic explainability. We also explore how to improve Chatgpt’s direct cognitive impairment prediction capabilities using the best features in our models. Evaluation metrics obtained endorse the effectiveness of a mixed approach combining feature extraction with Chatgpt and a specialized Machine Learning model to detect cognitive decline within free-form conversational dialogues with older adults. Ultimately, our work may facilitate the development of an inexpensive, non-invasive, and rapid means of detecting and explaining cognitive decline.
摘要:认知和神经功能障碍非常普遍,但只有一小部分受影响的人被诊断和治疗,部分原因是频繁筛查的高成本。通过有效且高效的智能系统检测疾病前阶段并分析神经功能障碍的进展,有助于及时诊断和早期干预。我们提出使用大语言模型从自由对话中提取特征以检测认知衰退。这些特征包括高级推理内容无关特征(如理解力下降、意识减弱、注意力分散增加和记忆问题)。我们的解决方案包括(i)预处理,(ii)通过自然语言处理技术和提示工程进行特征工程,(iii)特征分析和选择以优化性能,以及(iv)分类,并辅以自动可解释性。我们还探讨了如何利用模型中的最佳特征来提升 Chatgpt 直接预测认知障碍的能力。评估指标证实了结合特征提取与 Chatgpt 以及专用机器学习模型来检测老年人在自由形式对话中的认知衰退的混合方法的有效性。最终,我们的工作可能促进开发一种廉价、无创且快速的认知衰退检测和解释手段。
[NLP-22] Shortcut Learning in In-Context Learning: A Survey
【速读】: 该论文试图解决大语言模型(LLMs)在上下文学习(In-Context Learning, ICL)任务中存在的快捷学习(Shortcut Learning)问题,即模型在实际任务中采用简单、非鲁棒的决策规则,从而影响其泛化能力和鲁棒性。解决方案的关键在于深入探讨ICL任务中快捷学习的类型、成因、现有基准以及缓解策略。论文通过总结现有研究中的未解决问题,尝试勾勒出未来快捷学习研究的方向。
链接: https://arxiv.org/abs/2411.02018
作者: Rui Song,Yingji Li,Fausto Giunchiglia,Hao Xu
关键词-EN: non-robust decision rules, models employ simple, Shortcut learning refers, Shortcut learning, employ simple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures
点击查看摘要
Abstract:Shortcut learning refers to the phenomenon where models employ simple, non-robust decision rules in practical tasks, which hinders their generalization and robustness. With the rapid development of large language models (LLMs) in recent years, an increasing number of studies have shown the impact of shortcut learning on LLMs. This paper provides a novel perspective to review relevant research on shortcut learning in In-Context Learning (ICL). It conducts a detailed exploration of the types of shortcuts in ICL tasks, their causes, available benchmarks, and strategies for mitigating shortcuts. Based on corresponding observations, it summarizes the unresolved issues in existing research and attempts to outline the future research landscape of shortcut learning.
摘要:快捷学习(Shortcut learning)指的是模型在实际任务中采用简单、非鲁棒的决策规则,从而阻碍其泛化能力和鲁棒性的现象。近年来,随着大语言模型(LLMs)的快速发展,越来越多的研究表明快捷学习对大语言模型的影响。本文从新的视角回顾了与快捷学习相关的研究,特别是在上下文学习(In-Context Learning, ICL)中的应用。文章详细探讨了ICL任务中快捷学习的类型、成因、可用的基准测试以及缓解快捷学习的策略。基于相应的观察,本文总结了现有研究中未解决的问题,并尝试勾勒出快捷学习未来研究的方向。
[NLP-23] Culinary Class Wars: Evaluating LLM s using ASH in Cuisine Transfer Task
【速读】: 该论文试图解决大型语言模型(LLMs)在烹饪艺术领域中,尤其是在适应特定文化需求的食谱改编方面的创造力不足问题。解决方案的关键在于引入了一种名为ASH(authenticity, sensitivity, harmony)的基准,用于评估LLMs在烹饪转移任务中的食谱生成能力。通过比较LLMs生成的食谱与人类评价,研究揭示了LLMs在理解和应用烹饪文化细微差别方面的优势和局限性。
链接: https://arxiv.org/abs/2411.01996
作者: Hoonick Lee,Mogan Gim,Donghyeon Park,Donghee Choi,Jaewoo Kang
关键词-EN: Large Language Models, Language Models, Large Language, including culinary arts, advent of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The advent of Large Language Models (LLMs) have shown promise in various creative domains, including culinary arts. However, many LLMs still struggle to deliver the desired level of culinary creativity, especially when tasked with adapting recipes to meet specific cultural requirements. This study focuses on cuisine transfer-applying elements of one cuisine to another-to assess LLMs’ culinary creativity. We employ a diverse set of LLMs to generate and evaluate culturally adapted recipes, comparing their evaluations against LLM and human judgments. We introduce the ASH (authenticity, sensitivity, harmony) benchmark to evaluate LLMs’ recipe generation abilities in the cuisine transfer task, assessing their cultural accuracy and creativity in the culinary domain. Our findings reveal crucial insights into both generative and evaluative capabilities of LLMs in the culinary domain, highlighting strengths and limitations in understanding and applying cultural nuances in recipe creation. The code and dataset used in this project will be openly available in \urlthis http URL.
摘要:大语言模型 (LLM) 的出现为多个创意领域,包括烹饪艺术,带来了希望。然而,许多 LLM 在实现所需的烹饪创意水平方面仍面临挑战,尤其是在根据特定文化要求调整食谱时。本研究聚焦于烹饪转移——将一种烹饪的元素应用于另一种烹饪——以评估 LLM 的烹饪创意能力。我们采用了一系列多样化的 LLM 来生成和评估文化适应的食谱,并将它们的评估结果与 LLM 和人类的判断进行比较。我们引入了 ASH(真实性、敏感性、和谐性)基准,用于评估 LLM 在烹饪转移任务中的食谱生成能力,评估其在烹饪领域中的文化准确性和创意。我们的研究结果揭示了 LLM 在烹饪领域中生成和评估能力的重大见解,突显了其在理解和应用食谱创作中的文化细微差别方面的优势和局限性。本项目中使用的代码和数据集将公开在 \urlthis http URL。
[NLP-24] Can Language Models Learn to Skip Steps? NEURIPS2024
【速读】: 该论文试图解决的问题是如何使语言模型具备类似于人类专家的跳步推理能力(step-skipping ability)。解决方案的关键在于引入一个控制框架,通过迭代优化模型,使其生成更短且准确的推理路径。具体来说,研究通过在扩展数据集上进行微调,该数据集包含完整的推理序列和跳步推理序列,从而使模型能够在不牺牲准确性的前提下提高任务解决效率,并在域外场景中表现出更好的泛化能力。这一方法不仅首次探索了人类跳步推理能力在AI模型中的实现,还为如何通过认知能力提升AI性能提供了新的视角。
链接: https://arxiv.org/abs/2411.01855
作者: Tengxiao Liu,Qipeng Guo,Xiangkun Hu,Cheng Jiayang,Yue Zhang,Xipeng Qiu,Zheng Zhang
关键词-EN: Trained on vast, language models demonstrate, models demonstrate emergent, demonstrate emergent human-like, vast corpora
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Trained on vast corpora of human language, language models demonstrate emergent human-like reasoning abilities. Yet they are still far from true intelligence, which opens up intriguing opportunities to explore the parallels of humans and model behaviors. In this work, we study the ability to skip steps in reasoning - a hallmark of human expertise developed through practice. Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not inherently possess such motivations to minimize reasoning steps. To address this, we introduce a controlled framework that stimulates step-skipping behavior by iteratively refining models to generate shorter and accurate reasoning paths. Empirical results indicate that models can develop the step skipping ability under our guidance. Moreover, after fine-tuning on expanded datasets that include both complete and skipped reasoning sequences, the models can not only resolve tasks with increased efficiency without sacrificing accuracy, but also exhibit comparable and even enhanced generalization capabilities in out-of-domain scenarios. Our work presents the first exploration into human-like step-skipping ability and provides fresh perspectives on how such cognitive abilities can benefit AI models.
摘要:通过训练大量的人类语言语料库,语言模型展示了类似人类的推理能力。然而,它们仍远未达到真正的智能水平,这为探索人类和模型行为之间的相似性提供了有趣的机会。在本研究中,我们探讨了在推理过程中跳过步骤的能力——这是通过实践发展的人类专业知识的标志。与人类不同,人类可能会跳过步骤以提高效率或减轻认知负担,而模型本身并不具备减少推理步骤的动机。为此,我们引入了一个控制框架,通过迭代优化模型以生成更短且准确的推理路径,从而激发跳过步骤的行为。实证结果表明,在我们的指导下,模型能够发展出跳过步骤的能力。此外,在对包含完整和跳过推理序列的扩展数据集进行微调后,模型不仅能够在不牺牲准确性的前提下提高任务解决效率,而且在域外场景中表现出相当的甚至增强的泛化能力。我们的工作首次探索了类似人类的跳过步骤能力,并为这种认知能力如何使AI模型受益提供了新的视角。
[NLP-25] Leveraging Label Semantics and Meta-Label Refinement for Multi-Label Question Classification
【速读】: 该论文试图解决在线教育资源标注中的多标签分类问题,特别是由于标签语义重叠和分布不均衡导致的个性化学习和资源推荐效果不佳的问题。解决方案的关键在于引入了一种名为RR2QC的新型检索重排序方法,通过利用标签语义和元标签细化来提升多标签问题的分类效果。具体来说,RR2QC通过增强标签组内外的语义关系来优化多标签情境下的预训练策略,并引入类中心学习任务,将标签文本整合到下游训练中,确保问题与标签语义的一致性,从而检索出最相关的标签序列。此外,该方法将标签分解为元标签,并训练元标签分类器对检索到的标签序列进行重排序,从而增强对长尾标签的理解和预测能力。最后,通过使用数学大语言模型(Math LLM)生成问题解答,提取潜在信息以进一步细化模型的洞察力。实验结果表明,RR2QC在多个教育数据集上的Precision@k和F1分数均优于现有分类方法。
链接: https://arxiv.org/abs/2411.01841
作者: Shi Dong,Xiaobei Niu,Rui Zhong,Zhifeng Wang,Mingzhang Zuo
关键词-EN: rapidly advancing field, Accurate annotation, online education due, rapidly advancing, advancing field
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Accurate annotation of educational resources is critical in the rapidly advancing field of online education due to the complexity and volume of content. Existing classification methods face challenges with semantic overlap and distribution imbalance of labels in the multi-label context, which impedes effective personalized learning and resource recommendation. This paper introduces RR2QC, a novel Retrieval Reranking method To multi-label Question Classification by leveraging label semantics and meta-label refinement. Firstly, RR2QC leverages semantic relationships within and across label groups to enhance pre-training strategie in multi-label context. Next, a class center learning task is introduced, integrating label texts into downstream training to ensure questions consistently align with label semantics, retrieving the most relevant label sequences. Finally, this method decomposes labels into meta-labels and trains a meta-label classifier to rerank the retrieved label sequences. In doing so, RR2QC enhances the understanding and prediction capability of long-tail labels by learning from meta-labels frequently appearing in other labels. Addtionally, a Math LLM is used to generate solutions for questions, extracting latent information to further refine the model’s insights. Experimental results demonstrate that RR2QC outperforms existing classification methods in Precision@k and F1 scores across multiple educational datasets, establishing it as a potent enhancement for online educational content utilization.
摘要:在快速发展的在线教育领域,准确标注教育资源至关重要,因为内容的复杂性和数量庞大。现有的分类方法在多标签情境下面临标签语义重叠和分布不平衡的挑战,这阻碍了有效的个性化学习和资源推荐。本文介绍了 RR2QC,一种利用标签语义和元标签细化的多标签问题分类的新型检索重排序方法。首先,RR2QC 利用标签组内和跨标签组的语义关系,增强多标签情境下的预训练策略。其次,引入类别中心学习任务,将标签文本整合到下游训练中,确保问题与标签语义一致,检索最相关的标签序列。最后,该方法将标签分解为元标签,并训练一个元标签分类器来重排序检索到的标签序列。通过这种方式,RR2QC 通过学习其他标签中频繁出现的元标签,增强了长尾标签的理解和预测能力。此外,使用数学大语言模型生成问题的解决方案,提取潜在信息以进一步细化模型的洞察力。实验结果表明,RR2QC 在多个教育数据集上的 Precision@k 和 F1 分数均优于现有的分类方法,确立了其在在线教育内容利用方面的强大增强作用。
[NLP-26] riG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition
【速读】: 该论文试图解决不连续命名实体识别 (Discontinuous Named Entity Recognition, DNER) 的问题,即实体可能分散在多个非相邻的词块中,传统的序列标注方法难以应对。解决方案的关键在于提出了一个新颖的三元组网格框架 (Triplet-Grid Framework, TriG-NER),该框架通过在词块级别应用三元组损失 (triplet loss),定义词对之间的相似性,从而增强实体边界检测并减少对特定标注策略的依赖。具体来说,TriG-NER 通过识别同一实体内的词对关系,有效地将相似词对拉近,将不相似词对推远,从而在灵活的网格结构中捕捉复杂的实体结构。这一方法显著提升了在多个基准 DNER 数据集上的表现,展示了其对不同标注策略的适应性和高效性。
链接: https://arxiv.org/abs/2411.01839
作者: Rina Carines Cabral,Soyeon Caren Han,Areej Alhassan,Riza Batista-Navarro,Goran Nenadic,Josiah Poon
关键词-EN: Named Entity Recognition, making traditional sequence, labelling approaches inadequate, traditional sequence labelling, sequence labelling approaches
类目: Computation and Language (cs.CL)
备注: Code will be made available upon publication
点击查看摘要
Abstract:Discontinuous Named Entity Recognition (DNER) presents a challenging problem where entities may be scattered across multiple non-adjacent tokens, making traditional sequence labelling approaches inadequate. Existing methods predominantly rely on custom tagging schemes to handle these discontinuous entities, resulting in models tightly coupled to specific tagging strategies and lacking generalisability across diverse datasets. To address these challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces a generalisable approach to learning robust token-level representations for discontinuous entity extraction. Our framework applies triplet loss at the token level, where similarity is defined by word pairs existing within the same entity, effectively pulling together similar and pushing apart dissimilar ones. This approach enhances entity boundary detection and reduces the dependency on specific tagging schemes by focusing on word-pair relationships within a flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets and demonstrate significant improvements over existing grid-based architectures. These results underscore our framework’s effectiveness in capturing complex entity structures and its adaptability to various tagging schemes, setting a new benchmark for discontinuous entity extraction.
摘要:不连续命名实体识别 (Discontinuous Named Entity Recognition, DNER) 是一个具有挑战性的问题,其中实体可能分散在多个非相邻的 Token 中,使得传统的序列标注方法不再适用。现有方法主要依赖于自定义的标注方案来处理这些不连续的实体,导致模型与特定的标注策略紧密耦合,缺乏跨不同数据集的泛化能力。为了解决这些问题,我们提出了 TriG-NER,一种新颖的三元组网格框架 (Triplet-Grid Framework),该框架引入了一种可泛化的方法,用于学习不连续实体提取的鲁棒 Token 级表示。我们的框架在 Token 级别应用三元组损失,其中相似性由存在于同一实体内的词对定义,从而有效地将相似的词对拉近,将不相似的词对推远。这种方法通过关注灵活网格结构中的词对关系,增强了实体边界检测,并减少了对特定标注方案的依赖。我们在三个基准 DNER 数据集上评估了 TriG-NER,并展示了其相对于现有网格架构的显著改进。这些结果突显了我们的框架在捕捉复杂实体结构方面的有效性及其对各种标注方案的适应性,为不连续实体提取设定了新的基准。
[NLP-27] Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
【速读】: 该论文试图解决文本缺失的口语语言模型(Spoken Language Models, SLMs)在语义连贯性和相关性方面落后于基于文本的大型语言模型(Large Language Models, LLMs)的问题。解决方案的关键是引入Align-SLM框架,该框架利用基于AI反馈的强化学习(Reinforcement Learning with AI Feedback, RLAIF)启发的偏好优化技术来增强SLMs的语义理解能力。具体方法包括从给定提示生成多个语音延续,并使用语义度量来创建用于直接偏好优化(Direct Preference Optimization, DPO)的偏好数据。实验结果表明,该方法在大多数基准测试中实现了SLMs的最新性能,突显了偏好优化对于提升SLMs语义质量的重要性。
链接: https://arxiv.org/abs/2411.01834
作者: Guan-Ting Lin,Prashanth Gurunath Shivakumar,Aditya Gourav,Yile Gu,Ankur Gandhe,Hung-yi Lee,Ivan Bulyko
关键词-EN: Large Language Models, text-based Large Language, Spoken Language Models, Language Models, textless Spoken Language
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.
摘要:尽管无文本的口语语言模型 (Spoken Language Models, SLMs) 在端到端的语音到语音建模方面展示了潜力,但在语义连贯性和相关性方面仍落后于基于文本的大语言模型 (Large Language Models, LLMs)。本研究引入了 Align-SLM 框架,该框架利用由 AI 反馈驱动的强化学习 (Reinforcement Learning with AI Feedback, RLAIF) 启发的偏好优化,以增强 SLMs 的语义理解能力。我们的方法从给定的提示生成多个语音延续,并使用语义度量来创建用于直接偏好优化 (Direct Preference Optimization, DPO) 的偏好数据。我们通过 ZeroSpeech 2021 基准测试对词汇和句法建模、StoryCloze 数据集的口语版本进行语义连贯性评估,以及其他语音生成度量(包括 GPT4-o 评分和人工评估)来评估该框架。实验结果表明,我们的方法在大多数基准测试中实现了 SLMs 的最新性能,突显了偏好优化对于提升 SLMs 语义质量的重要性。
[NLP-28] owards Pedagogical LLM s with Supervised Fine Tuning for Computing Education
【速读】: 该论文试图解决大型语言模型(LLMs)在计算机教育中可能阻碍学习成果的问题,特别是通过改进LLMs的教学对齐性(pedagogical alignment)。解决方案的关键在于利用高质量的编程课程论坛数据集进行有监督的微调(supervised fine-tuning),以确保LLMs更好地符合教育原则,如建构主义(constructivism)。初步研究结果表明,这种微调方法有助于提升LLMs的教学对齐性,但仍需进一步深入评估。
链接: https://arxiv.org/abs/2411.01765
作者: Alexandra Vassar,Jake Renzella,Emily Ross,Andrew Taylor
关键词-EN: large language models, hinder learning outcomes, paper investigates supervised, investigates supervised fine-tuning, language models
类目: Computation and Language (cs.CL)
备注: 3 pages, 1 table, conference
点击查看摘要
Abstract:This paper investigates supervised fine-tuning of large language models (LLMs) to improve their pedagogical alignment in computing education, addressing concerns that LLMs may hinder learning outcomes. The project utilised a proprietary dataset of 2,500 high quality question/answer pairs from programming course forums, and explores two research questions: the suitability of university course forums in contributing to fine-tuning datasets, and how supervised fine-tuning can improve LLMs’ alignment with educational principles such as constructivism. Initial findings suggest benefits in pedagogical alignment of LLMs, with deeper evaluations required.
摘要:本文探讨了通过监督微调大语言模型 (LLMs) 来提升其在计算机教育中的教学对齐性,以解决 LLMs 可能阻碍学习成果的担忧。该项目利用了一个包含 2,500 对高质量问题/答案的专有数据集,这些数据来自编程课程论坛,并研究了两个研究问题:大学课程论坛在贡献微调数据集方面的适用性,以及监督微调如何提升 LLMs 与建构主义等教育原则的对齐性。初步研究结果表明,LLMs 在教学对齐性方面有所提升,但仍需进一步深入评估。
[NLP-29] RAGViz: Diagnose and Visualize Retrieval-Augmented Generation
【速读】: 该论文试图解决当前检索增强生成 (Retrieval-augmented Generation, RAG) 系统在可视化检索文档内容和模型对这些文档的关注度方面的不足。解决方案的关键是提出了 RAGViz,一个用于诊断 RAG 系统的工具,它通过可视化生成令牌在检索文档中的关注度来增强系统的透明性和可解释性。RAGViz 提供了两个主要功能:令牌和文档级别的关注度可视化,以及在添加或移除上下文文档时的生成结果比较。该工具通过结合自定义嵌入模型和 HuggingFace 支持的大型语言模型 (Large Language Model, LLM) 骨干,以及高效的近似最近邻 (Approximate Nearest Neighbor, ANN) 索引和内存高效的 LLM 推理工具,实现了在适度 GPU 节点上约 5 秒的中位查询时间。
链接: https://arxiv.org/abs/2411.01751
作者: Tevin Wang,Jingyuan He,Chenyan Xiong
关键词-EN: ground answer generation, Retrieval-augmented generation, large language models, combines knowledge, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) combines knowledge from domain-specific sources into large language models to ground answer generation. Current RAG systems lack customizable visibility on the context documents and the model’s attentiveness towards such documents. We propose RAGViz, a RAG diagnosis tool that visualizes the attentiveness of the generated tokens in retrieved documents. With a built-in user interface, retrieval index, and Large Language Model (LLM) backbone, RAGViz provides two main functionalities: (1) token and document-level attention visualization, and (2) generation comparison upon context document addition and removal. As an open-source toolkit, RAGViz can be easily hosted with a custom embedding model and HuggingFace-supported LLM backbone. Using a hybrid ANN (Approximate Nearest Neighbor) index, memory-efficient LLM inference tool, and custom context snippet method, RAGViz operates efficiently with a median query time of about 5 seconds on a moderate GPU node. Our code is available at this https URL. A demo video of RAGViz can be found at this https URL.
摘要:检索增强生成 (Retrieval-augmented Generation, RAG) 将领域特定资源的知识融入大语言模型中,以实现基于事实的答案生成。当前的 RAG 系统在上下文文档的可定制可见性以及模型对这些文档的关注度方面存在不足。我们提出了 RAGViz,一个用于诊断 RAG 系统的工具,该工具能够可视化生成 Token 在检索文档中的关注度。RAGViz 内置了用户界面、检索索引以及大语言模型 (Large Language Model, LLM) 骨干,提供两大主要功能:(1) Token 和文档级别的注意力可视化;(2) 在添加和移除上下文文档时的生成结果对比。作为一个开源工具包,RAGViz 可以轻松地与自定义嵌入模型和 HuggingFace 支持的 LLM 骨干一起部署。通过使用混合 ANN (Approximate Nearest Neighbor) 索引、内存高效的 LLM 推理工具以及自定义上下文片段方法,RAGViz 在适中的 GPU 节点上实现了约 5 秒的中位查询时间。我们的代码可在以下链接获取:https URL。RAGViz 的演示视频可在以下链接找到:https URL。
[NLP-30] DynaSaur: Large Language Agents Beyond Predefined Actions
【速读】: 该论文试图解决现有大型语言模型(LLM)代理系统在真实世界场景中面临的两个主要挑战:(1)固定预定义动作集限制了LLM代理的规划和执行能力;(2)在复杂环境中,枚举和实现所有可能动作所需的人力成本过高。解决方案的关键在于提出一个能够动态创建和组合动作的LLM代理框架,该框架允许代理在每一步通过生成和执行通用编程语言编写的程序来与环境交互,并且生成的动作可以累积以供未来重用。这一方法显著提高了系统的灵活性,并在GAIA基准测试中表现优于先前的方法,特别是在处理预定义动作集中不存在或现有动作因边缘情况失败的情况时。
链接: https://arxiv.org/abs/2411.01747
作者: Dang Nguyen,Viet Dac Lai,Seunghyun Yoon,Ryan A. Rossi,Handong Zhao,Ruiyi Zhang,Puneet Mathur,Nedim Lipka,Yu Wang,Trung Bui,Franck Dernoncourt,Tianyi Zhou
关键词-EN: systems typically select, LLM agent systems, agent systems typically, LLM agent, typically select actions
类目: Computation and Language (cs.CL)
备注: 15 pages, 8 figures
点击查看摘要
Abstract:Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly-scoped environments, we argue that it presents two major challenges when deploying LLM agents in real-world scenarios: (1) selecting from a fixed set of actions significantly restricts the planning and acting capabilities of LLM agents, and (2) this approach requires substantial human effort to enumerate and implement all possible actions, which becomes impractical in complex environments with a vast number of potential actions. In this work, we propose an LLM agent framework that enables the dynamic creation and composition of actions in an online manner. In this framework, the agent interacts with the environment by generating and executing programs written in a general-purpose programming language at each step. Furthermore, generated actions are accumulated over time for future reuse. Our extensive experiments on the GAIA benchmark demonstrate that this framework offers significantly greater flexibility and outperforms previous methods. Notably, it allows an LLM agent to recover in scenarios where no relevant action exists in the predefined set or when existing actions fail due to unforeseen edge cases. At the time of writing, we hold the top position on the GAIA public leaderboard. Our code can be found in \hrefthis https URLthis https URL.
摘要:现有的大语言模型 (LLM) 智能体系统通常在每一步从固定的预定义动作集中选择动作。尽管这种方法在封闭、狭义的环境中是有效的,但我们认为在实际场景中部署LLM智能体时,它面临两大挑战:(1) 从固定动作集中选择动作显著限制了LLM智能体的规划和执行能力;(2) 这种方法需要大量人力来枚举和实现所有可能的动作,这在具有大量潜在动作的复杂环境中变得不切实际。在本研究中,我们提出了一种LLM智能体框架,该框架支持在线动态创建和组合动作。在该框架中,智能体通过在每一步生成并执行用通用编程语言编写的程序与环境交互。此外,生成的动作会随着时间的推移积累,以便未来复用。我们在GAIA基准上的广泛实验表明,该框架提供了显著更高的灵活性,并优于以往的方法。值得注意的是,它允许LLM智能体在预定义动作集中不存在相关动作或现有动作因未预见的边缘情况失败的情况下恢复。在撰写本文时,我们在GAIA公共排行榜上位居榜首。我们的代码可以在以下链接找到:\hrefthis https URLthis https URL。
[NLP-31] Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models NEURIPS2024
【速读】: 该论文试图解决在从强大的基础模型(foundation model)恢复训练时,传统优化器如AdamW可能导致参数偏离预训练初始化,从而影响模型鲁棒性和泛化能力的问题。解决方案的关键在于提出了一种新的权重衰减技术——选择性投影衰减(Selective Projection Decay, SPD),该技术能够有选择地对某些层施加强惩罚,同时允许其他层自由变化。SPD通过分别扩展和收缩参数搜索空间,针对损失减少一致和不一致的层,从而在多个视觉和语言基准测试中,显著提升了Adam优化器的分布内泛化能力和分布外鲁棒性。
链接: https://arxiv.org/abs/2411.01713
作者: Junjiao Tian,Chengyue Huang,Zsolt Kira
关键词-EN: adaptive learning rate, escape local minima, Modern optimizers, learning rate, momentum and adaptive
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Neurips 2024
点击查看摘要
Abstract:Modern optimizers such as AdamW, equipped with momentum and adaptive learning rate, are designed to escape local minima and explore the vast parameter space. This exploration is beneficial for finding good loss basins when training from scratch. It is not necessarily ideal when resuming from a powerful foundation model because it can lead to large deviations from the pre-trained initialization and, consequently, worse robustness and generalization. At the same time, strong regularization on all parameters can lead to under-fitting. We hypothesize that selectively regularizing the parameter space is the key to fitting and retraining the pre-trained knowledge. This paper proposes a new weight decay technique, Selective Projection Decay (SPD), that selectively imposes a strong penalty on certain layers while allowing others to change freely. Intuitively, SPD expands and contracts the parameter search space for layers with consistent and inconsistent loss reduction, respectively. Experimentally, when equipped with SPD, Adam consistently provides better in-distribution generalization and out-of-distribution robustness performance on multiple popular vision and language benchmarks. Code available at~\urlthis https URL
摘要:现代优化器如 AdamW,配备了动量和自适应学习率,旨在逃离局部最小值并探索广阔的参数空间。这种探索在从头开始训练时有利于找到良好的损失盆地。然而,在从强大的基础模型恢复训练时,这并不一定理想,因为它可能导致与预训练初始化的大幅偏离,从而导致鲁棒性和泛化性下降。同时,对所有参数进行强正则化可能导致欠拟合。我们假设,选择性地正则化参数空间是适应和重训练预训练知识的关键。本文提出了一种新的权重衰减技术——选择性投影衰减(Selective Projection Decay, SPD),该技术选择性地对某些层施加强惩罚,同时允许其他层自由变化。直观地说,SPD分别扩展和收缩了具有一致和不一致损失减少的层的参数搜索空间。实验表明,当配备SPD时,Adam在多个流行的视觉和语言基准测试中持续提供更好的分布内泛化和分布外鲁棒性性能。代码可在~\urlthis https URL获取。
[NLP-32] SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
【速读】: 该论文试图解决在语音生成任务中,现有解释性AI技术未能充分考虑自回归模型特性以及无法提供细粒度、语音学上有意义的解释的问题。解决方案的关键在于提出了光谱扰动解释性语音生成 (Spectrogram Perturbation for Explainable Speech-to-text Generation, SPES) 技术,这是一种适用于自回归模型序列生成任务的特征归因方法。SPES通过结合输入光谱图和先前生成的token,为每个预测token提供解释,从而在语音识别和翻译任务中生成忠实且对人类有意义的解释。
链接: https://arxiv.org/abs/2411.01710
作者: Dennis Fucci,Marco Gaido,Beatrice Savoldi,Matteo Negri,Mauro Cettolo,Luisa Bentivogli
关键词-EN: experienced significant growth, attribution methods emerging, significant growth, demand for interpretable, language technologies
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Spurred by the demand for interpretable models, research on eXplainable AI for language technologies has experienced significant growth, with feature attribution methods emerging as a cornerstone of this progress. While prior work in NLP explored such methods for classification tasks and textual applications, explainability intersecting generation and speech is lagging, with existing techniques failing to account for the autoregressive nature of state-of-the-art models and to provide fine-grained, phonetically meaningful explanations. We address this gap by introducing Spectrogram Perturbation for Explainable Speech-to-text Generation (SPES), a feature attribution technique applicable to sequence generation tasks with autoregressive models. SPES provides explanations for each predicted token based on both the input spectrogram and the previously generated tokens. Extensive evaluation on speech recognition and translation demonstrates that SPES generates explanations that are faithful and plausible to humans.
摘要:随着对可解释模型的需求增加,语言技术领域的可解释 AI (eXplainable AI) 研究经历了显著增长,其中特征归因方法成为这一进展的基石。尽管自然语言处理 (NLP) 领域的先前工作探索了分类任务和文本应用的此类方法,但在生成和语音交叉领域的可解释性研究仍显滞后,现有技术未能考虑到最先进模型的自回归特性,也无法提供细粒度、语音上有意义的解释。我们通过引入可解释语音到文本生成中的频谱扰动 (Spectrogram Perturbation for Explainable Speech-to-text Generation, SPES) 来填补这一空白,这是一种适用于自回归模型序列生成任务的特征归因技术。SPES 基于输入频谱和先前生成的 Token 为每个预测的 Token 提供解释。在语音识别和翻译上的广泛评估表明,SPES 生成的解释既忠实又符合人类的直觉。
[NLP-33] Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups EMNLP2024
【速读】: 该论文试图解决复杂词识别 (Complex Word Identification, CWI) 及其变体任务,如词汇复杂度预测 (Lexical Complexity Prediction, LCP) 和多词表达复杂度评估 (Complexity Evaluation of Multi-word Expressions, MWE)。解决方案的关键在于评估大型语言模型 (Large Language Models, LLMs) 在零样本 (zero-shot)、少样本 (few-shot) 和微调 (fine-tuning) 设置下的表现,并探讨元学习 (meta-learning) 与提示学习 (prompt learning) 的结合。研究结果表明,尽管 LLMs 在某些条件下表现不佳或仅与现有方法相当,但它们在处理这些任务时仍具有一定的潜力。
链接: https://arxiv.org/abs/2411.01706
作者: Răzvan-Alexandru Smădu,David-Gabriel Ion,Dumitru-Clementin Cercel,Florin Pop,Mihaela-Claudia Cercel
关键词-EN: Complex Word Identification, Complex Word, Word Identification, lexical simplification task, Natural Language Processing
类目: Computation and Language (cs.CL)
备注: 37 pages, 16 figures, Accepted by EMNLP 2024
点击查看摘要
Abstract:Complex Word Identification (CWI) is an essential step in the lexical simplification task and has recently become a task on its own. Some variations of this binary classification task have emerged, such as lexical complexity prediction (LCP) and complexity evaluation of multi-word expressions (MWE). Large language models (LLMs) recently became popular in the Natural Language Processing community because of their versatility and capability to solve unseen tasks in zero/few-shot settings. Our work investigates LLM usage, specifically open-source models such as Llama 2, Llama 3, and Vicuna v1.5, and closed-source, such as ChatGPT-3.5-turbo and GPT-4o, in the CWI, LCP, and MWE settings. We evaluate zero-shot, few-shot, and fine-tuning settings and show that LLMs struggle in certain conditions or achieve comparable results against existing methods. In addition, we provide some views on meta-learning combined with prompt learning. In the end, we conclude that the current state of LLMs cannot or barely outperform existing methods, which are usually much smaller.
摘要:复杂词汇识别 (Complex Word Identification, CWI) 是词汇简化任务中的关键步骤,近年来已成为一个独立的研究任务。该任务的一些变体也随之出现,如词汇复杂度预测 (Lexical Complexity Prediction, LCP) 和多词表达的复杂度评估 (Complexity Evaluation of Multi-word Expressions, MWE)。大语言模型 (Large Language Models, LLMs) 因其多功能性和在零样本/少样本设置下解决未见任务的能力,在自然语言处理 (Natural Language Processing, NLP) 社区中变得流行。我们的研究探讨了 LLM 在 CWI、LCP 和 MWE 设置中的应用,特别是开源模型如 Llama 2、Llama 3 和 Vicuna v1.5,以及闭源模型如 ChatGPT-3.5-turbo 和 GPT-4o。我们评估了零样本、少样本和微调设置,并展示了 LLM 在某些条件下表现不佳或与现有方法取得相当的结果。此外,我们还提供了一些关于元学习与提示学习结合的见解。最终,我们得出结论,当前的 LLM 状态无法或仅略微优于通常规模小得多的现有方法。
[NLP-34] Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors
【速读】: 该论文试图解决大语言模型(LLMs)在缺乏特定领域或最新知识时提供准确答案的局限性,并探讨了检索增强生成(RAG)系统中知识数据库的数据提取攻击问题。解决方案的关键在于揭示了通过简单微调(fine-tuning)可以显著降低此类攻击的成功率,使其在实际应用中变得不切实际。此外,论文提出了通过在微调阶段注入少量毒化数据(poisoned data)来创建后门(backdoor)的方法,使得攻击者能够利用特定触发词(triggers)操纵LLM泄露检索数据库中的文档。通过精心设计毒化数据,实现了对文档的逐字提取和改写提取,展示了RAG系统在供应链部署中的隐私风险。
链接: https://arxiv.org/abs/2411.01705
作者: Yuefeng Peng,Junda Wang,Hong Yu,Amir Houmansadr
关键词-EN: large language models, providing accurate answers, significant advancements, large language, language models
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Despite significant advancements, large language models (LLMs) still struggle with providing accurate answers when lacking domain-specific or up-to-date knowledge. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge bases, but it also introduces new attack surfaces. In this paper, we investigate data extraction attacks targeting the knowledge databases of RAG systems. We demonstrate that previous attacks on RAG largely depend on the instruction-following capabilities of LLMs, and that simple fine-tuning can reduce the success rate of such attacks to nearly zero. This makes these attacks impractical since fine-tuning is a common practice when deploying LLMs in specific domains. To further reveal the vulnerability, we propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM. When this compromised LLM is integrated into a RAG system, attackers can exploit specific triggers in prompts to manipulate the LLM to leak documents from the retrieval database. By carefully designing the poisoned data, we achieve both verbatim and paraphrased document extraction. We show that with only 3% poisoned data, our method achieves an average success rate of 79.7% in verbatim extraction on Llama2-7B, with a ROUGE-L score of 64.21, and a 68.6% average success rate in paraphrased extraction, with an average ROUGE score of 52.6 across four datasets. These results underscore the privacy risks associated with the supply chain when deploying RAG systems.
摘要:尽管取得了显著进展,大语言模型(LLMs)在缺乏特定领域或最新知识时,仍难以提供准确的答案。检索增强生成(RAG)通过整合外部知识库来解决这一局限性,但同时也引入了新的攻击面。本文研究了针对RAG系统知识数据库的数据提取攻击。我们发现,先前对RAG的攻击很大程度上依赖于LLM的指令遵循能力,而简单的微调可以将此类攻击的成功率降低至接近零。这使得这些攻击在实际应用中变得不切实际,因为微调是部署LLM于特定领域时的常见做法。为进一步揭示其脆弱性,我们提出了一种后门攻击方法,即在微调阶段注入少量被污染的数据,以在LLM中创建一个后门。当这种被篡改的LLM集成到RAG系统中时,攻击者可以通过特定的提示触发器操纵LLM,使其泄露检索数据库中的文档。通过精心设计被污染的数据,我们实现了逐字和改写两种形式的文档提取。实验表明,仅使用3%的被污染数据,我们的方法在Llama2-7B模型上实现了79.7%的逐字提取平均成功率,ROUGE-L得分为64.21,以及68.6%的改写提取平均成功率,四个数据集上的平均ROUGE得分为52.6。这些结果突显了在部署RAG系统时供应链相关的隐私风险。
[NLP-35] UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
【速读】: 该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在面对多模态越狱攻击 (multimodal jailbreak attacks) 时的脆弱性问题。解决方案的关键是提出了一种名为 UniGuard 的新型多模态安全防护机制,该机制联合考虑了单模态和跨模态的有害信号 (unimodal and cross-modal harmful signals)。UniGuard 通过在有毒语料库中训练,最小化生成有害响应的概率,并在推理过程中无缝应用于任何输入提示,且计算成本极低。实验结果表明,UniGuard 在多种模态和攻击策略上具有显著的通用性,并能有效应用于多个最先进的 MLLMs,包括 LLaVA、Gemini Pro、GPT-4、MiniGPT-4 和 InstructBLIP。
链接: https://arxiv.org/abs/2411.01703
作者: Sejoon Oh,Yiqiao Jin,Megha Sharma,Donghyun Kim,Eric Ma,Gaurav Verma,Srijan Kumar
关键词-EN: large language models, revolutionized vision-language understanding, adversaries meticulously craft, Multimodal large language, meticulously craft inputs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have revolutionized vision-language understanding but are vulnerable to multimodal jailbreak attacks, where adversaries meticulously craft inputs to elicit harmful or inappropriate responses. We propose UniGuard, a novel multimodal safety guardrail that jointly considers the unimodal and cross-modal harmful signals. UniGuard is trained such that the likelihood of generating harmful responses in a toxic corpus is minimized, and can be seamlessly applied to any input prompt during inference with minimal computational costs. Extensive experiments demonstrate the generalizability of UniGuard across multiple modalities and attack strategies. It demonstrates impressive generalizability across multiple state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4, MiniGPT-4, and InstructBLIP, thereby broadening the scope of our solution.
摘要:多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在视觉-语言理解领域带来了革命性的变化,但同时也容易受到多模态越狱攻击,即攻击者精心设计输入以引发有害或不适当的响应。我们提出了 UniGuard,这是一种新颖的多模态安全护栏,它同时考虑了单模态和跨模态的有害信号。UniGuard 的训练旨在最小化在有毒语料库中生成有害响应的可能性,并且可以在推理过程中无缝应用于任何输入提示,计算成本极低。广泛的实验表明,UniGuard 在多种模态和攻击策略中具有良好的通用性。它在多个最先进的 MLLMs 中展示了令人印象深刻的通用性,包括 LLaVA、Gemini Pro、GPT-4、MiniGPT-4 和 InstructBLIP,从而扩大了我们解决方案的应用范围。
[NLP-36] Unlocking the Theory Behind Scaling 1-Bit Neural Networks
【速读】: 该论文试图解决1-bit大型语言模型(LLMs)的性能与参数规模之间的Scaling Law问题。解决方案的关键在于通过严格的理论证明,揭示了在权重限制为 -1, +1\ 的情况下,随着网络宽度的增加,1-bit模型的训练动态必然趋向于核行为,从而保证模型在宽度增加时损失收敛至任意小。此外,论文引入了泛化差异(generalization difference)的概念,证明了随着网络宽度的扩展,1-bit网络与其全精度对应网络之间的输出差异保持在一个可忽略的水平。这些理论突破为1-bit神经网络的扩展提供了坚实的理论基础,并暗示了int1精度在未来神经网络中的潜在标准地位。
链接: https://arxiv.org/abs/2411.01663
作者: Majid Daliri,Zhao Song,Chiwun Yang
关键词-EN: Large Language Models, Large Language, rivals traditional LLMs, Language Models, showcasing an impressive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a Scaling Law for 1-bit Neural Networks. In this paper, we present the first theoretical result that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to -1, +1\ , the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the gap between the outputs of 1-bit networks and their full-precision counterparts, and demonstrate that this difference maintains a negligible level as network width scales. Building on the work of Kaplan et al. (2020), we conclude by examining how the training loss scales as a power-law function of the model size, dataset size, and computational resources utilized for training. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.
摘要:近年来,1-bit 大语言模型 (LLM) 崭露头角,展现出与传统 LLM 相媲美的效率与性能。Wang 等人 (2023) 和 Ma 等人 (2024) 的研究表明,随着参数数量的增加,这些 1-bit LLM 的性能逐步提升,暗示了 1-bit 神经网络可能存在一个扩展定律 (Scaling Law)。本文首次从理论上严格确立了这一扩展定律。我们证明了,尽管权重被限制在 -1, +1\ 之间,但随着网络宽度的增加,模型训练的动力学不可避免地与核行为对齐。这一理论突破保证了 1-bit 模型在宽度增加时损失可以任意小。此外,我们引入了泛化差异的概念,定义为 1-bit 网络与其全精度对应物输出之间的差距,并证明随着网络宽度的扩展,这一差异保持在一个可忽略的水平。基于 Kaplan 等人 (2020) 的工作,我们进一步探讨了训练损失如何作为模型大小、数据集大小和训练所用计算资源的幂律函数进行扩展。我们的研究结果突显了扩展 1-bit 神经网络的巨大潜力,暗示 int1 可能成为未来神经网络精度的标准。
[NLP-37] Diagnosing Medical Datasets with Training Dynamics
【速读】: 该论文试图解决使用训练动态作为自动化替代人类标注来评估训练数据质量的问题。解决方案的关键在于采用Data Maps框架,该框架将数据点分类为易学、难学和模糊(easy-to-learn, hard-to-learn, and ambiguous)类别。通过在医疗问答领域的挑战性数据集上进行实验,论文验证了这些分类的可靠性,并发现该框架在处理医疗领域特有的复杂问题时表现不佳。
链接: https://arxiv.org/abs/2411.01653
作者: Laura Wenderoth
关键词-EN: Data Maps, Data Maps framework, study explores, explores the potential, automated alternative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL
点击查看摘要
Abstract:This study explores the potential of using training dynamics as an automated alternative to human annotation for evaluating the quality of training data. The framework used is Data Maps, which classifies data points into categories such as easy-to-learn, hard-to-learn, and ambiguous (Swayamdipta et al., 2020). Swayamdipta et al. (2020) highlight that difficult-to-learn examples often contain errors, and ambiguous cases significantly impact model training. To confirm the reliability of these findings, we replicated the experiments using a challenging dataset, with a focus on medical question answering. In addition to text comprehension, this field requires the acquisition of detailed medical knowledge, which further complicates the task. A comprehensive evaluation was conducted to assess the feasibility and transferability of the Data Maps framework to the medical domain. The evaluation indicates that the framework is unsuitable for addressing datasets’ unique challenges in answering medical questions.
摘要:本研究探讨了利用训练动态作为人工标注的自动化替代方案,以评估训练数据质量的潜力。所采用的框架是数据地图 (Data Maps),该框架将数据点分类为易于学习、难以学习和模糊不清等类别(Swayamdipta 等,2020)。Swayamdipta 等(2020)指出,难以学习的示例往往包含错误,而模糊不清的情况对模型训练有显著影响。为了验证这些发现的可靠性,我们使用了一个具有挑战性的数据集进行了实验复现,重点关注医疗问答领域。除了文本理解外,该领域还需要获取详细的医学知识,这进一步增加了任务的复杂性。我们进行了一项全面的评估,以评估数据地图框架在医疗领域的可行性和可迁移性。评估结果表明,该框架不适用于解决医疗问答数据集的独特挑战。
[NLP-38] EcoAct: Economic Agent Determines When to Register What Action
【速读】: 该论文试图解决大型语言模型(LLMs)在执行任务时,由于不加选择地将所有候选工具信息整合到模型上下文中,导致上下文长度增加和计算效率低下的问题。解决方案的关键是引入了一种名为EcoAct的工具注册算法,该算法允许LLMs在需要时选择性地注册工具,从而优化上下文的使用。通过将工具注册过程集成到推理过程中,EcoAct在多步骤推理任务中减少了超过50%的计算成本,同时保持了性能。此外,EcoAct只需对提示进行少量修改即可应用于现有的和未来的LLM代理推理流程中。
链接: https://arxiv.org/abs/2411.01643
作者: Shaokun Zhang,Jieyu Zhang,Dujian Ding,Mirian Hipolito Garcia,Ankur Mallick,Daniel Madrigal,Menglin Xia,Victor Rühle,Qingyun Wu,Chi Wang
关键词-EN: Large Language Models, enabled Large Language, Language Models, Large Language, Recent advancements
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 10 figures
点击查看摘要
Abstract:Recent advancements have enabled Large Language Models (LLMs) to function as agents that can perform actions using external tools. This requires registering, i.e., integrating tool information into the LLM context prior to taking actions. Current methods indiscriminately incorporate all candidate tools into the agent’s context and retain them across multiple reasoning steps. This process remains opaque to LLM agents and is not integrated into their reasoning procedures, leading to inefficiencies due to increased context length from irrelevant tools. To address this, we introduce EcoAct, a tool using algorithm that allows LLMs to selectively register tools as needed, optimizing context use. By integrating the tool registration process into the reasoning procedure, EcoAct reduces computational costs by over 50% in multiple steps reasoning tasks while maintaining performance, as demonstrated through extensive experiments. Moreover, it can be plugged into any reasoning pipeline with only minor modifications to the prompt, making it applicable to LLM agents now and future.
摘要:近年来,大语言模型 (Large Language Models, LLMs) 的发展使其能够作为智能体 (AI Agent) 使用外部工具执行操作。这要求在采取行动之前,将工具信息注册(即整合)到 LLM 的上下文中。当前的方法不加区分地将所有候选工具纳入智能体的上下文,并在多个推理步骤中保持这些工具的存在。这一过程对 LLM 智能体来说是透明的,并未融入其推理过程,导致由于无关工具增加了上下文长度而效率低下。为解决这一问题,我们提出了 EcoAct,这是一种工具使用算法,允许 LLM 按需选择性地注册工具,从而优化上下文使用。通过将工具注册过程整合到推理过程中,EcoAct 在多步骤推理任务中将计算成本降低了超过 50%,同时保持了性能,这一点通过广泛的实验得到了证明。此外,EcoAct 只需对提示进行轻微修改即可插入任何推理流程中,使其适用于当前和未来的 LLM 智能体。
[NLP-39] Leveraging Microservices Architecture for Dynamic Pricing in the Travel Industry: Algorithms Scalability and Impact on Revenue and Customer Satisfaction
【速读】: 该论文试图解决旅游行业中动态定价系统的实时性、可扩展性和灵活性问题。解决方案的关键在于采用微服务架构 (microservices architecture),将需求预测、竞争对手定价策略和事件驱动定价等模块独立为微服务,从而提高系统的可扩展性并减少峰值负载时的资源消耗。此外,通过结合客户行为数据的直接捕获和机器学习算法的发展,进一步优化实时定价的准确性。研究结果表明,微服务架构在动态定价系统中是合理且高效的,能够帮助旅游行业实施基于证据和客户中心的定价策略,确保利润不受客户需求波动的影响。
链接: https://arxiv.org/abs/2411.01636
作者: Biman Barua,M. Shamim Kaiser
关键词-EN: pricing, investigates the implementation, microservices-oriented dynamic pricing, dynamic pricing system, dynamic pricing
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
备注: 19 pages, 18 figures
点击查看摘要
Abstract:This research investigates the implementation of a real-time, microservices-oriented dynamic pricing system for the travel sector. The system is designed to address factors such as demand, competitor pricing, and other external circumstances in real-time. Both controlled simulation and real-life application showed a respectable gain of 22% in revenue generation and a 17% improvement in pricing response time which concern the issues of scaling and flexibility of classical pricing mechanisms. Demand forecasting, competitor pricing strategies, and event-based pricing were implemented as separate microservices to enhance their scalability and reduce resource consumption by 30% during peak loads. Customers were also more content as depicted by a 15% increase in satisfaction score post-implementation given the appreciation of more appropriate pricing. This research enhances the existing literature with practical illustrations of the possible application of microservices technology in developing dynamic pricing solutions in a complex and data-driven context. There exist however areas for improvement for instance inter-service latency and the need for extensive real-time data pipelines. The present research goes on to suggest combining these with direct data capture from customer behavior at the same time as machine learning capacity developments in pricing algorithms to assist in more accurate real time pricing. It is determined that the use of microservices is a reasonable and efficient model for dynamic pricing, allowing the tourism sector to employ evidence-based and customer centric pricing techniques, which ensures that their profits are not jeopardized because of the need for customers.
摘要:本研究探讨了在旅游行业中实施面向微服务的实时动态定价系统的实现。该系统旨在实时应对需求、竞争对手定价及其他外部因素。通过控制模拟和实际应用,系统在收入生成方面实现了22%的增长,并在定价响应时间上提升了17%,解决了传统定价机制在扩展性和灵活性方面的问题。需求预测、竞争对手定价策略和基于事件的定价被实现为独立的微服务,以增强其可扩展性,并在高峰负载期间减少30%的资源消耗。客户满意度也显著提升,实施后满意度评分增加了15%,这表明客户对更合理的定价表示赞赏。本研究通过实际案例丰富了现有文献,展示了微服务技术在复杂数据驱动环境中开发动态定价解决方案的潜力。然而,仍存在改进空间,例如服务间延迟和需要广泛的实时数据管道。当前研究进一步建议将这些与直接从客户行为中捕获数据相结合,同时开发定价算法中的机器学习能力,以实现更准确的实时定价。研究表明,微服务的使用是动态定价的一种合理且高效的模式,使旅游行业能够采用基于证据和以客户为中心的定价技术,确保其利润不会因客户需求而受到威胁。
[NLP-40] Ontology Population using LLM s
【速读】: 该论文试图解决知识图谱(Knowledge Graphs, KGs)构建过程中从非结构化自然语言文本中提取数据的高成本和复杂性问题。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的强大自然语言理解和内容生成能力,通过提示工程(prompt engineering)和微调(fine-tuning)来提高数据提取的准确性。研究结果表明,在提供模块化本体(ontology)作为提示指导的情况下,LLMs能够提取出约90%的三元组(triples),接近人类水平的表现。
链接: https://arxiv.org/abs/2411.01612
作者: Sanaz Saki Norouzi,Adrita Barua,Antrea Christou,Nikita Gautam,Andrew Eells,Pascal Hitzler,Cogan Shimizu
关键词-EN: Knowledge graphs, increasingly utilized, natural language, natural language data, Large Language Models
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Knowledge graphs (KGs) are increasingly utilized for data integration, representation, and visualization. While KG population is critical, it is often costly, especially when data must be extracted from unstructured text in natural language, which presents challenges, such as ambiguity and complex interpretations. Large Language Models (LLMs) offer promising capabilities for such tasks, excelling in natural language understanding and content generation. However, their tendency to ``hallucinate’’ can produce inaccurate outputs. Despite these limitations, LLMs offer rapid and scalable processing of natural language data, and with prompt engineering and fine-tuning, they can approximate human-level performance in extracting and structuring data for KGs. This study investigates LLM effectiveness for the KG population, focusing on the this http URL Hub Ontology. In this paper, we report that compared to the ground truth, LLM’s can extract ~90% of triples, when provided a modular ontology as guidance in the prompts.
摘要:知识图谱(Knowledge Graphs, KGs)在数据整合、表示和可视化方面得到了越来越广泛的应用。尽管知识图谱的构建至关重要,但其过程往往成本高昂,尤其是在需要从自然语言中的非结构化文本中提取数据时,面临着诸如歧义和复杂解释等挑战。大语言模型(Large Language Models, LLMs)在这些任务中展现出巨大的潜力,尤其在自然语言理解和内容生成方面表现出色。然而,它们倾向于“幻觉”(hallucinate),可能导致输出不准确。尽管存在这些局限性,LLMs 在处理自然语言数据方面具有快速和可扩展的优势,通过提示工程(prompt engineering)和微调(fine-tuning),它们可以在提取和结构化数据以构建知识图谱方面接近人类水平的表现。本研究探讨了 LLMs 在知识图谱构建中的有效性,重点关注了 this http URL Hub 本体。本文报告称,与真实数据相比,当在提示中提供模块化本体作为指导时,LLMs 能够提取约 90% 的三元组。
[NLP-41] Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM EMNLP2024
【速读】: 该论文试图解决对比解码(Contrastive Decoding, CD)在某些情况下可能无法输出高概率答案的问题,并提出了一种新的无监督解码方法——渐近概率解码(Asymptotic Probability Decoding, APD)。解决方案的关键在于APD通过显式地外推不同大小语言模型的概率曲线,来推断一个无限大语言模型的渐近概率,从而在不增加推理成本的情况下,显著提升生成文本的事实性和常识问答的准确性。
链接: https://arxiv.org/abs/2411.01610
作者: Haw-Shiuan Chang,Nanyun Peng,Mohit Bansal,Anil Ramakrishna,Tagyoung Chung
关键词-EN: expert language model, Contrastive decoding, large expert language, improves the next-token, next-token distribution
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 Oral
点击查看摘要
Abstract:Contrastive decoding (CD) (Li et al., 2023) improves the next-token distribution of a large expert language model (LM) using a small amateur LM. Although CD is applied to various LMs and domains to enhance open-ended text generation, it is still unclear why CD often works well, when it could fail, and how we can make it better. To deepen our understanding of CD, we first theoretically prove that CD could be viewed as linearly extrapolating the next-token logits from a huge and hypothetical LM. We also highlight that the linear extrapolation could make CD unable to output the most obvious answers that have already been assigned high probabilities by the amateur LM. To overcome CD’s limitation, we propose a new unsupervised decoding method called \mathbfA symptotic \mathbfP robability \mathbfD ecoding (APD). APD explicitly extrapolates the probability curves from the LMs of different sizes to infer the asymptotic probabilities from an infinitely large LM without inducing more inference costs than CD. In FactualityPrompts, an open-ended text generation benchmark, sampling using APD significantly boosts factuality in comparison to the CD sampling and its variants, and achieves state-of-the-art results for Pythia 6.9B and OPT 6.7B. Furthermore, in five commonsense QA datasets, APD is often significantly better than CD and achieves a similar effect of using a larger LLM. For example, the perplexity of APD on top of Pythia 6.9B is even lower than the perplexity of Pythia 12B in CommonsenseQA and LAMBADA. Comments: EMNLP 2024 Oral Subjects: Computation and Language (cs.CL) Cite as: arXiv:2411.01610 [cs.CL] (or arXiv:2411.01610v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.01610 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:对比解码 (Contrastive Decoding, CD) (Li et al., 2023) 通过使用一个小型的业余语言模型 (LM) 来改进大型专家语言模型的下一个 Token 分布。尽管 CD 被应用于各种语言模型和领域以增强开放式文本生成,但其为何通常效果良好、何时可能失败以及如何进一步改进仍不清楚。为了深化对 CD 的理解,我们首先从理论上证明,CD 可以被视为从一个大型的假设性语言模型中线性外推下一个 Token 的对数概率。我们还指出,这种线性外推可能导致 CD 无法输出业余语言模型已经赋予高概率的最明显答案。为了克服 CD 的这一局限性,我们提出了一种新的无监督解码方法,称为渐近概率解码 (Asymptotic Probability Decoding, APD)。APD 明确地从不同大小的语言模型中外推概率曲线,以推断出从无限大语言模型中得出的渐近概率,且不会比 CD 增加更多的推理成本。在 FactualityPrompts 这一开放式文本生成基准测试中,使用 APD 采样的结果在事实性方面显著优于 CD 采样及其变体,并在 Pythia 6.9B 和 OPT 6.7B 上达到了最先进的结果。此外,在五个常识问答数据集中,APD 通常显著优于 CD,并达到了使用更大语言模型的类似效果。例如,基于 Pythia 6.9B 的 APD 在 CommonsenseQA 和 LAMBADA 上的困惑度甚至低于 Pythia 12B。
评论:EMNLP 2024 口头报告
主题:计算与语言 (cs.CL)
引用方式:arXiv:2411.01610 [cs.CL]
(或 arXiv:2411.01610v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2411.01610
通过 DataCite 发布的 arXiv DOI(待注册)
[NLP-42] Are LLM s good pragmatic speakers?
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)是否能够像具有语用推理能力的说话者一样行为。解决方案的关键在于使用理性言语行为(Rational Speech Act, RSA)框架来模拟人类交流中的语用推理,并通过构建自TUNA语料库的指称游戏范式,比较和对比最先进的LLM(Llama3-8B-Instruct)和RSA模型中的候选指称话语得分。研究的关键在于探索不同选择的替代话语和真值条件意义函数对比较结果的影响,发现尽管LLM的得分与RSA模型有一定的正相关性,但尚无足够证据表明LLM能够像语用说话者一样行为。这一初步研究为未来针对不同模型和设置的进一步研究铺平了道路,包括进行人类受试者评估,以确定LLMs是否能够或是否可以通过某种方式被训练成具有语用推理能力。
链接: https://arxiv.org/abs/2411.01562
作者: Mingyue Jian,Siddharth Narayanaswamy
关键词-EN: Large language models, include natural language, Rational Speech Act, Large language, natural language pragmatics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are trained on data assumed to include natural language pragmatics, but do they actually behave like pragmatic speakers? We attempt to answer this question using the Rational Speech Act (RSA) framework, which models pragmatic reasoning in human communication. Using the paradigm of a reference game constructed from the TUNA corpus, we score candidate referential utterances in both a state-of-the-art LLM (Llama3-8B-Instruct) and in the RSA model, comparing and contrasting these scores. Given that RSA requires defining alternative utterances and a truth-conditional meaning function, we explore such comparison for different choices of each of these requirements. We find that while scores from the LLM have some positive correlation with those from RSA, there isn’t sufficient evidence to claim that it behaves like a pragmatic speaker. This initial study paves way for further targeted efforts exploring different models and settings, including human-subject evaluation, to see if LLMs truly can, or be made to, behave like pragmatic speakers.
摘要:大语言模型 (Large Language Models, LLMs) 在训练数据中假设包含了自然语言的语用学内容,但它们是否真的表现得像语用学上的说话者呢?我们尝试使用理性言语行为 (Rational Speech Act, RSA) 框架来回答这个问题,该框架用于模拟人类交流中的语用推理。通过基于 TUNA 语料库构建的指称游戏范式,我们对候选指称性话语在当前最先进的大语言模型 (Llama3-8B-Instruct) 和 RSA 模型中的得分进行了评分,并对比了这些得分。鉴于 RSA 需要定义替代性话语和真值条件意义函数,我们探讨了在这些要求的不同选择下进行比较的可能性。我们发现,尽管大语言模型的得分与 RSA 的得分之间存在一定的正相关性,但尚无足够证据表明它表现得像一个语用学上的说话者。这一初步研究为后续针对不同模型和设置的深入研究铺平了道路,包括进行人类主体评估,以探究大语言模型是否真正能够,或者是否可以通过某种方式表现得像语用学上的说话者。
[NLP-43] LLM s and the Madness of Crowds
【速读】: 该论文试图解决大型语言模型(LLMs)在评估过程中产生的错误答案的模式问题。解决方案的关键在于通过分析这些错误答案的非直观行为,测量不同LLMs之间的相似性,并构建一个基于错误相关性的分类法(taxonomy)。研究发现,这些错误答案并非随机分布,而是系统性地在不同模型之间相关联,从而揭示了LLMs底层结构和关系的新见解。
链接: https://arxiv.org/abs/2411.01539
作者: William F. Bradley
关键词-EN: large language models, incorrect answers produced, answers produced, produced by large, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 6 figures
点击查看摘要
Abstract:We investigate the patterns of incorrect answers produced by large language models (LLMs) during evaluation. These errors exhibit highly non-intuitive behaviors unique to each model. By analyzing these patterns, we measure the similarities between LLMs and construct a taxonomy that categorizes them based on their error correlations. Our findings reveal that the incorrect responses are not randomly distributed but systematically correlated across models, providing new insights into the underlying structures and relationships among LLMs.
摘要:我们研究了大语言模型(LLMs)在评估过程中产生的错误答案的模式。这些错误表现出高度非直觉的行为,且每个模型都具有独特性。通过分析这些模式,我们测量了不同 LLMs 之间的相似性,并构建了一个分类法,根据其错误相关性对它们进行分类。我们的研究结果揭示,这些错误的响应并非随机分布,而是在模型之间系统性地相关联,为理解 LLMs 的底层结构和相互关系提供了新的见解。
[NLP-44] Enhancing LLM Evaluations: The Garbling Trick
【速读】: 该论文试图解决的问题是随着大型语言模型(LLMs)性能的提升,传统评估指标逐渐饱和,难以区分不同模型之间的性能差异。解决方案的关键在于提出一种通用方法,将现有的LLM评估转化为一系列逐步增加难度的任务,这些增强的评估任务强调模型的推理能力,从而揭示出原始评估中不易察觉的性能差异。通过创建一个新的多选题测试语料库并扩展为一系列评估任务,论文展示了这种方法的有效性,并提供了对不同模型(特别是OpenAI的o1-preview和Google的gemini-pro-1.5-002)推理能力的比较分析。
链接: https://arxiv.org/abs/2411.01533
作者: William F. Bradley
关键词-EN: traditional evaluation metrics, evaluation metrics tend, increasingly powerful, tend to saturate, making it challenging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 3 figures
点击查看摘要
Abstract:As large language models (LLMs) become increasingly powerful, traditional evaluation metrics tend to saturate, making it challenging to distinguish between models based on their performance. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. These enhanced evaluations emphasize reasoning capabilities and can reveal relative performance differences that are not apparent in the original assessments. To demonstrate the effectiveness of our approach, we create a new multiple-choice test corpus, extend it into a family of evaluations, and assess a collection of LLMs. Our results offer insights into the comparative reasoning abilities of these models, particularly highlighting distinctions between OpenAI’s o1-preview and Google’s gemini-pro-1.5-002. Comments: 13 pages, 3 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.01533 [cs.CL] (or arXiv:2411.01533v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.01533 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:随着大语言模型 (LLMs) 的日益强大,传统的评估指标往往趋于饱和,使得基于性能区分模型变得困难。我们提出了一种通用方法,将现有的 LLM 评估转化为一系列渐进式更难的任务。这些增强的评估强调推理能力,并能揭示在原始评估中不明显的相对性能差异。为了展示我们方法的有效性,我们创建了一个新的多项选择测试语料库,将其扩展为一系列评估,并对一组 LLMs 进行了评估。我们的结果提供了对这些模型比较推理能力的深入见解,特别是突显了 OpenAI 的 o1-preview 和 Google 的 gemini-pro-1.5-002 之间的区别。
评论:13 页,3 图
主题:计算与语言 (cs.CL);人工智能 (cs.AI);机器学习 (cs.LG)
引用为:arXiv:2411.01533 [cs.CL]
(或 arXiv:2411.01533v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2411.01533
了解更多
arXiv 发布的 DOI 通过 DataCite(待注册)
[NLP-45] DAG: Dictionary-Augmented Generation for Disambiguation of Sentences in Endangered Uralic Languages using ChatGPT
【速读】: 该论文试图解决在濒危语言Erzya和Skolt Sami中词条歧义的问题,特别是在ChatGPT不擅长这些语言的情况下。解决方案的关键在于通过提供候选词条的词典翻译(在本文中为芬兰语)来增强提示,从而提高ChatGPT在这些语言中的准确性。具体来说,这种方法在Skolt Sami中达到了50%的准确率,在Erzya中达到了41%的准确率。
链接: https://arxiv.org/abs/2411.01531
作者: Mika Hämäläinen
关键词-EN: endangered languages ChatGPT, Skolt Sami, disambiguate lemmas, endangered languages, languages ChatGPT
类目: Computation and Language (cs.CL)
备注: IWCLUL 2024
点击查看摘要
Abstract:We showcase that ChatGPT can be used to disambiguate lemmas in two endangered languages ChatGPT is not proficient in, namely Erzya and Skolt Sami. We augment our prompt by providing dictionary translations of the candidate lemmas to a majority language - Finnish in our case. This dictionary augmented generation approach results in 50% accuracy for Skolt Sami and 41% accuracy for Erzya. On a closer inspection, many of the error types were of the kind even an untrained human annotator would make.
摘要:我们展示了 ChatGPT 可以用于消除两种濒危语言——Erzya 和 Skolt Sami 中的词条歧义,尽管 ChatGPT 在这两种语言上并不熟练。我们通过提供候选词条的词典翻译(在我们的案例中是芬兰语)来增强提示。这种词典增强生成方法在 Skolt Sami 上达到了 50% 的准确率,在 Erzya 上达到了 41% 的准确率。进一步分析发现,许多错误类型即使是未经训练的人类标注者也会犯。
[NLP-46] SinaTools: Open Source Toolkit for Arabic Natural Language Processing
【速读】: 该论文介绍了 SinaTools,一个开源的 Python 包,旨在解决阿拉伯语自然语言处理和理解中的多种任务。解决方案的关键在于 SinaTools 提供了一个统一的集成平台,支持多种 NLP 任务,包括扁平及嵌套命名实体识别 (NER)、全标记词义消歧 (WSD)、语义相关性、同义词提取与评估、词形还原、词性标注、词根标注等,并包含辅助工具如语料库处理、文本剥离方法和音调符号敏感的词匹配。通过全面的基准测试,SinaTools 在上述任务中表现优异,如扁平 NER (87.33%)、嵌套 NER (89.42%)、WSD (82.63%)、语义相关性 (0.49 Spearman 秩)、词形还原 (90.5%)、词性标注 (97.5%) 等,展示了其在阿拉伯语 NLP 领域的领先性能。
链接: https://arxiv.org/abs/2411.01523
作者: Tymaa Hammouda,Mustafa Jarrar,Mohammed Khalilia
关键词-EN: Arabic natural language, open-source Python package, Named Entity Recognition, natural language processing, Semantic Relatedness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures
点击查看摘要
Abstract:We introduce SinaTools, an open-source Python package for Arabic natural language processing and understanding. SinaTools is a unified package allowing people to integrate it into their system workflow, offering solutions for various tasks such as flat and nested Named Entity Recognition (NER), fully-flagged Word Sense Disambiguation (WSD), Semantic Relatedness, Synonymy Extractions and Evaluation, Lemmatization, Part-of-speech Tagging, Root Tagging, and additional helper utilities such as corpus processing, text stripping methods, and diacritic-aware word matching. This paper presents SinaTools and its benchmarking results, demonstrating that SinaTools outperforms all similar tools on the aforementioned tasks, such as Flat NER (87.33%), Nested NER (89.42%), WSD (82.63%), Semantic Relatedness (0.49 Spearman rank), Lemmatization (90.5%), POS tagging (97.5%), among others. SinaTools can be downloaded from (this https URL).
摘要:我们介绍了 SinaTools,一个用于阿拉伯语自然语言处理和理解的开放源代码 Python 包。SinaTools 是一个统一的包,允许用户将其集成到系统工作流程中,提供多种任务的解决方案,包括扁平及嵌套命名实体识别(Named Entity Recognition, NER)、全标记词义消歧(Word Sense Disambiguation, WSD)、语义相关性、同义词提取与评估、词形还原(Lemmatization)、词性标注(Part-of-speech Tagging, POS)、词根标注(Root Tagging),以及额外的辅助工具,如语料库处理、文本剥离方法和音调符号敏感的词匹配。本文介绍了 SinaTools 及其基准测试结果,证明 SinaTools 在上述任务中优于所有类似工具,例如扁平 NER(87.33%)、嵌套 NER(89.42%)、WSD(82.63%)、语义相关性(0.49 Spearman 等级)、词形还原(90.5%)、词性标注(97.5%)等。SinaTools 可以从(此 https URL)下载。
[NLP-47] Integration of Large Vision Language Models for Efficient Post-disaster Damage Assessment and Reporting
【速读】: 该论文试图解决传统自然灾害响应中由于人类局限性导致的行动延迟和经济损失问题。解决方案的关键是引入名为DisasTeller的多大型视觉语言模型(LVLMs)驱动的框架,该框架通过协调四个专门化的LVLMs代理,以GPT-4为核心模型,自动化执行灾害管理中的现场评估、紧急警报、资源分配和恢复计划等任务。DisasTeller不仅减少了人类执行时间,优化了资源分配,还简化了非专家对灾害管理过程的访问,从而提高了响应效率和资源获取能力,特别是在欠发达地区。
链接: https://arxiv.org/abs/2411.01511
作者: Zhaohui Chen,Elyas Asadi Shamsabadi,Sheng Jiang,Luming Shen,Daniel Dias-da-Costa
关键词-EN: involves significant coordinated, significant coordinated teamwork, response involves significant, Large Vision Language, Vision Language Models
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 13 pages, 4 figures
点击查看摘要
Abstract:Traditional natural disaster response involves significant coordinated teamwork where speed and efficiency are key. Nonetheless, human limitations can delay critical actions and inadvertently increase human and economic losses. Agentic Large Vision Language Models (LVLMs) offer a new avenue to address this challenge, with the potential for substantial socio-economic impact, particularly by improving resilience and resource access in underdeveloped regions. We introduce DisasTeller, the first multi-LVLM-powered framework designed to automate tasks in post-disaster management, including on-site assessment, emergency alerts, resource allocation, and recovery planning. By coordinating four specialised LVLM agents with GPT-4 as the core model, DisasTeller autonomously implements disaster response activities, reducing human execution time and optimising resource distribution. Our evaluations through both LVLMs and humans demonstrate DisasTeller’s effectiveness in streamlining disaster response. This framework not only supports expert teams but also simplifies access to disaster management processes for non-experts, bridging the gap between traditional response methods and LVLM-driven efficiency.
摘要:传统的自然灾害应对涉及大量的协调团队合作,其中速度和效率是关键。然而,人类的局限性可能导致关键行动的延迟,并无意中增加人员和经济损失。智能大视觉语言模型 (LVLMs) 提供了一种新的途径来应对这一挑战,具有显著的社会经济影响,特别是在提高欠发达地区的韧性和资源获取方面。我们引入了 DisasTeller,这是首个由多 LVLMs 驱动的框架,旨在自动化灾后管理任务,包括现场评估、紧急警报、资源分配和恢复计划。通过协调四个专门的 LVLMs 智能体,并以 GPT-4 为核心模型,DisasTeller 自主实施灾害应对活动,减少人类执行时间并优化资源分配。我们的评估通过 LVLMs 和人类的双重验证,证明了 DisasTeller 在简化灾害应对方面的有效性。该框架不仅支持专家团队,还简化了非专家对灾害管理流程的访问,弥合了传统应对方法与 LVLMs 驱动效率之间的差距。
[NLP-48] Sample-Efficient Alignment for LLM s
【速读】: 该论文试图解决在大语言模型 (LLMs) 与人类偏好对齐过程中,如何在有限的在线反馈预算下实现高效对齐的问题。解决方案的关键在于将 LLM 对齐问题形式化为上下文双臂赌博机 (contextual dueling bandits) 框架,并基于此框架引入了一种基于汤普森采样 (Thompson sampling) 的统一算法,称为 SEA (Sample-Efficient Alignment)。该算法通过在线主动探索 (online active exploration) 实现了样本效率的显著提升,并在不同模型规模和偏好学习算法上进行了广泛验证,结果表明 SEA 在样本效率上优于近期其他主动探索方法。
链接: https://arxiv.org/abs/2411.01493
作者: Zichen Liu,Changyu Chen,Chao Du,Wee Sun Lee,Min Lin
关键词-EN: aligning large language, budgeted online feedback, efficiently aligning large, large language models, aligning large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle’s preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.
摘要:我们研究了在预算有限的在线反馈条件下,如何高效地将大语言模型 (LLM) 与人类偏好对齐的方法。首先,我们将 LLM 对齐问题置于上下文对偶多臂老虎机的框架中。这一框架涵盖了近期如在线强化学习人类反馈 (RLHF) 和在线直接偏好优化 (DPO) 等范式,本质上追求包含在线主动探索的样本高效算法。借助老虎机理论的洞察,我们提出了一种基于 Thompson 采样的统一算法,并强调其在两种不同 LLM 对齐场景中的应用。实际中高效实现该算法的智能体,名为 SEA (Sample-Efficient Alignment),通过在三种模型规模 (1B, 2.8B, 6.9B) 和三种偏好学习算法 (DPO, IPO, SLiC) 上的广泛实验得到了实证验证。结果表明,SEA 在实现与预言机偏好高度样本高效对齐方面表现优异,超越了近期针对 LLM 的主动探索方法。此外,我们发布了 SEA 的实现代码,并附带了一个为 LLM 在线对齐设计的高效代码库,旨在加速该领域的未来研究。
[NLP-49] Domain-specific Guided Summarization for Mental Health Posts ACL
【速读】: 该论文试图解决在特定领域(尤其是心理健康领域)中,生成式摘要(abstractive summarization)需要高级技术来处理专业内容,以生成与领域相关且忠实于原文的摘要的问题。解决方案的关键在于引入了一个配备双编码器(dual-encoder)和适应性解码器(adapted decoder)的引导式摘要器,该解码器利用新颖的领域特定引导信号,即心理健康术语和源文档中上下文丰富的句子,来增强其与指导内容和上下文的紧密对齐能力,从而生成与领域相关的摘要。此外,论文还提出了一个后编辑校正模型(post-editing correction model),用于纠正生成摘要中的错误,从而增强其与原始内容的细节一致性。
链接: https://arxiv.org/abs/2411.01485
作者: Lu Qian,Yuqi Wang,Zimu Wang,Haiyang Zhang,Wei Wang,Ting Yu,Anh Nguyen
关键词-EN: abstractive summarization requires, summarization requires advanced, requires advanced techniques, advanced techniques adept, handling specialized content
类目: Computation and Language (cs.CL)
备注: Accepted at PACLIC 2024. Camera-ready version
点击查看摘要
Abstract:In domain-specific contexts, particularly mental health, abstractive summarization requires advanced techniques adept at handling specialized content to generate domain-relevant and faithful summaries. In response to this, we introduce a guided summarizer equipped with a dual-encoder and an adapted decoder that utilizes novel domain-specific guidance signals, i.e., mental health terminologies and contextually rich sentences from the source document, to enhance its capacity to align closely with the content and context of guidance, thereby generating a domain-relevant summary. Additionally, we present a post-editing correction model to rectify errors in the generated summary, thus enhancing its consistency with the original content in detail. Evaluation on the MentSum dataset reveals that our model outperforms existing baseline models in terms of both ROUGE and FactCC scores. Although the experiments are specifically designed for mental health posts, the methodology we’ve developed offers broad applicability, highlighting its versatility and effectiveness in producing high-quality domain-specific summaries.
摘要:在特定领域,特别是心理健康领域,生成式摘要需要先进的技术来处理专业内容,以生成与领域相关且忠实于原文的摘要。为此,我们引入了一种带有双编码器和适应性解码器的引导式摘要器,该摘要器利用新颖的领域特定引导信号,即心理健康术语和源文档中上下文丰富的句子,来增强其与引导内容和上下文的紧密对齐能力,从而生成与领域相关的摘要。此外,我们还提出了一种后编辑校正模型,用于纠正生成摘要中的错误,从而提高其与原始内容在细节上的一致性。在MentSum数据集上的评估结果显示,我们的模型在ROUGE和FactCC评分方面均优于现有的基线模型。尽管实验专门针对心理健康帖子设计,但我们开发的方法具有广泛的适用性,突显了其在生成高质量领域特定摘要方面的多功能性和有效性。
[NLP-50] aching Models to Improve on Tape
【速读】: 该论文试图解决大型语言模型(LLMs)在生成受特定约束内容时的表现不佳问题。解决方案的关键在于引入了一种基于强化学习(RL)的框架,称为CORGI(Controlled Generation with RL for Guided Interaction),通过模拟交互会话并根据模型满足约束的能力给予奖励,来显著提升模型在受控生成任务中的表现。CORGI不仅在无标签训练数据上表现优于传统的强化学习方法,还通过其交互框架实现了元学习,使模型能够更好地泛化到新的引导交互任务中。
链接: https://arxiv.org/abs/2411.01483
作者: Liat Bezalel,Eyal Orgad,Amir Globerson
关键词-EN: Large Language Models, Large Language, Language Models, struggle when prompted, prompted to generate
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) often struggle when prompted to generate content under specific constraints. However, in such cases it is often easy to check whether these constraints are satisfied or violated. Recent works have shown that LLMs can benefit from such ``corrective feedback’'. Here we claim that this skill of LLMs can be significantly enhanced via training. We introduce an RL framework for teaching models to use such rewards, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints. We refer to our method as CORGI (Controlled Generation with RL for Guided Interaction), and evaluate it on a variety of controlled generation tasks using unlabeled training data. We find that CORGI consistently outperforms the baseline reinforcement learning method that does not incorporate conversational feedback. Furthermore, CORGI’s interactive framework enables meta-learning, allowing the LLM to generalize better to guided interaction in new tasks. Our results clearly show that conversational optimization, when combined with reinforcement learning, significantly improves the effectiveness of LLMs in controlled generation contexts.
摘要:大语言模型 (LLM) 在生成受特定约束的内容时常常遇到困难。然而,在这种情况下,通常很容易检查这些约束是否被满足或违反。最近的研究表明,LLM 可以从这种“纠正性反馈”中受益。本文提出,通过训练可以显著增强 LLM 的这一技能。我们引入了一个强化学习 (RL) 框架,通过模拟交互会话,并根据模型满足约束的能力给予奖励,来教导模型使用这种奖励。我们将这种方法称为 CORGI(通过 RL 实现引导交互的受控生成),并在使用未标记训练数据的多种受控生成任务上对其进行了评估。结果显示,CORGI 持续优于不包含对话反馈的基线强化学习方法。此外,CORGI 的交互框架支持元学习,使得 LLM 能够更好地泛化到新任务中的引导交互。我们的研究结果明确表明,将对话优化与强化学习结合,可以显著提升 LLM 在受控生成环境中的有效性。
[NLP-51] DPCL-Diff: The Temporal Knowledge Graph Reasoning based on Graph Node Diffusion Model with Dual-Domain Periodic Contrastive Learning
【速读】: 该论文试图解决时间知识图谱(Temporal Knowledge Graph, TKG)推理中未来事件预测的问题,特别是针对历史交互稀疏的事件。解决方案的关键在于提出了一个结合图节点扩散模型(Graph Node Diffusion, GNDiff)和双域周期性对比学习(Dual-domain Periodic Contrastive Learning, DPCL)的模型,即DPCL-Diff。GNDiff通过向稀疏相关事件引入噪声来模拟新事件,生成高质量数据,从而增强模型对新事件的推理能力。DPCL则通过将周期性和非周期性事件实体映射到Poincaré和Euclidean空间,利用其特性有效区分相似的周期性事件。实验结果表明,DPCL-Diff在事件预测方面显著优于现有最先进的TKG模型,验证了该方法的有效性。
链接: https://arxiv.org/abs/2411.01477
作者: Yukun Cao,Lisheng Wang,Luobing Huang
关键词-EN: Temporal knowledge graph, Temporal knowledge, infers future missing, future missing facts, essential and challenging
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 2 figures
点击查看摘要
Abstract:Temporal knowledge graph (TKG) reasoning that infers future missing facts is an essential and challenging task. Predicting future events typically relies on closely related historical facts, yielding more accurate results for repetitive or periodic events. However, for future events with sparse historical interactions, the effectiveness of this method, which focuses on leveraging high-frequency historical information, diminishes. Recently, the capabilities of diffusion models in image generation have opened new opportunities for TKG reasoning. Therefore, we propose a graph node diffusion model with dual-domain periodic contrastive learning (DPCL-Diff). Graph node diffusion model (GNDiff) introduces noise into sparsely related events to simulate new events, generating high-quality data that better conforms to the actual distribution. This generative mechanism significantly enhances the model’s ability to reason about new events. Additionally, the dual-domain periodic contrastive learning (DPCL) maps periodic and non-periodic event entities to Poincaré and Euclidean spaces, leveraging their characteristics to distinguish similar periodic events effectively. Experimental results on four public datasets demonstrate that DPCL-Diff significantly outperforms state-of-the-art TKG models in event prediction, demonstrating our approach’s effectiveness. This study also investigates the combined effectiveness of GNDiff and DPCL in TKG tasks.
摘要:时间知识图谱(Temporal Knowledge Graph, TKG)推理,即推断未来缺失事实的任务,是一项既重要又具挑战性的工作。预测未来事件通常依赖于与之紧密相关的历史事实,从而对重复或周期性事件产生更准确的结果。然而,对于历史交互稀疏的未来事件,这种侧重于利用高频历史信息的方法效果会大打折扣。近期,扩散模型在图像生成方面的能力为TKG推理开辟了新的可能性。因此,我们提出了一种结合双域周期性对比学习的图节点扩散模型(DPCL-Diff)。图节点扩散模型(GNDiff)通过向稀疏相关的事件引入噪声来模拟新事件,生成更符合实际分布的高质量数据。这种生成机制显著提升了模型对新事件的推理能力。此外,双域周期性对比学习(DPCL)将周期性和非周期性事件实体映射到庞加莱和欧几里得空间,利用其特性有效区分相似的周期性事件。在四个公开数据集上的实验结果表明,DPCL-Diff在事件预测方面显著优于当前最先进的TKG模型,验证了我们方法的有效性。本研究还探讨了GNDiff和DPCL在TKG任务中的联合效果。
[NLP-52] MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
【速读】: 该论文试图解决在多语言机器翻译系统中,由于字节级分词导致语义信息不足的问题,特别是在不同语言编码规则存在差异的情况下。解决方案的关键是提出了一种自适应多尺度多头注意力机制 (Adaptive MultiScale-Headed Attention, Ada-MSHA),通过动态选择和混合注意力头,将其视为上下文建模专家,从而增强上下文建模的灵活性和效果。实验结果表明,该方法在不需大量手动调整超参数的情况下,优于现有的方法,并在Ted-59数据集上超越了基于子词的模型,同时参数更少。
链接: https://arxiv.org/abs/2411.01474
作者: Langlin Huang,Mengyu Bu,Yang Feng
关键词-EN: Byte-based machine translation, massively multilingual settings, machine translation systems, Byte-based machine, shown significant potential
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages, enabling broad language scalability. However, byte-level tokenization results in sequences that are hard to interpret due to limited semantic information per byte. Local contextualization has proven effective in assigning initial semantics to tokens, improving sentence comprehension. Nevertheless, variations in encoding rules across languages necessitate an adaptive approach for effective contextualization. To this end, we propose Adaptive MultiScale-Headed Attention (Ada-MSHA), adaptively selecting and mixing attention heads, which are treated as contextualization experts. This enhances the flexibility of contextualization scales and improves the potential to discover a better strategy than previous methods. Experiment results show that our method outperforms existing methods without extensive manual adjustment of hyper-parameters and surpasses subword-based models with fewer parameters in Ted-59 dataset. Our code is available at this https URL.
摘要:基于字节的机器翻译系统在多语言大规模应用中展现出显著的潜力。Unicode编码将每个字符映射到特定的字节,消除了在新语言中出现未知词的问题,从而实现了广泛的语言扩展性。然而,字节级别的Token化导致序列难以解释,因为每个字节包含的语义信息有限。局部上下文化已被证明在为Token分配初始语义方面是有效的,从而提高了句子理解能力。尽管如此,不同语言编码规则的差异要求采用一种自适应的方法来进行有效的上下文化。为此,我们提出了自适应多尺度多头注意力机制(Adaptive MultiScale-Headed Attention, Ada-MSHA),该机制能够自适应地选择和混合注意力头,这些注意力头被视为上下文化专家。这增强了上下文化尺度的灵活性,并提高了发现比以往方法更优策略的潜力。实验结果表明,我们的方法在不进行大量手动调整超参数的情况下优于现有方法,并且在Ted-59数据集上以更少的参数超越了基于子词的模型。我们的代码可在以下链接获取:https URL。
[NLP-53] Classifier-guided Gradient Modulation for Enhanced Multimodal Learning NEURIPS2024
【速读】: 该论文试图解决多模态学习中模型过度依赖单一模态的问题,即在训练过程中模型倾向于仅利用学习速度较快的模态,导致其他模态的利用不足。解决方案的关键在于提出了一种新的方法——分类器引导的梯度调制 (Classifier-Guided Gradient Modulation, CGGM),该方法不仅考虑了梯度的大小,还考虑了梯度的方向,从而更全面地平衡多模态学习过程。通过在四个多模态数据集(UPMC-Food 101, CMU-MOSI, IEMOCAP 和 BraTS 2021)上的广泛实验,CGGM 在分类、回归和分割任务中均表现出色,显著优于现有方法和基线模型,证明了其有效性和广泛适用性。
链接: https://arxiv.org/abs/2411.01409
作者: Zirun Guo,Tao Jin,Jingyuan Chen,Zhou Zhao
关键词-EN: recent years, developed very fast, fast in recent, multimodal training process, balance multimodal learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2024
点击查看摘要
Abstract:Multimodal learning has developed very fast in recent years. However, during the multimodal training process, the model tends to rely on only one modality based on which it could learn faster, thus leading to inadequate use of other modalities. Existing methods to balance the training process always have some limitations on the loss functions, optimizers and the number of modalities and only consider modulating the magnitude of the gradients while ignoring the directions of the gradients. To solve these problems, in this paper, we present a novel method to balance multimodal learning with Classifier-Guided Gradient Modulation (CGGM), considering both the magnitude and directions of the gradients. We conduct extensive experiments on four multimodal datasets: UPMC-Food 101, CMU-MOSI, IEMOCAP and BraTS 2021, covering classification, regression and segmentation tasks. The results show that CGGM outperforms all the baselines and other state-of-the-art methods consistently, demonstrating its effectiveness and versatility. Our code is available at this https URL.
摘要:近年来,多模态学习发展迅速。然而,在多模态训练过程中,模型往往倾向于依赖其中一种模态,因为这种模态能让学习速度更快,从而导致其他模态的利用不足。现有的平衡训练过程的方法在损失函数、优化器以及模态数量上总是存在一些局限性,并且仅考虑调节梯度的大小而忽略了梯度的方向。为了解决这些问题,本文提出了一种新的方法——分类器引导的梯度调制(Classifier-Guided Gradient Modulation, CGGM),该方法同时考虑了梯度的大小和方向。我们在四个多模态数据集上进行了广泛的实验:UPMC-Food 101、CMU-MOSI、IEMOCAP 和 BraTS 2021,涵盖了分类、回归和分割任务。实验结果表明,CGGM 在所有基线方法和其他最先进的方法中表现出色,展示了其有效性和通用性。我们的代码可在以下链接获取:https URL。
[NLP-54] Artificial Intelligence Driven Course Generation: A Case Study Using ChatGPT
【速读】: 该论文试图解决如何利用生成式 AI (Generative AI),特别是 ChatGPT,来高效、高质量地创建教育内容的问题。解决方案的关键在于通过迭代方法,利用 ChatGPT 生成课程材料,包括翻译、内容扩展、实际案例、作业、补充材料和 LaTeX 格式化,并在生成后立即进行验证以确保准确性。此外,通过使用 Detectia 和 Turnitin 进行后生成分析,确保内容的原创性。最终,生成的课程材料经过专家和大学委员会的审查和批准,展示了 AI 在教育内容创建中的革命性潜力。
链接: https://arxiv.org/abs/2411.01369
作者: Djaber Rouabhia
关键词-EN: explores Artificial Intelligence, study explores Artificial, Artificial Intelligence, explores Artificial, creating educational content
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 16 pages
点击查看摘要
Abstract:This study explores Artificial Intelligence use, specifically ChatGPT, in creating educational content. The study aims to elaborate on using ChatGPT to create course materials. The main objective is to assess the efficiency, quality, and impact of AI-driven course generation, and to create a Multimedia Databases course as a case study. The study highlights the potential of AI to revolutionize educational content creation, making it more accessible, personalized, and efficient. The course content was generated in less than one day through iterative methods, using prompts for translation, content expansion, practical examples, assignments, supplementary materials, and LaTeX formatting. Each part was verified immediately after generation to ensure accuracy. Post-generation analysis with Detectia and Turnitin showed similarity rates of 8.7% and 13%, indicating high originality. Experts and university committees reviewed and approved the course, with English university teachers praising its language quality. ChatGPT also created a well-structured and diversified exam for the module. Key findings reveal significant time efficiency, comprehensive content coverage, and high flexibility. The study underscores AI’s transformative potential in education, addressing challenges related to data privacy, technology dependence, content accuracy, and algorithmic biases. The conclusions emphasize the need for collaboration between educators, policymakers, and technology developers to harness AI’s benefits in education fully.
摘要:本研究探讨了人工智能的应用,特别是 ChatGPT 在创建教育内容方面的应用。研究旨在详细阐述如何利用 ChatGPT 生成课程材料。主要目标是评估 AI 驱动的课程生成在效率、质量和影响方面的表现,并以创建一门多媒体数据库课程作为案例研究。研究强调了 AI 在革新教育内容创建方面的潜力,使其更加易于获取、个性化和高效。通过迭代方法,使用提示进行翻译、内容扩展、实际示例、作业、补充材料和 LaTeX 格式化,课程内容在不到一天的时间内生成。每部分生成后立即进行验证以确保准确性。生成后的分析使用 Detectia 和 Turnitin 显示相似率分别为 8.7% 和 13%,表明高度原创性。专家和大学委员会审查并批准了该课程,英国大学教师对其语言质量给予了高度评价。ChatGPT 还为该模块创建了一个结构良好且多样化的考试。关键发现揭示了显著的时间效率、全面的内容覆盖和高度的灵活性。研究强调了 AI 在教育中的变革潜力,解决了与数据隐私、技术依赖、内容准确性和算法偏见相关的挑战。结论强调了教育者、政策制定者和技术开发者之间合作的重要性,以充分挖掘 AI 在教育中的优势。
[NLP-55] Online and Offline Evaluations of Collaborative Filtering and Content Based Recommender Systems
【速读】: 该论文试图解决推荐系统在实际应用中的效果评估问题,特别是如何通过离线和在线评估方法来确定不同推荐算法在特定数据集和系统规模下的最佳性能。解决方案的关键在于综合运用多种推荐算法(包括基于内容的推荐、协同过滤、趋势分析和混合方法),并通过大规模的实际操作(涵盖约70个伊朗网站,每秒处理约300个请求)进行验证。评估方法包括手动评估、离线测试(如准确率和排名指标hit-rate@k和nDCG)以及在线测试(如点击率CTR),同时分析并提出解决冷启动和流行度偏差的方法。
链接: https://arxiv.org/abs/2411.01354
作者: Ali Elahi,Armin Zirak
关键词-EN: discover relevant items, efficiently discover relevant, users efficiently discover, relevant items, applications designed
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 9 figures
点击查看摘要
Abstract:Recommender systems are widely used AI applications designed to help users efficiently discover relevant items. The effectiveness of such systems is tied to the satisfaction of both users and providers. However, user satisfaction is complex and cannot be easily framed mathematically using information retrieval and accuracy metrics. While many studies evaluate accuracy through offline tests, a growing number of researchers argue that online evaluation methods such as A/B testing are better suited for this purpose. We have employed a variety of algorithms on different types of datasets divergent in size and subject, producing recommendations in various platforms, including media streaming services, digital publishing websites, e-commerce systems, and news broadcasting networks. Notably, our target websites and datasets are in Persian (Farsi) language. This study provides a comparative analysis of a large-scale recommender system that has been operating for the past year across about 70 websites in Iran, processing roughly 300 requests per second collectively. The system employs user-based and item-based recommendations using content-based, collaborative filtering, trend-based methods, and hybrid approaches. Through both offline and online evaluations, we aim to identify where these algorithms perform most efficiently and determine the best method for our specific needs, considering the dataset and system scale. Our methods of evaluation include manual evaluation, offline tests including accuracy and ranking metrics like hit-rate@k and nDCG, and online tests consisting of click-through rate (CTR). Additionally we analyzed and proposed methods to address cold-start and popularity bias. Comments: 9 pages, 9 figures Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2411.01354 [cs.IR] (or arXiv:2411.01354v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.01354 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:推荐系统是广泛使用的 AI 应用,旨在帮助用户高效地发现相关项目。这类系统的有效性与用户和提供者的满意度密切相关。然而,用户满意度复杂且难以通过信息检索和准确性指标进行数学建模。尽管许多研究通过离线测试评估准确性,但越来越多的研究人员认为,在线评估方法如 A/B 测试更适合此目的。我们在不同类型、大小和主题各异的多个数据集上应用了多种算法,生成了包括媒体流服务、数字出版网站、电子商务系统和新闻广播网络在内的多个平台的推荐。值得注意的是,我们的目标网站和数据集使用的是波斯语(Farsi)。本研究对一个在过去一年中在伊朗约 70 个网站上运行的大规模推荐系统进行了比较分析,该系统每秒处理约 300 个请求。该系统采用基于用户和基于项目的推荐,使用基于内容的、协同过滤、基于趋势的方法以及混合方法。通过离线和在线评估,我们旨在确定这些算法在何处表现最有效,并确定最适合我们特定需求的方法,考虑到数据集和系统规模。我们的评估方法包括手动评估、离线测试(包括准确性和排序指标,如 hit-rate@k 和 nDCG)以及在线测试(包括点击率 CTR)。此外,我们还分析并提出了解决冷启动和流行度偏差的方法。
评论:9 页,9 幅图
主题:信息检索 (cs.IR);人工智能 (cs.AI);计算与语言 (cs.CL);机器学习 (cs.LG)
引用为:arXiv:2411.01354 [cs.IR]
(或 arXiv:2411.01354v1 [cs.IR] 用于此版本)
https://doi.org/10.48550/arXiv.2411.01354
了解更多信息
arXiv 发布的 DOI 通过 DataCite(待注册)
[NLP-56] AMREx: AMR for Explainable Fact Verification EMNLP
【速读】: 该论文试图解决社交媒体中大量信息传播导致的虚假信息扩散问题,并提出了一种基于抽象语义表示 (Abstract Meaning Representation, AMR) 的事实验证和解释系统 AMREx。解决方案的关键在于利用 AMR 评估指标 Smatch 来测量语义包含度和文本相似性,从而实现部分可解释的验证结果。AMREx 通过解释性 AMR 节点映射来澄清系统的验证预测,并在 FEVER 和 AVeriTeC 数据集上展示了其超越基线的准确性,证明了其在实际声明验证中的有效性。此外,AMREx 的输出还可以用于引导大型语言模型 (LLMs) 生成自然语言解释,以减少幻觉现象的发生。
链接: https://arxiv.org/abs/2411.01343
作者: Chathuri Jayaweera,Sangpil Youm,Bonnie Dorr
关键词-EN: social media networks, automatic fact verification, fact verification, Abstract Meaning Representation, spread of misinformation
类目: Computation and Language (cs.CL)
备注: This study implements, evaluates, and analyzes an Abstract Meaning Representation (AMR) based partially explainable system for fact verification/ veracity classification. Accepted by EMNLP Workshop on Fact Extraction and VERification (FEVER) 2024, 11 pages, 7 figures,
点击查看摘要
Abstract:With the advent of social media networks and the vast amount of information circulating through them, automatic fact verification is an essential component to prevent the spread of misinformation. It is even more useful to have fact verification systems that provide explanations along with their classifications to ensure accurate predictions. To address both of these requirements, we implement AMREx, an Abstract Meaning Representation (AMR)-based veracity prediction and explanation system for fact verification using a combination of Smatch, an AMR evaluation metric to measure meaning containment and textual similarity, and demonstrate its effectiveness in producing partially explainable justifications using two community standard fact verification datasets, FEVER and AVeriTeC. AMREx surpasses the AVeriTec baseline accuracy showing the effectiveness of our approach for real-world claim verification. It follows an interpretable pipeline and returns an explainable AMR node mapping to clarify the system’s veracity predictions when applicable. We further demonstrate that AMREx output can be used to prompt LLMs to generate natural-language explanations using the AMR mappings as a guide to lessen the probability of hallucinations.
摘要:随着社交媒体网络的兴起及其所传播的海量信息,自动事实验证成为防止错误信息传播的关键组成部分。更为重要的是,事实验证系统在提供分类结果的同时,还能提供解释以确保预测的准确性。为了满足这两方面的需求,我们实现了AMREx,这是一个基于抽象语义表示(Abstract Meaning Representation, AMR)的事实验证与解释系统。我们结合了Smatch这一AMR评估指标,用于衡量意义包含和文本相似度,并展示了其在两个社区标准事实验证数据集(FEVER和AVeriTeC)上生成部分可解释理由的有效性。AMREx在AVeriTeC基准测试中的准确性表现优异,证明了我们的方法在实际声明验证中的有效性。该系统遵循可解释的流程,并在适用时返回可解释的AMR节点映射,以阐明系统的真实性预测。此外,我们还展示了AMREx的输出可以用于提示大语言模型(LLM),通过AMR映射作为指导生成自然语言解释,从而降低幻觉产生的概率。
[NLP-57] Can Multimodal Large Language Model Think Analogically?
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Model, MLLM)在多模态类比推理中的应用问题。解决方案的关键在于探索MLLM在两个方面的能力:作为解释器(MLLM as an explainer)和作为预测器(MLLM as a predictor)。在解释器方面,论文提出了一种统一的提示模板和方法,以增强MLLM对多模态类比推理问题的深度理解能力;在预测器方面,论文旨在验证MLLM是否能够直接解决多模态类比推理问题。实验结果表明,该方法在流行数据集上优于现有方法,为MLLM的类比推理能力提供了初步证据。
链接: https://arxiv.org/abs/2411.01307
作者: Diandian Guo,Cong Cao,Fangfang Yuan,Dakui Wang,Wei Ma,Yanbing Liu,Jianhui Fu
关键词-EN: MLLM, Large Language Model, multimodal analogical reasoning, Analogical reasoning, Multimodal Large Language
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Analogical reasoning, particularly in multimodal contexts, is the foundation of human perception and creativity. Multimodal Large Language Model (MLLM) has recently sparked considerable discussion due to its emergent capabilities. In this paper, we delve into the multimodal analogical reasoning capability of MLLM. Specifically, we explore two facets: \textitMLLM as an explainer and \textitMLLM as a predictor. In \textitMLLM as an explainer, we primarily focus on whether MLLM can deeply comprehend multimodal analogical reasoning problems. We propose a unified prompt template and a method for harnessing the comprehension capabilities of MLLM to augment existing models. In \textitMLLM as a predictor, we aim to determine whether MLLM can directly solve multimodal analogical reasoning problems. The experiments show that our approach outperforms existing methods on popular datasets, providing preliminary evidence for the analogical reasoning capability of MLLM.
摘要:类比推理,特别是在多模态情境中,是人类感知和创造力的基础。多模态大语言模型 (Multimodal Large Language Model, MLLM) 因其涌现的能力而引发了广泛讨论。本文深入探讨了 MLLM 的多模态类比推理能力。具体而言,我们探索了两个方面:MLLM 作为解释器和 MLLM 作为预测器。在 MLLM 作为解释器的情况下,我们主要关注 MLLM 是否能够深入理解多模态类比推理问题。我们提出了一种统一的提示模板和一种利用 MLLM 理解能力来增强现有模型的方法。在 MLLM 作为预测器的情况下,我们的目标是确定 MLLM 是否能够直接解决多模态类比推理问题。实验结果表明,我们的方法在流行的数据集上优于现有方法,为 MLLM 的类比推理能力提供了初步证据。
[NLP-58] Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models
【速读】: 该论文试图解决当前大型语言模型(LLMs)评估方法中依赖于预定义参考输出的局限性,这种依赖性阻碍了基准测试对LLMs快速演进能力的灵活适应,并需要定期更新基准。论文提出的解决方案之关键是Varco Arena,这是一种无参考输出的基准测试方法,通过直接比较LLMs在多样化提示下的输出,采用单淘汰赛制结构来确定模型排名。这种方法的两个关键优势是:(1)直接比较不依赖于参考文本,能更有效地排序竞争模型,从而获得更可靠的排名;(2)无参考输出的基准测试方法增加了更新基准提示的灵活性,无需依赖高质量的参考输出。实证结果和模拟实验表明,Varco Arena的方法与当前的Elo模型基准测试更为一致,通过Spearman相关性测量,显示出优于使用参考输出作为比较锚点的当前基准测试实践。
链接: https://arxiv.org/abs/2411.01281
作者: Seonil Son,Ju-Min Oh,Heegon Jin,Cheolhun Jang,Jeongbeom Jeong,Kuntae Kim
关键词-EN: Large Language Models, Large Language, robust evaluation methodologies, advancement of Large, Varco Arena
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages for main body, 13 pages in total
点击查看摘要
Abstract:The rapid advancement of Large Language Models (LLMs) necessitates robust evaluation methodologies. Current benchmarking approaches often rely on comparing model outputs against predefined prompts and reference outputs. Relying on predefined reference outputs hinders flexible adaptation of benchmarks to the rapidly evolving capabilities of LLMs. This limitation necessitates periodic efforts to prepare new benchmarks. To keep pace with rapidly evolving LLM capabilities, we propose a more flexible benchmarking approach. Our method, \textit\textbfVarco Arena, provides reference-free benchmarking of LLMs in tournament style. \textit\textbfVarco Arena directly compares LLM outputs across a diverse set of prompts, determining model rankings through a single-elimination tournament structure. This direct pairwise comparison offers two key advantages: (1) Direct comparison, unmediated by reference text, more effectively orders competing LLMs, resulting in more reliable rankings, and (2) reference-free approach to benchmarking adds flexibility in updating benchmark prompts by eliminating the need for quality references. Our empirical results, supported by simulation experiments, demonstrate that the \textit\textbfVarco Arena tournament approach aligns better with the current Elo model for benchmarking LLMs. The alignment is measured in terms of Spearman correlation, showing improvement over current practice of benchmarking that use reference outputs as comparison \textitanchors.
摘要:大语言模型 (LLM) 的快速发展要求建立稳健的评估方法。当前的基准测试方法通常依赖于将模型输出与预定义的提示和参考输出进行比较。依赖预定义的参考输出限制了基准测试对 LLM 快速演进能力的灵活适应性,这需要定期努力准备新的基准测试。为了跟上 LLM 能力的快速演进,我们提出了一种更为灵活的基准测试方法。我们的方法,Varco Arena,提供了一种无参考输出的竞赛式基准测试。Varco Arena 直接在多样化的提示集合上比较 LLM 的输出,并通过单淘汰赛结构确定模型排名。这种直接的成对比较具有两个关键优势:(1) 直接比较,无需通过参考文本中介,更有效地对竞争的 LLM 进行排序,从而产生更可靠的排名;(2) 无参考的基准测试方法增加了更新基准提示的灵活性,因为它消除了对高质量参考的需求。我们的实证结果,通过模拟实验支持,表明 Varco Arena 竞赛方法与当前用于基准测试 LLM 的 Elo 模型更为一致。这种一致性通过 Spearman 相关性来衡量,显示出相对于使用参考输出作为比较锚点的当前基准测试实践的改进。
[NLP-59] NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom
【速读】: 该论文试图解决在大规模实施Cloze测试(Cloze test)时面临的挑战,特别是如何自动化评估学生答案的语义相似性。解决方案的关键在于利用自然语言处理技术(Natural Language Processing, NLP)中的词嵌入模型(Word Embeddings, WE),特别是针对巴西葡萄牙语(PT-BR)的词嵌入模型,来测量预期答案与学生提供答案之间的语义相似性。通过对比GloVe模型与其他模型的表现,发现GloVe在评估语义相似性方面与人工评估结果具有最高的相关性,从而验证了词嵌入模型在提升大规模Cloze测试评估效率方面的潜力。
链接: https://arxiv.org/abs/2411.01280
作者: Túlio Sousa de Gois,Flávia Oliveira Freitas,Julian Tejada,Raquel Meister Ko. Freitag
关键词-EN: assessing text comprehension, Natural Language Processing, text comprehension proficiency, utilizing Natural Language, examines the applicability
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This study examines the applicability of the Cloze test, a widely used tool for assessing text comprehension proficiency, while highlighting its challenges in large-scale implementation. To address these limitations, an automated correction approach was proposed, utilizing Natural Language Processing (NLP) techniques, particularly word embeddings (WE) models, to assess semantic similarity between expected and provided answers. Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure the semantic similarity of the responses. The results were validated through an experimental setup involving twelve judges who classified the students’ answers. A comparative analysis between the WE models’ scores and the judges’ evaluations revealed that GloVe was the most effective model, demonstrating the highest correlation with the judges’ assessments. This study underscores the utility of WE models in evaluating semantic similarity and their potential to enhance large-scale Cloze test assessments. Furthermore, it contributes to educational assessment methodologies by offering a more efficient approach to evaluating reading proficiency.
摘要:本研究探讨了完形填空测试(Cloze test)在评估文本理解能力方面的适用性,同时指出了其在大规模实施中面临的挑战。为解决这些限制,提出了一种自动化批改方法,利用自然语言处理(NLP)技术,特别是词嵌入(Word Embeddings, WE)模型,来评估预期答案与提供答案之间的语义相似度。通过使用在巴西学生中进行的完形填空测试数据,采用了巴西葡萄牙语(PT-BR)的词嵌入模型来测量回答的语义相似度。结果通过一个包含十二名评判员的实验设置进行了验证,这些评判员对学生的回答进行了分类。词嵌入模型评分与评判员评估之间的比较分析显示,GloVe模型最为有效,与评判员的评估显示出最高的相关性。本研究强调了词嵌入模型在评估语义相似性方面的实用性,并展示了其提升大规模完形填空测试评估的潜力。此外,本研究还通过提供一种更高效的阅读能力评估方法,为教育评估方法论做出了贡献。
[NLP-60] An Innovative CGL-MHA Model for Sarcasm Sentiment Recognition Using the MindSpore Framework
【速读】: 该论文试图解决社交媒体中讽刺表达的自动情感分析问题,特别是讽刺语言在用户生成内容中的检测。解决方案的关键在于提出了一种创新的混合模型,该模型整合了卷积神经网络 (CNN)、门控循环单元 (GRU)、长短期记忆网络 (LSTM) 和多头注意力机制 (Multi-Head Attention)。CNN 用于捕捉局部 n-gram 特征,GRU 和 LSTM 用于建模序列依赖性和上下文信息,而多头注意力机制则增强了模型对输入中相关部分的关注,从而提高了模型的解释性和性能。实验结果表明,该模型在两个讽刺检测数据集(Headlines 和 Riloff)上的表现优于传统模型,验证了其有效性。
链接: https://arxiv.org/abs/2411.01264
作者: Zhenkai Qin,Qining Luo,Xunyi Nong
关键词-EN: automated sentiment analysis, introduces significant challenges, media introduces significant, Gated Recurrent Units, Convolutional Neural Networks
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The pervasive use of the Internet and social media introduces significant challenges to automated sentiment analysis, particularly for sarcastic expressions in user-generated content. Sarcasm conveys negative emotions through ostensibly positive or exaggerated language, complicating its detection within natural language processing tasks. To address this, we propose an innovative sarcasm detection model integrating Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM), and Multi-Head Attention mechanisms. The CNN component captures local n-gram features, while GRU and LSTM layers model sequential dependencies and contextual information. Multi-Head Attention enhances the model’s focus on relevant parts of the input, improving interpretability. Experiments on two sarcasm detection datasets, Headlines and Riloff, demonstrate that the model achieves an accuracy of 81.20% and an F1 score of 80.77% on Headlines, and an accuracy of 79.72% with an F1 score of 61.39% on Riloff, outperforming traditional models. These results validate the effectiveness of our hybrid approach for sarcasm detection in social media texts.
摘要:互联网和社交媒体的广泛使用为自动化情感分析带来了重大挑战,尤其是在用户生成内容中的讽刺表达方面。讽刺通过表面上积极或夸张的语言传达负面情绪,使其在自然语言处理任务中的检测变得复杂。为解决这一问题,我们提出了一种创新的讽刺检测模型,该模型集成了卷积神经网络 (CNN)、门控循环单元 (GRU)、长短期记忆网络 (LSTM) 和多头注意力机制。CNN 组件捕捉局部 n-gram 特征,而 GRU 和 LSTM 层则建模序列依赖性和上下文信息。多头注意力机制增强了模型对输入相关部分的聚焦,提高了可解释性。在两个讽刺检测数据集(Headlines 和 Riloff)上的实验表明,该模型在 Headlines 数据集上达到了 81.20% 的准确率和 80.77% 的 F1 分数,在 Riloff 数据集上达到了 79.72% 的准确率和 61.39% 的 F1 分数,优于传统模型。这些结果验证了我们混合方法在社交媒体文本中进行讽刺检测的有效性。
[NLP-61] Diversidade linguistica e inclus~ao digital: desafios para uma ia brasileira
【速读】: 该论文试图解决生成式 AI (Generative AI) 对语言多样性构成的威胁问题。解决方案的关键在于探讨技术应用中的语言选择偏见,以及如何打破因语言标准化而形成的恶性循环,即主导语言因其有丰富的语言文档而被优先用于训练大型语言模型,从而进一步巩固其主导地位。论文基于社会语言学的贡献,提出需要关注和保护非主导语言的多样性,以防止语言多样性的进一步丧失。
链接: https://arxiv.org/abs/2411.01259
作者: Raquel Meister Ko Freitag
关键词-EN: generative AIs, coming under threat, human attribute, advance of generative, Linguistic diversity
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: in Portuguese language. paper aceepted to LAAI-Ethics 2024
点击查看摘要
Abstract:Linguistic diversity is a human attribute which, with the advance of generative AIs, is coming under threat. This paper, based on the contributions of sociolinguistics, examines the consequences of the variety selection bias imposed by technological applications and the vicious circle of preserving a variety that becomes dominant and standardized because it has linguistic documentation to feed the large language models for machine learning.
摘要:语言多样性是人类的一个特征,但随着生成式 AI (Generative AI) 的发展,这一特征正面临威胁。本文基于社会语言学的贡献,探讨了技术应用所施加的多样性选择偏差及其后果,以及由于某种语言变体拥有可供大语言模型进行机器学习的语言文档,从而成为主导并标准化的恶性循环。
[NLP-62] PMoL: Parameter Efficient MoE for Preference Mixing of LLM Alignment
【速读】: 该论文试图解决强化学习从人类反馈 (Reinforcement Learning from Human Feedback, RLHF) 在处理多重竞争偏好时导致的大语言模型 (Large Language Models, LLMs) 偏好对齐度下降的问题。解决方案的关键在于提出了偏好混合的低秩适配器 (Preference Mixture of LoRAs, PMoL),该方法结合了专家混合 (Mixture of Experts, MoE) 和低秩适配器 (Low Rank Adaptor, LoRA) 的架构,能够适应任意数量的偏好混合。通过引入专家组软损失 (expert group soft loss),PMoL 赋予了 MoE 混合偏好的能力,从而在偏好对齐和训练成本方面显著优于基线方法。
链接: https://arxiv.org/abs/2411.01245
作者: Dongxu Liu,Bing Xu,Yinzhuo Chen,Bufan Xu,Wenpeng Lu,Muyun Yang,Tiejun Zhao
关键词-EN: Reinforcement Learning, Human Feedback, large language models, Learning from Human, large language
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) has been proven to be an effective method for preference alignment of large language models (LLMs) and is widely used in the post-training process of LLMs. However, RLHF struggles with handling multiple competing preferences. This leads to a decrease in the alignment of LLMs with human preferences. To address this issue, we propose Preference Mixture of LoRAs (PMoL) from the perspective of model architecture, which can adapt to any number of preferences to mix. PMoL combines Mixture of Experts (MoE) and Low Rank Adaptor (LoRA). This architecture is innovatively applied to the research of preference alignment and has achieved significant performance improvement. The expert group soft loss is used to enable MoE with the ability to mix preferences. Through comprehensive evaluation by the reward model and GPT-4o, the experiment results show that PMoL has superior preference mixing capabilities compared to baseline methods. PMoL achieves better preference alignment with lower training costs.
摘要:基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 已被证明是大语言模型 (Large Language Models, LLMs) 偏好对齐的有效方法,并在 LLMs 的后训练过程中得到广泛应用。然而,RLHF 在处理多个竞争性偏好时存在困难,这导致 LLMs 与人类偏好对齐的下降。为解决这一问题,我们从模型架构的角度提出了偏好混合的 LoRAs (Preference Mixture of LoRAs, PMoL),该方法能够适应任意数量的偏好混合。PMoL 结合了专家混合 (Mixture of Experts, MoE) 和低秩适配器 (Low Rank Adaptor, LoRA)。这种架构创新性地应用于偏好对齐研究,并取得了显著的性能提升。专家组软损失被用于赋予 MoE 混合偏好的能力。通过奖励模型和 GPT-4o 的综合评估,实验结果表明 PMoL 相比基线方法具有更优越的偏好混合能力。PMoL 以更低的训练成本实现了更好的偏好对齐。
[NLP-63] B4: A Black-Box Scrubbing Attack on LLM Watermarks
【速读】: 该论文试图解决在生成式语言模型(LLM)生成的内容中,水印技术在面对黑盒攻击时的鲁棒性问题。解决方案的关键在于提出了一个名为 (\mathcal{B}^4) 的黑盒水印擦除攻击方法。具体而言,论文将水印擦除攻击建模为一个约束优化问题,通过捕捉水印分布和保真度分布两个目标分布来实现。该优化问题通过两个代理分布进行近似求解,实验结果表明 (\mathcal{B}^4) 在12种不同设置下相较于其他基线方法表现出更优越的性能。
链接: https://arxiv.org/abs/2411.01222
作者: Baizhou Huang,Xiao Pu,Xiaojun Wan
关键词-EN: embedding imperceptible patterns, LLM-generated content detection, imperceptible patterns, prominent technique, technique for LLM-generated
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose \mathcalB^4 , a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of \mathcalB^4 compared with other baselines.
摘要:水印技术已成为通过嵌入不可察觉的图案来检测大语言模型(LLM)生成内容的重要手段。尽管其性能卓越,但在对抗攻击下的鲁棒性仍未得到充分探索。以往的研究通常考虑灰盒攻击场景,即攻击者已知水印的具体类型,甚至需要了解水印方法的超参数。然而,这些前提在现实世界中难以实现。针对更为现实的、假设更少的黑盒威胁模型,我们提出了 \mathcalB^4,一种针对水印的黑盒擦除攻击。具体而言,我们将水印擦除攻击形式化为一个约束优化问题,通过捕捉其目标的两个分布——水印分布和保真度分布来实现。该优化问题可以通过两个代理分布进行近似求解。在12种不同设置下的实验结果表明,\mathcalB^4 相较于其他基线方法具有更优越的性能。
[NLP-64] One Arrow Many Targets: Probing LLM s for Multi-Attribute Controllable Text Summarization
【速读】: 该论文试图解决多属性可控文本摘要 (Multi-Attribute Controllable Summarization, MACS) 任务中的研究空白,特别是在利用大型语言模型 (Large Language Models, LLMs) 进行可控摘要生成时,如何有效整合多个可控属性的问题。解决方案的关键在于采用低秩适配器 (low-rank adapters) 进行模型微调,并通过实验评估不同适配器微调策略的效果。此外,论文提出了一种新颖的分层适配器融合技术 (hierarchical adapter fusion technique),以整合来自两个不同可控属性的学习成果,从而提升模型在多属性可控摘要生成任务中的表现。
链接: https://arxiv.org/abs/2411.01213
作者: Tathagato Roy,Rahul Mishra
关键词-EN: natural language processing, Text summarization, controllable summarization, controllable summarization tailored, MACS task
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Text summarization is a well-established task within the natural language processing (NLP) community. However, the focus on controllable summarization tailored to user requirements is gaining traction only recently. While several efforts explore controllability in text summarization, the investigation of Multi-Attribute Controllable Summarization (MACS) remains limited. This work addresses this gap by examining the MACS task through the lens of large language models (LLMs), using various learning paradigms, particularly low-rank adapters. We experiment with different popular adapter fine-tuning strategies to assess the effectiveness of the resulting models in retaining cues and patterns associated with multiple controllable attributes. Additionally, we propose and evaluate a novel hierarchical adapter fusion technique to integrate learnings from two distinct controllable attributes. Subsquently, we present our findings, discuss the challenges encountered, and suggest potential avenues for advancing the MACS task.
摘要:文本摘要作为自然语言处理(NLP)领域的一项成熟任务,近年来逐渐聚焦于根据用户需求进行可控摘要的研究。尽管已有多种研究探索了文本摘要的可控性,但多属性可控摘要(Multi-Attribute Controllable Summarization, MACS)的研究仍相对有限。本研究通过大语言模型(LLMs)的视角,采用多种学习范式,特别是低秩适配器(low-rank adapters),来探讨MACS任务。我们实验了不同的流行适配器微调策略,以评估生成的模型在保留与多个可控属性相关的线索和模式方面的有效性。此外,我们提出并评估了一种新颖的分层适配器融合技术,用于整合来自两个不同可控属性的学习成果。随后,我们展示了研究结果,讨论了遇到的挑战,并提出了推进MACS任务的潜在途径。
[NLP-65] PRIMO: Progressive Induction for Multi-hop Open Rule Generation COLING2024
【速读】: 该论文试图解决现有方法在生成开放规则(open rule)时忽略多跳场景(multi-hop scenarios),导致前提原子与假设原子之间逻辑不一致以及生成的规则原子语义重复的问题。解决方案的关键在于提出了一种渐进式多阶段开放规则生成方法,称为PRIMO。PRIMO通过引入本体信息(ontology information)在规则生成阶段减少歧义并提高规则准确性。其核心结构包括生成、提取和排序模块,通过多阶段处理充分挖掘语言模型中的潜在知识。此外,采用基于人类反馈的强化学习(reinforcement learning from human feedback)进一步优化模型,增强其对常识知识的理解。实验结果表明,PRIMO在提高规则质量和多样性的同时,显著降低了规则原子的重复率。
链接: https://arxiv.org/abs/2411.01205
作者: Jianyu Liu,Sheng Bi,Guilin Qi
关键词-EN: Open rule refer, Open rule, open rule generation, real world, rule
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: COLING 2024
点击查看摘要
Abstract:Open rule refer to the implication from premise atoms to hypothesis atoms, which captures various relations between instances in the real world. Injecting open rule knowledge into the machine helps to improve the performance of downstream tasks such as dialogue and relation extraction. Existing approaches focus on single-hop open rule generation, ignoring multi-hop scenarios, leading to logical inconsistencies between premise and hypothesis atoms, as well as semantic duplication of generated rule atoms. To address these issues, we propose a progressive multi-stage open rule generation method called PRIMO. We introduce ontology information during the rule generation stage to reduce ambiguity and improve rule accuracy. PRIMO constructs a multi-stage structure consisting of generation, extraction, and ranking modules to fully leverage the latent knowledge within the language model across multiple dimensions. Furthermore, we employ reinforcement learning from human feedback to further optimize model, enhancing the model’s understanding of commonsense knowledge. Experiments show that compared to baseline models, PRIMO significantly improves rule quality and diversity while reducing the repetition rate of rule atoms.
摘要:开放规则指的是从前提原子到假设原子的隐含关系,这些关系捕捉了现实世界中实例之间的各种联系。将开放规则知识注入机器有助于提升对话和关系提取等下游任务的性能。现有方法主要关注单跳开放规则生成,忽略了多跳场景,导致前提和假设原子之间的逻辑不一致,以及生成规则原子的语义重复。为解决这些问题,我们提出了一种渐进式多阶段开放规则生成方法,称为PRIMO。我们在规则生成阶段引入本体信息,以减少模糊性并提高规则的准确性。PRIMO构建了一个包含生成、提取和排序模块的多阶段结构,以充分利用大语言模型在多个维度上的潜在知识。此外,我们采用基于人类反馈的强化学习来进一步优化模型,增强模型对常识知识的理解。实验表明,与基线模型相比,PRIMO显著提高了规则的质量和多样性,同时降低了规则原子的重复率。
[NLP-66] ransfer Learning for Finetuning Large Language Models NEURIPS2024
【速读】: 该论文试图解决在大语言模型(Large Language Models)领域中,针对特定任务进行高效微调(finetuning)时面临的复杂选择问题。解决方案的关键在于通过迁移学习(transfer learning)将相关微调任务中的配置知识转移到新任务中,从而减少实践者的选择复杂性。具体来说,研究者通过元学习(meta-learning)性能和成本代理模型,从新的元数据集中进行灰盒元优化(grey-box meta-optimization),并提出仅依赖于迁移学习的方法,而不使用任务特定的贝叶斯优化(Bayesian optimization),优先考虑从相关任务中转移的知识,而非任务特定的反馈。实验结果表明,这种方法在多个合成问答数据集和包含1,800次微调运行的元数据集上,优于零样本(zero-shot)、默认微调和元优化基线。
链接: https://arxiv.org/abs/2411.01195
作者: Tobias Strangmann,Lennart Purucker,Jörg K.H. Franke,Ivo Rapant,Fabio Ferreira,Frank Hutter
关键词-EN: large language models, large language, language models, language models expands, increasingly crucial
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2024 Workshop on Adaptive Foundation Models
点击查看摘要
Abstract:As the landscape of large language models expands, efficiently finetuning for specific tasks becomes increasingly crucial. At the same time, the landscape of parameter-efficient finetuning methods rapidly expands. Consequently, practitioners face a multitude of complex choices when searching for an optimal finetuning pipeline for large language models. To reduce the complexity for practitioners, we investigate transfer learning for finetuning large language models and aim to transfer knowledge about configurations from related finetuning tasks to a new task. In this work, we transfer learn finetuning by meta-learning performance and cost surrogate models for grey-box meta-optimization from a new meta-dataset. Counter-intuitively, we propose to rely only on transfer learning for new datasets. Thus, we do not use task-specific Bayesian optimization but prioritize knowledge transferred from related tasks over task-specific feedback. We evaluate our method on eight synthetic question-answer datasets and a meta-dataset consisting of 1,800 runs of finetuning Microsoft’s Phi-3. Our transfer learning is superior to zero-shot, default finetuning, and meta-optimization baselines. Our results demonstrate the transferability of finetuning to adapt large language models more effectively.
摘要:随着大语言模型领域的扩展,针对特定任务的高效微调变得愈发关键。与此同时,参数高效微调方法的领域也在迅速扩展。因此,从业者在为大语言模型寻找最佳微调流程时面临众多复杂选择。为了降低从业者的复杂性,我们研究了大语言模型的迁移学习微调,旨在将相关微调任务的配置知识迁移到新任务中。在本研究中,我们通过从新的元数据集中进行灰盒元优化,利用元学习性能和成本代理模型进行迁移学习微调。反直觉地,我们提出仅依赖于新数据集的迁移学习。因此,我们不使用任务特定的贝叶斯优化,而是优先考虑从相关任务中迁移的知识,而非任务特定的反馈。我们在八个合成问答数据集和一个包含1,800次微调Microsoft’s Phi-3的元数据集上评估了我们的方法。我们的迁移学习优于零样本、默认微调和元优化基线。我们的结果展示了微调知识在大语言模型适应性上的可迁移性。
[NLP-67] Swan and ArabicMTEB: Dialect-Aware Arabic-Centric Cross-Lingual and Cross-Cultural Embedding Models and Benchmarks
【速读】: 该论文试图解决阿拉伯语嵌入模型在多语言、多方言、多领域和文化多样性方面的性能问题。解决方案的关键在于引入了Swan模型家族,包括基于ARBERTv2的Swan-Small和基于ArMistral的Swan-Large。通过提出ArabicMTEB基准测试套件,全面评估了Swan模型在跨语言、多方言、多领域和文化多样性方面的表现。Swan-Large在大多数阿拉伯语任务中超越了Multilingual-E5-large,而Swan-Small则持续优于Multilingual-E5 base,展示了其在方言和文化意识上的优势,同时提供了显著的经济效益。
链接: https://arxiv.org/abs/2411.01192
作者: Gagan Bhatia,El Moatez Billah Nagoudi,Abdellah El Mekki,Fakhraddin Alwajih,Muhammad Abdul-Mageed
关键词-EN: addressing both small-scale, large-scale use cases, Arabic, small-scale and large-scale, embedding models centred
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce Swan, a family of embedding models centred around the Arabic language, addressing both small-scale and large-scale use cases. Swan includes two variants: Swan-Small, based on ARBERTv2, and Swan-Large, built on ArMistral, a pretrained Arabic large language model. To evaluate these models, we propose ArabicMTEB, a comprehensive benchmark suite that assesses cross-lingual, multi-dialectal, multi-domain, and multi-cultural Arabic text embedding performance, covering eight diverse tasks and spanning 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks, while the Swan-Small consistently surpasses Multilingual-E5 base. Our extensive evaluations demonstrate that Swan models are both dialectally and culturally aware, excelling across various Arabic domains while offering significant monetary efficiency. This work significantly advances the field of Arabic language modelling and provides valuable resources for future research and applications in Arabic natural language processing. Our models and benchmark will be made publicly accessible for research.
摘要:我们介绍了 Swan,这是一系列以阿拉伯语为核心的嵌入模型,旨在解决从小规模到大规模的各种应用场景。Swan 包含两个变体:Swan-Small,基于 ARBERTv2 构建;以及 Swan-Large,基于预训练的阿拉伯语大语言模型 ArMistral 构建。为了评估这些模型,我们提出了 ArabicMTEB,这是一个全面的基准测试套件,用于评估跨语言、多方言、多领域和多文化的阿拉伯语文本嵌入性能,涵盖了八个不同的任务,并跨越了 94 个数据集。Swan-Large 在大多数阿拉伯语任务中达到了最先进的水平,超越了 Multilingual-E5-large,而 Swan-Small 则持续优于 Multilingual-E5 base。我们的广泛评估表明,Swan 模型在方言和文化意识方面表现出色,在各种阿拉伯语领域中表现优异,同时提供了显著的货币效率。这项工作显著推动了阿拉伯语语言建模领域的发展,并为未来的阿拉伯语自然语言处理研究和应用提供了宝贵的资源。我们的模型和基准测试将公开供研究使用。
[NLP-68] CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research
【速读】: 该论文试图解决网络安全领域中由于隐私和法规问题导致的命令行数据集缺乏的问题。解决方案的关键在于提出了首个名为CyPHER的相似命令行数据集,并通过大型语言模型(LLMs)生成了28,520对相似命令行用于训练,同时从真实命令行数据中提取了2,807对相似命令行用于测试。此外,论文还提出了一个名为CmdCaliper的命令行嵌入模型,该模型能够计算命令行的语义相似度,并在性能评估中显示,即使是最小版本的CmdCaliper(3000万参数)也能在恶意命令行检测和相似命令行检索等任务中超越现有最先进的句子嵌入模型。这一研究不仅探索了在网络安全领域使用LLMs生成数据的可行性,还公开了数据集、模型权重和所有程序代码,为未来研究提供了基础。
链接: https://arxiv.org/abs/2411.01176
作者: Sian-Yao Huang,Cheng-Lin Yang,Che-Yu Lin,Chun-Ying Huang
关键词-EN: comprehensive datasets due, research addresses command-line, regulation concerns, research addresses, field obstructed
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This research addresses command-line embedding in cybersecurity, a field obstructed by the lack of comprehensive datasets due to privacy and regulation concerns. We propose the first dataset of similar command lines, named CyPHER, for training and unbiased evaluation. The training set is generated using a set of large language models (LLMs) comprising 28,520 similar command-line pairs. Our testing dataset consists of 2,807 similar command-line pairs sourced from authentic command-line data. In addition, we propose a command-line embedding model named CmdCaliper, enabling the computation of semantic similarity with command lines. Performance evaluations demonstrate that the smallest version of CmdCaliper (30 million parameters) suppresses state-of-the-art (SOTA) sentence embedding models with ten times more parameters across various tasks (e.g., malicious command-line detection and similar command-line retrieval). Our study explores the feasibility of data generation using LLMs in the cybersecurity domain. Furthermore, we release our proposed command-line dataset, embedding models’ weights and all program codes to the public. This advancement paves the way for more effective command-line embedding for future researchers. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2411.01176 [cs.CL] (or arXiv:2411.01176v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.01176 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:本研究针对网络安全领域中的命令行嵌入问题,该领域因隐私和法规顾虑而缺乏全面的数据集。我们提出了首个用于训练和无偏评估的相似命令行数据集,命名为 CyPHER。训练集由一组大语言模型 (LLM) 生成,包含 28,520 对相似命令行。测试数据集则由真实命令行数据中提取的 2,807 对相似命令行组成。此外,我们提出了一种名为 CmdCaliper 的命令行嵌入模型,能够计算命令行之间的语义相似度。性能评估显示,最小的 CmdCaliper 版本(3000 万参数)在多项任务(如恶意命令行检测和相似命令行检索)中,超越了参数多十倍的当前最先进 (SOTA) 句子嵌入模型。本研究探讨了在网络安全领域使用大语言模型生成数据的可行性,并公开发布了所提出的命令行数据集、嵌入模型权重及所有程序代码,为未来研究者提供了更有效的命令行嵌入方法。
主题:计算与语言 (cs.CL)
引用方式:arXiv:2411.01176 [cs.CL](或 arXiv:2411.01176v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2411.01176
通过 DataCite 发布的 arXiv DOI(待注册)
[NLP-69] Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models
【速读】: 该论文试图解决大型语言模型(LLMs)在非英语语言上的性能不足问题,特别是针对那些在英语为中心的LLMs中资源匮乏的语言。解决方案的关键是一种名为**字典插入提示(Dictionary Insertion Prompting, DIP)**的新方法。DIP通过在提供非英语提示时,查找字典并将单词的英语对应词插入提示中,从而改善翻译成英语的效果,并增强模型的英语推理步骤,最终显著提升非英语语言上的任务表现。该方法在数学和常识推理任务上显示出显著效果,且具有简单和计算轻量的特点。
链接: https://arxiv.org/abs/2411.01141
作者: Hongyuan Lu,Zixuan Li,Wai Lam
关键词-EN: primarily studies English-centric, Large Language Models, studies English-centric models, data for Large, present impressive performance
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As current training data for Large Language Models (LLMs) are dominated by English corpus, they are English-centric and they present impressive performance on English reasoning tasks.\footnoteThis paper primarily studies English-centric models, but our method could be universal by using the centric language in the dictionary for non-English-centric LLMs. Yet, they usually suffer from lower performance in other languages. There are about 7,000 languages over the world, and many are low-resourced on English-centric LLMs. For the sake of people who primarily speak these languages, it is especially urgent to enable our LLMs in those languages. Model training is usually effective, but computationally expensive and requires experienced NLP practitioners. This paper presents a novel and simple yet effective method called \textbfDictionary \textbfInsertion \textbfPrompting (\textbfDIP). When providing a non-English prompt, DIP looks up a word dictionary and inserts words’ English counterparts into the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results. We experiment with about 200 languages from FLORES-200. Since there are no adequate datasets, we use the NLLB translator to create synthetic multilingual benchmarks from the existing 4 English reasoning benchmarks such as GSM8K and AQuA. Despite the simplicity and computationally lightweight, we surprisingly found the effectiveness of DIP on math and commonsense reasoning tasks on multiple open-source and close-source LLMs.\footnoteOur dictionaries, code, and synthetic benchmarks will be open-sourced to facilitate future research.
摘要:当前大语言模型 (LLM) 的训练数据主要由英文语料构成,这使得这些模型具有明显的英语中心性,并在英语推理任务上表现出卓越的性能。\footnote本文主要研究英语中心性模型,但我们的方法可以通过使用词典中的中心语言来推广到非英语中心性的 LLM。然而,这些模型在其他语言上的表现通常较差。全球约有7000种语言,其中许多在英语中心性的 LLM 上资源匮乏。为了满足主要使用这些语言的人们的需求,迫切需要在这些语言中启用我们的 LLM。模型训练通常是有效的,但计算成本高昂且需要经验丰富的自然语言处理 (NLP) 从业者。本文提出了一种新颖且简单但有效的方法,称为词典插入提示 (Dictionary Insertion Prompting, DIP)。当提供非英语提示时,DIP 会查找词典并将单词的英文对应词插入提示中,从而实现更好的英语翻译和更好的英语模型推理步骤,最终显著提升结果。我们在 FLORES-200 中的约200种语言上进行了实验。由于缺乏足够的多语言数据集,我们使用 NLLB 翻译器从现有的4个英语推理基准(如 GSM8K 和 AQuA)创建了合成多语言基准。尽管 DIP 方法简单且计算量轻,但我们惊讶地发现它在多个开源和闭源 LLM 的数学和常识推理任务上具有显著效果。\footnote我们的词典、代码和合成基准将开源,以促进未来的研究。
[NLP-70] Do LLM s Know to Respect Copyright Notice? EMNLP2024
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)在处理包含受版权保护材料的用户输入时,是否会尊重版权信息并相应地行事。解决方案的关键在于通过一系列实验,评估语言模型在处理用户输入时可能侵犯版权的程度,并强调需要进一步研究和确保LLMs在处理用户输入时遵守版权法规,以防止未经授权的使用或复制受保护内容。此外,论文还提出了一个基准数据集,用于评估LLMs的侵权行为,并强调未来需要进行模型与版权法规的对齐。
链接: https://arxiv.org/abs/2411.01136
作者: Jialiang Xu,Shenglan Li,Zhaozhuo Xu,Denghui Zhang
关键词-EN: Prior study shows, Prior study, Prior, LLMs, LLMs respect copyright
类目: Computation and Language (cs.CL)
备注: EMNLP 2024 main
点击查看摘要
Abstract:Prior study shows that LLMs sometimes generate content that violates copyright. In this paper, we study another important yet underexplored problem, i.e., will LLMs respect copyright information in user input, and behave accordingly? The research problem is critical, as a negative answer would imply that LLMs will become the primary facilitator and accelerator of copyright infringement behavior. We conducted a series of experiments using a diverse set of language models, user prompts, and copyrighted materials, including books, news articles, API documentation, and movie scripts. Our study offers a conservative evaluation of the extent to which language models may infringe upon copyrights when processing user input containing protected material. This research emphasizes the need for further investigation and the importance of ensuring LLMs respect copyright regulations when handling user input to prevent unauthorized use or reproduction of protected content. We also release a benchmark dataset serving as a test bed for evaluating infringement behaviors by LLMs and stress the need for future alignment.
摘要:先前的研究表明,大语言模型 (LLM) 有时会生成违反版权的内容。本文探讨了一个重要但尚未充分研究的问题,即大语言模型是否会尊重用户输入中的版权信息,并相应地行事?这一研究问题至关重要,因为如果答案是否定的,这意味着大语言模型将成为版权侵权行为的主要促进者和加速器。我们进行了一系列实验,使用了多种语言模型、用户提示以及受版权保护的材料,包括书籍、新闻文章、API 文档和电影剧本。本研究对语言模型在处理包含受保护材料的用户输入时可能侵犯版权的程度进行了保守评估。该研究强调了进一步调查的必要性,以及确保大语言模型在处理用户输入时遵守版权法规的重要性,以防止未经授权的使用或复制受保护内容。我们还发布了一个基准数据集,作为评估大语言模型侵权行为的测试平台,并强调了未来需要进行对齐研究。
[NLP-71] Infant Agent : A Tool-Integrated Logic-Driven Agent with Cost-Effective API Usage
【速读】: 该论文试图解决大型语言模型(LLMs)在自主解决现实世界工程问题和处理复杂逻辑推理方面的两大主要局限性。解决方案的关键在于开发了Infant Agent,该系统集成了任务感知功能、操作符、分层管理系统以及记忆检索机制。这些组件共同作用,使得LLMs能够维持长时间的推理过程,高效处理复杂的多步骤任务,并显著降低API成本。实验结果表明,Infant Agent显著提升了GPT-4o在SWE-bench-lite数据集和AIME-2024数学竞赛中的准确率。
链接: https://arxiv.org/abs/2411.01114
作者: Bin Lei,Yuchen Li,Yiming Zeng,Tao Ren,Yi Luo,Tianyu Shi,Zitian Gao,Zeyu Hu,Weitai Kang,Qiuwu Chen
关键词-EN: real world engineering, world engineering problem, textbf, Infant Agent, primary limitations
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Despite the impressive capabilities of large language models (LLMs), they currently exhibit two primary limitations, \textbf\uppercase\expandafter\romannumeral 1: They struggle to \textbfautonomously solve the real world engineering problem. \textbf\uppercase\expandafter\romannumeral 2: They remain \textbfchallenged in reasoning through complex logic problems. To address these challenges, we developed the \textscInfant Agent, integrating task-aware functions, operators, a hierarchical management system, and a memory retrieval mechanism. Together, these components enable large language models to sustain extended reasoning processes and handle complex, multi-step tasks efficiently, all while significantly reducing API costs. Using the \textscInfant Agent, GPT-4o’s accuracy on the SWE-bench-lite dataset rises from \mathbf0.33% to \mathbf30% , and in the AIME-2024 mathematics competition, it increases GPT-4o’s accuracy from \mathbf13.3% to \mathbf37% .
摘要:尽管大语言模型 (LLM) 展示了令人印象深刻的能力,但目前它们主要存在两个局限性:\textbf\uppercase\expandafter\romannumeral 1:它们难以\textbf自主解决现实世界的工程问题。\textbf\uppercase\expandafter\romannumeral 2:它们在处理复杂逻辑问题时仍然\textbf面临挑战。为了应对这些挑战,我们开发了\textscInfant Agent,该系统集成了任务感知功能、操作符、分层管理系统以及记忆检索机制。这些组件共同使大语言模型能够维持长时间的推理过程,并高效处理复杂的多步骤任务,同时显著降低 API 成本。使用\textscInfant Agent,GPT-4o 在 SWE-bench-lite 数据集上的准确率从 \mathbf0.33% 提升至 \mathbf30%,在 AIME-2024 数学竞赛中,GPT-4o 的准确率从 \mathbf13.3% 提升至 \mathbf37%。
[NLP-72] How Effective Is Self-Consistency for Long-Context Problems?
【速读】: 该论文试图解决在大语言模型 (LLMs) 处理长上下文 (long-context) 问题时,自一致性 (Self-consistency, SC) 是否仍然有效的问题。研究的关键在于探讨SC在长上下文场景中的表现,特别是其对位置偏差 (position bias) 的影响。研究发现,尽管SC在短上下文任务中表现出色,但在长上下文任务中,SC不仅未能缓解位置偏差,反而可能降低模型性能。研究还发现,SC的有效性与上下文长度和模型规模有关,但不受提示格式或任务类型的显著影响。这些结果揭示了当前LLMs在长上下文理解中的局限性,并强调了需要更复杂的方法来解决这些模型中的位置偏差问题。
链接: https://arxiv.org/abs/2411.01101
作者: Adam Byerly,Daniel Khashabi
关键词-EN: involving short content, domains involving short, large language models, short content, demonstrated to enhance
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures
点击查看摘要
Abstract:Self-consistency (SC) has been demonstrated to enhance the performance of large language models (LLMs) across various tasks and domains involving short content. However, does this evidence support its effectiveness for long-context problems? This study examines the role of SC in long-context scenarios, where LLMs often struggle with position bias, hindering their ability to utilize information effectively from all parts of their long input context. We examine a range of design parameters, including different models, context lengths, prompt formats, and types of datasets and tasks. Our findings demonstrate that SC, while effective for short-context problems, fundamentally fails for long-context tasks – not only does it fail to mitigate position bias, but it can also actively degrade performance. We observe that the effectiveness of SC varies with context length and model size but remains mainly unaffected by prompt format or task type. These results provide valuable insight into the limitations of current LLMs in long-context understanding and highlight the need for more sophisticated approaches to address position bias in these models.
摘要:自一致性 (Self-consistency, SC) 已被证明能够提升大语言模型 (Large Language Models, LLMs) 在涉及短内容的多种任务和领域中的表现。然而,这一证据是否支持其在长上下文问题中的有效性?本研究探讨了 SC 在长上下文场景中的作用,其中 LLMs 常常面临位置偏差问题,阻碍了其从长输入上下文的所有部分有效利用信息的能力。我们考察了一系列设计参数,包括不同的模型、上下文长度、提示格式以及数据集和任务类型。我们的研究结果表明,SC 虽然在短上下文问题中有效,但在长上下文任务中基本失效——它不仅未能缓解位置偏差,反而可能主动降低性能。我们观察到,SC 的有效性与上下文长度和模型大小有关,但主要不受提示格式或任务类型的影响。这些结果为当前 LLMs 在长上下文理解中的局限性提供了宝贵的见解,并强调了需要更复杂的方法来解决这些模型中的位置偏差问题。
[NLP-73] abVer: Tabular Fact Verification with Natural Logic ACL
【速读】: 该论文试图解决在表格证据上的事实验证问题,特别是在符号推理模型中如何处理非结构化表格数据的挑战。解决方案的关键在于提出了一种集合论解释的数字和算术函数的方法,将其融入自然逻辑推理中,从而在确定性证明中整合算术表达式。具体实现上,论文利用大型语言模型生成与声明中关键部分相关的问题,并通过在表格上执行适当的函数来回答这些问题,从而生成算术表达式。这种方法在FEVEROUS数据集上的少样本设置中达到了71.4的准确率,优于全神经网络和符号推理模型,并在TabFact数据集上无需进一步训练的情况下保持竞争力,领先0.5个百分点。
链接: https://arxiv.org/abs/2411.01093
作者: Rami Aly,Andreas Vlachos
关键词-EN: providing greater verifiability, Fact verification, form is constructed, LISP-style program, providing greater
类目: Computation and Language (cs.CL)
备注: Accepted to TACL. This is a slightly extended version
点击查看摘要
Abstract:Fact verification on tabular evidence incentivises the use of symbolic reasoning models where a logical form is constructed (e.g. a LISP-style program), providing greater verifiability than fully neural approaches. However, these systems typically rely on well-formed tables, restricting their use in many scenarios. An emerging symbolic reasoning paradigm for textual evidence focuses on natural logic inference, which constructs proofs by modelling set-theoretic relations between a claim and its evidence in natural language. This approach provides flexibility and transparency but is less compatible with tabular evidence since the relations do not extend to arithmetic functions. We propose a set-theoretic interpretation of numerals and arithmetic functions in the context of natural logic, enabling the integration of arithmetic expressions in deterministic proofs. We leverage large language models to generate arithmetic expressions by generating questions about salient parts of a claim which are answered by executing appropriate functions on tables. In a few-shot setting on FEVEROUS, we achieve an accuracy of 71.4, outperforming both fully neural and symbolic reasoning models by 3.4 points. When evaluated on TabFact without any further training, our method remains competitive with an accuracy lead of 0.5 points.
摘要:在表格证据上的事实验证促使使用符号推理模型,其中构建逻辑形式(例如 LISP 风格的程序),提供比完全神经方法更高的可验证性。然而,这些系统通常依赖于结构良好的表格,限制了其在许多场景中的应用。新兴的文本证据符号推理范式侧重于自然逻辑推理,通过在自然语言中建模声明与其证据之间的集合论关系来构建证明。这种方法提供了灵活性和透明性,但由于关系不扩展到算术函数,与表格证据的兼容性较差。我们提出了一种在自然逻辑背景下对数字和算术函数的集合论解释,使得在确定性证明中可以集成算术表达式。我们利用大语言模型生成算术表达式,通过生成关于声明中显著部分的问题,并在表格上执行适当函数来回答这些问题。在 FEVEROUS 的少样本设置中,我们达到了 71.4 的准确率,比完全神经和符号推理模型高出 3.4 个百分点。在未经进一步训练的情况下评估 TabFact 时,我们的方法仍保持竞争力,准确率领先 0.5 个百分点。
[NLP-74] Plentiful Jailbreaks with String Compositions NEURIPS
【速读】: 该论文旨在解决大型语言模型 (Large Language Models, LLMs) 在面对编码类对抗攻击时的脆弱性问题。解决方案的关键在于提出了一种可逆字符串变换框架,通过统一多种编码攻击方法(如leetspeak、rotary ciphers、Base64、ASCII等),实现任意字符串组合的端到端编码和解码。论文进一步设计了一种自动化最佳n攻击方法,从大量字符串组合中采样,以提高攻击成功率。实验结果表明,即使在先进的LLMs中,基于编码的攻击仍然是一个持续存在的漏洞。
链接: https://arxiv.org/abs/2411.01084
作者: Brian R.Y. Huang
关键词-EN: jailbreaking methods, slew of adversarial, Large language models, Large language, language models
类目: Computation and Language (cs.CL)
备注: NeurIPS SoLaR Workshop 2024
点击查看摘要
Abstract:Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or \textitred-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary \textitstring compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.
摘要:大语言模型 (Large Language Models, LLMs) 在面对一系列对抗性攻击和越狱方法时仍然显得脆弱。白帽攻击者(或称为“红队”)常用的一种方法是使用字符串级别的混淆技术处理模型的输入和输出,这些技术包括1337语(leetspeak)、旋转密码(rotary ciphers)、Base64、ASCII编码等。我们的工作通过将这些基于编码的攻击统一在一个可逆字符串变换框架中,扩展了这些攻击方法。通过可逆性,我们可以设计任意字符串组合,这些组合被定义为一系列变换的序列,并且可以编程实现端到端的编码和解码。我们设计了一种自动化的最佳n攻击方法,该方法从组合数量庞大的字符串组合中进行采样。在HarmBench上评估时,我们的越狱方法在多个领先的前沿模型上获得了具有竞争力的攻击成功率,这表明即使在先进的大语言模型中,基于编码的攻击仍然是一个持续存在的漏洞。
[NLP-75] Emoji Attack: A Method for Misleading Judge LLM s in Safety Risk Detection
【速读】: 该论文试图解决的问题是如何防止大型语言模型 (LLMs) 被恶意提示诱导生成有害输出,特别是在使用其他 LLMs 作为评判模型 (Judge LLMs) 来评估生成内容的有害性时,评判模型可能受到token分割偏差 (token segmentation bias) 的影响,导致对有害内容的检测失效。解决方案的关键在于识别并利用评判模型中的token分割偏差,通过Emoji攻击 (Emoji Attack) 方法,即在tokens中插入表情符号以增加子tokens与原始tokens在嵌入空间中的差异,从而使有害内容被错误地标记为“安全”。此外,论文还提出了一种防御机制,即设计提示词来过滤异常字符,但这种防御机制仍可通过混合使用表情符号和其他字符来绕过。
链接: https://arxiv.org/abs/2411.01077
作者: Zhipeng Wei,Yuqi Liu,N. Benjamin Erichson
关键词-EN: Large Language Models, Large Language, Language Models, Judge LLMs, Emoji Attack
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Jailbreaking attacks show how Large Language Models (LLMs) can be tricked into generating harmful outputs using malicious prompts. To prevent these attacks, other LLMs are often used as judges to evaluate the harmfulness of the generated content. However, relying on LLMs as judges can introduce biases into the detection process, which in turn compromises the effectiveness of the evaluation. In this paper, we show that Judge LLMs, like other LLMs, are also affected by token segmentation bias. This bias occurs when tokens are split into smaller sub-tokens, altering their embeddings. This makes it harder for the model to detect harmful content. Specifically, this bias can cause sub-tokens to differ significantly from the original token in the embedding space, leading to incorrect “safe” predictions for harmful content. To exploit this bias in Judge LLMs, we introduce the Emoji Attack – a method that places emojis within tokens to increase the embedding differences between sub-tokens and their originals. These emojis create new tokens that further distort the token embeddings, exacerbating the bias. To counter the Emoji Attack, we design prompts that help LLMs filter out unusual characters. However, this defense can still be bypassed by using a mix of emojis and other characters. The Emoji Attack can also be combined with existing jailbreaking prompts using few-shot learning, which enables LLMs to generate harmful responses with emojis. These responses are often mistakenly labeled as “safe” by Judge LLMs, allowing the attack to slip through. Our experiments with six state-of-the-art Judge LLMs show that the Emoji Attack allows 25% of harmful responses to bypass detection by Llama Guard and Llama Guard 2, and up to 75% by ShieldLM. These results highlight the need for stronger Judge LLMs to address this vulnerability.
摘要:越狱攻击展示了如何利用恶意提示诱导大语言模型 (LLMs) 生成有害输出。为了防止此类攻击,通常使用其他 LLMs 作为评判者来评估生成内容的有害性。然而,依赖 LLMs 作为评判者可能会在检测过程中引入偏见,从而影响评估的有效性。本文表明,评判 LLMs 与其他 LLMs 一样,也受到 Token 分割偏差的影响。这种偏差发生在 Token 被分割成更小的子 Token 时,改变了其嵌入表示。这使得模型更难检测到有害内容。具体而言,这种偏差可能导致子 Token 在嵌入空间中与原始 Token 存在显著差异,从而导致对有害内容做出错误的“安全”预测。为了利用评判 LLMs 中的这种偏差,我们引入了表情符号攻击 (Emoji Attack)——一种在 Token 内插入表情符号以增加子 Token 与其原始 Token 之间嵌入差异的方法。这些表情符号创建了新的 Token,进一步扭曲了 Token 嵌入,加剧了偏差。为了对抗表情符号攻击,我们设计了提示,帮助 LLMs 过滤掉异常字符。然而,这种防御措施仍可通过混合使用表情符号和其他字符来绕过。表情符号攻击还可以与现有的越狱提示结合使用,通过少样本学习 (Few-shot Learning) 使 LLMs 生成带有表情符号的有害响应。这些响应通常被评判 LLMs 错误地标记为“安全”,从而使攻击得以通过。我们对六种最先进的评判 LLMs 进行的实验表明,表情符号攻击使得 Llama Guard 和 Llama Guard 2 的检测率下降了 25%,而 ShieldLM 的检测率下降了高达 75%。这些结果凸显了需要更强大的评判 LLMs 来应对这一漏洞。
[NLP-76] Privacy Risks of Speculative Decoding in Large Language Models
【速读】: 该论文旨在揭示大型语言模型(LLMs)中投机解码(Speculative Decoding)技术存在的隐私风险。论文发现,通过观察正确和错误预测的标记模式,恶意攻击者可以利用标记生成时间和数据包大小的信息,以高准确率(超过90%)推断出用户的私密输入,甚至泄露用于设计这些技术的机密知识产权。解决方案的关键在于采用缓解策略,如跨多个迭代聚合标记和在数据包中填充额外字节,以防止隐私和机密性泄露。
链接: https://arxiv.org/abs/2411.01076
作者: Jiankun Wei,Abdulrahman Abdulrazzag,Tianchen Zhang,Adel Muursepp,Gururaj Saileshwar
关键词-EN: large language models, speculatively predicting multiple, Speculative decoding, language models, widely deployed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Speculative decoding in large language models (LLMs) accelerates token generation by speculatively predicting multiple tokens cheaply and verifying them in parallel, and has been widely deployed. In this paper, we provide the first study demonstrating the privacy risks of speculative decoding. We observe that input-dependent patterns of correct and incorrect predictions can be leaked out to an adversary monitoring token generation times and packet sizes, leading to privacy breaches. By observing the pattern of correctly and incorrectly speculated tokens, we show that a malicious adversary can fingerprint queries and learn private user inputs with more than 90% accuracy across three different speculative decoding techniques - BiLD (almost 100% accuracy), LADE (up to 92% accuracy), and REST (up to 95% accuracy). We show that an adversary can also leak out confidential intellectual property used to design these techniques, such as data from data-stores used for prediction (in REST) at a rate of more than 25 tokens per second, or even hyper-parameters used for prediction (in LADE). We also discuss mitigation strategies, such as aggregating tokens across multiple iterations and padding packets with additional bytes, to avoid such privacy or confidentiality breaches.
摘要:在大语言模型 (LLM) 中,推测性解码通过廉价地推测多个 Token 并并行验证它们,从而加速 Token 生成,并已被广泛部署。本文首次研究了推测性解码的隐私风险。我们观察到,正确和错误预测的输入依赖模式可以通过监控 Token 生成时间和数据包大小的攻击者泄露,导致隐私泄露。通过观察正确和错误推测的 Token 模式,我们展示了恶意攻击者可以在三种不同的推测性解码技术(BiLD、LADE 和 REST)中,以超过 90% 的准确率指纹查询并学习用户的私人输入。具体来说,BiLD 的准确率接近 100%,LADE 的准确率高达 92%,REST 的准确率高达 95%。我们还展示了攻击者可以泄露用于设计这些技术的机密知识产权,例如用于预测的数据存储(在 REST 中)的数据,泄露速率超过每秒 25 个 Token,或者用于预测的超参数(在 LADE 中)。此外,我们讨论了缓解策略,如在多个迭代中聚合 Token 并在数据包中填充额外字节,以避免此类隐私或机密性泄露。
[NLP-77] Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities NEURIPS2024
【速读】: 该论文试图解决多模态数据(如图像、文本、音频等)在对比学习中未能有效捕捉各模态间联合信息的问题。解决方案的关键在于提出了一种名为Symile的简单对比学习方法,该方法能够捕捉任意数量模态间的高阶信息。Symile通过推导总相关性的下界,提供了一个灵活且与架构无关的目标函数,用于学习模态特定的表示。实验结果表明,Symile在跨模态分类和检索任务中优于传统的成对CLIP方法,即使在数据中缺少某些模态的情况下也能表现出色。
链接: https://arxiv.org/abs/2411.01053
作者: Adriel Saporta,Aahlad Puli,Mark Goldstein,Rajesh Ranganath
关键词-EN: leverage naturally paired, naturally paired data-for, captions-to learn general, learn general representations, text captions-to learn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: NeurIPS 2024
点击查看摘要
Abstract:Contrastive learning methods, such as CLIP, leverage naturally paired data-for example, images and their corresponding text captions-to learn general representations that transfer efficiently to downstream tasks. While such approaches are generally applied to two modalities, domains such as robotics, healthcare, and video need to support many types of data at once. We show that the pairwise application of CLIP fails to capture joint information between modalities, thereby limiting the quality of the learned representations. To address this issue, we present Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations. To develop Symile’s objective, we derive a lower bound on total correlation, and show that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities. Symile outperforms pairwise CLIP, even with modalities missing in the data, on cross-modal classification and retrieval across several experiments including on an original multilingual dataset of 33M image, text and audio samples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements. All datasets and code used in this work are publicly available at this https URL.
摘要:对比学习方法,如 CLIP,利用自然配对的数据(例如,图像及其对应的文本描述)来学习能够高效迁移到下游任务的通用表示。尽管这类方法通常应用于两种模态,但在机器人学、医疗保健和视频等领域,需要同时支持多种类型的数据。我们发现,CLIP 的成对应用未能捕捉到模态间的联合信息,从而限制了所学表示的质量。为解决这一问题,我们提出了 Symile,一种简单的对比学习方法,能够捕捉任意数量模态间的高阶信息。Symile 提供了一个灵活、与架构无关的目标函数,用于学习特定模态的表示。为了推导 Symile 的目标函数,我们推导了总相关性的下界,并证明 Symile 对任意一组模态的表示构成了预测剩余模态的充分统计量。在跨模态分类和检索任务中,Symile 在多个实验中均优于成对 CLIP,即使在数据中缺少某些模态的情况下也是如此。这些实验包括在一个包含 3300 万图像、文本和音频样本的多语言数据集以及一个包含胸部 X 光片、心电图和实验室测量数据的临床数据集上的测试。本文所使用的所有数据集和代码均已在以下网址公开:https URL。
[NLP-78] owards Robust Text Classification: Mitigating Spurious Correlations with Causal Learning
链接: https://arxiv.org/abs/2411.01045
作者: Yuqing Zhou,Ziwei Zhu
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
[NLP-79] Enhancing Question Answering Precision with Optimized Vector Retrieval and Instructions
【速读】: 该论文试图解决在信息检索和语言模型应用中,如何通过预训练的大型神经网络提升问答系统(Question-answering, QA)性能的问题。解决方案的关键在于结合优化的向量检索技术和指令方法,通过文档嵌入、向量检索和上下文构建的流程来增强QA任务的表现。具体来说,论文提出了一种基于检索增强的方法,并通过实验验证了在文本分段技术中,采用100字的小块大小且无重叠的策略,能够显著提升QA性能,优于基于语义分段的句子分割方法。
链接: https://arxiv.org/abs/2411.01039
作者: Lixiao Yang,Mengyang Xu,Weimao Ke
关键词-EN: pre-trained large neural, large neural networks, application of Information, Information Retrieval, important application
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 4 tables
点击查看摘要
Abstract:Question-answering (QA) is an important application of Information Retrieval (IR) and language models, and the latest trend is toward pre-trained large neural networks with embedding parameters. Augmenting QA performances with these LLMs requires intensive computational resources for fine-tuning. We propose an innovative approach to improve QA task performances by integrating optimized vector retrievals and instruction methodologies. Based on retrieval augmentation, the process involves document embedding, vector retrieval, and context construction for optimal QA results. We experiment with different combinations of text segmentation techniques and similarity functions, and analyze their impacts on QA performances. Results show that the model with a small chunk size of 100 without any overlap of the chunks achieves the best result and outperforms the models based on semantic segmentation using sentences. We discuss related QA examples and offer insight into how model performances are improved within the two-stage framework.
摘要:问答系统 (Question-answering, QA) 是信息检索 (Information Retrieval, IR) 和语言模型的重要应用,当前的趋势是采用预训练的大型神经网络,这些网络具有嵌入参数。利用这些大语言模型 (Large Language Model, LLM) 来增强 QA 性能需要大量的计算资源进行微调。我们提出了一种创新的方法,通过整合优化的向量检索和指令方法来提升 QA 任务的性能。基于检索增强的方法,该过程包括文档嵌入、向量检索和上下文构建,以实现最佳的 QA 结果。我们实验了不同的文本分段技术和相似度函数组合,并分析了它们对 QA 性能的影响。结果显示,使用小块大小为 100 且无重叠的模型表现最佳,并且优于基于句子语义分段的模型。我们讨论了相关的 QA 示例,并提供了关于在两阶段框架内如何提升模型性能的见解。
[NLP-80] Provable Length Generalization in Sequence Prediction via Spectral Filtering
链接: https://arxiv.org/abs/2411.01035
作者: Annie Marsden,Evan Dogariu,Naman Agarwal,Xinyi Chen,Daniel Suo,Elad Hazan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 34 pages, 9 figures
[NLP-81] Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula EMNLP2024
链接: https://arxiv.org/abs/2411.01030
作者: Sam Blouir,Jimmy Smith,Antonios Anastasopoulos,Amarda Shehu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 (Main Conference)
[NLP-82] Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output EMNLP2024
链接: https://arxiv.org/abs/2411.01022
作者: Hithesh Sankararaman,Mohammed Nasheed Yasin,Tanner Sorensen,Alessandro Di Bari,Andreas Stolcke
关键词-EN:
类目: Computation and Language (cs.CL)
备注: To appear in Proceedings of EMNLP 2024 Industry Track
[NLP-83] Identifying Implicit Social Biases in Vision-Language Models
链接: https://arxiv.org/abs/2411.00997
作者: Kimia Hamidieh,Haoran Zhang,Walter Gerych,Thomas Hartvigsen,Marzyeh Ghassemi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
[NLP-84] FedDTPT: Federated Discrete and Transferable Prompt Tuning for Black-Box Large Language Models
链接: https://arxiv.org/abs/2411.00985
作者: Jiaqi Wu,Simin Chen,Yuzhe Yang,Yijiang Li,Shiyue Hou,Rui Jing,Zehua Wang,Wei Chen,Zijian Tian
关键词-EN:
类目: Computation and Language (cs.CL)
备注:
[NLP-85] Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO
链接: https://arxiv.org/abs/2411.00980
作者: Macarious Hui,Jinda Zhang,Aanchan Mohan
关键词-EN:
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
[NLP-86] Generic Embedding-Based Lexicons for Transparent and Reproducible Text Scoring
链接: https://arxiv.org/abs/2411.00964
作者: Catherine Moez
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Preprint
[NLP-87] xt2Freq: Learning Series Patterns from Text via Frequency Domain NEURIPS2024
链接: https://arxiv.org/abs/2411.00929
作者: Ming-Chih Lo,Ching Chang,Wen-Chih Peng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 3 figures, and be accepted by NeurIPS 2024 Workshop: Time Series in the Age of Large Models
[NLP-88] ReSpAct: Harmonizing Reasoning Speaking and Acting Towards Building Large Language Model-Based Conversational AI Agents
链接: https://arxiv.org/abs/2411.00927
作者: Vardhan Dongre,Xiaocheng Yang,Emre Can Acikgoz,Suvodip Dey,Gokhan Tur,Dilek Hakkani-Tür
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 30 pages, 9 Figures, 22 Tables
[NLP-89] LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
链接: https://arxiv.org/abs/2411.00918
作者: Nam V. Nguyen,Thong T. Doan,Luong Tran,Van Nguyen,Quang Pham
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 9 figures
[NLP-90] Enhancing the Traditional Chinese Medicine Capabilities of Large Language Model through Reinforcement Learning from AI Feedback
【速读】: 该论文试图解决大型语言模型在中医领域(Traditional Chinese Medicine, TCM)应用中表现不佳的问题,主要原因是缺乏专业知识和高质量数据。解决方案的关键在于提出了一种框架,通过少量数据实现对大型模型的优化。首先,利用医疗案例数据进行监督式微调(supervised fine-tuning),使模型初步具备处理TCM任务的能力;接着,通过AI反馈的强化学习(reinforcement learning from AI feedback, RLAIF)进一步优化模型,使其与偏好数据对齐。实验结果表明,即使使用少量数据,该方法也能显著提升模型在代表性TCM任务上的表现。
链接: https://arxiv.org/abs/2411.00897
作者: Song Yu,Xiaofei Xu,Fangfei Xu,Li Li
关键词-EN: Traditional Chinese Medicine, Chinese Medicine, Traditional Chinese, remains limited due, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures
点击查看摘要
Abstract:Although large language models perform well in understanding and responding to user intent, their performance in specialized domains such as Traditional Chinese Medicine (TCM) remains limited due to lack of expertise. In addition, high-quality data related to TCM is scarce and difficult to obtain, making large language models ineffective in handling TCM tasks. In this work, we propose a framework to improve the performance of large language models for TCM tasks using only a small amount of data. First, we use medical case data for supervised fine-tuning of the large model, making it initially capable of performing TCM tasks. Subsequently, we further optimize the model’s performance using reinforcement learning from AI feedback (RLAIF) to align it with the preference data. The ablation study also demonstrated the performance gain is attributed to both supervised fine-tuning and the direct policy optimization. The experimental results show that the model trained with a small amount of data achieves a significant performance improvement on a representative TCM task.
摘要:尽管大语言模型在理解和响应用户意图方面表现出色,但由于缺乏专业知识,其在如中医学 (Traditional Chinese Medicine, TCM) 等专业领域的表现仍然有限。此外,与中医学相关的高质量数据稀缺且难以获取,使得大语言模型在处理中医学任务时效果不佳。在本研究中,我们提出了一种框架,仅使用少量数据来提升大语言模型在中医学任务中的表现。首先,我们利用医学案例数据对大模型进行监督式微调,使其初步具备执行中医学任务的能力。随后,我们通过 AI 反馈的强化学习 (Reinforcement Learning from AI Feedback, RLAIF) 进一步优化模型的表现,使其与偏好数据对齐。消融研究也表明,性能的提升归功于监督式微调和直接策略优化两方面。实验结果显示,经过少量数据训练的模型在代表性中医学任务上实现了显著的性能提升。
[NLP-91] Rethinking Scale: The Efficacy of Fine-Tuned Open-Source LLM s in Large-Scale Reproducible Social Science Research
链接: https://arxiv.org/abs/2411.00890
作者: Marcello Carammia,Stefano Maria Iacus,Giuseppe Porro
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
[NLP-92] Exploring the Knowledge Mismatch Hypothesis: Hallucination Propensity in Small Models Fine-tuned on Data from Larger Models
链接: https://arxiv.org/abs/2411.00878
作者: Phil Wee,Riyadh Baghdadi
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures
[NLP-93] DemoCraft: Using In-Context Learning to Improve Code Generation in Large Language Models
链接: https://arxiv.org/abs/2411.00865
作者: Nirmal Joshua Kapu,Mihit Sreejith
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-94] Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation
链接: https://arxiv.org/abs/2411.00863
作者: Chenyang An,Shima Imani,Feng Yao,Chengyu Dong,Ali Abbasi,Harsh Shrivastava,Samuel Buss,Jingbo Shang,Gayathri Mahalingam,Pramod Sharma,Maurice Diesendruck
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
[NLP-95] Survey of Cultural Awareness in Language Models: Text and Beyond
链接: https://arxiv.org/abs/2411.00860
作者: Siddhesh Pawar,Junyeong Park,Jiho Jin,Arnav Arora,Junho Myung,Srishti Yadav,Faiz Ghifari Haznitrama,Inhwa Song,Alice Oh,Isabelle Augenstein
关键词-EN:
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
[NLP-96] Vision-Language Models Can Self-Improve Reasoning via Reflection
链接: https://arxiv.org/abs/2411.00855
作者: Kanzhi Cheng,Yantao Li,Fangzhi Xu,Jianbing Zhang,Hao Zhou,Yang Liu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
[NLP-97] Accelerated AI Inference via Dynamic Execution Methods
链接: https://arxiv.org/abs/2411.00853
作者: Haim Barad,Jascha Achterberg,Tien Pei Chou,Jean Yu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-98] GWQ: Gradient-Aware Weight Quantization for Large Language Models
链接: https://arxiv.org/abs/2411.00850
作者: Yihua Shao,Siyu Liang,Xiaolin Lin,Zijian Ling,Zixian Zhu,Minxi Yan,Haiyang Liu,Siyu Chen,Ziyang Yan,Yilan Meng,Chenyu Zhang,Haotong Qin,Michele Magno,Yang Yang,Zhen Lei,Yan Wang,Jingcai Guo,Ling Shao,Hao Tang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-99] he Graphs Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation
链接: https://arxiv.org/abs/2411.00843
作者: Reza Moravej,Saurabh Bodhe,Zhanguang Zhang,Didier Chetelat,Dimitrios Tsaras,Yingxue Zhang,Hui-Ling Zhen,Jianye Hao,Mingxuan Yuan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:
[NLP-100] A Theoretical Perspective for Speculative Decoding Algorithm NEURIPS2024
链接: https://arxiv.org/abs/2411.00841
作者: Ming Yin,Minshuo Chen,Kaixuan Huang,Mengdi Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: NeurIPS 2024
[NLP-101] DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
链接: https://arxiv.org/abs/2411.00836
作者: Chengke Zou,Xingang Guo,Rui Yang,Junyu Zhang,Bin Hu,Huan Zhang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 39 pages, 10 figures
[NLP-102] Mobility-LLM : Learning Visiting Intentions and Travel Preferences from Human Mobility Data with Large Language Models NEURIPS2024
链接: https://arxiv.org/abs/2411.00823
作者: Letian Gong,Yan Lin,Xinyue Zhang,Yiwen Lu,Xuedi Han,Yichen Liu,Shengnan Guo,Youfang Lin,Huaiyu Wan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted by NeurIPS2024
[NLP-103] AutoGLM: Autonomous Foundation Agents for GUIs
链接: https://arxiv.org/abs/2411.00820
作者: Xiao Liu,Bo Qin,Dongzhu Liang,Guang Dong,Hanyu Lai,Hanchen Zhang,Hanlin Zhao,Iat Long Iong,Jiadai Sun,Jiaqi Wang,Junjie Gao,Junjun Shan,Kangning Liu,Shudan Zhang,Shuntian Yao,Siyi Cheng,Wentao Yao,Wenyi Zhao,Xinghan Liu,Xinyi Liu,Xinying Chen,Xinyue Yang,Yang Yang,Yifan Xu,Yu Yang,Yujia Wang,Yulin Xu,Zehan Qi,Yuxiao Dong,Jie Tang
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
[NLP-104] CycleResearcher: Improving Automated Research via Automated Review
链接: https://arxiv.org/abs/2411.00816
作者: Yixuan Weng,Minjun Zhu,Guangsheng Bao,Hongbo Zhang,Jindong Wang,Yue Zhang,Linyi Yang
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
[NLP-105] Personality Analysis from Online Short Video Platforms with Multi-domain Adaptation
链接: https://arxiv.org/abs/2411.00813
作者: Sixu An,Xiangguo Sun,Yicong Li,Yu Yang,Guandong Xu
关键词-EN:
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Audio and Speech Processing (eess.AS)
备注:
[NLP-106] Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment
链接: https://arxiv.org/abs/2411.00809
作者: Yanshi Li,Shaopan Xiong,Gengru Chen,Xiaoyang Li,Yijia Luo,Xingyao Zhang,Yanhui Huang,Xingyuan Bu,Yingshui Tan,Chun Yuan,Jiamang Wang,Wenbo Su,Bo Zheng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
[NLP-107] KeyInst: Keyword Instruction for Improving SQL Formulation in Text-to-SQL
链接: https://arxiv.org/abs/2411.00788
作者: Xiping Liu,Zhao Tan
关键词-EN:
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
[NLP-108] FIRE: Fact-checking with Iterative Retrieval and Verification
链接: https://arxiv.org/abs/2411.00784
作者: Zhuohan Xie,Rui Xing,Yuxia Wang,Jiahui Geng,Hasan Iqbal,Dhruv Sahnan,Iryna Gurevych,Preslav Nakov
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 4 figures, 8 tables
[NLP-109] Hazards in Daily Life? Enabling Robots to Proactively Detect and Resolve Anomalies
链接: https://arxiv.org/abs/2411.00781
作者: Zirui Song,Guangxian Ouyang,Meng Fang,Hongbin Na,Zijing Shi,Zhenhao Chen,Yujie Fu,Zeyu Zhang,Shiyu Jiang,Miao Fang,Ling Chen,Xiuying Chen
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: In processing
人工智能
[AI-0] Adaptive Length Image Tokenization via Recurrent Allocation
链接: https://arxiv.org/abs/2411.02393
作者: Shivam Duggal,Phillip Isola,Antonio Torralba,William T. Freeman
关键词-EN: Current vision systems, vision systems typically, systems typically assign, typically assign fixed-length, Current vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Code at: this https URL
点击查看摘要
Abstract:Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.
[AI-1] How Far is Video Generation from World Model: A Physical Law Perspective
链接: https://arxiv.org/abs/2411.02385
作者: Bingyi Kang,Yang Yue,Rui Lu,Zhijie Lin,Yang Zhao,Kaixin Wang,Gao Huang,Jiashi Feng
关键词-EN: OpenAI Sora highlights, video generation models, video generation, developing world models, highlights the potential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: preprint
点击查看摘要
Abstract:OpenAI’s Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit “case-based” generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color size velocity shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora’s broader success. See our project page at this https URL
[AI-2] Addressing Uncertainty in LLM s to Enhance Reliability in Generative AI
链接: https://arxiv.org/abs/2411.02381
作者: Ramneet Kaur,Colin Samplawski,Adam D. Cobb,Anirban Roy,Brian Matejek,Manoj Acharya,Daniel Elenius,Alexander M. Berenbeim,John A. Pavlik,Nathaniel D. Bastian,Susmit Jha
关键词-EN: Chinese Restaurant Process, Large Language Models, Restaurant Process, Chinese Restaurant, Large Language
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In this paper, we present a dynamic semantic clustering approach inspired by the Chinese Restaurant Process, aimed at addressing uncertainty in the inference of Large Language Models (LLMs). We quantify uncertainty of an LLM on a given query by calculating entropy of the generated semantic clusters. Further, we propose leveraging the (negative) likelihood of these clusters as the (non)conformity score within Conformal Prediction framework, allowing the model to predict a set of responses instead of a single output, thereby accounting for uncertainty in its predictions. We demonstrate the effectiveness of our uncertainty quantification (UQ) technique on two well known question answering benchmarks, COQA and TriviaQA, utilizing two LLMs, Llama2 and Mistral. Our approach achieves SOTA performance in UQ, as assessed by metrics such as AUROC, AUARC, and AURAC. The proposed conformal predictor is also shown to produce smaller prediction sets while maintaining the same probabilistic guarantee of including the correct response, in comparison to existing SOTA conformal prediction baseline.
[AI-3] DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution NEURIPS2024
链接: https://arxiv.org/abs/2411.02359
作者: Yang Yue,Yulin Wang,Bingyi Kang,Yizeng Han,Shenzhi Wang,Shiji Song,Jiashi Feng,Gao Huang
关键词-EN: demonstrated remarkable comprehension, visual data, demonstrated remarkable, remarkable comprehension, comprehension and reasoning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 25 pages, 6 figures, NeurIPS 2024
点击查看摘要
Abstract:MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at this https URL.
[AI-4] “Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization
链接: https://arxiv.org/abs/2411.02355
作者: Eldar Kurtic,Alexandre Marques,Shubhra Pandit,Mark Kurtz,Dan Alistarh
关键词-EN: significant uncertainty remains, large language model, significant uncertainty, popularity of large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the “best” format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous “continuous batching” deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.
[AI-5] Simulation of Nanorobots with Artificial Intelligence and Reinforcement Learning for Advanced Cancer Cell Detection and Tracking
链接: https://arxiv.org/abs/2411.02345
作者: Shahab Kavousinejad
关键词-EN: targeted drug delivery, neurological disorders, promising development, crossing the blood-brain, targeted payload delivery
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph); Other Quantitative Biology (q-bio.OT)
*备注: The source code for this simulation is available on GitHub: this https URL
点击查看摘要
Abstract:Nanorobots are a promising development in targeted drug delivery and the treatment of neurological disorders, with potential for crossing the blood-brain barrier (BBB). These small devices leverage advancements in nanotechnology and bioengineering for precise navigation and targeted payload delivery, particularly for conditions like brain tumors, Alzheimer’s disease, and Parkinson’s disease. Recent progress in artificial intelligence (AI) and machine learning (ML) has improved the navigation and effectiveness of nanorobots, allowing them to detect and interact with cancer cells through biomarker analysis. This study presents a new reinforcement learning (RL) framework for optimizing nanorobot navigation in complex biological environments, focusing on cancer cell detection by analyzing the concentration gradients of surrounding biomarkers. We utilize a computer simulation model to explore the behavior of nanorobots in a three-dimensional space with cancer cells and biological barriers. The proposed method uses Q-learning to refine movement strategies based on real-time biomarker concentration data, enabling nanorobots to autonomously navigate to cancerous tissues for targeted drug delivery. This research lays the groundwork for future laboratory experiments and clinical applications, with implications for personalized medicine and less invasive cancer treatments. The integration of intelligent nanorobots could revolutionize therapeutic strategies, reducing side effects and enhancing treatment effectiveness for cancer patients. Further research will investigate the practical deployment of these technologies in medical settings, aiming to unlock the full potential of nanorobotics in healthcare.
[AI-6] Disrupting Test Development with AI Assistants
链接: https://arxiv.org/abs/2411.02328
作者: Vijay Joshi,Iver Band
关键词-EN: large language models, Generative AI-assisted coding, Recent advancements, GitHub Copilot, significantly transformed software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advancements in large language models, including GPT-4 and its variants, and Generative AI-assisted coding tools like GitHub Copilot, ChatGPT, and Tabnine, have significantly transformed software development. This paper analyzes how these innovations impact productivity and software test development metrics. These tools enable developers to generate complete software programs with minimal human intervention before deployment. However, thorough review and testing by developers are still crucial. Utilizing the Test Pyramid concept, which categorizes tests into unit, integration, and end-to-end tests, we evaluate three popular AI coding assistants by generating and comparing unit tests for opensource modules. Our findings show that AI-generated tests are of equivalent quality to original tests, highlighting differences in usage and results among the tools. This research enhances the understanding and capabilities of AI-assistant tools in automated testing.
[AI-7] GenXD: Generating Any 3D and 4D Scenes
链接: https://arxiv.org/abs/2411.02319
作者: Yuyang Zhao,Chung-Ching Lin,Kevin Lin,Zhiwen Yan,Linjie Li,Zhengyuan Yang,Jianfeng Wang,Gim Hee Lee,Lijuan Wang
关键词-EN: Recent developments, remarkably successful, Recent, visual generation, generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD’s effectiveness and versatility compared to previous methods in 3D and 4D generation.
[AI-8] Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast
链接: https://arxiv.org/abs/2411.02318
作者: Marilyn Rego,Wen Fan,Xin Hu,Sanya Dod,Zhaorui Ni,Danning Xie,Jenna DiVincenzo,Lin Tan
关键词-EN: demands significant human, significant human labor, enhancing software quality, GPT models, labor and resources
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注:
点击查看摘要
Abstract:Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores the effectiveness of large language models (LLMs), specifically OpenAI’s GPT models, in generating fully correct specifications based on separation logic for static verification of human-written programs in VeriFast. Our first experiment employed traditional prompt engineering and the second used Chain-of-Thought (CoT) Prompting to identify and address common errors generated across the GPT models. The results indicate that GPT models can successfully generate specifications for verifying heap manipulating code with VeriFast. Furthermore, while CoT prompting significantly reduces syntax errors generated by the GPT models, it does not greatly improve verification error rates compared to prompt engineering. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL) Cite as: arXiv:2411.02318 [cs.SE] (or arXiv:2411.02318v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.02318 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-9] Defining and Evaluating Physical Safety for Large Language Models
链接: https://arxiv.org/abs/2411.02317
作者: Yung-Chen Tang,Pin-Yu Chen,Tsung-Yi Ho
关键词-EN: Large Language Models, Large Language, applications remain unexplored, real-world applications remain, control robotic systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and harm in real-world applications remain unexplored. Our study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations. Our evaluation of mainstream LLMs reveals an undesirable trade-off between utility and safety, with models that excel in code generation often performing poorly in crucial safety aspects. Furthermore, while incorporating advanced prompt engineering techniques such as In-Context Learning and Chain-of-Thought can improve safety, these methods still struggle to identify unintentional attacks. In addition, larger models demonstrate better safety capabilities, particularly in refusing dangerous commands. Our findings and benchmark can facilitate the design and evaluation of physical safety for LLMs. The project page is available at this http URL.
[AI-10] Grid-Based Projection of Spatial Data into Knowledge Graphs
链接: https://arxiv.org/abs/2411.02309
作者: Amin Anjomshoaa,Hannah Schuster,Axel Polleres
关键词-EN: Spatial Knowledge Graphs, Knowledge Graphs, experiencing growing adoption, Graphs, Spatial Knowledge
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:
点击查看摘要
Abstract:The Spatial Knowledge Graphs (SKG) are experiencing growing adoption as a means to model real-world entities, proving especially invaluable in domains like crisis management and urban planning. Considering that RDF specifications offer limited support for effectively managing spatial information, it’s common practice to include text-based serializations of geometrical features, such as polygons and lines, as string literals in knowledge graphs. Consequently, Spatial Knowledge Graphs (SKGs) often rely on geo-enabled RDF Stores capable of parsing, interpreting, and indexing such serializations. In this paper, we leverage grid cells as the foundational element of SKGs and demonstrate how efficiently the spatial characteristics of real-world entities and their attributes can be encoded within knowledge graphs. Furthermore, we introduce a novel methodology for representing street networks in knowledge graphs, diverging from the conventional practice of individually capturing each street segment. Instead, our approach is based on tessellating the street network using grid cells and creating a simplified representation that could be utilized for various routing and navigation tasks, solely relying on RDF specifications.
[AI-11] argeted Manipulation and Deception Emerge when Optimizing LLM s for User Feedback
链接: https://arxiv.org/abs/2411.02306
作者: Marcus Williams,Micah Carroll,Adhyyan Narang,Constantin Weisser,Brendan Murphy,Anca Dragan
关键词-EN: widely deployed, paid annotators, increasing interest, interest in directly, directly optimizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative tactics to obtain positive feedback, and some users may be especially vulnerable to such tactics. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback. We have three main findings: 1) Extreme forms of “feedback gaming” such as manipulation and deception can reliably emerge in domains of practical LLM usage; 2) Concerningly, even if only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect; 3 To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. To our surprise, we found that while such approaches help in some settings, they backfire in others, leading to the emergence of subtler problematic behaviors that would also fool the LLM judges. Our findings serve as a cautionary tale, highlighting the risks of using gameable feedback sources – such as user feedback – as a target for RL.
[AI-12] Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation
链接: https://arxiv.org/abs/2411.02293
作者: Xianghui Yang,Huiwen Shi,Bowen Zhang,Fan Yang,Jiacheng Wang,Hongxu Zhao,Xinhai Liu,Xinzhou Wang,Qingxiang Lin,Jiaao Yu,Lifu Wang,Zhuo Chen,Sicong Liu,Yuhong Liu,Yong Yang,Di Wang,Jie Jiang,Chunchao Guo
关键词-EN: improved artists’ workflows, greatly improved artists’, artists’ workflows, poor generalization, greatly improved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:While 3D generative models have greatly improved artists’ workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. % Extensive experimental results demonstrate the effectiveness of Hunyuan3D-1.0 in generating high-quality 3D assets. Our framework involves the text-to-image model ~\ie, Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has 10\times more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.
[AI-13] ControlSynth Neural ODEs: Modeling Dynamical Systems with Guaranteed Convergence
链接: https://arxiv.org/abs/2411.02292
作者: Wenjie Mei,Dongzhe Zheng,Shihua Li
关键词-EN: continuous-time neural networks, time intervals, process data, limitation of time, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Neural ODEs (NODEs) are continuous-time neural networks (NNs) that can process data without the limitation of time intervals. They have advantages in learning and understanding the evolution of complex real dynamics. Many previous works have focused on NODEs in concise forms, while numerous physical systems taking straightforward forms, in fact, belong to their more complex quasi-classes, thus appealing to a class of general NODEs with high scalability and flexibility to model those systems. This, however, may result in intricate nonlinear properties. In this paper, we introduce ControlSynth Neural ODEs (CSODEs). We show that despite their highly nonlinear nature, convergence can be guaranteed via tractable linear inequalities. In the composition of CSODEs, we introduce an extra control term for learning the potential simultaneous capture of dynamics at different scales, which could be particularly useful for partial differential equation-formulated systems. Finally, we compare several representative NNs with CSODEs on important physical dynamics under the inductive biases of CSODEs, and illustrate that CSODEs have better learning and predictive abilities in these settings.
[AI-14] Federated GNNs for EEG-Based Stroke Assessment
链接: https://arxiv.org/abs/2411.02286
作者: Andrea Protani,Lorenzo Giusti,Albert Sund Aillet,Simona Sacco,Paolo Manganotti,Lucio Marinelli,Diogo Reis Santos,Pierpaolo Brutti,Pietro Caliandro,Luigi Serio
关键词-EN: clinical decision-making processes, personalized treatment plans, offering enhanced diagnostic, supporting clinical decision-making, enhanced diagnostic capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 13 pages, 5 figures, Proceedings of the II edition of the Workshop on Unifying Representations in Neural Models (UniReps 2024)
点击查看摘要
Abstract:Machine learning (ML) has the potential to become an essential tool in supporting clinical decision-making processes, offering enhanced diagnostic capabilities and personalized treatment plans. However, outsourcing medical records to train ML models using patient data raises legal, privacy, and security concerns. Federated learning has emerged as a promising paradigm for collaborative ML, meeting healthcare institutions’ requirements for robust models without sharing sensitive data and compromising patient privacy. This study proposes a novel method that combines federated learning (FL) and Graph Neural Networks (GNNs) to predict stroke severity using electroencephalography (EEG) signals across multiple medical institutions. Our approach enables multiple hospitals to jointly train a shared GNN model on their local EEG data without exchanging patient information. Specifically, we address a regression problem by predicting the National Institutes of Health Stroke Scale (NIHSS), a key indicator of stroke severity. The proposed model leverages a masked self-attention mechanism to capture salient brain connectivity patterns and employs EdgeSHAP to provide post-hoc explanations of the neurological states after a stroke. We evaluated our method on EEG recordings from four institutions, achieving a mean absolute error (MAE) of 3.23 in predicting NIHSS, close to the average error made by human experts (MAE \approx 3.0). This demonstrates the method’s effectiveness in providing accurate and explainable predictions while maintaining data privacy.
[AI-15] Breaking the Reclustering Barrier in Centroid-based Deep Clustering
链接: https://arxiv.org/abs/2411.02275
作者: Lukas Miklautz,Timo Klein,Kevin Sidak,Collin Leiber,Thomas Lang,Andrii Shkabrii,Sebastian Tschiatschek,Claudia Plant
关键词-EN: Performance quickly saturates, rapid early gains, reclustering barrier, work investigates, investigates an important
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This work investigates an important phenomenon in centroid-based deep clustering (DC) algorithms: Performance quickly saturates after a period of rapid early gains. Practitioners commonly address early saturation with periodic reclustering, which we demonstrate to be insufficient to address performance plateaus. We call this phenomenon the “reclustering barrier” and empirically show when the reclustering barrier occurs, what its underlying mechanisms are, and how it is possible to Break the Reclustering Barrier with our algorithm BRB. BRB avoids early over-commitment to initial clusterings and enables continuous adaptation to reinitialized clustering targets while remaining conceptually simple. Applying our algorithm to widely-used centroid-based DC algorithms, we show that (1) BRB consistently improves performance across a wide range of clustering benchmarks, (2) BRB enables training from scratch, and (3) BRB performs competitively against state-of-the-art DC algorithms when combined with a contrastive loss. We release our code and pre-trained models at this https URL .
[AI-16] On the Utilization of Unique Node Identifiers in Graph Neural Networks
链接: https://arxiv.org/abs/2411.02271
作者: Maya Bechler-Speicher,Moshe Eliasof,Carola-Bibiane Schönlieb,Ran Gilad-Bachrach,Amir Globerson
关键词-EN: Graph neural networks, Graph neural, representational limitations due, inherent representational limitations, message-passing structure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Graph neural networks have inherent representational limitations due to their message-passing structure. Recent work has suggested that these limitations can be overcome by using unique node identifiers (UIDs). Here we argue that despite the advantages of UIDs, one of their disadvantages is that they lose the desirable property of permutation-equivariance. We thus propose to focus on UID models that are permutation-equivariant, and present theoretical arguments for their advantages. Motivated by this, we propose a method to regularize UID models towards permutation equivariance, via a contrastive loss. We empirically demonstrate that our approach improves generalization and extrapolation abilities while providing faster training convergence. On the recent BREC expressiveness benchmark, our proposed method achieves state-of-the-art performance compared to other random-based approaches.
[AI-17] he Enhancement of Software Delivery Performance through Enterprise DevSecOps and Generative Artificial Intelligence in Chinese Technology Firms
链接: https://arxiv.org/abs/2411.02255
作者: Jun Cui
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, Generative Artificial, software delivery performance, DevSecOps and Generative
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This study investigates the impact of integrating DevSecOps and Generative Artificial Intelligence (GAI) on software delivery performance within technology firms. Utilizing a qualitative research methodology, the research involved semi-structured interviews with industry practitioners and analysis of case studies from organizations that have successfully implemented these methodologies. The findings reveal significant enhancements in research and development (RD) efficiency, improved source code management, and heightened software quality and security. The integration of GAI facilitated automation of coding tasks and predictive analytics, while DevSecOps ensured that security measures were embedded throughout the development lifecycle. Despite the promising results, the study identifies gaps related to the generalizability of the findings due to the limited sample size and the qualitative nature of the research. This paper contributes valuable insights into the practical implementation of DevSecOps and GAI, highlighting their potential to transform software delivery processes in technology firms. Future research directions include quantitative assessments of the impact on specific business outcomes and comparative studies across different industries.
[AI-18] Detect an Object At Once without Fine-tuning
链接: https://arxiv.org/abs/2411.02181
作者: Junyu Hao,Jianheng Liu,Yongjia Zhao,Zuofan Chen,Qi Sun,Jinlong Chen,Jianguo Wei,Minghao Yang
关键词-EN: Region Alignment Network, Similarity Density Map, previously unseen object, Deep Siamese Network, instantly recognize
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:When presented with one or a few photos of a previously unseen object, humans can instantly recognize it in different scenes. Although the human brain mechanism behind this phenomenon is still not fully understood, this work introduces a novel technical realization of this task. It consists of two phases: (1) generating a Similarity Density Map (SDM) by convolving the scene image with the given object image patch(es) so that the highlight areas in the SDM indicate the possible locations; (2) obtaining the object occupied areas in the scene through a Region Alignment Network (RAN). The RAN is constructed on a backbone of Deep Siamese Network (DSN), and different from the traditional DSNs, it aims to obtain the object accurate regions by regressing the location and area differences between the ground truths and the predicted ones indicated by the highlight areas in SDM. By pre-learning from labels annotated in traditional datasets, the SDM-RAN can detect previously unknown objects without fine-tuning. Experiments were conducted on the MS COCO, PASCAL VOC datasets. The results indicate that the proposed method outperforms state-of-the-art methods on the same task.
[AI-19] Behavioral Sequence Modeling with Ensemble Learning
链接: https://arxiv.org/abs/2411.02174
作者: Maxime Kawawa-Beaudan,Srijan Sood,Soham Palande,Ganapathy Mani,Tucker Balch,Manuela Veloso
关键词-EN: Hidden Markov Models, emphasizing that sequential, sequential context, context often outweighs, aggregate features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We investigate the use of sequence analysis for behavior modeling, emphasizing that sequential context often outweighs the value of aggregate features in understanding human behavior. We discuss framing common problems in fields like healthcare, finance, and e-commerce as sequence modeling tasks, and address challenges related to constructing coherent sequences from fragmented data and disentangling complex behavior patterns. We present a framework for sequence modeling using Ensembles of Hidden Markov Models, which are lightweight, interpretable, and efficient. Our ensemble-based scoring method enables robust comparison across sequences of different lengths and enhances performance in scenarios with imbalanced or scarce data. The framework scales in real-world scenarios, is compatible with downstream feature-based modeling, and is applicable in both supervised and unsupervised learning settings. We demonstrate the effectiveness of our method with results on a longitudinal human behavior dataset.
[AI-20] Do graph neural network states contain graph properties?
链接: https://arxiv.org/abs/2411.02168
作者: Tom Pelletreau-Duris,Ruud van Bakel,Michael Cochez
关键词-EN: requires increasingly large, Graph Neural Networks, learning models achieve, large model sizes, increasingly large model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 22 figures, conference
点击查看摘要
Abstract:Graph learning models achieve state-of-the-art performance on many tasks, but this often requires increasingly large model sizes. Accordingly, the complexity of their representations increase. Explainability techniques (XAI) have made remarkable progress in the interpretability of ML models. However, the non-relational nature of Graph Neural Networks (GNNs) make it difficult to reuse already existing XAI methods. While other works have focused on instance-based explanation methods for GNNs, very few have investigated model-based methods and, to our knowledge, none have tried to probe the embedding of the GNNs for well-known structural graph properties. In this paper we present a model agnostic explainability pipeline for Graph Neural Networks (GNNs) employing diagnostic classifiers. This pipeline aims to probe and interpret the learned representations in GNNs across various architectures and datasets, refining our understanding and trust in these models.
[AI-21] Learning Multiple Initial Solutions to Optimization Problems
链接: https://arxiv.org/abs/2411.02158
作者: Elad Sharony,Heng Yang,Tong Che,Marco Pavone,Shie Mannor,Peter Karkus
关键词-EN: Sequentially solving similar, strict runtime constraints, Sequentially solving, solving similar optimization, similar optimization problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Under Review
点击查看摘要
Abstract:Sequentially solving similar optimization problems under strict runtime constraints is essential for many applications, such as robot control, autonomous driving, and portfolio management. The performance of local optimization methods in these settings is sensitive to the initial solution: poor initialization can lead to slow convergence or suboptimal solutions. To address this challenge, we propose learning to predict \emphmultiple diverse initial solutions given parameters that define the problem instance. We introduce two strategies for utilizing multiple initial solutions: (i) a single-optimizer approach, where the most promising initial solution is chosen using a selection function, and (ii) a multiple-optimizers approach, where several optimizers, potentially run in parallel, are each initialized with a different solution, with the best solution chosen afterward. We validate our method on three optimal control benchmark tasks: cart-pole, reacher, and autonomous driving, using different optimizers: DDP, MPPI, and iLQR. We find significant and consistent improvement with our method across all evaluation settings and demonstrate that it efficiently scales with the number of initial solutions required. The code is available at \hrefthis https URL\ttthis https URL .
[AI-22] raining Compute-Optimal Protein Language Models NEURIPS2024
链接: https://arxiv.org/abs/2411.02142
作者: Xingyi Cheng,Bo Chen,Pan Li,Jing Gong,Jie Tang,Le Song
关键词-EN: explore optimally training, Causal Language Model, Masked Language Model, protein language models, Language Model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: NeurIPS 2024 (Spotlight); Code: this https URL . Additional resources are available here
点击查看摘要
Abstract:We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model~(MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure- and function-related tasks, all within less or equivalent pre-training compute budgets.
[AI-23] Generating the Traces You Need: A Conditional Generative Model for Process Mining Data
链接: https://arxiv.org/abs/2411.02131
作者: Riccardo Graziosi,Massimiliano Ronzani,Andrei Buliga,Chiara Di Francescomarino,Francesco Folino,Chiara Ghidini,Francesca Meneghello,Luigi Pontieri
关键词-EN: Process Mining community, recent years, Mining community, Process Mining, data
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6th International Conference on Process Mining (ICPM) 2024 Copenhagen, Denmark 14-18 October 2024
点击查看摘要
Abstract:In recent years, trace generation has emerged as a significant challenge within the Process Mining community. Deep Learning (DL) models have demonstrated accuracy in reproducing the features of the selected processes. However, current DL generative models are limited in their ability to adapt the learned distributions to generate data samples based on specific conditions or attributes. This limitation is particularly significant because the ability to control the type of generated data can be beneficial in various contexts, enabling a focus on specific behaviours, exploration of infrequent patterns, or simulation of alternative ‘what-if’ scenarios. In this work, we address this challenge by introducing a conditional model for process data generation based on a conditional variational autoencoder (CVAE). Conditional models offer control over the generation process by tuning input conditional variables, enabling more targeted and controlled data generation. Unlike other domains, CVAE for process mining faces specific challenges due to the multiperspective nature of the data and the need to adhere to control-flow rules while ensuring data variability. Specifically, we focus on generating process executions conditioned on control flow and temporal features of the trace, allowing us to produce traces for specific, identified sub-processes. The generated traces are then evaluated using common metrics for generative model assessment, along with additional metrics to evaluate the quality of the conditional generation
[AI-24] Unsupervised detection of semantic correlations in big data
链接: https://arxiv.org/abs/2411.02126
作者: Santiago Acevedo,Alex Rodriguez,Alessandro Laio
关键词-EN: large feature vectors, extremely large feature, information is stored, stored in extremely, extremely large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注:
点击查看摘要
Abstract:In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test this approach identifying phase transitions in model magnetic systems and we then apply it to the detection of semantic correlations of images and text inside deep neural networks.
[AI-25] Revisiting K-mer Profile for Effective and Scalable Genome Representation Learning NEURIPS2024
链接: https://arxiv.org/abs/2411.02125
作者: Abdulkadir Celikkanat,Andres R. Masegosa,Thomas D. Nielsen
关键词-EN: Obtaining effective representations, Obtaining effective, DNA sequences, DNA fragments, performing metagenomic binning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Genomics (q-bio.GN)
*备注: Accepted to the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
点击查看摘要
Abstract:Obtaining effective representations of DNA sequences is crucial for genome analysis. Metagenomic binning, for instance, relies on genome representations to cluster complex mixtures of DNA fragments from biological samples with the aim of determining their microbial compositions. In this paper, we revisit k-mer-based representations of genomes and provide a theoretical analysis of their use in representation learning. Based on the analysis, we propose a lightweight and scalable model for performing metagenomic binning at the genome read level, relying only on the k-mer compositions of the DNA fragments. We compare the model to recent genome foundation models and demonstrate that while the models are comparable in performance, the proposed model is significantly more effective in terms of scalability, a crucial aspect for performing metagenomic binning of real-world datasets.
[AI-26] Adaptive Sparse Allocation with Mutual Choice Feature Choice Sparse Autoencoders
链接: https://arxiv.org/abs/2411.02124
作者: Kola Ayonrinde
关键词-EN: Choice SAEs, Mutual Choice SAEs, Feature Choice SAEs, Choice SAEs solve, SAEs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages (18 w/ appendices), 7 figures. Preprint
点击查看摘要
Abstract:Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a sparsifying activation function that implicitly defines a set of token-feature matches. We frame the token-feature matching as a resource allocation problem constrained by a total sparsity upper bound. For example, TopK SAEs solve this allocation problem with the additional constraint that each token matches with at most k features. In TopK SAEs, the k active features per token constraint is the same across tokens, despite some tokens being more difficult to reconstruct than others. To address this limitation, we propose two novel SAE variants, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token. Feature Choice SAEs solve the sparsity allocation problem under the additional constraint that each feature matches with at most m tokens. Mutual Choice SAEs solve the unrestricted allocation problem where the total sparsity budget can be allocated freely between tokens and features. Additionally, we introduce a new auxiliary loss function, \mathttaux_zipf_loss , which generalises the \mathttaux_k_loss to mitigate dead and underutilised features. Our methods result in SAEs with fewer dead features and improved reconstruction loss at equivalent sparsity levels as a result of the inherent adaptive computation. More accurate and scalable feature extraction methods provide a path towards better understanding and more precise control of foundation models.
[AI-27] Bridge-IF: Learning Inverse Protein Folding with Markov Bridges NEURIPS2024
链接: https://arxiv.org/abs/2411.02120
作者: Yiheng Zhu,Jialu Wu,Qiuyi Li,Jiahuan Yan,Mingze Yin,Wei Wu,Mingyang Li,Jieping Ye,Zheng Wang,Jian Wu
关键词-EN: desired backbone structures, Markov bridge, Inverse protein folding, fundamental task, computational protein design
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: NeurIPS 2024
点击查看摘要
Abstract:Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter-efficient training. Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at this https URL.
[AI-28] Differentially Private Integrated Decision Gradients (IDG-DP) for Radar-based Human Activity Recognition
链接: https://arxiv.org/abs/2411.02099
作者: Idris Zakariyya,Linda Tran,Kaushik Bhargav Sivangi,Paul Henderson,Fani Deligianni
关键词-EN: motion analysis offers, analysis offers significant, Human motion analysis, offers significant potential, detection of diseases
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Human motion analysis offers significant potential for healthcare monitoring and early detection of diseases. The advent of radar-based sensing systems has captured the spotlight for they are able to operate without physical contact and they can integrate with pre-existing Wi-Fi networks. They are also seen as less privacy-invasive compared to camera-based systems. However, recent research has shown high accuracy in recognizing subjects or gender from radar gait patterns, raising privacy concerns. This study addresses these issues by investigating privacy vulnerabilities in radar-based Human Activity Recognition (HAR) systems and proposing a novel method for privacy preservation using Differential Privacy (DP) driven by attributions derived with Integrated Decision Gradient (IDG) algorithm. We investigate Black-box Membership Inference Attack (MIA) Models in HAR settings across various levels of attacker-accessible information. We extensively evaluated the effectiveness of the proposed IDG-DP method by designing a CNN-based HAR model and rigorously assessing its resilience against MIAs. Experimental results demonstrate the potential of IDG-DP in mitigating privacy attacks while maintaining utility across all settings, particularly excelling against label-only and shadow model black-box MIA attacks. This work represents a crucial step towards balancing the need for effective radar-based HAR with robust privacy protection in healthcare environments.
[AI-29] Alignment-Based Adversarial Training (ABAT) for Improving the Robustness and Accuracy of EEG-Based BCIs
链接: https://arxiv.org/abs/2411.02094
作者: Xiaoqing Chen,Ziwei Wang,Dongrui Wu
关键词-EN: based brain-computer interfaces, achieved great success, Machine learning, success in electroencephalogram, based brain-computer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Machine learning has achieved great success in electroencephalogram (EEG) based brain-computer interfaces (BCIs). Most existing BCI studies focused on improving the decoding accuracy, with only a few considering the adversarial security. Although many adversarial defense approaches have been proposed in other application domains such as computer vision, previous research showed that their direct extensions to BCIs degrade the classification accuracy on benign samples. This phenomenon greatly affects the applicability of adversarial defense approaches to EEG-based BCIs. To mitigate this problem, we propose alignment-based adversarial training (ABAT), which performs EEG data alignment before adversarial training. Data alignment aligns EEG trials from different domains to reduce their distribution discrepancies, and adversarial training further robustifies the classification boundary. The integration of data alignment and adversarial training can make the trained EEG classifiers simultaneously more accurate and more robust. Experiments on five EEG datasets from two different BCI paradigms (motor imagery classification, and event related potential recognition), three convolutional neural network classifiers (EEGNet, ShallowCNN and DeepCNN) and three different experimental settings (offline within-subject cross-block/-session classification, online cross-session classification, and pre-trained classifiers) demonstrated its effectiveness. It is very intriguing that adversarial attacks, which are usually used to damage BCI systems, can be used in ABAT to simultaneously improve the model accuracy and robustness.
[AI-30] Real-time and Downtime-tolerant Fault Diagnosis for Railway Turnout Machines (RTMs) Empowered with Cloud-Edge Pipeline Parallelism
链接: https://arxiv.org/abs/2411.02086
作者: Fan Wu,Muhammad Bilal,Haolong Xiang,Heng Wang,Jinjun Yu,Xiaolong Xu
关键词-EN: Railway Turnout Machines, railway transportation infrastructure, Turnout Machines, Railway Turnout, railway transportation
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Railway Turnout Machines (RTMs) are mission-critical components of the railway transportation infrastructure, responsible for directing trains onto desired tracks. For safety assurance applications, especially in early-warning scenarios, RTM faults are expected to be detected as early as possible on a continuous 7x24 basis. However, limited emphasis has been placed on distributed model inference frameworks that can meet the inference latency and reliability requirements of such mission critical fault diagnosis systems. In this paper, an edge-cloud collaborative early-warning system is proposed to enable real-time and downtime-tolerant fault diagnosis of RTMs, providing a new paradigm for the deployment of models in safety-critical scenarios. Firstly, a modular fault diagnosis model is designed specifically for distributed deployment, which utilizes a hierarchical architecture consisting of the prior knowledge module, subordinate classifiers, and a fusion layer for enhanced accuracy and parallelism. Then, a cloud-edge collaborative framework leveraging pipeline parallelism, namely CEC-PA, is developed to minimize the overhead resulting from distributed task execution and context exchange by strategically partitioning and offloading model components across cloud and edge. Additionally, an election consensus mechanism is implemented within CEC-PA to ensure system robustness during coordinator node downtime. Comparative experiments and ablation studies are conducted to validate the effectiveness of the proposed distributed fault diagnosis approach. Our ensemble-based fault diagnosis model achieves a remarkable 97.4% accuracy on a real-world dataset collected by Nanjing Metro in Jiangsu Province, China. Meanwhile, CEC-PA demonstrates superior recovery proficiency during node disruptions and speed-up ranging from 1.98x to 7.93x in total inference time compared to its counterparts.
[AI-31] Collaborative Cognitive Diagnosis with Disentangled Representation Learning for Learner Modeling NEURIPS2024
链接: https://arxiv.org/abs/2411.02066
作者: Weibo Gao,Qi Liu,Linan Yue,Fangzhou Yao,Hao Wang,Yin Gu,Zheng Zhang
关键词-EN: display comparable observable, comparable observable problem-solving, Learners sharing similar, collaborative, observable problem-solving performances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS2024
点击查看摘要
Abstract:Learners sharing similar implicit cognitive states often display comparable observable problem-solving performances. Leveraging collaborative connections among such similar learners proves valuable in comprehending human learning. Motivated by the success of collaborative modeling in various domains, such as recommender systems, we aim to investigate how collaborative signals among learners contribute to the diagnosis of human cognitive states (i.e., knowledge proficiency) in the context of intelligent education. The primary challenges lie in identifying implicit collaborative connections and disentangling the entangled cognitive factors of learners for improved explainability and controllability in learner Cognitive Diagnosis (CD). However, there has been no work on CD capable of simultaneously modeling collaborative and disentangled cognitive states. To address this gap, we present Coral, a Collaborative cognitive diagnosis model with disentangled representation learning. Specifically, Coral first introduces a disentangled state encoder to achieve the initial disentanglement of learners’ states. Subsequently, a meticulously designed collaborative representation learning procedure captures collaborative signals. It dynamically constructs a collaborative graph of learners by iteratively searching for optimal neighbors in a context-aware manner. Using the constructed graph, collaborative information is extracted through node representation learning. Finally, a decoding process aligns the initial cognitive states and collaborative states, achieving co-disentanglement with practice performance reconstructions. Extensive experiments demonstrate the superior performance of Coral, showcasing significant improvements over state-of-the-art methods across several real-world datasets. Our code is available at this https URL.
[AI-32] ableGPT2: A Large Multimodal Model with Tabular Data Integration
链接: https://arxiv.org/abs/2411.02059
作者: Aofeng Su,Aowen Wang,Chao Ye,Chen Zhou,Ga Zhang,Guangcheng Zhu,Haobo Wang,Haokai Xu,Hao Chen,Haoze Li,Haoxuan Lan,Jiaming Tian,Jing Yuan,Junbo Zhao,Junlin Zhou,Kaizhe Shou,Liangyu Zha,Lin Long,Liyao Li,Pengzuo Wu,Qi Zhang,Qingyi Huang,Saisai Yang,Tao Zhang,Wentao Ye,Wufang Zhu,Xiaomeng Hu,Xijun Gu,Xinjie Sun,Xiang Li,Yuhang Yang,Zhiqing Xiao
关键词-EN: Qwen has reshaped, Toggle, data, Code Toggle Papers, Toggle Hugging Face
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:
点击查看摘要
Abstract:The emergence of models like GPTs, Claude, LLaMA, and Qwen has reshaped AI applications, presenting vast new opportunities across industries. Yet, the integration of tabular data remains notably underdeveloped, despite its foundational role in numerous real-world domains. This gap is critical for three main reasons. First, database or data warehouse data integration is essential for advanced applications; second, the vast and largely untapped resource of tabular data offers immense potential for analysis; and third, the business intelligence domain specifically demands adaptable, precise solutions that many current LLMs may struggle to provide. In response, we introduce TableGPT2, a model rigorously pre-trained and fine-tuned with over 593.8K tables and 2.36M high-quality query-table-output tuples, a scale of table-related data unprecedented in prior research. This extensive training enables TableGPT2 to excel in table-centric tasks while maintaining strong general language and coding abilities. One of TableGPT2’s key innovations is its novel table encoder, specifically designed to capture schema-level and cell-level information. This encoder strengthens the model’s ability to handle ambiguous queries, missing column names, and irregular tables commonly encountered in real-world applications. Similar to visual language models, this pioneering approach integrates with the decoder to form a robust large multimodal model. We believe the results are compelling: over 23 benchmarking metrics, TableGPT2 achieves an average performance improvement of 35.20% in the 7B model and 49.32% in the 72B model over prior benchmark-neutral LLMs, with robust general-purpose capabilities intact. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2411.02059 [cs.LG] (or arXiv:2411.02059v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.02059 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Wentao Ye [view email] [v1] Mon, 4 Nov 2024 13:03:13 UTC (1,328 KB) Full-text links: Access Paper: View a PDF of the paper titled TableGPT2: A Large Multimodal Model with Tabular Data Integration, by Aofeng Su and 31 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-11 Change to browse by: cs cs.AI cs.DB References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[AI-33] Enhancing ID-based Recommendation with Large Language Models
链接: https://arxiv.org/abs/2411.02041
作者: Lei Chen,Chen Gao,Xiaoyi Du,Hengliang Luo,Depeng Jin,Yong Li,Meng Wang
关键词-EN: Large Language Models, Large Language, Language Models, recently garnered significant, garnered significant attention
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have recently garnered significant attention in various domains, including recommendation systems. Recent research leverages the capabilities of LLMs to improve the performance and user modeling aspects of recommender systems. These studies primarily focus on utilizing LLMs to interpret textual data in recommendation tasks. However, it’s worth noting that in ID-based recommendations, textual data is absent, and only ID data is available. The untapped potential of LLMs for ID data within the ID-based recommendation paradigm remains relatively unexplored. To this end, we introduce a pioneering approach called “LLM for ID-based Recommendation” (LLM4IDRec). This innovative approach integrates the capabilities of LLMs while exclusively relying on ID data, thus diverging from the previous reliance on textual data. The basic idea of LLM4IDRec is that by employing LLM to augment ID data, if augmented ID data can improve recommendation performance, it demonstrates the ability of LLM to interpret ID data effectively, exploring an innovative way for the integration of LLM in ID-based recommendation. We evaluate the effectiveness of our LLM4IDRec approach using three widely-used datasets. Our results demonstrate a notable improvement in recommendation performance, with our approach consistently outperforming existing methods in ID-based recommendation by solely augmenting input data.
[AI-34] SibylSat: Using SAT as an Oracle to Perform a Greedy Search on TOHTN Planning
链接: https://arxiv.org/abs/2411.02035
作者: Gaspard Quenard(Marvin),Damier Pellier(Marvin),Humbert Fiorino(Marvin)
关键词-EN: solve totally-ordered HTN, efficiently solve totally-ordered, totally-ordered HTN problems, paper presents SibylSat, SAT-based method designed
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper presents SibylSat, a novel SAT-based method designed to efficiently solve totally-ordered HTN problems (TOHTN). In contrast to prevailing SAT-based HTN planners that employ a breadth-first search strategy, SibylSat adopts a greedy search approach, enabling it to identify promising decompositions for expansion. The selection process is facilitated by a heuristic derived from solving a relaxed problem, which is also expressed as a SAT problem. Our experimental evaluations demonstrate that SibylSat outperforms existing SAT-based TOHTN approaches in terms of both runtime and plan quality on most of the IPC benchmarks, while also solving a larger number of problems.
[AI-35] CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching
链接: https://arxiv.org/abs/2411.02026
作者: Yu Pan,Yuguang Yang,Jixun Yao,Jianhao Ye,Hongbin Zhou,Lei Ma,Jianjun Zhao
关键词-EN: previously unseen target, Zero-shot voice conversion, unseen target speaker, voice conversion, aims to transform
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Work in progress; 5 pages;
点击查看摘要
Abstract:Zero-shot voice conversion (VC) aims to transform the timbre of a source speaker into any previously unseen target speaker, while preserving the original linguistic content. Despite notable progress, attaining a degree of speaker similarity and naturalness on par with ground truth recordings continues to pose great challenge. In this paper, we propose CTEFM-VC, a zero-shot VC framework that leverages Content-aware Timbre Ensemble modeling and Flow Matching. Specifically, CTEFM-VC disentangles utterances into linguistic content and timbre representations, subsequently utilizing a conditional flow matching model and a vocoder to reconstruct the mel-spectrogram and waveform. To enhance its timbre modeling capability and the naturalness of generated speech, we propose a context-aware timbre ensemble modeling approach that adaptively integrates diverse speaker verification embeddings and enables the joint utilization of linguistic and timbre features through a cross-attention module. Experiments show that our CTEFM-VC system surpasses state-of-the-art VC methods in both speaker similarity and naturalness by at least 18.5% and 7.0%.
[AI-36] Foundations and Recent Trends in Multimodal Mobile Agents : A Survey
链接: https://arxiv.org/abs/2411.02006
作者: Biao Wu,Yanda Li,Meng Fang,Zirui Song,Zhiwei Zhang,Yunchao Wei,Ling Chen
关键词-EN: essential for automating, complex and dynamic, dynamic mobile environments, Mobile, mobile agent technologies
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 1 figure
点击查看摘要
Abstract:Mobile agents are essential for automating tasks in complex and dynamic mobile environments. As foundation models evolve, the demands for agents that can adapt in real-time and process multimodal data have grown. This survey provides a comprehensive review of mobile agent technologies, focusing on recent advancements that enhance real-time adaptability and multimodal interaction. Recent evaluation benchmarks have been developed better to capture the static and interactive environments of mobile tasks, offering more accurate assessments of agents’ performance. We then categorize these advancements into two main approaches: prompt-based methods, which utilize large language models (LLMs) for instruction-based task execution, and training-based methods, which fine-tune multimodal models for mobile-specific applications. Additionally, we explore complementary technologies that augment agent performance. By discussing key challenges and outlining future research directions, this survey offers valuable insights for advancing mobile agent technologies. A comprehensive resource list is available at this https URL
[AI-37] Against Multifaceted Graph Heterogeneity via Asymmetric Federated Prompt Learning
链接: https://arxiv.org/abs/2411.02003
作者: Zhuoning Guo,Ruiqian Han,Hao Liu
关键词-EN: privately optimize graph, optimize graph models, Graph, Graph Prompt Learning, Federated Graph Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Social and Information Networks (cs.SI)
*备注:
点击查看摘要
Abstract:Federated Graph Learning (FGL) aims to collaboratively and privately optimize graph models on divergent data for different tasks. A critical challenge in FGL is to enable effective yet efficient federated optimization against multifaceted graph heterogeneity to enhance mutual performance. However, existing FGL works primarily address graph data heterogeneity and perform incapable of graph task heterogeneity. To address the challenge, we propose a Federated Graph Prompt Learning (FedGPL) framework to efficiently enable prompt-based asymmetric graph knowledge transfer between multifaceted heterogeneous federated participants. Generally, we establish a split federated framework to preserve universal and domain-specific graph knowledge, respectively. Moreover, we develop two algorithms to eliminate task and data heterogeneity for advanced federated knowledge preservation. First, a Hierarchical Directed Transfer Aggregator (HiDTA) delivers cross-task beneficial knowledge that is hierarchically distilled according to the directional transferability. Second, a Virtual Prompt Graph (VPG) adaptively generates graph structures to enhance data utility by distinguishing dominant subgraphs and neutralizing redundant ones. We conduct theoretical analyses and extensive experiments to demonstrate the significant accuracy and efficiency effectiveness of FedGPL against multifaceted graph heterogeneity compared to state-of-the-art baselines on large-scale federated graph datasets.
[AI-38] Understanding Variational Autoencoders with Intrinsic Dimension and Information Imbalance NEURIPS2024
链接: https://arxiv.org/abs/2411.01978
作者: Charles Camboulin,Diego Doimo,Aldo Glielmo
关键词-EN: Variational Autoencoders, Intrinsic Dimension, Information Imbalance, representations of Variational, work presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 4 pages, 3 figures, accepted at the Unifying Representations in Neural Models (UniReps) workshop of NeurIPS 2024 ( this https URL )
点击查看摘要
Abstract:This work presents an analysis of the hidden representations of Variational Autoencoders (VAEs) using the Intrinsic Dimension (ID) and the Information Imbalance (II). We show that VAEs undergo a transition in behaviour once the bottleneck size is larger than the ID of the data, manifesting in a double hunchback ID profile and a qualitative shift in information processing as captured by the II. Our results also highlight two distinct training phases for architectures with sufficiently large bottleneck sizes, consisting of a rapid fit and a slower generalisation, as assessed by a differentiated behaviour of ID, II, and KL loss. These insights demonstrate that II and ID could be valuable tools for aiding architecture search, for diagnosing underfitting in VAEs, and, more broadly, they contribute to advancing a unified understanding of deep generative models through geometric analysis.
[AI-39] Active Gaze Behavior Boosts Self-Supervised Object Learning
链接: https://arxiv.org/abs/2411.01969
作者: Zhengyang Yu,Arthur Aubret,Marcel C. Raabe,Jane Yang,Chen Yu,Jochen Triesch
关键词-EN: Due to significant, visual, Due, significant variations, object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages, 11 figures
点击查看摘要
Abstract:Due to significant variations in the projection of the same object from different viewpoints, machine learning algorithms struggle to recognize the same object across various perspectives. In contrast, toddlers quickly learn to recognize objects from different viewpoints with almost no supervision. Recent works argue that toddlers develop this ability by mapping close-in-time visual inputs to similar representations while interacting with objects. High acuity vision is only available in the central visual field, which may explain why toddlers (much like adults) constantly move their gaze around during such interactions. It is unclear whether/how much toddlers curate their visual experience through these eye movements to support learning object representations. In this work, we explore whether a bio inspired visual learning model can harness toddlers’ gaze behavior during a play session to develop view-invariant object recognition. Exploiting head-mounted eye tracking during dyadic play, we simulate toddlers’ central visual field experience by cropping image regions centered on the gaze location. This visual stream feeds a time-based self-supervised learning algorithm. Our experiments demonstrate that toddlers’ gaze strategy supports the learning of invariant object representations. Our analysis also reveals that the limited size of the central visual field where acuity is high is crucial for this. We further find that toddlers’ visual experience elicits more robust representations compared to adults’ mostly because toddlers look at objects they hold themselves for longer bouts. Overall, our work reveals how toddlers’ gaze behavior supports self-supervised learning of view-invariant object recognition.
[AI-40] V-CAS: A Realtime Vehicle Anti Collision System Using Vision Transformer on Multi-Camera Streams ICML
链接: https://arxiv.org/abs/2411.01963
作者: Muhammad Waqas Ashraf,Ali Hassan,Imad Ali Shah
关键词-EN: enhance vehicle safety, enhance vehicle, Vehicle Collision Avoidance, real-time Vehicle Collision, designed to enhance
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted at ICMLA 2024
点击查看摘要
Abstract:This paper introduces a real-time Vehicle Collision Avoidance System (V-CAS) designed to enhance vehicle safety through adaptive braking based on environmental perception. V-CAS leverages the advanced vision-based transformer model RT-DETR, DeepSORT tracking, speed estimation, brake light detection, and an adaptive braking mechanism. It computes a composite collision risk score based on vehicles’ relative accelerations, distances, and detected braking actions, using brake light signals and trajectory data from multiple camera streams to improve scene perception. Implemented on the Jetson Orin Nano, V-CAS enables real-time collision risk assessment and proactive mitigation through adaptive braking. A comprehensive training process was conducted on various datasets for comparative analysis, followed by fine-tuning the selected object detection model using transfer learning. The system’s effectiveness was rigorously evaluated on the Car Crash Dataset (CCD) from YouTube and through real-time experiments, achieving over 98% accuracy with an average proactive alert time of 1.13 seconds. Results indicate significant improvements in object detection and tracking, enhancing collision avoidance compared to traditional single-camera methods. This research demonstrates the potential of low-cost, multi-camera embedded vision transformer systems to advance automotive safety through enhanced environmental perception and proactive collision avoidance mechanisms.
[AI-41] Evaluating the quality of published medical research with ChatGPT
链接: https://arxiv.org/abs/2411.01952
作者: Mike Thelwall,Xiaorui Jiang,Peter A. Bath
关键词-EN: Clinical Medicine, Clinical Medicine correlated, time-consuming but important, REF scores, REF
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Evaluating the quality of published research is time-consuming but important for departmental evaluations, appointments, and promotions. Previous research has shown that ChatGPT can score articles for research quality, with the results correlating positively with an indicator of quality in all fields except Clinical Medicine. This article investigates this anomaly with the largest dataset yet and a more detailed analysis. The results showed that ChatGPT 4o-mini scores for articles submitted to the UK’s Research Excellence Framework (REF) 2021 Unit of Assessment (UoA) 1 Clinical Medicine correlated positively (r=0.134, n=9872) with departmental mean REF scores, against a theoretical maximum correlation of r=0.226 (due to the departmental averaging involved). At the departmental level, mean ChatGPT scores correlated more strongly with departmental mean REF scores (r=0.395, n=31). For the 100 journals with the most articles in UoA 1, their mean ChatGPT score correlated strongly with their REF score (r=0.495) but negatively with their citation rate (r=-0.148). Journal and departmental anomalies in these results point to ChatGPT being ineffective at assessing the quality of research in prestigious medical journals or research directly affecting human health, or both. Nevertheless, the results give evidence of ChatGPT’s ability to assess research quality overall for Clinical Medicine, so now there is evidence of its ability in all academic fields.
[AI-42] HACD: Harnessing Attribute Semantics and Mesoscopic Structure for Community Detection
链接: https://arxiv.org/abs/2411.01947
作者: Anran Zhang,Xingfen Wang,Yuhan Zhao
关键词-EN: closely connected subgraphs, uncovering closely connected, attributed community detection, Community detection, Community detection plays
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Community detection plays a pivotal role in uncovering closely connected subgraphs, aiding various real-world applications such as recommendation systems and anomaly detection. With the surge of rich information available for entities in real-world networks, the community detection problem in attributed networks has attracted widespread attention. While previous research has effectively leveraged network topology and attribute information for attributed community detection, these methods overlook two critical issues: (i) the semantic similarity between node attributes within the community, and (ii) the inherent mesoscopic structure, which differs from the pairwise connections of the micro-structure. To address these limitations, we propose HACD, a novel attributed community detection model based on heterogeneous graph attention networks. HACD treats node attributes as another type of node, constructs attributed networks into heterogeneous graph structures and employs attribute-level attention mechanisms to capture semantic similarity. Furthermore, HACD introduces a community membership function to explore mesoscopic community structures, enhancing the robustness of detected communities. Extensive experiments demonstrate the effectiveness and efficiency of HACD, outperforming state-of-the-art methods in attributed community detection tasks. Our code is publicly available at this https URL.
[AI-43] Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis
链接: https://arxiv.org/abs/2411.01929
作者: Mohammad Zbeeb,Mohammad Ghorayeb,Mariam Salman
关键词-EN: Malicious Network Traffic, Artificial Intelligence, research often aims, aims to develop, generalize reliably
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 7 figures, 3 tables, 1 algorithm. code @ this https URL
点击查看摘要
Abstract:Artificial Intelligence (AI) research often aims to develop models that generalize reliably across complex datasets, yet this remains challenging in fields where data is scarce, intricate, or inaccessible. This paper introduces a novel approach leveraging three generative models of varying complexity to synthesize one of the most demanding structured datasets: Malicious Network Traffic. Our approach transforms numerical data into text, reframing data generation as a language modeling task, which enhances data regularization and significantly improves generalization and the quality of the synthetic data. Extensive statistical analyses demonstrate that our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data. Additionally, we conduct a comprehensive study on synthetic data applications, effectiveness, and evaluation strategies, offering valuable insights into its role across various domains. Our code and pre-trained models are openly accessible at this https URL, enabling further exploration and application of our methodology. Index Terms: Data synthesis, machine learning, traffic generation, privacy-preserving data, generative models. Comments: 25 pages, 7 figures, 3 tables, 1 algorithm. code @ this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.01929 [cs.LG] (or arXiv:2411.01929v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.01929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-44] Fairness-Utilization Trade-off in Wireless Networks with Explainable Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2411.01924
作者: Masoud Shokrnezhad,Hamidreza Mazandarani,Tarik Taleb
关键词-EN: Deep Neural Networks, wireless networks brings, effective distribution, significant advancements, Deep Neural
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: a conference paper, accepted for publication at IEEE VCC 2024
点击查看摘要
Abstract:The effective distribution of user transmit powers is essential for the significant advancements that the emergence of 6G wireless networks brings. In recent studies, Deep Neural Networks (DNNs) have been employed to address this challenge. However, these methods frequently encounter issues regarding fairness and computational inefficiency when making decisions, rendering them unsuitable for future dynamic services that depend heavily on the participation of each individual user. To address this gap, this paper focuses on the challenge of transmit power allocation in wireless networks, aiming to optimize \alpha -fairness to balance network utilization and user equity. We introduce a novel approach utilizing Kolmogorov-Arnold Networks (KANs), a class of machine learning models that offer low inference costs compared to traditional DNNs through superior explainability. The study provides a comprehensive problem formulation, establishing the NP-hardness of the power allocation problem. Then, two algorithms are proposed for dataset generation and decentralized KAN training, offering a flexible framework for achieving various fairness objectives in dynamic 6G environments. Extensive numerical simulations demonstrate the effectiveness of our approach in terms of fairness and inference cost. The results underscore the potential of KANs to overcome the limitations of existing DNN-based methods, particularly in scenarios that demand rapid adaptation and fairness.
[AI-45] Best-Arm Identification in Unimodal Bandits
链接: https://arxiv.org/abs/2411.01898
作者: Riccardo Poiani,Marc Jourdan,Emilie Kaufmann,Rémy Degenne
关键词-EN: fixed-confidence best-arm identification, best-arm identification problem, study the fixed-confidence, fixed-confidence best-arm, best-arm identification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We study the fixed-confidence best-arm identification problem in unimodal bandits, in which the means of the arms increase with the index of the arm up to their maximum, then decrease. We derive two lower bounds on the stopping time of any algorithm. The instance-dependent lower bound suggests that due to the unimodal structure, only three arms contribute to the leading confidence-dependent cost. However, a worst-case lower bound shows that a linear dependence on the number of arms is unavoidable in the confidence-independent cost. We propose modifications of Track-and-Stop and a Top Two algorithm that leverage the unimodal structure. Both versions of Track-and-Stop are asymptotically optimal for one-parameter exponential families. The Top Two algorithm is asymptotically near-optimal for Gaussian distributions and we prove a non-asymptotic guarantee matching the worse-case lower bound. The algorithms can be implemented efficiently and we demonstrate their competitive empirical performance.
[AI-46] LE-PDE: Mamba for accelerating PDEs Simulations
链接: https://arxiv.org/abs/2411.01897
作者: Aoming Liang,Zhaoyang Mu,Qi liu,Ruipeng Li,Mingming Ge,Dixia Fan
关键词-EN: Partial Differential Equations, Partial Differential, Differential Equations, Equations are foundational, weather forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: submitted in 08/15/2025 for artificial intelligence meeting
点击查看摘要
Abstract:Partial Differential Equations are foundational in modeling science and natural systems such as fluid dynamics and weather forecasting. The Latent Evolution of PDEs method is designed to address the computational intensity of classical and deep learning-based PDE solvers by proposing a scalable and efficient alternative. To enhance the efficiency and accuracy of LE-PDE, we incorporate the Mamba model, an advanced machine learning model known for its predictive efficiency and robustness in handling complex dynamic systems with a progressive learning strategy. The LE-PDE was tested on several benchmark problems. The method demonstrated a marked reduction in computational time compared to traditional solvers and standalone deep learning models while maintaining high accuracy in predicting system behavior over time. Our method doubles the inference speed compared to the LE-PDE while retaining the same level of parameter efficiency, making it well-suited for scenarios requiring long-term predictions.
[AI-47] LiDAttack: Robust Black-box Attack on LiDAR-based Object Detection
链接: https://arxiv.org/abs/2411.01889
作者: Jinyin Chen,Danxin Liao,Sheng Xiang,Haibin Zheng
关键词-EN: carefully crafted adversarial, DNN is vulnerable, extensively studied, crafted adversarial, vulnerable to carefully
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Since DNN is vulnerable to carefully crafted adversarial examples, adversarial attack on LiDAR sensors have been extensively studied. We introduce a robust black-box attack dubbed LiDAttack. It utilizes a genetic algorithm with a simulated annealing strategy to strictly limit the location and number of perturbation points, achieving a stealthy and effective attack. And it simulates scanning deviations, allowing it to adapt to dynamic changes in real world scenario variations. Extensive experiments are conducted on 3 datasets (i.e., KITTI, nuScenes, and self-constructed data) with 3 dominant object detection models (i.e., PointRCNN, PointPillar, and PV-RCNN++). The results reveal the efficiency of the LiDAttack when targeting a wide range of object detection models, with an attack success rate (ASR) up to 90%.
[AI-48] Mining and Transferring Feature-Geometry Coherence for Unsupervised Point Cloud Registration NEURIPS2024
链接: https://arxiv.org/abs/2411.01870
作者: Kezheng Xiong,Haoen Xiang,Qingshan Xu,Chenglu Wen,Siqi Shen,Jonathan Li,Cheng Wang
关键词-EN: Point cloud registration, achieved remarkable success, outdoor point cloud, Point cloud, cloud registration methods
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS2024
点击查看摘要
Abstract:Point cloud registration, a fundamental task in 3D vision, has achieved remarkable success with learning-based methods in outdoor environments. Unsupervised outdoor point cloud registration methods have recently emerged to circumvent the need for costly pose annotations. However, they fail to establish reliable optimization objectives for unsupervised training, either relying on overly strong geometric assumptions, or suffering from poor-quality pseudo-labels due to inadequate integration of low-level geometric and high-level contextual information. We have observed that in the feature space, latent new inlier correspondences tend to cluster around respective positive anchors that summarize features of existing inliers. Motivated by this observation, we propose a novel unsupervised registration method termed INTEGER to incorporate high-level contextual information for reliable pseudo-label mining. Specifically, we propose the Feature-Geometry Coherence Mining module to dynamically adapt the teacher for each mini-batch of data during training and discover reliable pseudo-labels by considering both high-level feature representations and low-level geometric cues. Furthermore, we propose Anchor-Based Contrastive Learning to facilitate contrastive learning with anchors for a robust feature space. Lastly, we introduce a Mixed-Density Student to learn density-invariant features, addressing challenges related to density variation and low overlap in the outdoor scenario. Extensive experiments on KITTI and nuScenes datasets demonstrate that our INTEGER achieves competitive performance in terms of accuracy and generalizability.
[AI-49] Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales
链接: https://arxiv.org/abs/2411.01866
作者: Resul Dagdanov,Milan Andrejevic,Dikai Liu,Chin-Teng Lin
关键词-EN: human trust, trust, adjust their behavior, behavior based, based on perceived
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, 1 table. This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:When interacting with each other, humans adjust their behavior based on perceived trust. However, to achieve similar adaptability, robots must accurately estimate human trust at sufficiently granular timescales during the human-robot collaboration task. A beta reputation is a popular way to formalize a mathematical estimation of human trust. However, it relies on binary performance, which updates trust estimations only after each task concludes. Additionally, manually crafting a reward function is the usual method of building a performance indicator, which is labor-intensive and time-consuming. These limitations prevent efficiently capturing continuous changes in trust at more granular timescales throughout the collaboration task. Therefore, this paper presents a new framework for the estimation of human trust using a beta reputation at fine-grained timescales. To achieve granularity in beta reputation, we utilize continuous reward values to update trust estimations at each timestep of a task. We construct a continuous reward function using maximum entropy optimization to eliminate the need for the laborious specification of a performance indicator. The proposed framework improves trust estimations by increasing accuracy, eliminating the need for manually crafting a reward function, and advancing toward developing more intelligent robots. The source code is publicly available. this https URL
[AI-50] Silver medal Solution for Image Matching Challenge 2024
链接: https://arxiv.org/abs/2411.01851
作者: Yian Wang
关键词-EN: computer vision challenges, solve fundamental computer, fundamental computer vision, diverse image sets, Image Matching Challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Image Matching Challenge 2024 is a competition focused on building 3D maps from diverse image sets, requiring participants to solve fundamental computer vision challenges in image matching across varying angles, lighting, and seasonal changes. This project develops a Pipeline method that combines multiple advanced techniques: using pre-trained EfficientNet-B7 for initial feature extraction and cosine distance-based image pair filtering, employing both KeyNetAffNetHardNet and SuperPoint for keypoint feature extraction, utilizing AdaLAM and SuperGlue for keypoint matching, and finally applying Pycolmap for 3D spatial analysis. The methodology achieved an excellent score of 0.167 on the private leaderboard, with experimental results demonstrating that the combination of KeyNetAffNetHardNet and SuperPoint provides significant advantages in keypoint detection and matching, particularly when dealing with challenging variations in surface texture and environmental conditions that typically degrade traditional algorithm performance.
[AI-51] DeMod: A Holistic Tool with Explainable Detection and Personalized Modification for Toxicity Censorship
链接: https://arxiv.org/abs/2411.01844
作者: Yaqiong Li,Peng Zhang,Hansu Gu,Tun Lu,Siyuan Qiao,Yubo Shu,Yiyang Shao,Ning Gu
关键词-EN: supporting toxicity censorship, tools supporting toxicity, social posts, automated approaches, toxicity censorship
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Although there have been automated approaches and tools supporting toxicity censorship for social posts, most of them focus on detection. Toxicity censorship is a complex process, wherein detection is just an initial task and a user can have further needs such as rationale understanding and content modification. For this problem, we conduct a needfinding study to investigate people’s diverse needs in toxicity censorship and then build a ChatGPT-based censorship tool named DeMod accordingly. DeMod is equipped with the features of explainable Detection and personalized Modification, providing fine-grained detection results, detailed explanations, and personalized modification suggestions. We also implemented the tool and recruited 35 Weibo users for evaluation. The results suggest DeMod’s multiple strengths like the richness of functionality, the accuracy of censorship, and ease of use. Based on the findings, we further propose several insights into the design of content censorship systems.
[AI-52] DiffuMask-Editor: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing to Improve Segmentation Ability
链接: https://arxiv.org/abs/2411.01819
作者: Bo Gao,Fangxu Xing,Daniel Tang
关键词-EN: Semantic segmentation models, manually annotated data, Semantic segmentation, Stable Diffusion, inefficient to acquire
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages,4 figures
点击查看摘要
Abstract:Semantic segmentation models, like mask2former, often demand a substantial amount of manually annotated data, which is time-consuming and inefficient to acquire. Leveraging state-of-the-art text-to-image models like Midjourney and Stable Diffusion has emerged as an effective strategy for automatically generating synthetic data instead of human annotations. However, prior approaches have been constrained to synthesizing single-instance images due to the instability inherent in generating multiple instances with Stable Diffusion. To expand the domains and diversity of synthetic datasets, this paper introduces a novel paradigm named DiffuMask-Editor, which combines the Diffusion Model for Segmentation with Image Editing. By integrating multiple objects into images using Text2Image models, our method facilitates the creation of more realistic datasets that closely resemble open-world settings while simultaneously generating accurate masks. Our approach significantly reduces the laborious effort associated with manual annotation while ensuring precise mask generation. Experimental results demonstrate that synthetic data generated by DiffuMask-Editor enable segmentation methods to achieve superior performance compared to real data. Particularly in zero-shot backgrounds, DiffuMask-Editor achieves new state-of-the-art results on Unseen classes of VOC 2012. The code and models will be publicly available soon.
[AI-53] So You Think You Can Scale Up Autonomous Robot Data Collection?
链接: https://arxiv.org/abs/2411.01813
作者: Suvir Mirchandani,Suneel Belkhale,Joey Hejna,Evelyn Choi,Md Sazzad Islam,Dorsa Sadigh
关键词-EN: autonomous data collection, autonomous, data collection, skills autonomously, autonomous data
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 21 pages, 25 figures. Conference on Robot Learning (CoRL) 2024
点击查看摘要
Abstract:A long-standing goal in robot learning is to develop methods for robots to acquire new skills autonomously. While reinforcement learning (RL) comes with the promise of enabling autonomous data collection, it remains challenging to scale in the real-world partly due to the significant effort required for environment design and instrumentation, including the need for designing reset functions or accurate success detectors. On the other hand, imitation learning (IL) methods require little to no environment design effort, but instead require significant human supervision in the form of collected demonstrations. To address these shortcomings, recent works in autonomous IL start with an initial seed dataset of human demonstrations that an autonomous policy can bootstrap from. While autonomous IL approaches come with the promise of addressing the challenges of autonomous RL as well as pure IL strategies, in this work, we posit that such techniques do not deliver on this promise and are still unable to scale up autonomous data collection in the real world. Through a series of real-world experiments, we demonstrate that these approaches, when scaled up to realistic settings, face much of the same scaling challenges as prior attempts in RL in terms of environment design. Further, we perform a rigorous study of autonomous IL methods across different data scales and 7 simulation and real-world tasks, and demonstrate that while autonomous data collection can modestly improve performance, simply collecting more human data often provides significantly more improvement. Our work suggests a negative result: that scaling up autonomous data collection for learning robot policies for real-world tasks is more challenging and impractical than what is suggested in prior work. We hope these insights about the core challenges of scaling up data collection help inform future efforts in autonomous learning.
[AI-54] Can Language Models Enable In-Context Database?
链接: https://arxiv.org/abs/2411.01807
作者: Yu Pan,Hongfeng Yu,Tianjiao Zhao,Jianxin Sun
关键词-EN: few-shot learners capable, Large language models, including comprehension, question answering, arithmetic calculations
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large language models (LLMs) are emerging as few-shot learners capable of handling a variety of tasks, including comprehension, planning, reasoning, question answering, arithmetic calculations, and more. At the core of these capabilities is LLMs’ proficiency in representing and understanding structural or semi-structural data, such as tables and graphs. Numerous studies have demonstrated that reasoning on tabular data or graphs is not only feasible for LLMs but also gives a promising research direction which treats these data as in-context data. The lightweight and human readable characteristics of in-context database can potentially make it an alternative for the traditional database in typical RAG (Retrieval Augmented Generation) settings. However, almost all current work focuses on static in-context data, which does not allow dynamic update. In this paper, to enable dynamic database update, delta encoding of database is proposed. We explore how data stored in traditional RDBMS can be encoded as in-context text and evaluate LLMs’ proficiency for CRUD (Create, Read, Update and Delete) operations on in-context databases. A benchmark named InConDB is presented and extensive experiments are conducted to show the performance of different language models in enabling in-context database by varying the database encoding method, prompting method, operation type and input data distribution, revealing both the proficiency and limitations.
[AI-55] Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge NEURIPS2024
链接: https://arxiv.org/abs/2411.01796
作者: Weihua Du,Qiushi Lyu,Jiaming Shan,Zhenting Qi,Hongxin Zhang,Sunli Chen,Andi Peng,Tianmin Shu,Kwonjoon Lee,Behzad Dariush,Chuang Gan
关键词-EN: Constrained Human-AI Cooperation, introduce Constrained Human-AI, Constrained Human-AI, inclusive embodied social, Human-AI Cooperation
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: NeurIPS 2024 Dataset and Benchmark Track. Project at this URL: this https URL
点击查看摘要
Abstract:We introduce Constrained Human-AI Cooperation (CHAIC), an inclusive embodied social intelligence challenge designed to test social perception and cooperation in embodied agents. In CHAIC, the goal is for an embodied agent equipped with egocentric observations to assist a human who may be operating under physical constraints – e.g., unable to reach high places or confined to a wheelchair – in performing common household or outdoor tasks as efficiently as possible. To achieve this, a successful helper must: (1) infer the human’s intents and constraints by following the human and observing their behaviors (social perception), and (2) make a cooperative plan tailored to the human partner to solve the task as quickly as possible, working together as a team (cooperative planning). To benchmark this challenge, we create four new agents with real physical constraints and eight long-horizon tasks featuring both indoor and outdoor scenes with various constraints, emergency events, and potential risks. We benchmark planning- and learning-based baselines on the challenge and introduce a new method that leverages large language models and behavior modeling. Empirical evaluations demonstrate the effectiveness of our benchmark in enabling systematic assessment of key aspects of machine social intelligence. Our benchmark and code are publicly available at this URL: this https URL.
[AI-56] hinking Forward and Backward: Effective Backward Planning with Large Language Models
链接: https://arxiv.org/abs/2411.01790
作者: Allen Z. Ren,Brian Ichter,Anirudha Majumdar
关键词-EN: Large language models, Large language, exhibited remarkable reasoning, planning, language models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review
点击查看摘要
Abstract:Large language models (LLMs) have exhibited remarkable reasoning and planning capabilities. Most prior work in this area has used LLMs to reason through steps from an initial to a goal state or criterion, thereby effectively reasoning in a forward direction. Nonetheless, many planning problems exhibit an inherent asymmetry such that planning backward from the goal is significantly easier – for example, if there are bottlenecks close to the goal. We take inspiration from this observation and demonstrate that this bias holds for LLM planning as well: planning performance in one direction correlates with the planning complexity of the problem in that direction. However, our experiments also reveal systematic biases which lead to poor planning in the backward direction. With this knowledge, we propose a backward planning algorithm for LLMs that first flips the problem and then plans forward in the flipped problem. This helps avoid the backward bias, generate more diverse candidate plans, and exploit asymmetries between the forward and backward directions in planning problems – we find that combining planning in both directions with self-verification improves the overall planning success rates by 4-24% in three planning domains.
[AI-57] ransferable Sequential Recommendation via Vector Quantized Meta Learning
链接: https://arxiv.org/abs/2411.01785
作者: Zhenrui Yue,Huimin Zeng,Yang Zhang,Julian McAuley,Dong Wang
关键词-EN: achieves significant progress, systems remains challenging, remains challenging due, large-scale recommender systems, recommender systems remains
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted to BigData 2024
点击查看摘要
Abstract:While sequential recommendation achieves significant progress on capturing user-item transition patterns, transferring such large-scale recommender systems remains challenging due to the disjoint user and item groups across domains. In this paper, we propose a vector quantized meta learning for transferable sequential recommenders (MetaRec). Without requiring additional modalities or shared information across domains, our approach leverages user-item interactions from multiple source domains to improve the target domain performance. To solve the input heterogeneity issue, we adopt vector quantization that maps item embeddings from heterogeneous input spaces to a shared feature space. Moreover, our meta transfer paradigm exploits limited target data to guide the transfer of source domain knowledge to the target domain (i.e., learn to transfer). In addition, MetaRec adaptively transfers from multiple source tasks by rescaling meta gradients based on the source-target domain similarity, enabling selective learning to improve recommendation performance. To validate the effectiveness of our approach, we perform extensive experiments on benchmark datasets, where MetaRec consistently outperforms baseline methods by a considerable margin.
[AI-58] Context Parallelism for Scalable Million-Token Inference
链接: https://arxiv.org/abs/2411.01783
作者: Amy (Jie)Yang,Jingyi Yang,Aya Ibrahim,Xinfeng Xie,Bangsheng Tang,Grigory Sizov,Jongsoo Park,Jianyu Huang
关键词-EN: language model inference, present context parallelism, achieves near-linear scaling, large language model, long-context large language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.
[AI-59] Eurekaverse: Environment Curriculum Generation via Large Language Models
链接: https://arxiv.org/abs/2411.01775
作者: William Liang,Sam Wang,Hung-Ju Wang,Osbert Bastani,Dinesh Jayaraman,Yecheng Jason Ma
关键词-EN: Recent work, work has demonstrated, promising strategy, strategy for teaching, wide range
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Conference on Robot Learning (CoRL), 2024. Project website and code: this https URL
点击查看摘要
Abstract:Recent work has demonstrated that a promising strategy for teaching robots a wide range of complex skills is by training them on a curriculum of progressively more challenging environments. However, developing an effective curriculum of environment distributions currently requires significant expertise, which must be repeated for every new domain. Our key insight is that environments are often naturally represented as code. Thus, we probe whether effective environment curriculum design can be achieved and automated via code generation by large language models (LLM). In this paper, we introduce Eurekaverse, an unsupervised environment design algorithm that uses LLMs to sample progressively more challenging, diverse, and learnable environments for skill training. We validate Eurekaverse’s effectiveness in the domain of quadrupedal parkour learning, in which a quadruped robot must traverse through a variety of obstacle courses. The automatic curriculum designed by Eurekaverse enables gradual learning of complex parkour skills in simulation and can successfully transfer to the real-world, outperforming manual training courses designed by humans.
[AI-60] Mitigating Spurious Correlations via Disagreement Probability
链接: https://arxiv.org/abs/2411.01757
作者: Hyeonggeun Han,Sehwan Kim,Hyungjun Joo,Sangwoo Hong,Jungwoo Lee
关键词-EN: data groups lacking, groups lacking spurious, empirical risk minimization, spurious correlations, bias labels
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Models trained with empirical risk minimization (ERM) are prone to be biased towards spurious correlations between target labels and bias attributes, which leads to poor performance on data groups lacking spurious correlations. It is particularly challenging to address this problem when access to bias labels is not permitted. To mitigate the effect of spurious correlations without bias labels, we first introduce a novel training objective designed to robustly enhance model performance across all data samples, irrespective of the presence of spurious correlations. From this objective, we then derive a debiasing method, Disagreement Probability based Resampling for debiasing (DPR), which does not require bias labels. DPR leverages the disagreement between the target label and the prediction of a biased model to identify bias-conflicting samples-those without spurious correlations-and upsamples them according to the disagreement probability. Empirical evaluations on multiple benchmarks demonstrate that DPR achieves state-of-the-art performance over existing baselines that do not use bias labels. Furthermore, we provide a theoretical analysis that details how DPR reduces dependency on spurious correlations.
[AI-61] xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
链接: https://arxiv.org/abs/2411.01738
作者: Jiarui Fang,Jinzhe Pan,Xibo Sun,Aoyu Li,Jiannan Wang
关键词-EN: Diffusion models, Diffusion Transformers, generating high-quality images, images and videos, Parallel
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Diffusion models are pivotal for generating high-quality images and videos. Inspired by the success of OpenAI’s Sora, the backbone of diffusion models is evolving from U-Net to Transformer, known as Diffusion Transformers (DiTs). However, generating high-quality content necessitates longer sequence lengths, exponentially increasing the computation required for the attention mechanism, and escalating DiTs inference latency. Parallel inference is essential for real-time DiTs deployments, but relying on a single parallel method is impractical due to poor scalability at large scales. This paper introduces xDiT, a comprehensive parallel inference engine for DiTs. After thoroughly investigating existing DiTs parallel approaches, xDiT chooses Sequence Parallel (SP) and PipeFusion, a novel Patch-level Pipeline Parallel method, as intra-image parallel strategies, alongside CFG parallel for inter-image parallelism. xDiT can flexibly combine these parallel approaches in a hybrid manner, offering a robust and scalable solution. Experimental results on two 8xL40 GPUs (PCIe) nodes interconnected by Ethernet and an 8xA100 (NVLink) node showcase xDiT’s exceptional scalability across five state-of-the-art DiTs. Notably, we are the first to demonstrate DiTs scalability on Ethernet-connected GPU clusters. xDiT is available at this https URL.
[AI-62] Large-Scale Multi-Robot Coverage Path Planning on Grids with Path Deconfliction
链接: https://arxiv.org/abs/2411.01707
作者: Jingtao Tang,Zining Mao,Hang Ma
关键词-EN: Coverage Path Planning, Spanning Tree Coverage, Path Planning, compute coverage trees, study Multi-Robot Coverage
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to T-RO
点击查看摘要
Abstract:We study Multi-Robot Coverage Path Planning (MCPP) on a 4-neighbor 2D grid G, which aims to compute paths for multiple robots to cover all cells of G. Traditional approaches are limited as they first compute coverage trees on a quadrant coarsened grid H and then employ the Spanning Tree Coverage (STC) paradigm to generate paths on G, making them inapplicable to grids with partially obstructed 2x2 blocks. To address this limitation, we reformulate the problem directly on G, revolutionizing grid-based MCPP solving and establishing new NP-hardness results. We introduce Extended-STC (ESTC), a novel paradigm that extends STC to ensure complete coverage with bounded suboptimality, even when H includes partially obstructed blocks. Furthermore, we present LS-MCPP, a new algorithmic framework that integrates ESTC with three novel types of neighborhood operators within a local search strategy to optimize coverage paths directly on G. Unlike prior grid-based MCPP work, our approach also incorporates a versatile post-processing procedure that applies Multi-Agent Path Finding (MAPF) techniques to MCPP for the first time, enabling a fusion of these two important fields in multi-robot coordination. This procedure effectively resolves inter-robot conflicts and accommodates turning costs by solving a MAPF variant, making our MCPP solutions more practical for real-world applications. Extensive experiments demonstrate that our approach significantly improves solution quality and efficiency, managing up to 100 robots on grids as large as 256x256 within minutes of runtime. Validation with physical robots confirms the feasibility of our solutions under real-world conditions.
[AI-63] Sing-On-Your-Beat: Simple Text-Controllable Accompaniment Generations
链接: https://arxiv.org/abs/2411.01661
作者: Quoc-Huy Trinh,Minh-Van Nguyen,Trong-Hieu Nguyen Mau,Khoa Tran,Thanh Do
关键词-EN: human entertainment, cherished forms, forms of human, Singing, beautiful song requires
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Singing is one of the most cherished forms of human entertainment. However, creating a beautiful song requires an accompaniment that complements the vocals and aligns well with the song instruments and genre. With advancements in deep learning, previous research has focused on generating suitable accompaniments but often lacks precise alignment with the desired instrumentation and genre. To address this, we propose a straightforward method that enables control over the accompaniment through text prompts, allowing the generation of music that complements the vocals and aligns with the song instrumental and genre requirements. Through extensive experiments, we successfully generate 10-second accompaniments using vocal input and text control.
[AI-64] Optimizing Gastrointestinal Diagnostics: A CNN-Based Model for VCE Image Classification
链接: https://arxiv.org/abs/2411.01652
作者: Vaneeta Ahlawat,Rohit Sharma,Urush
关键词-EN: video capsule endoscopy, high-tech video capsule, Capsule Vision Challenge, recent years, diagnosis of gastrointestinal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figuers
点击查看摘要
Abstract:In recent years, the diagnosis of gastrointestinal (GI) diseases has advanced greatly with the advent of high-tech video capsule endoscopy (VCE) technology, which allows for non-invasive observation of the digestive system. The MisaHub Capsule Vision Challenge encourages the development of vendor-independent artificial intelligence models that can autonomously classify GI anomalies from VCE images. This paper presents CNN architecture designed specifically for multiclass classification of ten gut pathologies, including angioectasia, bleeding, erosion, erythema, foreign bodies, lymphangiectasia, polyps, ulcers, and worms as well as their normal state.
[AI-65] Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation
链接: https://arxiv.org/abs/2411.01647
作者: Zhenbin Wang,Lei Zhang,Lituan Wang,Minjuan Zhu,Zhenwei Zhang
关键词-EN: surgical planning, healthcare industry, education and training, Simulation Video Generator, profound impact
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Medical video generation models are expected to have a profound impact on the healthcare industry, including but not limited to medical education and training, surgical planning, and simulation. Current video diffusion models typically build on image diffusion architecture by incorporating temporal operations (such as 3D convolution and temporal attention). Although this approach is effective, its oversimplification limits spatio-temporal performance and consumes substantial computational resources. To counter this, we propose Medical Simulation Video Generator (MedSora), which incorporates three key elements: i) a video diffusion framework integrates the advantages of attention and Mamba, balancing low computational load with high-quality video generation, ii) an optical flow representation alignment method that implicitly enhances attention to inter-frame pixels, and iii) a video variational autoencoder (VAE) with frequency compensation addresses the information loss of medical features that occurs when transforming pixel space into latent features and then back to pixel frames. Extensive experiments and applications demonstrate that MedSora exhibits superior visual quality in generating medical videos, outperforming the most advanced baseline methods. Further results and code are available at this https URL
[AI-66] Enriching Tabular Data with Contextual LLM Embeddings: A Comprehensive Ablation Study for Ensemble Classifiers
链接: https://arxiv.org/abs/2411.01645
作者: Gjergji Kasneci,Enkelejda Kasneci
关键词-EN: data classification tasks, optimizing machine learning, classification tasks, engineering is crucial, crucial for optimizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Feature engineering is crucial for optimizing machine learning model performance, particularly in tabular data classification tasks. Leveraging advancements in natural language processing, this study presents a systematic approach to enrich tabular datasets with features derived from large language model embeddings. Through a comprehensive ablation study on diverse datasets, we assess the impact of RoBERTa and GPT-2 embeddings on ensemble classifiers, including Random Forest, XGBoost, and CatBoost. Results indicate that integrating embeddings with traditional numerical and categorical features often enhances predictive performance, especially on datasets with class imbalance or limited features and samples, such as UCI Adult, Heart Disease, Titanic, and Pima Indian Diabetes, with improvements particularly notable in XGBoost and CatBoost classifiers. Additionally, feature importance analysis reveals that LLM-derived features frequently rank among the most impactful for the predictions. This study provides a structured approach to embedding-based feature enrichment and illustrates its benefits in ensemble learning for tabular data.
[AI-67] Know Where Youre Uncertain When Planning with Multimodal Foundation Models: A Formal Framework
链接: https://arxiv.org/abs/2411.01639
作者: Neel P. Bhatt,Yunhao Yang,Rohan Siva,Daniel Milan,Ufuk Topcu,Zhangyang Wang
关键词-EN: Multimodal foundation models, Multimodal foundation, generate actionable plans, foundation models offer, processing sensory inputs
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Fine-tuned models, code, and datasets are available at this https URL
点击查看摘要
Abstract:Multimodal foundation models offer a promising framework for robotic perception and planning by processing sensory inputs to generate actionable plans. However, addressing uncertainty in both perception (sensory interpretation) and decision-making (plan generation) remains a critical challenge for ensuring task reliability. We present a comprehensive framework to disentangle, quantify, and mitigate these two forms of uncertainty. We first introduce a framework for uncertainty disentanglement, isolating perception uncertainty arising from limitations in visual understanding and decision uncertainty relating to the robustness of generated plans. To quantify each type of uncertainty, we propose methods tailored to the unique properties of perception and decision-making: we use conformal prediction to calibrate perception uncertainty and introduce Formal-Methods-Driven Prediction (FMDP) to quantify decision uncertainty, leveraging formal verification techniques for theoretical guarantees. Building on this quantification, we implement two targeted intervention mechanisms: an active sensing process that dynamically re-observes high-uncertainty scenes to enhance visual input quality and an automated refinement procedure that fine-tunes the model on high-certainty data, improving its capability to meet task specifications. Empirical validation in real-world and simulated robotic tasks demonstrates that our uncertainty disentanglement framework reduces variability by up to 40% and enhances task success rates by 5% compared to baselines. These improvements are attributed to the combined effect of both interventions and highlight the importance of uncertainty disentanglement which facilitates targeted interventions that enhance the robustness and reliability of autonomous systems. Comments: Fine-tuned models, code, and datasets are available at this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2411.01639 [cs.RO] (or arXiv:2411.01639v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2411.01639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-68] FilterNet: Harnessing Frequency Filters for Time Series Forecasting NEURIPS2024
链接: https://arxiv.org/abs/2411.01623
作者: Kun Yi,Jingru Fei,Qi Zhang,Hui He,Shufeng Hao,Defu Lian,Wei Fan
关键词-EN: Transformer-based models, time series, time series forecasting, series forecasting, series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:While numerous forecasters have been proposed using different network architectures, the Transformer-based models have state-of-the-art performance in time series forecasting. However, forecasters based on Transformers are still suffering from vulnerability to high-frequency signals, efficiency in computation, and bottleneck in full-spectrum utilization, which essentially are the cornerstones for accurately predicting time series with thousands of points. In this paper, we explore a novel perspective of enlightening signal processing for deep time series forecasting. Inspired by the filtering process, we introduce one simple yet effective network, namely FilterNet, built upon our proposed learnable frequency filters to extract key informative temporal patterns by selectively passing or attenuating certain components of time series signals. Concretely, we propose two kinds of learnable filters in the FilterNet: (i) Plain shaping filter, that adopts a universal frequency kernel for signal filtering and temporal modeling; (ii) Contextual shaping filter, that utilizes filtered frequencies examined in terms of its compatibility with input signals for dependency learning. Equipped with the two filters, FilterNet can approximately surrogate the linear and attention mappings widely adopted in time series literature, while enjoying superb abilities in handling high-frequency noises and utilizing the whole frequency spectrum that is beneficial for forecasting. Finally, we conduct extensive experiments on eight time series forecasting benchmarks, and experimental results have demonstrated our superior performance in terms of both effectiveness and efficiency compared with state-of-the-art methods. Code is available at this repository: \hrefthis https URL\small\textthis https URL.
[AI-69] VQ-Map: Birds-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization
链接: https://arxiv.org/abs/2411.01618
作者: Yiwei Zhang,Jin Gao,Fudong Ge,Guan Luo,Bing Li,Zhaoxiang Zhang,Haibin Ling,Weiming Hu
关键词-EN: map layout estimation, BEV semantic maps, BEV map layout, BEV, layout estimation requires
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Bird’s-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car to make the results coherent and realistic. Due to the challenges posed by occlusion, unfavourable imaging conditions and low resolution, \emphgenerating the BEV semantic maps corresponding to corrupted or invalid areas in the perspective view (PV) is appealing very recently. \emphThe question is how to align the PV features with the generative models to facilitate the map estimation. In this paper, we propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens from the discrete representation learning based on a specialized token decoder module, and finally generate high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV. We evaluate the BEV map layout estimation performance of our model, termed VQ-Map, on both the nuScenes and Argoverse benchmarks, achieving 62.2/47.6 mean IoU for surround-view/monocular evaluation on nuScenes, as well as 73.4 IoU for monocular evaluation on Argoverse, which all set a new record for this map layout estimation task. The code and models are available on \urlthis https URL.
[AI-70] Stochastic Communication Avoidance for Recommendation Systems
链接: https://arxiv.org/abs/2411.01611
作者: Lutfi Eren Erdogan,Vijay Anand Raghava Kanakagiri,Kurt Keutzer,Zhen Dong
关键词-EN: network based recommendation, neural network based, based recommendation systems, large embedding tables, network based
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:One of the major bottlenecks for efficient deployment of neural network based recommendation systems is the memory footprint of their embedding tables. Although many neural network based recommendation systems could benefit from the faster on-chip memory access and increased computational power of hardware accelerators, the large embedding tables in these models often cannot fit on the constrained memory of accelerators. Despite the pervasiveness of these models, prior methods in memory optimization and parallelism fail to address the memory and communication costs of large embedding tables on accelerators. As a result, the majority of models are trained on CPUs, while current implementations of accelerators are hindered by issues such as bottlenecks in inter-device communication and main memory lookups. In this paper, we propose a theoretical framework that analyses the communication costs of arbitrary distributed systems that use lookup tables. We use this framework to propose algorithms that maximize throughput subject to memory, computation, and communication constraints. Furthermore, we demonstrate that our method achieves strong theoretical performance across dataset distributions and memory constraints, applicable to a wide range of use cases from mobile federated learning to warehouse-scale computation. We implement our framework and algorithms in PyTorch and achieve up to 6x increases in training throughput on GPU systems over baselines, on the Criteo Terabytes dataset.
[AI-71] GITSR: Graph Interaction Transformer-based Scene Representation for Multi Vehicle Collaborative Decision-making
链接: https://arxiv.org/abs/2411.01608
作者: Xingyu Hu,Lijun Zhang,Dejian Meng,Ye Han,Lisha Yuan
关键词-EN: Transformer-based Scene Representation, Interaction Transformer-based Scene, Graph Interaction Transformer-based, intelligent transportation system, Scene Representation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:In this study, we propose GITSR, an effective framework for Graph Interaction Transformer-based Scene Representation for multi-vehicle collaborative decision-making in intelligent transportation system. In the context of mixed traffic where Connected Automated Vehicles (CAVs) and Human Driving Vehicles (HDVs) coexist, in order to enhance the understanding of the environment by CAVs to improve decision-making capabilities, this framework focuses on efficient scene representation and the modeling of spatial interaction behaviors of traffic states. We first extract features of the driving environment based on the background of intelligent networking. Subsequently, the local scene representation, which is based on the agent-centric and dynamic occupation grid, is calculated by the Transformer module. Besides, feasible region of the map is captured through the multi-head attention mechanism to reduce the collision of vehicles. Notably, spatial interaction behaviors, based on motion information, are modeled as graph structures and extracted via Graph Neural Network (GNN). Ultimately, the collaborative decision-making among multiple vehicles is formulated as a Markov Decision Process (MDP), with driving actions output by Reinforcement Learning (RL) algorithms. Our algorithmic validation is executed within the extremely challenging scenario of highway off-ramp task, thereby substantiating the superiority of agent-centric approach to scene representation. Simulation results demonstrate that the GITSR method can not only effectively capture scene representation but also extract spatial interaction data, outperforming the baseline method across various comparative metrics.
[AI-72] Large Language Model Supply Chain: Open Problems From the Security Perspective
链接: https://arxiv.org/abs/2411.01604
作者: Qiang Hu,Xiaofei Xie,Sen Chen,Lei Ma
关键词-EN: Large Language Model, Large Language, software development paradigm, gained huge attention, LLM
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:
点击查看摘要
Abstract:Large Language Model (LLM) is changing the software development paradigm and has gained huge attention from both academia and industry. Researchers and developers collaboratively explore how to leverage the powerful problem-solving ability of LLMs for specific domain tasks. Due to the wide usage of LLM-based applications, e.g., ChatGPT, multiple works have been proposed to ensure the security of LLM systems. However, a comprehensive understanding of the entire processes of LLM system construction (the LLM supply chain) is crucial but relevant works are limited. More importantly, the security issues hidden in the LLM SC which could highly impact the reliable usage of LLMs are lack of exploration. Existing works mainly focus on assuring the quality of LLM from the model level, security assurance for the entire LLM SC is ignored. In this work, we take the first step to discuss the potential security risks in each component as well as the integration between components of LLM SC. We summarize 12 security-related risks and provide promising guidance to help build safer LLM systems. We hope our work can facilitate the evolution of artificial general intelligence with secure LLM ecosystems.
[AI-73] DreamPolish: Domain Score Distillation With Progressive Geometry Generation
链接: https://arxiv.org/abs/2411.01602
作者: Yean Cheng,Ziqi Cai,Ming Ding,Wendi Zheng,Shiyu Huang,Yuxiao Dong,Jie Tang,Boxin Shi
关键词-EN: producing refined geometry, excels in producing, producing refined, refined geometry, introduce DreamPolish
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We introduce DreamPolish, a text-to-3D generation model that excels in producing refined geometry and high-quality textures. In the geometry construction phase, our approach leverages multiple neural representations to enhance the stability of the synthesis process. Instead of relying solely on a view-conditioned diffusion prior in the novel sampled views, which often leads to undesired artifacts in the geometric surface, we incorporate an additional normal estimator to polish the geometry details, conditioned on viewpoints with varying field-of-views. We propose to add a surface polishing stage with only a few training steps, which can effectively refine the artifacts attributed to limited guidance from previous stages and produce 3D objects with more desirable geometry. The key topic of texture generation using pretrained text-to-image models is to find a suitable domain in the vast latent distribution of these models that contains photorealistic and consistent renderings. In the texture generation phase, we introduce a novel score distillation objective, namely domain score distillation (DSD), to guide neural representations toward such a domain. We draw inspiration from the classifier-free guidance (CFG) in textconditioned image generation tasks and show that CFG and variational distribution guidance represent distinct aspects in gradient guidance and are both imperative domains for the enhancement of texture quality. Extensive experiments show our proposed model can produce 3D assets with polished surfaces and photorealistic textures, outperforming existing state-of-the-art methods.
[AI-74] OSAD: Open-Set Aircraft Detection in SAR Images
链接: https://arxiv.org/abs/2411.01597
作者: Xiayang Xiao,Zhuoxuan Li,Haipeng Wang
关键词-EN: Current mainstream SAR, Current mainstream, mainstream SAR image, unknown objects, SAR image object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 15 pages,11 figures. This work has been submitted to the IEEE for possible publication on March 2024
点击查看摘要
Abstract:Current mainstream SAR image object detection methods still lack robustness when dealing with unknown objects in open environments. Open-set detection aims to enable detectors trained on a closed set to detect all known objects and identify unknown objects in open-set environments. The key challenges are how to improve the generalization to potential unknown objects and reduce the empirical classification risk of known categories under strong supervision. To address these challenges, a novel open-set aircraft detector for SAR images is proposed, named Open-Set Aircraft Detection (OSAD), which is equipped with three dedicated components: global context modeling (GCM), location quality-driven pseudo labeling generation (LPG), and prototype contrastive learning (PCL). GCM effectively enhances the network’s representation of objects by attention maps which is formed through the capture of long sequential positional relationships. LPG leverages clues about object positions and shapes to optimize localization quality, avoiding overfitting to known category information and enhancing generalization to potential unknown objects. PCL employs prototype-based contrastive encoding loss to promote instance-level intra-class compactness and inter-class variance, aiming to minimize the overlap between known and unknown distributions and reduce the empirical classification risk of known categories. Extensive experiments have demonstrated that the proposed method can effectively detect unknown objects and exhibit competitive performance without compromising closed-set performance. The highest absolute gain which ranges from 0 to 18.36% can be achieved on the average precision of unknown objects.
[AI-75] RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering
链接: https://arxiv.org/abs/2411.01595
作者: Hui Lin,Danfeng Hong,Shuhang Ge,Chuyao Luo,Kai Jiang,Hao Jin,Congcong Wen
关键词-EN: Sensing Image Captioning, Remote Sensing Image, Image Captioning, Remote Sensing, remote sensing domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Remote Sensing Image Captioning (RSIC) presents unique challenges and plays a critical role in applications. Traditional RSIC methods often struggle to produce rich and diverse descriptions. Recently, with advancements in VLMs, efforts have emerged to integrate these models into the remote sensing domain and to introduce descriptive datasets specifically designed to enhance VLM training. This paper proposes RS-MoE, a first Mixture of Expert based VLM specifically customized for remote sensing domain. Unlike traditional MoE models, the core of RS-MoE is the MoE Block, which incorporates a novel Instruction Router and multiple lightweight Large Language Models (LLMs) as expert models. The Instruction Router is designed to generate specific prompts tailored for each corresponding LLM, guiding them to focus on distinct aspects of the RSIC task. This design not only allows each expert LLM to concentrate on a specific subset of the task, thereby enhancing the specificity and accuracy of the generated captions, but also improves the scalability of the model by facilitating parallel processing of sub-tasks. Additionally, we present a two-stage training strategy for tuning our RS-MoE model to prevent performance degradation due to sparsity. We fine-tuned our model on the RSICap dataset using our proposed training strategy. Experimental results on the RSICap dataset, along with evaluations on other traditional datasets where no additional fine-tuning was applied, demonstrate that our model achieves state-of-the-art performance in generating precise and contextually relevant captions. Notably, our RS-MoE-1B variant achieves performance comparable to 13B VLMs, demonstrating the efficiency of our model design. Moreover, our model demonstrates promising generalization capabilities by consistently achieving state-of-the-art performance on the Remote Sensing Visual Question Answering (RSVQA) task.
[AI-76] rustworthy Federated Learning: Privacy Security and Beyond
链接: https://arxiv.org/abs/2411.01583
作者: Chunlu Chen,Ji Liu,Haowen Tan,Xingjian Li,Kevin I-Kai Wang,Peng Li,Kouichi Sakurai,Dejing Dou
关键词-EN: Artificial Intelligence, safeguard data privacy, recent years, years have witnessed, witnessed the advancement
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 32 pages, to appear in KAIS
点击查看摘要
Abstract:While recent years have witnessed the advancement in big data and Artificial Intelligence (AI), it is of much importance to safeguard data privacy and security. As an innovative approach, Federated Learning (FL) addresses these concerns by facilitating collaborative model training across distributed data sources without transferring raw data. However, the challenges of robust security and privacy across decentralized networks catch significant attention in dealing with the distributed data in FL. In this paper, we conduct an extensive survey of the security and privacy issues prevalent in FL, underscoring the vulnerability of communication links and the potential for cyber threats. We delve into various defensive strategies to mitigate these risks, explore the applications of FL across different sectors, and propose research directions. We identify the intricate security challenges that arise within the FL frameworks, aiming to contribute to the development of secure and efficient FL systems.
[AI-77] Flexible Coded Distributed Convolution Computing for Enhanced Fault Tolerance and Numerical Stability in Distributed CNNs
链接: https://arxiv.org/abs/2411.01579
作者: Shuo Tan,Rui Liu,XianLei Long,Kai Wan,Linqi Song,Yong Li
关键词-EN: Deploying Convolutional Neural, Convolutional Neural Networks, Deploying Convolutional, Neural Networks, Convolutional Neural
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures
点击查看摘要
Abstract:Deploying Convolutional Neural Networks (CNNs) on resource-constrained devices necessitates efficient management of computational resources, often via distributed systems susceptible to latency from straggler nodes. This paper introduces the Flexible Coded Distributed Convolution Computing (FCDCC) framework to enhance fault tolerance and numerical stability in distributed CNNs. We extend Coded Distributed Computing (CDC) with Circulant and Rotation Matrix Embedding (CRME) which was originally proposed for matrix multiplication to high-dimensional tensor convolution. For the proposed scheme, referred to as Numerically Stable Coded Tensor Convolution (NSCTC) scheme, we also propose two new coded partitioning schemes: Adaptive-Padding Coded Partitioning (APCP) for input tensor and Kernel-Channel Coded Partitioning (KCCP) for filter tensor. These strategies enable linear decomposition of tensor convolutions and encoding them into CDC sub-tasks, combining model parallelism with coded redundancy for robust and efficient execution. Theoretical analysis identifies an optimal trade-off between communication and storage costs. Empirical results validate the framework’s effectiveness in computational efficiency, fault tolerance, and scalability across various CNN architectures.
[AI-78] DELE: Deductive mathcalEL thinspace Embeddings for Knowledge Base Completion
链接: https://arxiv.org/abs/2411.01574
作者: Olga Mashkova,Fernando Zhapa-Camacho,Robert Hoehndorf
关键词-EN: embeddings map classes, mathbb, map classes, similarity between entities, Description Logic
类目: Artificial Intelligence (cs.AI)
*备注: Extended version of the paper “Enhancing Geometric Ontology Embeddings for EL++ with Negative Sampling and Deductive Closure Filtering” presented at NeSy 2024 conference
点击查看摘要
Abstract:Ontology embeddings map classes, relations, and individuals in ontologies into \mathbbR^n , and within \mathbbR^n similarity between entities can be computed or new axioms inferred. For ontologies in the Description Logic \mathcalEL^++ , several embedding methods have been developed that explicitly generate models of an ontology. However, these methods suffer from some limitations; they do not distinguish between statements that are unprovable and provably false, and therefore they may use entailed statements as negatives. Furthermore, they do not utilize the deductive closure of an ontology to identify statements that are inferred but not asserted. We evaluated a set of embedding methods for \mathcalEL^++ ontologies, incorporating several modifications that aim to make use of the ontology deductive closure. In particular, we designed novel negative losses that account both for the deductive closure and different types of negatives and formulated evaluation methods for knowledge base completion. We demonstrate that our embedding methods improve over the baseline ontology embedding in the task of knowledge base or ontology completion.
[AI-79] Learning to Construct Implicit Communication Channel
链接: https://arxiv.org/abs/2411.01553
作者: Han Wang,Binbin Chen,Tieying Zhang,Baoxiang Wang
关键词-EN: collaborative multi-agent systems, multi-agent systems, essential component, component in collaborative, collaborative multi-agent
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures
点击查看摘要
Abstract:Effective communication is an essential component in collaborative multi-agent systems. Situations where explicit messaging is not feasible have been common in human society throughout history, which motivate the study of implicit communication. Previous works on learning implicit communication mostly rely on theory of mind (ToM), where agents infer the mental states and intentions of others by interpreting their actions. However, ToM-based methods become less effective in making accurate inferences in complex tasks. In this work, we propose the Implicit Channel Protocol (ICP) framework, which allows agents to construct implicit communication channels similar to the explicit ones. ICP leverages a subset of actions, denoted as the scouting actions, and a mapping between information and these scouting actions that encodes and decodes the messages. We propose training algorithms for agents to message and act, including learning with a randomly initialized information map and with a delayed information map. The efficacy of ICP has been tested on the tasks of Guessing Number, Revealing Goals, and Hanabi, where ICP significantly outperforms baseline methods through more efficient information transmission.
[AI-80] Customized Subgraph Selection and Encoding for Drug-drug Interaction Prediction NEURIPS2024
链接: https://arxiv.org/abs/2411.01535
作者: Haotong Du,Quanming Yao,Juzheng Zhang,Yang Liu,Zhen Wang
关键词-EN: predicting drug-drug interactions, effective and interpretable, interpretable in predicting, predicting drug-drug, essential for medical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Subgraph-based methods have proven to be effective and interpretable in predicting drug-drug interactions (DDIs), which are essential for medical practice and drug development. Subgraph selection and encoding are critical stages in these methods, yet customizing these components remains underexplored due to the high cost of manual adjustments. In this study, inspired by the success of neural architecture search (NAS), we propose a method to search for data-specific components within subgraph-based frameworks. Specifically, we introduce extensive subgraph selection and encoding spaces that account for the diverse contexts of drug interactions in DDI prediction. To address the challenge of large search spaces and high sampling costs, we design a relaxation mechanism that uses an approximation strategy to efficiently explore optimal subgraph configurations. This approach allows for robust exploration of the search space. Extensive experiments demonstrate the effectiveness and superiority of the proposed method, with the discovered subgraphs and encoding functions highlighting the model’s adaptability.
[AI-81] Diversity Progress for Goal Selection in Discriminability-Motivated RL NEURIPS2024
链接: https://arxiv.org/abs/2411.01521
作者: Erik M. Lintunen,Nadia M. Ady,Christian Guckelsberger
关键词-EN: Non-uniform goal selection, uniform-random selection, Non-uniform goal, potential to improve, improve the reinforcement
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages including appendices, full-track paper at the Intrinsically Motivated Open-ended Learning workshop at NeurIPS 2024
点击查看摘要
Abstract:Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: “Diversity Progress” (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent’s certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution – a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.
[AI-82] FaceDig: Automated tool for placing landmarks on facial portraits for geometric morphometrics users
链接: https://arxiv.org/abs/2411.01508
作者: Karel Kleisner,Jaroslav Trnka,Petr Turecek
关键词-EN: in-depth morphological analysis, enabling the quantification, biological shapes, morphological analysis, digitization is essential
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 13 pages, 2 figures
点击查看摘要
Abstract:Landmark digitization is essential in geometric morphometrics, enabling the quantification of biological shapes, such as facial structures, for in-depth morphological analysis. Traditional landmarking, which identifies specific anatomical points, can be complemented by semilandmarks when precise locations are challenging to define. However, manual placement of numerous landmarks is time-consuming and prone to human error, leading to inconsistencies across studies. To address this, we introduce FaceDig, an AI-powered tool designed to automate landmark placement with human-level precision, focusing on anatomically sound facial points. FaceDig is open-source and integrates seamlessly with analytical platforms like R and Python. It was trained using one of the largest and most ethnically diverse face datasets, applying a landmark configuration optimized for 2D enface photographs. Our results demonstrate that FaceDig provides reliable landmark coordinates, comparable to those placed manually by experts. The tool’s output is compatible with the widely-used TpsDig2 software, facilitating adoption and ensuring consistency across studies. Users are advised to work with standardized facial images and visually inspect the results for potential corrections. Despite the growing preference for 3D morphometrics, 2D facial photographs remain valuable due to their cultural and practical significance. Future enhancements to FaceDig will include support for profile views, further expanding its utility. By offering a standardized approach to landmark placement, FaceDig promotes reproducibility in facial morphology research and provides a robust alternative to existing 2D tools.
[AI-83] Capsule Vision Challenge 2024: Multi-Class Abnormality Classification for Video Capsule Endoscopy
链接: https://arxiv.org/abs/2411.01479
作者: Aakarsh Bansal,Bhuvanesh Singla,Raajan Rajesh Wankhade,Nagamma Patil
关键词-EN: video capsule endoscopy, capsule endoscopy, study presents, classifying abnormalities, abnormalities in video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This study presents an approach to developing a model for classifying abnormalities in video capsule endoscopy (VCE) frames. Given the challenges of data imbalance, we implemented a tiered augmentation strategy using the albumentations library to enhance minority class representation. Additionally, we addressed learning complexities by progressively structuring training tasks, allowing the model to differentiate between normal and abnormal cases and then gradually adding more specific classes based on data availability. Our pipeline, developed in PyTorch, employs a flexible architecture enabling seamless adjustments to classification complexity. We tested our approach using ResNet50 and a custom ViT-CNN hybrid model, with training conducted on the Kaggle platform. This work demonstrates a scalable approach to abnormality classification in VCE.
[AI-84] Adaptive Domain Learning for Cross-domain Image Denoising NEURIPS2024
链接: https://arxiv.org/abs/2411.01472
作者: Zian Qian,Chenyang Qi,Ka Lung Law,Hao Fu,Chenyang Lei,Qifeng Chen
关键词-EN: noise patterns, image denoising, domain, denoising model trained, sensor
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 3 figures, accepted by neurips 2024
点击查看摘要
Abstract:Different camera sensors have different noise patterns, and thus an image denoising model trained on one sensor often does not generalize well to a different sensor. One plausible solution is to collect a large dataset for each sensor for training or fine-tuning, which is inevitably time-consuming. To address this cross-domain challenge, we present a novel adaptive domain learning (ADL) scheme for cross-domain RAW image denoising by utilizing existing data from different sensors (source domain) plus a small amount of data from the new sensor (target domain). The ADL training scheme automatically removes the data in the source domain that are harmful to fine-tuning a model for the target domain (some data are harmful as adding them during training lowers the performance due to domain gaps). Also, we introduce a modulation module to adopt sensor-specific information (sensor type and ISO) to understand input data for image denoising. We conduct extensive experiments on public datasets with various smartphone and DSLR cameras, which show our proposed model outperforms prior work on cross-domain image denoising, given a small amount of image data from the target domain sensor.
[AI-85] wo-Timescale Model Caching and Resource Allocation for Edge-Enabled AI-Generated Content Services
链接: https://arxiv.org/abs/2411.01458
作者: Zhang Liu,Hongyang Du,Xiangwang Hou,Lianfen Huang,Seyyedali Hosseinalipour,Dusit Niyato,Khaled Ben Letaief
关键词-EN: personalized AI-generated content, AIGC service provisioning, edge-enabled AIGC service, transformative technology, enabling customized
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 14 pages, 8 figures, 39 references
点击查看摘要
Abstract:Generative AI (GenAI) has emerged as a transformative technology, enabling customized and personalized AI-generated content (AIGC) services. In this paper, we address challenges of edge-enabled AIGC service provisioning, which remain underexplored in the literature. These services require executing GenAI models with billions of parameters, posing significant obstacles to resource-limited wireless edge. We subsequently introduce the formulation of joint model caching and resource allocation for AIGC services to balance a trade-off between AIGC quality and latency metrics. We obtain mathematical relationships of these metrics with the computational resources required by GenAI models via experimentation. Afterward, we decompose the formulation into a model caching subproblem on a long-timescale and a resource allocation subproblem on a short-timescale. Since the variables to be solved are discrete and continuous, respectively, we leverage a double deep Q-network (DDQN) algorithm to solve the former subproblem and propose a diffusion-based deep deterministic policy gradient (D3PG) algorithm to solve the latter. The proposed D3PG algorithm makes an innovative use of diffusion models as the actor network to determine optimal resource allocation decisions. Consequently, we integrate these two learning methods within the overarching two-timescale deep reinforcement learning (T2DRL) algorithm, the performance of which is studied through comparative numerical simulations.
[AI-86] Denoising Fisher Training For Neural Implicit Samplers
链接: https://arxiv.org/abs/2411.01453
作者: Weijian Luo,Wei Deng
关键词-EN: un-normalized target distributions, machine learning, un-normalized target, pivotal in scientific, scientific computing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)
*备注:
点击查看摘要
Abstract:Efficient sampling from un-normalized target distributions is pivotal in scientific computing and machine learning. While neural samplers have demonstrated potential with a special emphasis on sampling efficiency, existing neural implicit samplers still have issues such as poor mode covering behavior, unstable training dynamics, and sub-optimal performances. To tackle these issues, in this paper, we introduce Denoising Fisher Training (DFT), a novel training approach for neural implicit samplers with theoretical guarantees. We frame the training problem as an objective of minimizing the Fisher divergence by deriving a tractable yet equivalent loss function, which marks a unique theoretical contribution to assessing the intractable Fisher divergences. DFT is empirically validated across diverse sampling benchmarks, including two-dimensional synthetic distribution, Bayesian logistic regression, and high-dimensional energy-based models (EBMs). Notably, in experiments with high-dimensional EBMs, our best one-step DFT neural sampler achieves results on par with MCMC methods with up to 200 sampling steps, leading to a substantially greater efficiency over 100 times higher. This result not only demonstrates the superior performance of DFT in handling complex high-dimensional sampling but also sheds light on efficient sampling methodologies across broader applications.
[AI-87] Online Relational Inference for Evolving Multi-agent Interacting Systems NEURIPS2024
链接: https://arxiv.org/abs/2411.01442
作者: Beomseok Kang,Priyabrata Saha,Sudarshan Sharma,Biswadeep Chakraborty,Saibal Mukhopadhyay
关键词-EN: efficiently identify hidden, identify hidden interaction, evolving multi-agent interacting, multi-agent interacting systems, ORI employs online
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted at NeurIPS 2024
点击查看摘要
Abstract:We introduce a novel framework, Online Relational Inference (ORI), designed to efficiently identify hidden interaction graphs in evolving multi-agent interacting systems using streaming data. Unlike traditional offline methods that rely on a fixed training set, ORI employs online backpropagation, updating the model with each new data point, thereby allowing it to adapt to changing environments in real-time. A key innovation is the use of an adjacency matrix as a trainable parameter, optimized through a new adaptive learning rate technique called AdaRelation, which adjusts based on the historical sensitivity of the decoder to changes in the interaction graph. Additionally, a data augmentation method named Trajectory Mirror ™ is introduced to improve generalization by exposing the model to varied trajectory patterns. Experimental results on both synthetic datasets and real-world data (CMU MoCap for human motion) demonstrate that ORI significantly improves the accuracy and adaptability of relational inference in dynamic settings compared to existing methods. This approach is model-agnostic, enabling seamless integration with various neural relational inference (NRI) architectures, and offers a robust solution for real-time applications in complex, evolving systems.
[AI-88] SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
链接: https://arxiv.org/abs/2411.01438
作者: Ziming Mao,Tian Xia,Zhanghao Wu,Wei-Lin Chiang,Tyler Griggs,Romil Bhardwaj,Zongheng Yang,Scott Shenker,Ion Stoica
关键词-EN: Recent years, years have witnessed, witnessed an explosive, explosive growth, replicas
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models. To address this, we introduce SkyServe, a system that efficiently serves AI models over a mixture of spot and on-demand replicas across regions and clouds. SkyServe intelligently spreads spot replicas across different failure domains (e.g., regions or clouds) to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We compare SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by up to 44% while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by up to 2.6x, 3.1x, 2.7x compared to other research and production systems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.01438 [cs.DC] (or arXiv:2411.01438v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2411.01438 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-89] Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision
链接: https://arxiv.org/abs/2411.01431
作者: Xiangzhong Luo,Di Liu,Hao Kong,Shuo Huai,Hui Chen,Guochu Xiong,Weichen Liu
关键词-EN: embedded computing systems, embedded computing, computing systems, recently achieved impressive, achieved impressive success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ACM Transactions on Embedded Computing Systems (TECS) 2024
点击查看摘要
Abstract:Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems.
[AI-90] Learning Hidden Subgoals under Temporal Ordering Constraints in Reinforcement Learning
链接: https://arxiv.org/abs/2411.01425
作者: Duo Xu,Faramarz Fekri
关键词-EN: multiple key steps, key steps, success of completing, determined by multiple, fixed time order
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:In real-world applications, the success of completing a task is often determined by multiple key steps which are distant in time steps and have to be achieved in a fixed time order. For example, the key steps listed on the cooking recipe should be achieved one-by-one in the right time order. These key steps can be regarded as subgoals of the task and their time orderings are described as temporal ordering constraints. However, in many real-world problems, subgoals or key states are often hidden in the state space and their temporal ordering constraints are also unknown, which make it challenging for previous RL algorithms to solve this kind of tasks. In order to address this issue, in this work we propose a novel RL algorithm for \bf learning hidden \bf subgoals under \bf temporal \bf ordering \bf constraints (LSTOC). We propose a new contrastive learning objective which can effectively learn hidden subgoals (key states) and their temporal orderings at the same time, based on first-occupancy representation and temporal geometric sampling. In addition, we propose a sample-efficient learning strategy to discover subgoals one-by-one following their temporal order constraints by building a subgoal tree to represent discovered subgoals and their temporal ordering relationships. Specifically, this tree can be used to improve the sample efficiency of trajectory collection, fasten the task solving and generalize to unseen tasks. The LSTOC framework is evaluated on several environments with image-based observations, showing its significant improvement over baseline methods.
[AI-91] PSformer: Parameter-efficient Transformer with Segment Attention for Time Series Forecasting
链接: https://arxiv.org/abs/2411.01419
作者: Yanlong Wang,Jian Xu,Fei Ma,Shao-Lun Huang,Danny Dongning Sun,Xiao-Ping Zhang
关键词-EN: Time series forecasting, Time series, time series segment, series forecasting remains, series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages
点击查看摘要
Abstract:Time series forecasting remains a critical challenge across various domains, often complicated by high-dimensional data and long-term dependencies. This paper presents a novel transformer architecture for time series forecasting, incorporating two key innovations: parameter sharing (PS) and Spatial-Temporal Segment Attention (SegAtt). We also define the time series segment as the concatenation of sequence patches from the same positions across different variables. The proposed model, PSformer, reduces the number of training parameters through the parameter sharing mechanism, thereby improving model efficiency and scalability. The introduction of SegAtt could enhance the capability of capturing local spatio-temporal dependencies by computing attention over the segments, and improve global representation by integrating information across segments. The combination of parameter sharing and SegAtt significantly improves the forecasting performance. Extensive experiments on benchmark datasets demonstrate that PSformer outperforms popular baselines and other transformer-based approaches in terms of accuracy and scalability, establishing itself as an accurate and scalable tool for time series forecasting.
[AI-92] BF-IMNA: A Bit Fluid In-Memory Neural Architecture for Neural Network Acceleration
链接: https://arxiv.org/abs/2411.01417
作者: Mariam Rakka,Rachid Karami,Ahmed M. Eltawil,Mohammed E. Fouda,Fadi Kurdahi
关键词-EN: works Neural Networks, quantization works Neural, Neural Networks, works Neural, Mixed-precision quantization works
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Mixed-precision quantization works Neural Networks (NNs) are gaining traction for their efficient realization on the hardware leading to higher throughput and lower energy. In-Memory Computing (IMC) accelerator architectures are offered as alternatives to traditional architectures relying on a data-centric computational paradigm, diminishing the memory wall problem, and scoring high throughput and energy efficiency. These accelerators can support static fixed-precision but are not flexible to support mixed-precision NNs. In this paper, we present BF-IMNA, a bit fluid IMC accelerator for end-to-end Convolutional NN (CNN) inference that is capable of static and dynamic mixed-precision without any hardware reconfiguration overhead at run-time. At the heart of BF-IMNA are Associative Processors (APs), which are bit-serial word-parallel Single Instruction, Multiple Data (SIMD)-like engines. We report the performance of end-to-end inference of ImageNet on AlexNet, VGG16, and ResNet50 on BF-IMNA for different technologies (eNVM and NVM), mixed-precision configurations, and supply voltages. To demonstrate bit fluidity, we implement HAWQ-V3’s per-layer mixed-precision configurations for ResNet18 on BF-IMNA using different latency budgets, and results reveal a trade-off between accuracy and Energy-Delay Product (EDP): On one hand, mixed-precision with a high latency constraint achieves the closest accuracy to fixed-precision INT8 and reports a high (worse) EDP compared to fixed-precision INT4. On the other hand, with a low latency constraint, BF-IMNA reports the closest EDP to fixed-precision INT4, with a higher degradation in accuracy compared to fixed-precision INT8. We also show that BF-IMNA with fixed-precision configuration still delivers performance that is comparable to current state-of-the-art accelerators: BF-IMNA achieves 20% higher energy efficiency and 2% higher throughput.
[AI-93] A Deep Dive Into Large Language Model Code Generation Mistakes: What and Why?
链接: https://arxiv.org/abs/2411.01414
作者: QiHong Chen,Jiawei Li,Jiecheng Deng,Jiachen Yu,Justin Tian Jin Chen,Iftekhar Ahmed
关键词-EN: Large Language Models, Large Language, Recent advancements, Language Models, advancements in Large
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advancements in Large Language Models (LLMs) have led to their widespread application in automated code generation. However, these models can still generate defective code that deviates from the specification. Previous research has mainly focused on the mistakes in LLM-generated standalone functions, overlooking real-world software development situations where the successful generation of the code requires software contexts such as external dependencies. In this paper, we considered both of these code generation situations and identified a range of \textitnon-syntactic mistakes arising from LLMs’ misunderstandings of coding question specifications. Seven categories of non-syntactic mistakes were identified through extensive manual analyses, four of which were missed by previous works. To better understand these mistakes, we proposed six reasons behind these mistakes from various perspectives. Moreover, we explored the effectiveness of LLMs in detecting mistakes and their reasons. Our evaluation demonstrated that GPT-4 with the ReAct prompting technique can achieve an F1 score of up to 0.65 when identifying reasons for LLM’s mistakes, such as misleading function signatures. We believe that these findings offer valuable insights into enhancing the quality of LLM-generated code.
[AI-94] PageRank Bandits for Link Prediction NEURIPS2024
链接: https://arxiv.org/abs/2411.01410
作者: Yikun Ban,Jiaru Zou,Zihao Li,Yunzhe Qi,Dongqi Fu,Jian Kang,Hanghang Tong,Jingrui He
关键词-EN: knowledge graph completion, Graph Neural Networks, Link prediction, graph completion, broad applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted to NeurIPS 2024
点击查看摘要
Abstract:Link prediction is a critical problem in graph learning with broad applications such as recommender systems and knowledge graph completion. Numerous research efforts have been directed at solving this problem, including approaches based on similarity metrics and Graph Neural Networks (GNN). However, most existing solutions are still rooted in conventional supervised learning, which makes it challenging to adapt over time to changing customer interests and to address the inherent dilemma of exploitation versus exploration in link prediction. To tackle these challenges, this paper reformulates link prediction as a sequential decision-making process, where each link prediction interaction occurs sequentially. We propose a novel fusion algorithm, PRB (PageRank Bandits), which is the first to combine contextual bandits with PageRank for collaborative exploitation and exploration. We also introduce a new reward formulation and provide a theoretical performance guarantee for PRB. Finally, we extensively evaluate PRB in both online and offline settings, comparing it with bandit-based and graph-based methods. The empirical success of PRB demonstrates the value of the proposed fusion approach. Our code is released at this https URL.
[AI-95] HeightMapNet: Explicit Height Modeling for End-to-End HD Map Learning WACV2025
链接: https://arxiv.org/abs/2411.01408
作者: Wenzhao Qiu,Shanmin Pang,Hao zhang,Jianwu Fang,Jianru Xue
关键词-EN: Recent advances, advances in high-definition, map construction, cost-effectiveness in deployment, construction from surround-view
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted to WACV 2025
点击查看摘要
Abstract:Recent advances in high-definition (HD) map construction from surround-view images have highlighted their cost-effectiveness in deployment. However, prevailing techniques often fall short in accurately extracting and utilizing road features, as well as in the implementation of view transformation. In response, we introduce HeightMapNet, a novel framework that establishes a dynamic relationship between image features and road surface height distributions. By integrating height priors, our approach refines the accuracy of Bird’s-Eye-View (BEV) features beyond conventional methods. HeightMapNet also introduces a foreground-background separation network that sharply distinguishes between critical road elements and extraneous background components, enabling precise focus on detailed road micro-features. Additionally, our method leverages multi-scale features within the BEV space, optimally utilizing spatial geometric information to boost model performance. HeightMapNet has shown exceptional results on the challenging nuScenes and Argoverse 2 datasets, outperforming several widely recognized approaches. The code will be available at \urlthis https URL.
[AI-96] Exploring the Edges of Latent State Clusters for Goal-Conditioned Reinforcement Learning NEURIPS2024
链接: https://arxiv.org/abs/2411.01396
作者: Yuanlin Duan,Guofeng Cui,He Zhu
关键词-EN: goal-conditioned reinforcement learning, Exploring unknown environments, unsupervised goal-conditioned reinforcement, Exploring unknown, unknown environments efficiently
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: NeurIPS2024 Poster
点击查看摘要
Abstract:Exploring unknown environments efficiently is a fundamental challenge in unsupervised goal-conditioned reinforcement learning. While selecting exploratory goals at the frontier of previously explored states is an effective strategy, the policy during training may still have limited capability of reaching rare goals on the frontier, resulting in reduced exploratory behavior. We propose “Cluster Edge Exploration” ( CE^2 ), a new goal-directed exploration algorithm that when choosing goals in sparsely explored areas of the state space gives priority to goal states that remain accessible to the agent. The key idea is clustering to group states that are easily reachable from one another by the current policy under training in a latent space and traversing to states holding significant exploration potential on the boundary of these clusters before doing exploratory behavior. In challenging robotics environments including navigating a maze with a multi-legged ant robot, manipulating objects with a robot arm on a cluttered tabletop, and rotating objects in the palm of an anthropomorphic robotic hand, CE^2 demonstrates superior efficiency in exploration compared to baseline methods and ablations.
[AI-97] Medical X-Ray Image Enhancement Using Global Contrast-Limited Adaptive Histogram Equalization
链接: https://arxiv.org/abs/2411.01373
作者: Sohrab Namazi Nia,Frank Y. Shih
关键词-EN: Adaptive Histogram Equalization, accurate diagnosis heavily, diagnosis heavily relies, Limited Adaptive Histogram, Histogram Equalization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In medical imaging, accurate diagnosis heavily relies on effective image enhancement techniques, particularly for X-ray images. Existing methods often suffer from various challenges such as sacrificing global image characteristics over local image characteristics or vice versa. In this paper, we present a novel approach, called G-CLAHE (Global-Contrast Limited Adaptive Histogram Equalization), which perfectly suits medical imaging with a focus on X-rays. This method adapts from Global Histogram Equalization (GHE) and Contrast Limited Adaptive Histogram Equalization (CLAHE) to take both advantages and avoid weakness to preserve local and global characteristics. Experimental results show that it can significantly improve current state-of-the-art algorithms to effectively address their limitations and enhance the contrast and quality of X-ray images for diagnostic accuracy.
[AI-98] Guided Synthesis of Labeled Brain MRI Data Using Latent Diffusion Models for Segmentation of Enlarged Ventricles
链接: https://arxiv.org/abs/2411.01351
作者: Tim Ruschke,Jonathan Frederik Carlsen,Adam Espe Hansen,Ulrich Lindberg,Amalie Monberg Hindsholm,Martin Norgaard,Claes Nøhr Ladefoged
关键词-EN: Deep learning models, medical contexts face, contexts face challenges, Deep learning, data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Deep learning models in medical contexts face challenges like data scarcity, inhomogeneity, and privacy concerns. This study focuses on improving ventricular segmentation in brain MRI images using synthetic data. We employed two latent diffusion models (LDMs): a mask generator trained using 10,000 masks, and a corresponding SPADE image generator optimized using 6,881 scans to create an MRI conditioned on a 3D brain mask. Conditioning the mask generator on ventricular volume in combination with classifier-free guidance enabled the control of the ventricular volume distribution of the generated synthetic images. Next, the performance of the synthetic data was tested using three nnU-Net segmentation models trained on a real, augmented and entirely synthetic data, respectively. The resulting models were tested on a completely independent hold-out dataset of patients with enlarged ventricles, with manual delineation of the ventricles used as ground truth. The model trained on real data showed a mean absolute error (MAE) of 9.09 \pm 12.18 mL in predicted ventricular volume, while the models trained on synthetic and augmented data showed MAEs of 7.52 \pm 4.81 mL and 6.23 \pm 4.33 mL, respectively. Both the synthetic and augmented model also outperformed the state-of-the-art model SynthSeg, which due to limited performance in cases of large ventricular volumes, showed an MAE of 7.73 \pm 12.12 mL with a factor of 3 higher standard deviation. The model trained on augmented data showed the highest Dice score of 0.892 \pm 0.05, slightly outperforming SynthSeg and on par with the model trained on real data. The synthetic model performed similar to SynthSeg. In summary, we provide evidence that guided synthesis of labeled brain MRI data using LDMs improves the segmentation of enlarged ventricles and outperforms existing state-of-the-art segmentation models.
[AI-99] Can Humans Oversee Agents to Prevent Privacy Leakage? A Study on Privacy Awareness Preferences and Trust in Language Model Agents
链接: https://arxiv.org/abs/2411.01344
作者: Zhiping Zhang,Bingcan Guo,Tianshi Li
关键词-EN: Language model, boost productivity, act on users’, users’ behalf, behalf for personal
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Language model (LM) agents that act on users’ behalf for personal tasks can boost productivity, but are also susceptible to unintended privacy leakage risks. We present the first study on people’s capacity to oversee the privacy implications of the LM agents. By conducting a task-based survey (N=300), we investigate how people react to and assess the response generated by LM agents for asynchronous interpersonal communication tasks, compared with a response they wrote. We found that people may favor the agent response with more privacy leakage over the response they drafted or consider both good, leading to an increased harmful disclosure from 15.7% to 55.0%. We further uncovered distinct patterns of privacy behaviors, attitudes, and preferences, and the nuanced interactions between privacy considerations and other factors. Our findings shed light on designing agentic systems that enable privacy-preserving interactions and achieve bidirectional alignment on privacy preferences to help users calibrate trust.
[AI-100] Adaptive World Models: Learning Behaviors by Latent Imagination Under Non-Stationarity NEURIPS2024
链接: https://arxiv.org/abs/2411.01342
作者: Emiliyan Gospodinov,Vaisakh Shaj,Philipp Becker,Stefan Geyer,Gerhard Neumann
关键词-EN: Developing foundational world, key research direction, Developing foundational, foundational world models, embodied intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024 Workshop Adaptive Foundation Models
点击查看摘要
Abstract:Developing foundational world models is a key research direction for embodied intelligence, with the ability to adapt to non-stationary environments being a crucial criterion. In this work, we introduce a new formalism, Hidden Parameter-POMDP, designed for control with adaptive world models. We demonstrate that this approach enables learning robust behaviors across a variety of non-stationary RL benchmarks. Additionally, this formalism effectively learns task abstractions in an unsupervised manner, resulting in structured, task-aware latent spaces.
[AI-101] A Mechanistic Explanatory Strategy for XAI
链接: https://arxiv.org/abs/2411.01332
作者: Marcin Rabiza
关键词-EN: solid conceptual foundations, broader scientific discourse, scholars note, note a persistent, persistent lack
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Forthcoming in Müller, V. C., Dewey, A. R., Dung, L., Löhr, G. (Eds.), Philosophy of Artificial Intelligence: The State of the Art, Synthese Library, Berlin: Springer Nature. Please cite the published version
点击查看摘要
Abstract:Despite significant advancements in XAI, scholars note a persistent lack of solid conceptual foundations and integration with broader scientific discourse on explanation. In response, emerging XAI research draws on explanatory strategies from various sciences and philosophy of science literature to fill these gaps. This paper outlines a mechanistic strategy for explaining the functional organization of deep learning systems, situating recent advancements in AI explainability within a broader philosophical context. According to the mechanistic approach, the explanation of opaque AI systems involves identifying mechanisms that drive decision-making. For deep neural networks, this means discerning functionally relevant components – such as neurons, layers, circuits, or activation patterns – and understanding their roles through decomposition, localization, and recomposition. Proof-of-principle case studies from image recognition and language modeling align these theoretical approaches with the latest research from AI labs like OpenAI and Anthropic. This research suggests that a systematic approach to studying model organization can reveal elements that simpler (or ‘‘more modest’’) explainability techniques might miss, fostering more thoroughly explainable AI. The paper concludes with a discussion on the epistemic relevance of the mechanistic approach positioned in the context of selected philosophical debates on XAI.
[AI-102] Visual Fourier Prompt Tuning NEURIPS
链接: https://arxiv.org/abs/2411.01327
作者: Runjia Zeng,Cheng Han,Qifan Wang,Chunshu Wu,Tong Geng,Lifu Huang,Ying Nian Wu,Dongfang Liu
关键词-EN: continuing to grow, increasingly parameter-intensive, Transformer-based models continuing, vision Transformer-based models, scale of vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Conference on Neural Information Processing Systems (NeurIPS) 2024
点击查看摘要
Abstract:With the scale of vision Transformer-based models continuing to grow, finetuning these large-scale pretrained models for new tasks has become increasingly parameter-intensive. Visual prompt tuning is introduced as a parameter-efficient finetuning (PEFT) method to this trend. Despite its successes, a notable research challenge persists within almost all PEFT approaches: significant performance degradation is observed when there is a substantial disparity between the datasets applied in pretraining and finetuning phases. To address this challenge, we draw inspiration from human visual cognition, and propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models. Our approach innovatively incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information. Apart from its inherent simplicity and intuitiveness, VFPT exhibits superior performance across all datasets, offering a general solution to dataset challenges, irrespective of data disparities. Empirical results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks, with low parameter usage (e.g., 0.57% of model parameters on VTAB-1k) and notable performance enhancements (e.g., 73.20% of mean accuracy on VTAB-1k). Our code is avaliable at this https URL.
[AI-103] FEED: Fairness-Enhanced Meta-Learning for Domain Generalization
链接: https://arxiv.org/abs/2411.01316
作者: Kai Jiang,Chen Zhao,Haoliang Wang,Feng Chen
关键词-EN: domain generalization, generalization, Generalizing, domain, challenging problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: IEEE International Conference on Big Data 2024
点击查看摘要
Abstract:Generalizing to out-of-distribution data while being aware of model fairness is a significant and challenging problem in meta-learning. The goal of this problem is to find a set of fairness-aware invariant parameters of classifier that is trained using data drawn from a family of related training domains with distribution shift on non-sensitive features as well as different levels of dependence between model predictions and sensitive features so that the classifier can achieve good generalization performance on unknown but distinct test domains. To tackle this challenge, existing state-of-the-art methods either address the domain generalization problem but completely ignore learning with fairness or solely specify shifted domains with various fairness levels. This paper introduces an approach to fairness-aware meta-learning that significantly enhances domain generalization capabilities. Our framework, Fairness-Enhanced Meta-Learning for Domain Generalization (FEED), disentangles latent data representations into content, style, and sensitive vectors. This disentanglement facilitates the robust generalization of machine learning models across diverse domains while adhering to fairness constraints. Unlike traditional methods that focus primarily on domain invariance or sensitivity to shifts, our model integrates a fairness-aware invariance criterion directly into the meta-learning process. This integration ensures that the learned parameters uphold fairness consistently, even when domain characteristics vary widely. We validate our approach through extensive experiments across multiple benchmarks, demonstrating not only superior performance in maintaining high accuracy and fairness but also significant improvements over existing state-of-the-art methods in domain generalization tasks.
[AI-104] False Data Injection Attack Detection in Edge-based Smart Metering Networks with Federated Learning
链接: https://arxiv.org/abs/2411.01313
作者: Md Raihan Uddin,Ratun Rahman,Dinh C. Nguyen
关键词-EN: FDI attack detection, FDI attack, false data injection, attack detection, FDI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Smart metering networks are increasingly susceptible to cyber threats, where false data injection (FDI) appears as a critical attack. Data-driven-based machine learning (ML) methods have shown immense benefits in detecting FDI attacks via data learning and prediction abilities. Literature works have mostly focused on centralized learning and deploying FDI attack detection models at the control center, which requires data collection from local utilities like meters and transformers. However, this data sharing may raise privacy concerns due to the potential disclosure of household information like energy usage patterns. This paper proposes a new privacy-preserved FDI attack detection by developing an efficient federated learning (FL) framework in the smart meter network with edge computing. Distributed edge servers located at the network edge run an ML-based FDI attack detection model and share the trained model with the grid operator, aiming to build a strong FDI attack detection model without data sharing. Simulation results demonstrate the efficiency of our proposed FL method over the conventional method without collaboration.
[AI-105] From Federated Learning to Quantum Federated Learning for Space-Air-Ground Integrated Networks
链接: https://arxiv.org/abs/2411.01312
作者: Vu Khanh Quy,Nguyen Minh Quy,Tran Thi Hoai,Shaba Shaon,Md Raihan Uddin,Tien Nguyen,Dinh C. Nguyen,Aryan Kaushik,Periklis Chatzimisios
关键词-EN: connections that cover, seamless and data-based, data-based connections, expected to provide, provide seamless
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:6G wireless networks are expected to provide seamless and data-based connections that cover space-air-ground and underwater networks. As a core partition of future 6G networks, Space-Air-Ground Integrated Networks (SAGIN) have been envisioned to provide countless real-time intelligent applications. To realize this, promoting AI techniques into SAGIN is an inevitable trend. Due to the distributed and heterogeneous architecture of SAGIN, federated learning (FL) and then quantum FL are emerging AI model training techniques for enabling future privacy-enhanced and computation-efficient SAGINs. In this work, we explore the vision of using FL/QFL in SAGINs. We present a few representative applications enabled by the integration of FL and QFL in SAGINs. A case study of QFL over UAV networks is also given, showing the merit of quantum-enabled training approach over the conventional FL benchmark. Research challenges along with standardization for QFL adoption in future SAGINs are also highlighted.
[AI-106] Marginal Causal Flows for Validation and Inference NEURIPS2024
链接: https://arxiv.org/abs/2411.01295
作者: Daniel de Vassimon Manela,Laura Battaglia,Robin J. Evans
关键词-EN: remains challenging due, Investigating the marginal, complex data remains, data remains challenging, reproduce intricate real-world
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 23 pages, 10 figures, Accepted as a Poster at NeurIPS 2024
点击查看摘要
Abstract:Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging due to the inflexibility of employed models and the lack of complexity in causal benchmark datasets, which often fail to reproduce intricate real-world data patterns. In this paper we introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process, while also directly inferring the marginal causal quantities from observational data. We propose that these models are exceptionally well suited for generating synthetic data to validate causal methods. They can create synthetic datasets that closely resemble the empirical dataset, while automatically and exactly satisfying a user-defined average treatment effect. To our knowledge, Frugal Flows are the first generative model to both learn flexible data representations and also exactly parameterise quantities such as the average treatment effect and the degree of unobserved confounding. We demonstrate the above with experiments on both simulated and real-world datasets.
[AI-107] Causal reasoning in difference graphs
链接: https://arxiv.org/abs/2411.01292
作者: Charles K. Assaad
关键词-EN: understanding causal mechanisms, designing effective public, essential for designing, designing effective, public health interventions
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:In epidemiology, understanding causal mechanisms across different populations is essential for designing effective public health interventions. Recently, difference graphs have been introduced as a tool to visually represent causal variations between two distinct populations. While there has been progress in inferring these graphs from data through causal discovery methods, there remains a gap in systematically leveraging their potential to enhance causal reasoning. This paper addresses that gap by establishing conditions for identifying causal changes and effects using difference graphs and observational data. It specifically focuses on identifying total causal changes and total effects in a nonparametric framework, as well as direct causal changes and direct effects in a linear context. In doing so, it provides a novel approach to causal reasoning that holds potential for various public health applications.
[AI-108] Improving Energy Efficiency in Manufacturing: A Novel Expert System Shell
链接: https://arxiv.org/abs/2411.01272
作者: Borys Ioshchikhes,Michael Frank,Tresa Maria Joseph,Matthias Weigold
关键词-EN: global climate targets, automatically identifying energy, Expert systems, Expert, climate targets
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
*备注: 6 pages, 3 figures, preprint for conference contribution
点击查看摘要
Abstract:Expert systems are effective tools for automatically identifying energy efficiency potentials in manufacturing, thereby contributing significantly to global climate targets. These systems analyze energy data, pinpoint inefficiencies, and recommend optimizations to reduce energy consumption. Beyond systematic approaches for developing expert systems, there is a pressing need for simple and rapid software implementation solutions. Expert system shells, which facilitate the swift development and deployment of expert systems, are crucial tools in this process. They provide a template that simplifies the creation and integration of expert systems into existing manufacturing processes. This paper provides a comprehensive comparison of existing expert system shells regarding their suitability for improving energy efficiency, highlighting significant gaps and limitations. To address these deficiencies, we introduce a novel expert system shell, implemented in Jupyter Notebook, that provides a flexible and easily integrable solution for expert system development.
[AI-109] Interacting Large Language Model Agents . Interpretable Models and Social Learning
链接: https://arxiv.org/abs/2411.01271
作者: Adit Jain,Vikram Krishnamurthy
关键词-EN: statistical signal processing, interacting large language, large language model, paper develops theory, LLMAs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:This paper develops theory and algorithms for interacting large language model agents (LLMAs) using methods from statistical signal processing and microeconomics. While both fields are mature, their application to decision-making by interacting LLMAs remains unexplored. Motivated by Bayesian sentiment analysis on online platforms, we construct interpretable models and stochastic control algorithms that enable LLMAs to interact and perform Bayesian inference. Because interacting LLMAs learn from prior decisions and external inputs, they exhibit bias and herding behavior. Thus, developing interpretable models and stochastic control algorithms is essential to understand and mitigate these behaviors. This paper has three main results. First, we show using Bayesian revealed preferences from microeconomics that an individual LLMA satisfies the sufficient conditions for rationally inattentive (bounded rationality) utility maximization and, given an observation, the LLMA chooses an action that maximizes a regularized utility. Second, we utilize Bayesian social learning to construct interpretable models for LLMAs that interact sequentially with each other and the environment while performing Bayesian inference. Our models capture the herding behavior exhibited by interacting LLMAs. Third, we propose a stochastic control framework to delay herding and improve state estimation accuracy under two settings: (a) centrally controlled LLMAs and (b) autonomous LLMAs with incentives. Throughout the paper, we demonstrate the efficacy of our methods on real datasets for hate speech classification and product quality assessment, using open-source models like Mistral and closed-source models like ChatGPT. The main takeaway of this paper, based on substantial empirical analysis and mathematical formalism, is that LLMAs act as rationally bounded Bayesian agents that exhibit social learning when interacting.
[AI-110] Optimizing Federated Learning by Entropy-Based Client Selection
链接: https://arxiv.org/abs/2411.01240
作者: Andreas Lutz,Gabriele Steidl,Karsten Müller,Wojciech Samek
关键词-EN: including natural language, natural language processing, emerging field revolutionizing, computer vision, revolutionizing various industries
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
点击查看摘要
Abstract:Deep learning is an emerging field revolutionizing various industries, including natural language processing, computer vision, and many more. These domains typically require an extensive amount of data for optimal performance, potentially utilizing huge centralized data repositories. However, such centralization could raise privacy issues concerning the storage of sensitive data. To address this issue, federated learning was developed. It is a newly distributed learning technique that enables to collaboratively train a deep learning model on decentralized devices, referred to as clients, without compromising their data privacy. Traditional federated learning methods often suffer from severe performance degradation when the data distribution among clients differs significantly. This becomes especially problematic in the case of label distribution skew, where the distribution of labels varies across clients. To address this, a novel method called FedEntOpt is proposed. FedEntOpt is designed to mitigate performance issues caused by label distribution skew by maximizing the entropy of the global label distribution of the selected client subset in each federated learning round. This ensures that the aggregated model parameters from the clients were exhibited to data from all available labels, which improves the accuracy of the global model. Extensive experiments on several benchmark datasets show that the proposed method outperforms several state-of-the-art algorithms by up to 6% in classification accuracy, demonstrating robust and superior performance, particularly under low participation rates. In addition, it offers the flexibility to be combined with them, enhancing their performance by over 40%.
[AI-111] AutoPT: How Far Are We from the End2End Automated Web Penetration Testing?
链接: https://arxiv.org/abs/2411.01236
作者: Benlong Wu,Guoqiang Chen,Kejiang Chen,Xiuwei Shang,Jiapeng Han,Yanru He,Weiming Zhang,Nenghai Yu
关键词-EN: ensure Web security, Penetration testing, prevent data leakage, ensure Web, automated penetration testing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 22 pages, 6 figures
点击查看摘要
Abstract:Penetration testing is essential to ensure Web security, which can detect and fix vulnerabilities in advance, and prevent data leakage and serious consequences. The powerful inference capabilities of large language models (LLMs) have made significant progress in various fields, and the development potential of LLM-based agents can revolutionize the cybersecurity penetration testing industry. In this work, we establish a comprehensive end-to-end penetration testing benchmark using a real-world penetration testing environment to explore the capabilities of LLM-based agents in this domain. Our results reveal that the agents are familiar with the framework of penetration testing tasks, but they still face limitations in generating accurate commands and executing complete processes. Accordingly, we summarize the current challenges, including the difficulty of maintaining the entire message history and the tendency for the agent to become stuck. Based on the above insights, we propose a Penetration testing State Machine (PSM) that utilizes the Finite State Machine (FSM) methodology to address these limitations. Then, we introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs, which utilizes the inherent inference ability of LLM and the constraint framework of state machines. Our evaluation results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model and improves the task completion rate from 22% to 41% on the benchmark target. Compared with the baseline framework and manual work, AutoPT also reduces time and economic costs further. Hence, our AutoPT has facilitated the development of automated penetration testing and significantly impacted both academia and industry. Comments: 22 pages, 6 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.01236 [cs.CR] (or arXiv:2411.01236v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.01236 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-112] he Interaction Layer: An Exploration for Co-Designing User-LLM Interactions in Parental Wellbeing Support Systems
链接: https://arxiv.org/abs/2411.01228
作者: Sruthi Viswanathan,Seray Ibrahim,Ravi Shankar,Reuben Binns,Max Van Kleek,Petr Slovak
关键词-EN: limited personal time, Parenting brings emotional, balancing work, personal time, brings emotional
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Parenting brings emotional and physical challenges, from balancing work, childcare, and finances to coping with exhaustion and limited personal time. Yet, one in three parents never seek support. AI systems potentially offer stigma-free, accessible, and affordable solutions. Yet, user adoption often fails due to issues with explainability and reliability. To see if these issues could be solved using a co-design approach, we developed and tested NurtureBot, a wellbeing support assistant for new parents. 32 parents co-designed the system through Asynchronous Remote Communities method, identifying the key challenge as achieving a “successful chat.” Aspart of co-design, parents role-played as NurturBot, rewriting its dialogues to improve user understanding, control, and this http URL refined prototype evaluated by 32 initial and 46 new parents, showed improved user experience and usability, with final CUQ score of 91.3/100, demonstrating successful interaction patterns. Our process revealed useful interaction design lessons for effective AI parenting support.
[AI-113] Infinite-Resolution Integral Noise Warping for Diffusion Models
链接: https://arxiv.org/abs/2411.01212
作者: Yitong Deng,Winnie Lin,Lingxiao Li,Dmitriy Smirnov,Ryan Burgert,Ning Yu,Vincent Dedun,Mohammad H. Taghavi
关键词-EN: Adapting pretrained image-based, modeling research direction, pretrained image-based diffusion, image-based diffusion models, generate temporally consistent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Adapting pretrained image-based diffusion models to generate temporally consistent videos has become an impactful generative modeling research direction. Training-free noise-space manipulation has proven to be an effective technique, where the challenge is to preserve the Gaussian white noise distribution while adding in temporal consistency. Recently, Chang et al. (2024) formulated this problem using an integral noise representation with distribution-preserving guarantees, and proposed an upsampling-based algorithm to compute it. However, while their mathematical formulation is advantageous, the algorithm incurs a high computational cost. Through analyzing the limiting-case behavior of their algorithm as the upsampling resolution goes to infinity, we develop an alternative algorithm that, by gathering increments of multiple Brownian bridges, achieves their infinite-resolution accuracy while simultaneously reducing the computational cost by orders of magnitude. We prove and experimentally validate our theoretical claims, and demonstrate our method’s effectiveness in real-world applications. We further show that our method readily extends to the 3-dimensional space.
[AI-114] Class-specific feature selection for classification explainability
链接: https://arxiv.org/abs/2411.01204
作者: Jesus S. Aguilar-Ruiz
关键词-EN: Selection techniques aim, subset selection techniques, Feature Selection techniques, Feature Selection, Selection techniques
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Feature Selection techniques aim at finding a relevant subset of features that perform equally or better than the original set of features at explaining the behavior of data. Typically, features are extracted from feature ranking or subset selection techniques, and the performance is measured by classification or regression tasks. However, while selected features may not have equal importance for the task, they do have equal importance for each class. This work first introduces a comprehensive review of the concept of class-specific, with a focus on feature selection and classification. The fundamental idea of the class-specific concept resides in the understanding that the significance of each feature can vary from one class to another. This contrasts with the traditional class-independent approach, which evaluates the importance of attributes collectively for all classes. For example, in tumor prediction scenarios, each type of tumor may be associated with a distinct subset of relevant features. These features possess significant discriminatory power, enabling the differentiation of one tumor type from others. This class-specific perspective offers a more effective approach to classification tasks by recognizing and leveraging the unique characteristics of each class. Secondly, classification schemes from one-versus-all and one-versus-each strategies are described, and a novel deep one-versus-each strategy is introduced, which offers advantages from the point of view of explainability (feature selection) and decomposability (classification). Thirdly, a novel class-specific relevance matrix is presented, from which some more sophisticated classification schemes can be derived, such as the three-layer class-specific scheme. The potential for further advancements is wide and will open new horizons for exploring novel research directions in multiclass hyperdimensional contexts.
[AI-115] XNB: Explainable Class-Specific NaIve-Bayes Classifier
链接: https://arxiv.org/abs/2411.01203
作者: Jesus S. Aguilar-Ruiz,Cayetano Romero,Andrea Cicconardi
关键词-EN: today data-intensive landscape, improve model accuracy, data-intensive landscape, increasingly common, reducing the number
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In today’s data-intensive landscape, where high-dimensional datasets are increasingly common, reducing the number of input features is essential to prevent overfitting and improve model accuracy. Despite numerous efforts to tackle dimensionality reduction, most approaches apply a universal set of features across all classes, potentially missing the unique characteristics of individual classes. This paper presents the Explainable Class-Specific Naive Bayes (XNB) classifier, which introduces two critical innovations: 1) the use of Kernel Density Estimation to calculate posterior probabilities, allowing for a more accurate and flexible estimation process, and 2) the selection of class-specific feature subsets, ensuring that only the most relevant variables for each class are utilized. Extensive empirical analysis on high-dimensional genomic datasets shows that XNB matches the classification performance of traditional Naive Bayes while drastically improving model interpretability. By isolating the most relevant features for each class, XNB not only reduces the feature set to a minimal, distinct subset for each class but also provides deeper insights into how the model makes predictions. This approach offers significant advantages in fields where both precision and explainability are critical.
[AI-116] GarmentLab: A Unified Simulation and Benchmark for Garment Manipulation NEURIPS2024
链接: https://arxiv.org/abs/2411.01200
作者: Haoran Lu,Ruihai Wu,Yitong Li,Sijie Li,Ziyu Zhu,Chuanruo Ning,Yan Shen,Longzan Luo,Yuanpei Chen,Hao Dong
关键词-EN: Manipulating garments, home-assistant robots, fabrics has long, critical endeavor, development of home-assistant
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: NeurIPS 2024
点击查看摘要
Abstract:Manipulating garments and fabrics has long been a critical endeavor in the development of home-assistant robots. However, due to complex dynamics and topological structures, garment manipulations pose significant challenges. Recent successes in reinforcement learning and vision-based methods offer promising avenues for learning garment manipulation. Nevertheless, these approaches are severely constrained by current benchmarks, which offer limited diversity of tasks and unrealistic simulation behavior. Therefore, we present GarmentLab, a content-rich benchmark and realistic simulation designed for deformable object and garment manipulation. Our benchmark encompasses a diverse range of garment types, robotic systems and manipulators. The abundant tasks in the benchmark further explores of the interactions between garments, deformable objects, rigid bodies, fluids, and human body. Moreover, by incorporating multiple simulation methods such as FEM and PBD, along with our proposed sim-to-real algorithms and real-world benchmark, we aim to significantly narrow the sim-to-real gap. We evaluate state-of-the-art vision methods, reinforcement learning, and imitation learning approaches on these tasks, highlighting the challenges faced by current algorithms, notably their limited generalization capabilities. Our proposed open-source environments and comprehensive analysis show promising boost to future research in garment manipulation by unlocking the full potential of these methods. We guarantee that we will open-source our code as soon as possible. You can watch the videos in supplementary files to learn more about the details of our work. Our project page is available at: this https URL
[AI-117] Learning Rules Explaining Interactive Theorem Proving Tactic Prediction
链接: https://arxiv.org/abs/2411.01188
作者: Liao Zhang,David M. Cerna,Cezary Kaliszyk
关键词-EN: interactive theorem provers, curve remains steep, learning curve remains, Formally verifying, interactive theorem
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 5 figures
点击查看摘要
Abstract:Formally verifying the correctness of mathematical proofs is more accessible than ever, however, the learning curve remains steep for many of the state-of-the-art interactive theorem provers (ITP). Deriving the most appropriate subsequent proof step, and reasoning about it, given the multitude of possibilities, remains a daunting task for novice users. To improve the situation, several investigations have developed machine learning based guidance for tactic selection. Such approaches struggle to learn non-trivial relationships between the chosen tactic and the structure of the proof state and represent them as symbolic expressions. To address these issues we (i) We represent the problem as an Inductive Logic Programming (ILP) task, (ii) Using the ILP representation we enriched the feature space by encoding additional, computationally expensive properties as background knowledge predicates, (iii) We use this enriched feature space to learn rules explaining when a tactic is applicable to a given proof state, (iv) we use the learned rules to filter the output of an existing tactic selection approach and empirically show improvement over the non-filtering approaches.
[AI-118] Guiding Multi-agent Multi-task Reinforcement Learning by a Hierarchical Framework with Logical Reward Shaping
链接: https://arxiv.org/abs/2411.01184
作者: Chanjuan Liu,Jinmiao Cong,Bingcai Chen,Yaochu Jin,Enqiang Zhu
关键词-EN: Multi-agent hierarchical reinforcement, solve intelligent decision, intelligent decision problems, hierarchical reinforcement learning, current MAHRL algorithms
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:
点击查看摘要
Abstract:Multi-agent hierarchical reinforcement learning (MAHRL) has been studied as an effective means to solve intelligent decision problems in complex and large-scale environments. However, most current MAHRL algorithms follow the traditional way of using reward functions in reinforcement learning, which limits their use to a single task. This study aims to design a multi-agent cooperative algorithm with logic reward shaping (LRS), which uses a more flexible way of setting the rewards, allowing for the effective completion of multi-tasks. LRS uses Linear Temporal Logic (LTL) to express the internal logic relation of subtasks within a complex task. Then, it evaluates whether the subformulae of the LTL expressions are satisfied based on a designed reward structure. This helps agents to learn to effectively complete tasks by adhering to the LTL expressions, thus enhancing the interpretability and credibility of their decisions. To enhance coordination and cooperation among multiple agents, a value iteration technique is designed to evaluate the actions taken by each agent. Based on this evaluation, a reward function is shaped for coordination, which enables each agent to evaluate its status and complete the remaining subtasks through experiential learning. Experiments have been conducted on various types of tasks in the Minecraft-like environment. The results demonstrate that the proposed algorithm can improve the performance of multi-agents when learning to complete multi-tasks.
[AI-119] Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models NEURIPS2024
链接: https://arxiv.org/abs/2411.01179
作者: Wonguk Cho,Seokeon Choi,Debasmit Das,Matthias Reisser,Taesup Kim,Sungrack Yun,Fatih Porikli
关键词-EN: generate custom images, Recent advancements, textual prompts, generate custom, custom images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: NeurIPS 2024
点击查看摘要
Abstract:Recent advancements in text-to-image diffusion models have enabled the personalization of these models to generate custom images from textual prompts. This paper presents an efficient LoRA-based personalization approach for on-device subject-driven generation, where pre-trained diffusion models are fine-tuned with user-specific data on resource-constrained devices. Our method, termed Hollowed Net, enhances memory efficiency during fine-tuning by modifying the architecture of a diffusion U-Net to temporarily remove a fraction of its deep layers, creating a hollowed structure. This approach directly addresses on-device memory constraints and substantially reduces GPU memory requirements for training, in contrast to previous methods that primarily focus on minimizing training steps and reducing the number of parameters to update. Additionally, the personalized Hollowed Net can be transferred back into the original U-Net, enabling inference without additional memory overhead. Quantitative and qualitative analyses demonstrate that our approach not only reduces training memory to levels as low as those required for inference but also maintains or improves personalization performance compared to existing methods.
[AI-120] Reasoning Limitations of Multimodal Large Language Models . A case study of Bongard Problems
链接: https://arxiv.org/abs/2411.01173
作者: Mikołaj Małkiński,Szymon Pawlonka,Jacek Mańdziuk
关键词-EN: Abstract visual reasoning, Abstract visual, discover common concepts, common concepts underlying, encompasses a suite
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Abstract visual reasoning (AVR) encompasses a suite of tasks whose solving requires the ability to discover common concepts underlying the set of pictures through an analogy-making process, similarly to human IQ tests. Bongard Problems (BPs), proposed in 1968, constitute a fundamental challenge in this domain mainly due to their requirement to combine visual reasoning and verbal description. This work poses a question whether multimodal large language models (MLLMs) inherently designed to combine vision and language are capable of tackling BPs. To this end, we propose a set of diverse MLLM-suited strategies to tackle BPs and examine four popular proprietary MLLMs: GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, and Claude 3.5 Sonnet, and four open models: InternVL2-8B, LLaVa-1.6 Mistral-7B, Phi-3.5-Vision, and Pixtral 12B. The above MLLMs are compared on three BP datasets: a set of original BP instances relying on synthetic, geometry-based images and two recent datasets based on real-world images, i.e., Bongard-HOI and Bongard-OpenWorld. The experiments reveal significant limitations of MLLMs in solving BPs. In particular, the models struggle to solve the classical set of synthetic BPs, despite their visual simplicity. Though their performance ameliorates on real-world concepts expressed in Bongard-HOI and Bongard-OpenWorld, the models still have difficulty in utilizing new information to improve their predictions, as well as utilizing a dialog context window effectively. To capture the reasons of performance discrepancy between synthetic and real-world AVR domains, we propose Bongard-RWR, a new BP dataset consisting of real-world images that translates concepts from hand-crafted synthetic BPs to real-world concepts. The MLLMs’ results on Bongard-RWR suggest that their poor performance on classical BPs is not due to domain specificity but rather reflects their general AVR limitations.
[AI-121] Covariance-based Space Regularization for Few-shot Class Incremental Learning WACV2025
链接: https://arxiv.org/abs/2411.01172
作者: Yijie Hu,Guanyu Yang,Zhaorui Tan,Xiaowei Huang,Kaizhu Huang,Qiu-Feng Wang
关键词-EN: Class Incremental Learning, Incremental Learning, previously learned base, incremental sessions, learned base classes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: WACV2025,10 pages, 5 figures
点击查看摘要
Abstract:Few-shot Class Incremental Learning (FSCIL) presents a challenging yet realistic scenario, which requires the model to continually learn new classes with limited labeled data (i.e., incremental sessions) while retaining knowledge of previously learned base classes (i.e., base sessions). Due to the limited data in incremental sessions, models are prone to overfitting new classes and suffering catastrophic forgetting of base classes. To tackle these issues, recent advancements resort to prototype-based approaches to constrain the base class distribution and learn discriminative representations of new classes. Despite the progress, the limited data issue still induces ill-divided feature space, leading the model to confuse the new class with old classes or fail to facilitate good separation among new classes. In this paper, we aim to mitigate these issues by directly constraining the span of each class distribution from a covariance perspective. In detail, we propose a simple yet effective covariance constraint loss to force the model to learn each class distribution with the same covariance matrix. In addition, we propose a perturbation approach to perturb the few-shot training samples in the feature space, which encourages the samples to be away from the weighted distribution of other classes. Regarding perturbed samples as new class data, the classifier is forced to establish explicit boundaries between each new class and the existing ones. Our approach is easy to integrate into existing FSCIL approaches to boost performance. Experiments on three benchmarks validate the effectiveness of our approach, achieving a new state-of-the-art performance of FSCIL.
[AI-122] Fast and Memory-Efficient Video Diffusion Using Streamlined Inference NEURIPS2024
链接: https://arxiv.org/abs/2411.01171
作者: Zheng Zhan,Yushu Wu,Yifan Gong,Zichong Meng,Zhenglun Kong,Changdi Yang,Geng Yuan,Pu Zhao,Wei Niu,Yanzhi Wang
关键词-EN: artificial intelligence-generated content, video diffusion models, diffusion models, significantly advanced development, intelligence-generated content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024
点击查看摘要
Abstract:The rapid progress in artificial intelligence-generated content (AIGC), especially with diffusion models, has significantly advanced development of high-quality video generation. However, current video diffusion models exhibit demanding computational requirements and high peak memory usage, especially for generating longer and higher-resolution videos. These limitations greatly hinder the practical application of video diffusion models on standard hardware platforms. To tackle this issue, we present a novel, training-free framework named Streamlined Inference, which leverages the temporal and spatial properties of video diffusion models. Our approach integrates three core components: Feature Slicer, Operator Grouping, and Step Rehash. Specifically, Feature Slicer effectively partitions input features into sub-features and Operator Grouping processes each sub-feature with a group of consecutive operators, resulting in significant memory reduction without sacrificing the quality or speed. Step Rehash further exploits the similarity between adjacent steps in diffusion, and accelerates inference through skipping unnecessary steps. Extensive experiments demonstrate that our approach significantly reduces peak memory and computational overhead, making it feasible to generate high-quality videos on a single consumer GPU (e.g., reducing peak memory of AnimateDiff from 42GB to 11GB, featuring faster inference on 2080Ti).
[AI-123] Bi-Level Graph Structure Learning for Next POI Recommendation
链接: https://arxiv.org/abs/2411.01169
作者: Liang Wang,Shu Wu,Qiang Liu,Yanqiao Zhu,Xiang Tao,Mengdi Zhang,Liang Wang
关键词-EN: sequential check-in history, POI, aims to predict, predict a user, user next destination
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted by IEEE Transactions on Knowledge and Data Engineering
点击查看摘要
Abstract:Next point-of-interest (POI) recommendation aims to predict a user’s next destination based on sequential check-in history and a set of POI candidates. Graph neural networks (GNNs) have demonstrated a remarkable capability in this endeavor by exploiting the extensive global collaborative signals present among POIs. However, most of the existing graph-based approaches construct graph structures based on pre-defined heuristics, failing to consider inherent hierarchical structures of POI features such as geographical locations and visiting peaks, or suffering from noisy and incomplete structures in graphs. To address the aforementioned issues, this paper presents a novel Bi-level Graph Structure Learning (BiGSL) for next POI recommendation. BiGSL first learns a hierarchical graph structure to capture the fine-to-coarse connectivity between POIs and prototypes, and then uses a pairwise learning module to dynamically infer relationships between POI pairs and prototype pairs. Based on the learned bi-level graphs, our model then employs a multi-relational graph network that considers both POI- and prototype-level neighbors, resulting in improved POI representations. Our bi-level structure learning scheme is more robust to data noise and incompleteness, and improves the exploration ability for recommendation by alleviating sparsity issues. Experimental results on three real-world datasets demonstrate the superiority of our model over existing state-of-the-art methods, with a significant improvement in recommendation accuracy and exploration performance.
[AI-124] Prompt Tuning with Diffusion for Few-Shot Pre-trained Policy Generalization
链接: https://arxiv.org/abs/2411.01168
作者: Shengchao Hu,Wanru Zhao,Weixiong Lin,Li Shen,Ya Zhang,Dacheng Tao
关键词-EN: Offline reinforcement learning, harness previous experiences, methods harness previous, Offline reinforcement, pre-trained large-scale models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages
点击查看摘要
Abstract:Offline reinforcement learning (RL) methods harness previous experiences to derive an optimal policy, forming the foundation for pre-trained large-scale models (PLMs). When encountering tasks not seen before, PLMs often utilize several expert trajectories as prompts to expedite their adaptation to new requirements. Though a range of prompt-tuning methods have been proposed to enhance the quality of prompts, these methods often face optimization restrictions due to prompt initialization, which can significantly constrain the exploration domain and potentially lead to suboptimal solutions. To eliminate the reliance on the initial prompt, we shift our perspective towards the generative model, framing the prompt-tuning process as a form of conditional generative modeling, where prompts are generated from random noise. Our innovation, the Prompt Diffuser, leverages a conditional diffusion model to produce prompts of exceptional quality. Central to our framework is the approach to trajectory reconstruction and the meticulous integration of downstream task guidance during the training phase. Further experimental results underscore the potency of the Prompt Diffuser as a robust and effective tool for the prompt-tuning process, demonstrating strong performance in the meta-RL tasks.
[AI-125] Role Play: Learning Adaptive Role-Specific Strategies in Multi-Agent Interactions
链接: https://arxiv.org/abs/2411.01166
作者: Weifan Long,Wen Wen,Peng Zhai,Lihua Zhang
关键词-EN: Zero-shot coordination problem, attracted increasing attention, Zero-shot coordination, multi-agent reinforcement learning, increasing attention
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Zero-shot coordination problem in multi-agent reinforcement learning (MARL), which requires agents to adapt to unseen agents, has attracted increasing attention. Traditional approaches often rely on the Self-Play (SP) framework to generate a diverse set of policies in a policy pool, which serves to improve the generalization capability of the final agent. However, these frameworks may struggle to capture the full spectrum of potential strategies, especially in real-world scenarios that demand agents balance cooperation with competition. In such settings, agents need strategies that can adapt to varying and often conflicting goals. Drawing inspiration from Social Value Orientation (SVO)-where individuals maintain stable value orientations during interactions with others-we propose a novel framework called \emphRole Play (RP). RP employs role embeddings to transform the challenge of policy diversity into a more manageable diversity of roles. It trains a common policy with role embedding observations and employs a role predictor to estimate the joint role embeddings of other agents, helping the learning agent adapt to its assigned role. We theoretically prove that an approximate optimal policy can be achieved by optimizing the expected cumulative reward relative to an approximate role-based policy. Experimental results in both cooperative (Overcooked) and mixed-motive games (Harvest, CleanUp) reveal that RP consistently outperforms strong baselines when interacting with unseen agents, highlighting its robustness and adaptability in complex environments.
[AI-126] Supervised Score-Based Modeling by Gradient Boosting
链接: https://arxiv.org/abs/2411.01159
作者: Changyuan Zhao,Hongyang Du,Guangyuan Liu,Dusit Niyato
关键词-EN: Score-based generative models, gradient boosting algorithm, Score-based generative, Supervised Score-based Model, gradient boosting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 1 figure, 4 tables
点击查看摘要
Abstract:Score-based generative models can effectively learn the distribution of data by estimating the gradient of the distribution. Due to the multi-step denoising characteristic, researchers have recently considered combining score-based generative models with the gradient boosting algorithm, a multi-step supervised learning algorithm, to solve supervised learning tasks. However, existing generative model algorithms are often limited by the stochastic nature of the models and the long inference time, impacting prediction performances. Therefore, we propose a Supervised Score-based Model (SSM), which can be viewed as a gradient boosting algorithm combining score matching. We provide a theoretical analysis of learning and sampling for SSM to balance inference time and prediction accuracy. Via the ablation experiment in selected examples, we demonstrate the outstanding performances of the proposed techniques. Additionally, we compare our model with other probabilistic models, including Natural Gradient Boosting (NGboost), Classification and Regression Diffusion Models (CARD), Diffusion Boosted Trees (DBT), and Bayesian neural network-based models. The experimental results show that our model outperforms existing models in both accuracy and inference time.
[AI-127] Pin-Tuning: Parameter-Efficient In-Context Tuning for Few-Shot Molecular Property Prediction NEURIPS2024
链接: https://arxiv.org/abs/2411.01158
作者: Liang Wang,Qiang Liu,Shaozhen Liu,Xin Sun,Shu Wu,Liang Wang
关键词-EN: Molecular property prediction, property prediction, Molecular property, material science, real-world scenarios
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Molecular property prediction (MPP) is integral to drug discovery and material science, but often faces the challenge of data scarcity in real-world scenarios. Addressing this, few-shot molecular property prediction (FSMPP) has been developed. Unlike other few-shot tasks, FSMPP typically employs a pre-trained molecular encoder and a context-aware classifier, benefiting from molecular pre-training and molecular context information. Despite these advancements, existing methods struggle with the ineffective fine-tuning of pre-trained encoders. We attribute this issue to the imbalance between the abundance of tunable parameters and the scarcity of labeled molecules, and the lack of contextual perceptiveness in the encoders. To overcome this hurdle, we propose a parameter-efficient in-context tuning method, named Pin-Tuning. Specifically, we propose a lightweight adapter for pre-trained message passing layers (MP-Adapter) and Bayesian weight consolidation for pre-trained atom/bond embedding layers (Emb-BWC), to achieve parameter-efficient tuning while preventing over-fitting and catastrophic forgetting. Additionally, we enhance the MP-Adapters with contextual perceptiveness. This innovation allows for in-context tuning of the pre-trained encoder, thereby improving its adaptability for specific FSMPP tasks. When evaluated on public datasets, our method demonstrates superior tuning with fewer trainable parameters, improving few-shot predictive performance.
[AI-128] Designing a Robust Radiology Report Generation System
链接: https://arxiv.org/abs/2411.01153
作者: Sonit Singh
关键词-EN: visual language navigation, visual question answering, natural language processing, Recent advances, radiology report generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 21 pages, 2 figures
点击查看摘要
Abstract:Recent advances in deep learning have enabled researchers to explore tasks at the intersection of computer vision and natural language processing, such as image captioning, visual question answering, visual dialogue, and visual language navigation. Taking inspiration from image captioning, the task of radiology report generation aims at automatically generating radiology reports by having a comprehensive understanding of medical images. However, automatically generating radiology reports from medical images is a challenging task due to the complexity, diversity, and nature of medical images. In this paper, we outline the design of a robust radiology report generation system by integrating different modules and highlighting best practices drawing upon lessons from our past work and also from relevant studies in the literature. We also discuss the impact of integrating different components to form a single integrated system. We believe that these best practices, when implemented, could improve automatic radiology report generation, augment radiologists in decision making, and expedite diagnostic workflow, in turn improve healthcare and save human lives.
[AI-129] ask-Aware Harmony Multi-Task Decision Transformer for Offline Reinforcement Learning ICML
链接: https://arxiv.org/abs/2411.01146
作者: Ziqing Fan,Shengchao Hu,Yuhang Zhou,Li Shen,Ya Zhang,Yanfeng Wang,Dacheng Tao
关键词-EN: online environmental interaction, multi-task reinforcement learning, offline multi-task reinforcement, reinforcement learning, environmental interaction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Extension of corresponding ICML edition arXiv:2405.18080 . arXiv admin note: substantial text overlap with arXiv:2405.18080
点击查看摘要
Abstract:The purpose of offline multi-task reinforcement learning (MTRL) is to develop a unified policy applicable to diverse tasks without the need for online environmental interaction. Recent advancements approach this through sequence modeling, leveraging the Transformer architecture’s scalability and the benefits of parameter sharing to exploit task similarities. However, variations in task content and complexity pose significant challenges in policy formulation, necessitating judicious parameter sharing and management of conflicting gradients for optimal policy performance. Furthermore, identifying the optimal parameter subspace for each task often necessitates prior knowledge of the task identifier during inference, limiting applicability in real-world scenarios with variable task content and unknown current tasks. In this work, we introduce the Harmony Multi-Task Decision Transformer (HarmoDT), a novel solution designed to identify an optimal harmony subspace of parameters for each task. We formulate this as a bi-level optimization problem within a meta-learning framework, where the upper level learns masks to define the harmony subspace, while the inner level focuses on updating parameters to improve the overall performance of the unified policy. To eliminate the need for task identifiers, we further design a group-wise variant (G-HarmoDT) that clusters tasks into coherent groups based on gradient information, and utilizes a gating network to determine task identifiers during inference. Empirical evaluations across various benchmarks highlight the superiority of our approach, demonstrating its effectiveness in the multi-task context with specific improvements of 8% gain in task-provided settings, 5% in task-agnostic settings, and 10% in unseen settings.
[AI-130] NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
链接: https://arxiv.org/abs/2411.01142
作者: Xuanlin Jiang,Yang Zhou,Shiyi Cao,Ion Stoica,Minlan Yu
关键词-EN: Online LLM inference, LLM inference powers, LLM inference, Modern LLM inference, Online LLM
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5 \times , 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.01142 [cs.DC] (or arXiv:2411.01142v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2411.01142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-131] Privacy-Preserving Federated Learning with Differentially Private Hyperdimensional Computing
链接: https://arxiv.org/abs/2411.01140
作者: Fardin Jalil Piran,Zhiling Chen,Mohsen Imani,Farhad Imani
关键词-EN: Internet of Things, trains Machine Learning, exchange in Internet, trains Machine, efficient data exchange
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 28 Pages, 10 Figures
点击查看摘要
Abstract:Federated Learning (FL) is essential for efficient data exchange in Internet of Things (IoT) environments, as it trains Machine Learning (ML) models locally and shares only model updates. However, FL is vulnerable to privacy threats like model inversion and membership inference attacks, which can expose sensitive training data. To address these privacy concerns, Differential Privacy (DP) mechanisms are often applied. Yet, adding DP noise to black-box ML models degrades performance, especially in dynamic IoT systems where continuous, lifelong FL learning accumulates excessive noise over time. To mitigate this issue, we introduce Federated HyperDimensional computing with Privacy-preserving (FedHDPrivacy), an eXplainable Artificial Intelligence (XAI) framework that combines the neuro-symbolic paradigm with DP. FedHDPrivacy carefully manages the balance between privacy and performance by theoretically tracking cumulative noise from previous rounds and adding only the necessary incremental noise to meet privacy requirements. In a real-world case study involving in-process monitoring of manufacturing machining operations, FedHDPrivacy demonstrates robust performance, outperforming standard FL frameworks-including Federated Averaging (FedAvg), Federated Stochastic Gradient Descent (FedSGD), Federated Proximal (FedProx), Federated Normalized Averaging (FedNova), and Federated Adam (FedAdam)-by up to 38%. FedHDPrivacy also shows potential for future enhancements, such as multimodal data fusion.
[AI-132] Data movement limits to frontier model training
链接: https://arxiv.org/abs/2411.01137
作者: Ege Erdil,David Schneider-Joseph
关键词-EN: sparse training runs, training runs, present a theoretical, dense and sparse, training runs exceeding
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about 10^28 FLOP, two orders of magnitude above the largest training run to date, \textbfsuggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about 10^31 FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.
[AI-133] Rule Based Rewards for Language Model Safety NEURIPS2024
链接: https://arxiv.org/abs/2411.01111
作者: Tong Mu,Alec Helyar,Johannes Heidecke,Joshua Achiam,Andrea Vallone,Ian Kivlichan,Molly Lin,Alex Beutel,John Schulman,Lilian Weng
关键词-EN: Reinforcement learning based, Reinforcement learning, learning based fine-tuning, large language models, fine-tuning of large
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at Neurips 2024
点击查看摘要
Abstract:Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a costly need to add or relabel data to modify safety behavior. We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors (e.g. refusals should not be judgmental) along with a LLM grader. In contrast to prior methods using AI feedback, our method uses fine-grained, composable, LLM-graded few-shot prompts as reward directly in RL training, resulting in greater control, accuracy and ease of updating. We show that RBRs are an effective training method, achieving an F1 score of 97.1, compared to a human-feedback baseline of 91.7, resulting in much higher safety-behavior accuracy through better balancing usefulness and safety.
[AI-134] Effective ML Model Versioning in Edge Networks
链接: https://arxiv.org/abs/2411.01078
作者: Fin Gentzen,Mounir Bensalem,Admela Jukan
关键词-EN: essential version updates, Machine learning, data and software, feasible for integration, regularly updated
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: This paper is uploaded here for research community, thus it is for non-commercial purposes
点击查看摘要
Abstract:Machine learning (ML) models, data and software need to be regularly updated whenever essential version updates are released and feasible for integration. This is a basic but most challenging requirement to satisfy in the edge, due to the various system constraints and the major impact that an update can have on robustness and stability. In this paper, we formulate for the first time the ML model versioning optimization problem, and propose effective solutions, including the automation with reinforcement learning (RL) based algorithm. Without loss of generality, we choose the edge network environment due to the known constraints in performance, response time, security, and reliability. The performance study shows that ML model version updates can be fully and effectively automated with reinforcement learning method as compared to other approaches. We show that with a carefully chosen range of traffic load values, the proper versioning can improve the security, reliability and ML model accuracy, while assuring a comparably lower response time.
[AI-135] AttackQA: Development and Adoption of a Dataset for Assisting Cybersecurity Operations using Fine-tuned and Open-Source LLM s
链接: https://arxiv.org/abs/2411.01073
作者: Varun Badrinath Krishna
关键词-EN: shown improved performance, Retrieval-augmented generation, large language models, user queries, specialized domain datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) on specialized domain datasets has shown improved performance when large language models (LLMs) are fine-tuned for generating responses to user queries. In this study, we develop a cybersecurity question-answering (Q\A) dataset, called AttackQA, and employ it to build a RAG-based Q\A system designed for analysts in security operations centers. The dataset comprises 25,335 Q\A pairs, accompanied by rationales to facilitate fine-tuning and evaluation. 80% of the dataset was generated with help of a lightweight open-source LLM (LLama 3 8B), which produced over 1100 tokens per second with full 16-bit precision on SambaNova System’s SN40L specialized hardware. To ensure dataset quality, we fine-tuned LLama 3 70B to detect and reject low-quality Q\A pairs. In using the dataset for RAG, we demonstrate that fine-tuning open-source embeddings and LLMs can yield superior accuracy compared to OpenAI’s state-of-the-art proprietary embedding and LLM (GPT-4o). Furthermore, we use Llama 3.1 405B as a judge to evaluate answer correctness, enabling the creation of a fully open-source, high-speed RAG and evaluation pipeline with a benchmark for model accuracy.
[AI-136] InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM -based Code Translation
链接: https://arxiv.org/abs/2411.01063
作者: Marcos Macedo,Yuan Tian,Pengyu Nie,Filipe R. Cogo,Bram Adams
关键词-EN: Code translation aims, Code translation, aims to convert, convert a program, Code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Code translation aims to convert a program from one programming language (PL) to another. This long-standing software engineering task is crucial for modernizing legacy systems, ensuring cross-platform compatibility, enhancing performance, and more. However, automating this process remains challenging due to many syntactic and semantic differences between PLs. Recent studies show that even advanced techniques such as large language models (LLMs), especially open-source LLMs, still struggle with the task. Currently, code LLMs are trained with source code from multiple programming languages, thus presenting multilingual capabilities. In this paper, we investigate whether such multilingual capabilities can be harnessed to enhance code translation. To achieve this goal, we introduce InterTrans, an LLM-based automated code translation approach that, in contrast to existing approaches, leverages intermediate translations across PLs to bridge the syntactic and semantic gaps between source and target PLs. InterTrans contains two stages. It first utilizes a novel Tree of Code Translation (ToCT) algorithm to plan transitive intermediate translation sequences between a given source and target PL, then validates them in a specific order. We evaluate InterTrans with three open LLMs on three benchmarks (i.e., CodeNet, HumanEval-X, and TransCoder) involving six PLs. Results show an absolute improvement between 18.3% to 43.3% in Computation Accuracy (CA) for InterTrans over Direct Translation with 10 attempts. The best-performing variant of InterTrans (with Magicoder LLM) achieved an average CA of 87.3%-95.4% on three benchmarks. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.01063 [cs.SE] (or arXiv:2411.01063v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.01063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-137] Combining Physics-based and Data-driven Modeling for Building Energy Systems
链接: https://arxiv.org/abs/2411.01055
作者: Leandro Von Krannichfeldt,Kristina Orehounig,Olga Fink
关键词-EN:
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-138] BACSA: A Bias-Aware Client Selection Algorithm for Privacy-Preserving Federated Learning in Wireless Healthcare Networks
链接: https://arxiv.org/abs/2411.01050
作者: Sushilkumar Yadav,Irem Bor-Yaliniz
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
[AI-139] Exploratory Models of Human-AI Teams: Leveraging Human Digital Twins to Investigate Trust Development
链接: https://arxiv.org/abs/2411.01049
作者: Daniel Nguyen,Myke C. Cohen,Hsien-Te Kao,Grant Engberson,Louis Penafiel,Spencer Lynch,Svitlana Volkova
关键词-EN: measuring HAT effectiveness, HAT, modeling HAT behaviors, continues to grow, continue to develop
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: in review; submitted to Interaction Studies
点击查看摘要
Abstract:As human-agent teaming (HAT) research continues to grow, computational methods for modeling HAT behaviors and measuring HAT effectiveness also continue to develop. One rising method involves the use of human digital twins (HDT) to approximate human behaviors and socio-emotional-cognitive reactions to AI-driven agent team members. In this paper, we address three research questions relating to the use of digital twins for modeling trust in HATs. First, to address the question of how we can appropriately model and operationalize HAT trust through HDT HAT experiments, we conducted causal analytics of team communication data to understand the impact of empathy, socio-cognitive, and emotional constructs on trust formation. Additionally, we reflect on the current state of the HAT trust science to discuss characteristics of HAT trust that must be replicable by a HDT such as individual differences in trust tendencies, emergent trust patterns, and appropriate measurement of these characteristics over time. Second, to address the question of how valid measures of HDT trust are for approximating human trust in HATs, we discuss the properties of HDT trust: self-report measures, interaction-based measures, and compliance type behavioral measures. Additionally, we share results of preliminary simulations comparing different LLM models for generating HDT communications and analyze their ability to replicate human-like trust dynamics. Third, to address how HAT experimental manipulations will extend to human digital twin studies, we share experimental design focusing on propensity to trust for HDTs vs. transparency and competency-based trust for AI agents.
[AI-140] Introduction to AI Safety Ethics and Society
链接: https://arxiv.org/abs/2411.01042
作者: Dan Hendrycks
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 603 pages
[AI-141] Semi-Strongly solved: a New Definition Leading Computer to Perfect Gameplay
链接: https://arxiv.org/abs/2411.01029
作者: Hiroki Takizawa
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:
[AI-142] Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs
链接: https://arxiv.org/abs/2411.01023
作者: Gerard Pons,Besim Bilalli,Anna Queralt
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: Pre-print submitted to Knowledge-Based Systems
[AI-143] MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
链接: https://arxiv.org/abs/2411.01016
作者: Cheng Yang,Yang Sui,Jinqi Xiao,Lingyi Huang,Yu Gong,Yuanlin Duan,Wenqi Jia,Miao Yin,Yu Cheng,Bo Yuan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-144] A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data
链接: https://arxiv.org/abs/2411.01013
作者: Ismail Hakki Karaman,Gulser Koksal,Levent Eriskin,Salih Salihoglu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-145] aking AI Welfare Seriously
链接: https://arxiv.org/abs/2411.00986
作者: Robert Long,Jeff Sebo,Patrick Butlin,Kathleen Finlinson,Kyle Fish,Jacqueline Harding,Jacob Pfau,Toni Sims,Jonathan Birch,David Chalmers
关键词-EN: realistic possibility, systems, future, welfare, issue
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:In this report, we argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future. That means that the prospect of AI welfare and moral patienthood, i.e. of AI systems with their own interests and moral significance, is no longer an issue only for sci-fi or the distant future. It is an issue for the near future, and AI companies and other actors have a responsibility to start taking it seriously. We also recommend three early steps that AI companies and other actors can take: They can (1) acknowledge that AI welfare is an important and difficult issue (and ensure that language model outputs do the same), (2) start assessing AI systems for evidence of consciousness and robust agency, and (3) prepare policies and procedures for treating AI systems with an appropriate level of moral concern. To be clear, our argument in this report is not that AI systems definitely are, or will be, conscious, robustly agentic, or otherwise morally significant. Instead, our argument is that there is substantial uncertainty about these possibilities, and so we need to improve our understanding of AI welfare and our ability to make wise decisions about this issue. Otherwise there is a significant risk that we will mishandle decisions about AI welfare, mistakenly harming AI systems that matter morally and/or mistakenly caring for AI systems that do not.
[AI-146] Improving How Agents Cooperate: Attention Schemas in Artificial Neural Networks
链接: https://arxiv.org/abs/2411.00983
作者: Kathryn T. Farrell,Kirsten Ziman,Michael S. A. Graziano
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-147] Incremental IVF Index Maintenance for Streaming Vector Search
链接: https://arxiv.org/abs/2411.00970
作者: Jason Mohoney,Anil Pacaci,Shihabur Rahman Chowdhury,Umar Farooq Minhas,Jeffery Pound,Cedric Renggli,Nima Reyhani,Ihab F. Ilyas,Theodoros Rekatsinas,Shivaram Venkataraman
关键词-EN:
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 14 figures
[AI-148] Scalable AI Framework for Defect Detection in Metal Additive Manufacturing
链接: https://arxiv.org/abs/2411.00960
作者: Duy Nhat Phan,Sushant Jha,James P. Mavo,Erin L. Lanigan,Linh Nguyen,Lokendra Poudel,Rahul Bhowmik
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 29 pages
[AI-149] From Fake Perfects to Conversational Imperfects: Exploring Image-Generative AI as a Boundary Object for Participatory Design of Public Spaces
链接: https://arxiv.org/abs/2411.00949
作者: Jose A. Guridi,Angel Hsing-Chi Hwang,Duarte Santo,Maria Goula,Cristobal Cheyre,Lee Humphreys,Marco Rangel
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Forthcoming in the Proceedings of the 2025 Conference on Computer Supported Cooperative Work and Social Computing (CSCW)
[AI-150] Generative Memesis: AI Mediates Political Memes in the 2024 USA Presidential Election
链接: https://arxiv.org/abs/2411.00934
作者: Ho-Chun Herbert Chang,Benjamin Shaman,Yung-chun Chen,Mingyue Zha,Sean Noh,Chiyu Wei,Tracy Weener,Maya Magee
关键词-EN:
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
[AI-151] LLM s: A Game-Changer for Software Engineers?
链接: https://arxiv.org/abs/2411.00932
作者: Md Asraful Haque
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 20 pages, 7 figures, 3 tables
[AI-152] Comparative Evaluation of Applicability Domain Definition Methods for Regression Models
链接: https://arxiv.org/abs/2411.00920
作者: Shakir Khurshid,Bharath Kumar Loganathan,Matthieu Duvinage
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-153] Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering
链接: https://arxiv.org/abs/2411.00916
作者: Mehdi Hosseini Chagahi,Saeed Mohammadi Dashtaki,Niloufar Delfan,Nadia Mohammadi,Alireza Samari,Behzad Moshiri,Md. Jalil Piran,U. Rajendra Acharya,Oliver Faust
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-154] V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM
链接: https://arxiv.org/abs/2411.00915
作者: Liang Mi,Weijun Wang,Wenming Tu,Qingfeng He,Rui Kong,Xinyu Fang,Yazhu Dong,Yikang Zhang,Yunchun Li,Meng Li,Haipeng Dai,Guihai Chen,Yunxin Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-155] AAD-LLM : Adaptive Anomaly Detection Using Large Language Models
链接: https://arxiv.org/abs/2411.00914
作者: Alicia Russell-Gilbert,Alexander Sommers,Andrew Thompson,Logan Cummins,Sudip Mittal,Shahram Rahimi,Maria Seale,Joseph Jaboure,Thomas Arnold,Joshua Church
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:
[AI-156] Ratio law: mathematical descriptions for a universal relationship between AI performance and input samples
链接: https://arxiv.org/abs/2411.00913
作者: Boming Kang,Qinghua Cui
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 5 figures, 4 Tables
[AI-157] On the Impact of White-box Deployment Strategies for Edge AI on Latency and Model Performance
链接: https://arxiv.org/abs/2411.00907
作者: Jaskirat Singh,Bram Adams,Ahmed E. Hassan
关键词-EN:
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:
[AI-158] Similarity and Dissimilarity Guided Co-association Matrix Construction for Ensemble Clustering
链接: https://arxiv.org/abs/2411.00904
作者: Xu Zhang,Yuheng Jia,Mofei Song,Ran Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-159] Differentiable architecture search with multi-dimensional attention for spiking neural networks
链接: https://arxiv.org/abs/2411.00902
作者: Yilei Man,Linhai Xie,Shushan Qiao,Yumei Zhou,Delong Shang
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-160] Certified Robustness for Deep Equilibrium Models via Serialized Random Smoothing NEURIPS2024
链接: https://arxiv.org/abs/2411.00899
作者: Weizhi Gao,Zhichao Hou,Han Xu,Xiaorui Liu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 25 figures, NeurIPS 2024 accepted
[AI-161] Replace-then-Perturb: Targeted Adversarial Attacks With Visual Reasoning for Vision-Language Models
链接: https://arxiv.org/abs/2411.00898
作者: Jonggyu Jang,Hyeonsu Lyu,Jungyeon Koh,Hyun Jong Yang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 13 pages, 5 figure
[AI-162] Measuring Responsibility in Multi-Agent Systems
链接: https://arxiv.org/abs/2411.00887
作者: Chunyan Mu,Nir Oren
关键词-EN:
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:
[AI-163] Revolutionizing Personalized Cancer Vaccines with NEO: Novel Epitope Optimization Using an Aggregated Feed Forward and Recurrent Neural Network with LSTM Architecture
链接: https://arxiv.org/abs/2411.00885
作者: Nishanth Basava
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:
[AI-164] Resilience to the Flowing Unknown: an Open Set Recognition Framework for Data Streams DATE
链接: https://arxiv.org/abs/2411.00876
作者: Marcos Barcina-Blanco,Jesus L. Lobo,Pablo Garcia-Bringas,Javier Del Ser
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 figures, an updated version of this article is published in LNAI,volume 14857 as part of the conference proceedings HAIS 2024
[AI-165] VecCity: A Taxonomy-guided Library for Map Entity Representation Learning
链接: https://arxiv.org/abs/2411.00874
作者: Wentao Zhang,Jingyuan Wang,Yifan Yang,Leong Hou U
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: under review
[AI-166] CleaR: Towards Robust and Generalized Parameter-Efficient Fine-Tuning for Noisy Label Learning ACL2024
链接: https://arxiv.org/abs/2411.00873
作者: Yeachan Kim,Junho Kim,SangKeun Lee
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at ACL 2024 Main Conference
[AI-167] LLaMo: Large Language Model-based Molecular Graph Assistant NEURIPS2024
链接: https://arxiv.org/abs/2411.00871
作者: Jinyoung Park,Minseong Bae,Dohwan Ko,Hyunwoo J. Kim
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
*备注: NeurIPS 2024
[AI-168] Emory Knee Radiograph (MRKR) Dataset
链接: https://arxiv.org/abs/2411.00866
作者: Brandon Price,Jason Adleberg,Kaesha Thomas,Zach Zaiman,Aawez Mansuri,Beatrice Brown-Mulry,Chima Okecheukwu,Judy Gichoya,Hari Trivedi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages
[AI-169] Advancing Crime Linkage Analysis with Machine Learning: A Comprehensive Review and Framework for Data-Driven Approaches
链接: https://arxiv.org/abs/2411.00864
作者: Vinicius Lima,Umit Karabiyik
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
[AI-170] A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage
链接: https://arxiv.org/abs/2411.00862
作者: Levi Harris
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-171] Profiling AI Models: Towards Efficient Computation Offloading in Heterogeneous Edge AI Systems
链接: https://arxiv.org/abs/2411.00859
作者: Juan Marcelo Parra-Ullauri,Oscar Dilley,Hari Madhukumar,Dimitra Simeonidou
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
*备注:
[AI-172] DiabML: AI-assisted diabetes diagnosis method with meta-heuristic-based feature selection
链接: https://arxiv.org/abs/2411.00858
作者: Vahideh Hayyolalam,Öznur Özkasap
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Proceedings of 14th Turkish Congress of Medical Informatics 16 (18), 19-30; this https URL
[AI-173] AI in Investment Analysis: LLM s for Equity Stock Ratings
链接: https://arxiv.org/abs/2411.00856
作者: Kassiani Papasotiriou,Srijan Sood,Shayleen Reynolds,Tucker Balch
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注: 9 pages, 5 figures, ICAIF24: 5th ACM International Conference on AI in Finance
[AI-174] FPE-LLM : Highly Intelligent Time-Series Forecasting and Language Interaction LLM in Energy Systems
链接: https://arxiv.org/abs/2411.00852
作者: Zihang Qiu,Chaojie Li,Zhongyang Wang,Huadong Mo,Renyou Xie,Guo Chen,Zhaoyang Dong
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-175] Evaluating Evidential Reliability In Pattern Recognition Based On Intuitionistic Fuzzy Sets
链接: https://arxiv.org/abs/2411.00848
作者: Juntao Xu,Tianxiang Zhan,Yong Deng
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 35 pages
[AI-176] End-to-end Graph Learning Approach for Cognitive Diagnosis of Student Tutorial
链接: https://arxiv.org/abs/2411.00845
作者: Fulai Yang,Di Wu,Yi He,Li Tao,Xin Luo
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
[AI-177] Extralonger: Toward a Unified Perspective of Spatial-Temporal Factors for Extra-Long-Term Traffic Forecasting NEURIPS2024
链接: https://arxiv.org/abs/2411.00844
作者: Zhiwei Zhang,Shaojun E,Fandong Meng,Jie Zhou,Wenjuan Han
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS2024 workshop
[AI-178] Peri-AIIMS: Perioperative Artificial Intelligence Driven Integrated Modeling of Surgeries using Anesthetic Physical and Cognitive Statuses for Predicting Hospital Outcomes
链接: https://arxiv.org/abs/2411.00840
作者: Sabyasachi Bandyopadhyay,Jiaqing Zhang,Ronald L. Ison,David J. Libon,Patrick Tighe,Catherine Price,Parisa Rashidi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
[AI-179] CausAdv: A Causal-based Framework for Detecting Adversarial Examples
链接: https://arxiv.org/abs/2411.00839
作者: Hichem Debbi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
[AI-180] ask-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment
链接: https://arxiv.org/abs/2411.00838
作者: Jiaqi Wu,Simin Chen,Zehua Wang,Wei Chen,Zijian Tian,F. Richard Yu,Victor C. M. Leung
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-181] Longitudinal Mammogram Exam-based Breast Cancer Diagnosis Models: Vulnerability to Adversarial Attacks
链接: https://arxiv.org/abs/2411.00837
作者: Zhengbo Zhou,Degan Hao,Dooman Arefan,Margarita Zuley,Jules Sumkin,Shandong Wu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-182] Saliency-Based diversity and fairness Metric and FaceKeepOriginalAugment: A Novel Approach for Enhancing Fairness and Diversity
链接: https://arxiv.org/abs/2411.00831
作者: Teerath Kumar,Alessandra Mileo,Malika Bendechache
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Paper is underReview in Image and Vision Computing Journal special issue: Advancing Transparency and Privacy: Explainable AI and Synthetic Data in Biometrics and Computer Vision
[AI-183] IDEATOR: Jailbreaking VLMs Using VLMs
链接: https://arxiv.org/abs/2411.00827
作者: Ruofan Wang,Bo Wang,Xingjun Ma,Yu-Gang Jiang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-184] Uncertainty Quantification via H"older Divergence for Multi-View Representation Learning
链接: https://arxiv.org/abs/2411.00826
作者: an Zhang,Ming Li,Chun Li,Zhaoxia Liu,Ye Zhang,Fei Richard Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NA
[AI-185] EEG-based Multimodal Representation Learning for Emotion Recognition
链接: https://arxiv.org/abs/2411.00822
作者: Kang Yin,Hye-Bin Shin,Dan Li,Seong-Whan Lee
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
[AI-186] On the Black-box Explainability of Object Detection Models for Safe and Trustworthy Industrial Applications
链接: https://arxiv.org/abs/2411.00818
作者: Alain Andres,Aitor Martinez-Seras,Ibai Laña,Javier Del Ser
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:
[AI-187] An Improved Chicken Swarm Optimization Algorithm for Handwritten Document Image Enhancement
链接: https://arxiv.org/abs/2411.00802
作者: Stanley Mugisha,Lynn tar Gutu,P Nagabhushan
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, conference
[AI-188] Sentiment Analysis Based on RoBERTa for Amazon Review: An Empirical Study on Decision Making
链接: https://arxiv.org/abs/2411.00796
作者: Xinli Guo
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: Master’s thesis
[AI-189] IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI
链接: https://arxiv.org/abs/2411.00785
作者: Xiaoyu Chen,Junliang Guo,Tianyu He,Chuheng Zhang,Pushi Zhang,Derek Cathera Yang,Li Zhao,Jiang Bian
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
[AI-190] From chalkboards to chatbots: SELAR assists teachers in embracing AI in the curriculum
链接: https://arxiv.org/abs/2411.00783
作者: Hani Alers,Aleksandra Malinowska,Mathis Mourey,Jasper Waaijer
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 19 pages, 2 figures
[AI-191] radExpert: Revolutionizing Trading with Mixture of Expert LLM s
链接: https://arxiv.org/abs/2411.00782
作者: Qianggang Ding,Haochen Shi,Bang Liu
关键词-EN:
类目: Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST)
*备注:
[AI-192] Neural Collaborative Filtering to Detect Anomalies in Human Semantic Trajectories
链接: https://arxiv.org/abs/2409.18427
作者: Yueyang Liu,Lance Kennedy,Hossein Amiri,Andreas Züfle
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Accepted for publication in the 1st ACM SIGSPATIAL International Workshop on Geospatial Anomaly Detection (GeoAnomalies’24)
[AI-193] Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis on the role of model complexity
链接: https://arxiv.org/abs/2411.02184
作者: Mouïn Ben Ammar,David Brellmann,Arturo Mendoza,Antoine Manzanera,Gianni Franchi
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
[AI-194] MBDRes-U-Net: Multi-Scale Lightweight Brain Tumor Segmentation Network
链接: https://arxiv.org/abs/2411.01896
作者: Longfeng Shen,Yanqi Hou,Jiacong Chen,Liangjin Diao,Yaxi Duan
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Brain tumor segmentation, lightweight model, Brain Tumor Segmentation (BraTS) Challenge, group convolution
[AI-195] Symmetry Adapted Residual Neural Network Diabatization: Conical Intersections in Aniline Photodissociation
链接: https://arxiv.org/abs/2411.01702
作者: Yifan Shen,David Yarkony
关键词-EN:
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
*备注:
[AI-196] Counterfactual explainability of black-box prediction models
链接: https://arxiv.org/abs/2411.01625
作者: Zijun Gao,Qingyuan Zhao
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 3 figures
[AI-197] Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design
链接: https://arxiv.org/abs/2411.01423
作者: Onur Boyar,Hiroyuki Hanada,Ichiro Takeuchi
关键词-EN:
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, 10 figures, 4 tables
[AI-198] Pre-trained Molecular Language Models with Random Functional Group Masking
链接: https://arxiv.org/abs/2411.01401
作者: Tianhao Peng,Yuchen Li,Xuhong Li,Jiang Bian,Zeke Xie,Ning Sui,Shahid Mumtaz,Yanwu Xu,Linghe Kong,Haoyi Xiong
关键词-EN:
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: Under review
[AI-199] Scaling Laws with Hidden Structure
链接: https://arxiv.org/abs/2411.01375
作者: Charles Arnald,Clement Berenfeld,Simon Rosenberg,Vivien Cabannes
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-200] Spatial Transformers for Radio Map Estimation
链接: https://arxiv.org/abs/2411.01211
作者: Pham Q. Viet,Daniel Romero
关键词-EN:
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
[AI-201] LEARNER: Learning Granular Labels from Coarse Labels using Contrastive Learning
链接: https://arxiv.org/abs/2411.01144
作者: Gautam Gare,Jana Armouti,Nikhil Madaan,Rohan Panda,Tom Fox,Laura Hutchins,Amita Krishnan,Ricardo Rodriguez,Bennett DeBoisblanc,Deva Ramanan,John Galeotti
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Under review at ISBI 2025 conference
[AI-202] Artificial Intelligence for Microbiology and Microbiome Research
链接: https://arxiv.org/abs/2411.01098
作者: Xu-Wen Wang,Tong Wang,Yang-Yu Liu
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[AI-203] Practical hybrid PQC-QKD protocols with enhanced security and performance
链接: https://arxiv.org/abs/2411.01086
作者: Pei Zeng,Debayan Bandyopadhyay,José A. Méndez Méndez,Nolan Bitner,Alexander Kolar,Michael T. Solomon,Filip Rozpedek,Tian Zhong,F. Joseph Heremans,David D. Awschalom,Liang Jiang,Junyu Liu
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 6 pages, 3 figures
[AI-204] owards efficient and secure quantum-classical communication networks
链接: https://arxiv.org/abs/2411.01081
作者: Pei Zeng,Debayan Bandyopadhyay,Jose A. Mendez,Nolan Bitner,Alexander Kolar,Michael T. Solomon,F. Joseph Heremans,David D. Awschalom,Liang Jiang,Junyu Liu
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 4 pages, a blue print paper, Submission for IEEE 2024 IEEE Workshop on Quantum IntelLigence, Learning Security (QUILLS), this https URL
[AI-205] Evaluation Metric for Quality Control and Generative Models in Histopathology Images
链接: https://arxiv.org/abs/2411.01034
作者: Pranav Jeevan,Neeraj Nixon,Abhijeet Patil,Amit Sethi
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 7 pages, 5 figures
[AI-206] Internship Report: Benchmark of Deep Learning-based Imaging PPG in Automotive Domain
链接: https://arxiv.org/abs/2411.00919
作者: Yuqi Tu,Shakith Fernando,Mark van Gastel
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Internship Report
[AI-207] Deep Learning Predicts Mammographic Breast Density in Clinical Breast Ultrasound Images
链接: https://arxiv.org/abs/2411.00891
作者: Arianna Bunnell,Thomas Wolfgruber,Brandon Quon,Kailee Hung,Brenda Hernandez,Peter Sadowski,John A. Shepherd
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-208] Enhancing Brain Tumor Classification Using TrAdaBoost and Multi-Classifier Deep Learning Approaches
链接: https://arxiv.org/abs/2411.00875
作者: Mahin Mohammadi,Saman Jamshidi
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-209] Advanced Hybrid Deep Learning Model for Enhanced Classification of Osteosarcoma Histopathology Images
链接: https://arxiv.org/abs/2411.00832
作者: Arezoo Borji,Gernot Kronreif,Bernhard Angermayr,Sepideh Hatamikia
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[AI-210] Unsupervised Training of a Dynamic Context-Aware Deep Denoising Framework for Low-Dose Fluoroscopic Imaging
链接: https://arxiv.org/abs/2411.00830
作者: Sun-Young Jeon,Sen Wang,Adam S. Wang,Garry E. Gold,Jang-Hwan Choi
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 10 figures
计算机视觉
[CV-0] Adaptive Caching for Faster Video Generation with Diffusion Transformers
链接: https://arxiv.org/abs/2411.02397
作者: Kumara Kahatapitiya,Haozhe Liu,Sen He,Ding Liu,Menglin Jia,Michael S. Ryoo,Tian Xie
关键词-EN: Generating temporally-consistent high-fidelity, longer temporal spans, Generating temporally-consistent, temporally-consistent high-fidelity videos, computationally expensive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project-page is available at this https URL
点击查看摘要
Abstract:Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) – despite making significant headway in this context – have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that “not all videos are created equal”: meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
[CV-1] raining-free Regional Prompting for Diffusion Transformers
链接: https://arxiv.org/abs/2411.02395
作者: Anthony Chen,Jianjin Xu,Wenzhao Zheng,Gaole Dai,Yida Wang,Renrui Zhang,Haofan Wang,Shanghang Zhang
关键词-EN: demonstrated excellent capabilities, demonstrated excellent, excellent capabilities, recent Diffusion Transformer, Diffusion Transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL
点击查看摘要
Abstract:Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and this http URL this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at this https URL.
[CV-2] AutoVFX: Physically Realistic Video Editing from Natural Language Instructions
链接: https://arxiv.org/abs/2411.02394
作者: Hao-Yu Hsu,Zhi-Hao Lin,Albert Zhai,Hongchi Xia,Shenlong Wang
关键词-EN: Modern visual effects, Modern visual, software has made, skilled artists, imagery of virtually
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL
点击查看摘要
Abstract:Modern visual effects (VFX) software has made it possible for skilled artists to create imagery of virtually anything. However, the creation process remains laborious, complex, and largely inaccessible to everyday users. In this work, we present AutoVFX, a framework that automatically creates realistic and dynamic VFX videos from a single video and natural language instructions. By carefully integrating neural scene modeling, LLM-based code generation, and physical simulation, AutoVFX is able to provide physically-grounded, photorealistic editing effects that can be controlled directly using natural language instructions. We conduct extensive experiments to validate AutoVFX’s efficacy across a diverse spectrum of videos and instructions. Quantitative and qualitative results suggest that AutoVFX outperforms all competing methods by a large margin in generative quality, instruction alignment, editing versatility, and physical plausibility.
[CV-3] Learning General-Purpose Biomedical Volume Representations using Randomized Synthesis
链接: https://arxiv.org/abs/2411.02372
作者: Neel Dey,Benjamin Billot,Hallee E. Wong,Clinton J. Wang,Mengwei Ren,P. Ellen Grant,Adrian V. Dalca,Polina Golland
关键词-EN: Current volumetric biomedical, Current volumetric, volumetric biomedical foundation, foundation models struggle, anatomical regions
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code and model weights available at this https URL
点击查看摘要
Abstract:Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network’s features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.
[CV-4] Machine learning identification of maternal inflammatory response and histologic choroamnionitis from placental membrane whole slide images
链接: https://arxiv.org/abs/2411.02354
作者: Abhishek Sharma,Ramin Nateghi,Marina Ayad,Lee A.D. Cooper,Jeffery A. Goldstein
关键词-EN: Maternal Inflammatory Response, infection through pregnancy, forms a critical, critical barrier, barrier to infection
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:
点击查看摘要
Abstract:The placenta forms a critical barrier to infection through pregnancy, labor and, delivery. Inflammatory processes in the placenta have short-term, and long-term consequences for offspring health. Digital pathology and machine learning can play an important role in understanding placental inflammation, and there have been very few investigations into methods for predicting and understanding Maternal Inflammatory Response (MIR). This work intends to investigate the potential of using machine learning to understand MIR based on whole slide images (WSI), and establish early benchmarks. To that end, we use Multiple Instance Learning framework with 3 feature extractors: ImageNet-based EfficientNet-v2s, and 2 histopathology foundation models, UNI and Phikon to investigate predictability of MIR stage from histopathology WSIs. We also interpret predictions from these models using the learned attention maps from these models. We also use the MIL framework for predicting white blood cells count (WBC) and maximum fever temperature ( T_max ). Attention-based MIL models are able to classify MIR with a balanced accuracy of up to 88.5% with a Cohen’s Kappa ( \kappa ) of up to 0.772. Furthermore, we found that the pathology foundation models (UNI and Phikon) are both able to achieve higher performance with balanced accuracy and \kappa , compared to ImageNet-based feature extractor (EfficientNet-v2s). For WBC and T_max prediction, we found mild correlation between actual values and those predicted from histopathology WSIs. We used MIL framework for predicting MIR stage from WSIs, and compared effectiveness of foundation models as feature extractors, with that of an ImageNet-based model. We further investigated model failure cases and found them to be either edge cases prone to interobserver variability, examples of pathologist’s overreach, or mislabeled due to processing errors.
[CV-5] Physically Based Neural Bidirectional Reflectance Distribution Function
链接: https://arxiv.org/abs/2411.02347
作者: Chenliang Zhou,Alejandro Sztrajman,Gilles Rainer,Fangcheng Zhong,Fazilet Gokbudak,Zhilin Guo,Weihao Xia,Rafal Mantiuk,Cengiz Oztireli
关键词-EN: reflectance distribution function, bidirectional reflectance distribution, material appearance based, based neural bidirectional, physically based neural
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We introduce the physically based neural bidirectional reflectance distribution function (PBNBRDF), a novel, continuous representation for material appearance based on neural fields. Our model accurately reconstructs real-world materials while uniquely enforcing physical properties for realistic BRDFs, specifically Helmholtz reciprocity via reparametrization and energy passivity via efficient analytical integration. We conduct a systematic analysis demonstrating the benefits of adhering to these physical laws on the visual quality of reconstructed materials. Additionally, we enhance the color accuracy of neural BRDFs by introducing chromaticity enforcement supervising the norms of RGB channels. Through both qualitative and quantitative experiments on multiple databases of measured real-world BRDFs, we show that adhering to these physical constraints enables neural fields to more faithfully and stably represent the original data and achieve higher rendering quality.
[CV-6] MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
链接: https://arxiv.org/abs/2411.02336
作者: Wei Cheng,Juncheng Mu,Xianfang Zeng,Xin Chen,Anqi Pang,Chi Zhang,Zhibin Wang,Bin Fu,Gang Yu,Ziwei Liu,Liang Pan
关键词-EN: asset production workflow, asset production, production workflow, crucial step, enhances the visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL
点击查看摘要
Abstract:Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D meshes from the Objaverse dataset and the entire GSO dataset, respectively. Extensive experimental results demonstrate that MVPaint surpasses existing state-of-the-art methods. Notably, MVPaint could generate high-fidelity textures with minimal Janus issues and highly enhanced cross-view consistency.
[CV-7] Diffusion-based Generative Multicasting with Intent-aware Semantic Decomposition
链接: https://arxiv.org/abs/2411.02334
作者: Xinkai Liu,Mahdi Boloursaz Mashhadi,Li Qiao,Yi Ma,Rahim Tafazolli,Mehdi Bennis
关键词-EN: recently shown great, shown great success, Generative diffusion models, pre-trained diffusion models, synthesizing multimedia signals
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Generative diffusion models (GDMs) have recently shown great success in synthesizing multimedia signals with high perceptual quality enabling highly efficient semantic communications in future wireless networks. In this paper, we develop an intent-aware generative semantic multicasting framework utilizing pre-trained diffusion models. In the proposed framework, the transmitter decomposes the source signal to multiple semantic classes based on the multi-user intent, i.e. each user is assumed to be interested in details of only a subset of the semantic classes. The transmitter then sends to each user only its intended classes, and multicasts a highly compressed semantic map to all users over shared wireless resources that allows them to locally synthesize the other classes, i.e. non-intended classes, utilizing pre-trained diffusion models. The signal retrieved at each user is thereby partially reconstructed and partially synthesized utilizing the received semantic map. This improves utilization of the wireless resources, with better preserving privacy of the non-intended classes. We design a communication/computation-aware scheme for per-class adaptation of the communication parameters, such as the transmission power and compression rate to minimize the total latency of retrieving signals at multiple receivers, tailored to the prevailing channel conditions as well as the users reconstruction/synthesis distortion/perception requirements. The simulation results demonstrate significantly reduced per-user latency compared with non-generative and intent-unaware multicasting benchmarks while maintaining high perceptual quality of the signals retrieved at the users.
[CV-8] PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
链接: https://arxiv.org/abs/2411.02327
作者: Ruyang Liu,Haoran Tang,Haibo Liu,Yixiao Ge,Ying Shan,Chen Li,Jiankun Yang
关键词-EN: video-based large language, large language models, past year, year has witnessed, witnessed the significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The past year has witnessed the significant advancement of video-based large language models. However, the challenge of developing a unified model for both short and long video understanding remains unresolved. Most existing video LLMs cannot handle hour-long videos, while methods custom for long videos tend to be ineffective for shorter videos and images. In this paper, we identify the key issue as the redundant content in videos. To address this, we propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation. Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short. Specifically, PPLLaVA consists of three core components: the CLIP-based visual-prompt alignment that extracts visual information relevant to the user’s instructions, the prompt-guided pooling that compresses the visual sequence to arbitrary scales using convolution-style pooling, and the clip context extension designed for lengthy prompt common in visual dialogue. Moreover, our codebase also integrates the most advanced video Direct Preference Optimization (DPO) and visual interleave training. Extensive experiments have validated the performance of our model. With superior throughput and only 1024 visual context, PPLLaVA achieves better results on image benchmarks as a video LLM, while achieving state-of-the-art performance across various video benchmarks, excelling in tasks ranging from caption generation to multiple-choice questions, and handling video lengths from seconds to hours. Codes have been available at this https URL.
[CV-9] Grouped Discrete Representation for Object-Centric Learning
链接: https://arxiv.org/abs/2411.02299
作者: Rongzhen Zhao,Vivienne Wang,Juho Kannala,Joni Pajarinen
关键词-EN: Object-Centric Learning, Variational Autoencoder, images or videos, videos by simply, simply reconstructing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Object-Centric Learning (OCL) can discover objects in images or videos by simply reconstructing the input. For better object discovery, representative OCL methods reconstruct the input as its Variational Autoencoder (VAE) intermediate representation, which suppresses pixel noises and promotes object separability by discretizing continuous super-pixels with template features. However, treating features as units overlooks their composing attributes, thus impeding model generalization; indexing features with scalar numbers loses attribute-level similarities and differences, thus hindering model convergence. We propose \textitGrouped Discrete Representation (GDR) for OCL. We decompose features into combinatorial attributes via organized channel grouping, and compose these attributes into discrete representation via tuple indexes. Experiments show that our GDR improves both Transformer- and Diffusion-based OCL methods consistently on various datasets. Visualizations show that our GDR captures better object separability.
[CV-10] Conformal-in-the-Loop for Learning with Imbalanced Noisy Data
链接: https://arxiv.org/abs/2411.02281
作者: John Brandon Graham-Knight,Jamil Fayyad,Nourhan Bayasi,Patricia Lasserre,Homayoun Najjaran
关键词-EN: research assumes well-labeled, real world conditions, machine learning research, learning research assumes, rarely reflects real
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review
点击查看摘要
Abstract:Class imbalance and label noise are pervasive in large-scale datasets, yet much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions. Existing approaches typically address either label noise or class imbalance in isolation, leading to suboptimal results when both issues coexist. In this work, we propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach. CitL evaluates sample uncertainty to adjust weights and prune unreliable examples, enhancing model resilience and accuracy with minimal computational cost. Our extensive experiments include a detailed analysis showing how CitL effectively emphasizes impactful data in noisy, imbalanced datasets. Our results show that CitL consistently boosts model performance, achieving up to a 6.1% increase in classification accuracy and a 5.0 mIoU improvement in segmentation. Our code is publicly available: CitL.
[CV-11] Unified Speech Recognition: A Single Model for Auditory Visual and Audiovisual Inputs NEURIPS2024
链接: https://arxiv.org/abs/2411.02256
作者: Alexandros Haliassos,Rodrigo Mira,Honglie Chen,Zoe Landgraf,Stavros Petridis,Maja Pantic
关键词-EN: audiovisual speech recognition, Research in auditory, speech recognition, conducted independently, audiovisual speech
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024. Code: this https URL
点击查看摘要
Abstract:Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at this https URL.
[CV-12] 3D Audio-Visual Segmentation NEURIPS2024
链接: https://arxiv.org/abs/2411.02236
作者: Artem Sokolov,Swapnil Bhosale,Xiatian Zhu
关键词-EN: Audio-Visual Segmentation, longstanding objective, applications in robotics, sounding objects, Recognizing
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at the NeurIPS 2024 Workshop on Audio Imagination
点击查看摘要
Abstract:Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: this https URL
[CV-13] FewViewGS: Gaussian Splatting with Few View Matching and Multi-stage Training NEURIPS2024
链接: https://arxiv.org/abs/2411.02229
作者: Ruihong Yin,Vladimir Yugay,Yue Li,Sezer Karaoglu,Theo Gevers
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, Gaussian Splatting, introduction of Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS2024
点击查看摘要
Abstract:The field of novel view synthesis from images has seen rapid advancements with the introduction of Neural Radiance Fields (NeRF) and more recently with 3D Gaussian Splatting. Gaussian Splatting became widely adopted due to its efficiency and ability to render novel views accurately. While Gaussian Splatting performs well when a sufficient amount of training images are available, its unstructured explicit representation tends to overfit in scenarios with sparse input images, resulting in poor rendering performance. To address this, we present a 3D Gaussian-based novel view synthesis method using sparse input images that can accurately render the scene from the viewpoints not covered by the training images. We propose a multi-stage training scheme with matching-based consistency constraints imposed on the novel views without relying on pre-trained depth estimation or diffusion models. This is achieved by using the matches of the available training images to supervise the generation of the novel views sampled between the training frames with color, geometry, and semantic losses. In addition, we introduce a locality preserving regularization for 3D Gaussians which removes rendering artifacts by preserving the local color structure of the scene. Evaluation on synthetic and real-world datasets demonstrates competitive or superior performance of our method in few-shot novel view synthesis compared to existing state-of-the-art methods.
[CV-14] SIRA: Scalable Inter-frame Relation and Association for Radar Perception CVPR2024
链接: https://arxiv.org/abs/2411.02220
作者: Ryoma Yataka,Pu Perry Wang,Petros Boufounos,Ryuhei Takahashi
关键词-EN: Conventional radar feature, Conventional radar, extraction faces limitations, faces limitations due, low spatial resolution
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 25 pages, Accepted to CVPR2024
点击查看摘要
Abstract:Conventional radar feature extraction faces limitations due to low spatial resolution, noise, multipath reflection, the presence of ghost targets, and motion blur. Such limitations can be exacerbated by nonlinear object motion, particularly from an ego-centric viewpoint. It becomes evident that to address these challenges, the key lies in exploiting temporal feature relation over an extended horizon and enforcing spatial motion consistency for effective association. To this end, this paper proposes SIRA (Scalable Inter-frame Relation and Association) with two designs. First, inspired by Swin Transformer, we introduce extended temporal relation, generalizing the existing temporal relation layer from two consecutive frames to multiple inter-frames with temporally regrouped window attention for scalability. Second, we propose motion consistency track with the concept of a pseudo-tracklet generated from observational data for better trajectory prediction and subsequent object association. Our approach achieves 58.11 mAP@0.5 for oriented object detection and 47.79 MOTA for multiple object tracking on the Radiate dataset, surpassing previous state-of-the-art by a margin of +4.11 mAP@0.5 and +9.94 MOTA, respectively.
[CV-15] One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering
链接: https://arxiv.org/abs/2411.02210
作者: Deepayan Das,Davide Talon,Massimiliano Mancini,Yiming Wang,Elisa Ricci
关键词-EN: Visual Question Answering, web-scale multimodal datasets, shown significant promise, leveraging web-scale multimodal, promise in Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets. However, these models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks. As an effective remedy to mitigate catastrophic forgetting, rehearsal strategy uses the data of past tasks upon learning new task. However, such strategy incurs the need of storing past data, which might not be feasible due to hardware constraints or privacy concerns. In this work, we propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models, to produce pseudo-rehearsal data for addressing continual VQA. Our proposal, named as GaB, generates pseudo-rehearsal data by posing previous task questions on new task data. Yet, despite being effective, the distribution of generated questions skews towards the most frequently posed questions due to the limited and task-specific training data. To mitigate this issue, we introduce a pseudo-rehearsal balancing module that aligns the generated data towards the ground-truth data distribution using either the question meta-statistics or an unsupervised clustering method. We evaluate our proposed method on two recent benchmarks, \ie VQACL-VQAv2 and CLOVE-function benchmarks. GaB outperforms all the data-free baselines with substantial improvement in maintaining VQA performance across evolving tasks, while being on-par with methods with access to the past data.
[CV-16] Digi2Real: Bridging the Realism Gap in Synthetic Data Face Recognition via Foundation Models
链接: https://arxiv.org/abs/2411.02188
作者: Anjith George,Sebastien Marcel
关键词-EN: neural network architectures, face recognition, past few years, network architectures, advancement in neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages
点击查看摘要
Abstract:The accuracy of face recognition systems has improved significantly in the past few years, thanks to the large amount of data collected and the advancement in neural network architectures. However, these large-scale datasets are often collected without explicit consent, raising ethical and privacy concerns. To address this, there have been proposals to use synthetic datasets for training face recognition models. Yet, such models still rely on real data to train the generative models and generally exhibit inferior performance compared to those trained on real datasets. One of these datasets, DigiFace, uses a graphics pipeline to generate different identities and different intra-class variations without using real data in training the models. However, the performance of this approach is poor on face recognition benchmarks, possibly due to the lack of realism in the images generated from the graphics pipeline. In this work, we introduce a novel framework for realism transfer aimed at enhancing the realism of synthetically generated face images. Our method leverages the large-scale face foundation model, and we adapt the pipeline for realism enhancement. By integrating the controllable aspects of the graphics pipeline with our realism enhancement technique, we generate a large amount of realistic variations-combining the advantages of both approaches. Our empirical evaluations demonstrate that models trained using our enhanced dataset significantly improve the performance of face recognition systems over the baseline. The source code and datasets will be made available publicly.
[CV-17] CleAR: Robust Context-Guided Generative Lighting Estimation for Mobile Augmented Reality
链接: https://arxiv.org/abs/2411.02179
作者: Yiqin Zhao,Mallesham Dasari,Tian Guo
关键词-EN: mobile augmented reality, lighting estimation, lighting, creating immersive user, immersive user experiences
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:High-quality environment lighting is the foundation of creating immersive user experiences in mobile augmented reality (AR) applications. However, achieving visually coherent environment lighting estimation for Mobile AR is challenging due to several key limitations associated with AR device sensing capabilities, including limitations in device camera FoV and pixel dynamic ranges. Recent advancements in generative AI, which can generate high-quality images from different types of prompts, including texts and images, present a potential solution for high-quality lighting estimation. Still, to effectively use generative image diffusion models, we must address their key limitations of generation hallucination and slow inference process. To do so, in this work, we design and implement a generative lighting estimation system called CleAR that can produce high-quality and diverse environment maps in the format of 360 ^\circ images. Specifically, we design a two-step generation pipeline guided by AR environment context data to ensure the results follow physical environment visual context and color appearances. To improve the estimation robustness under different lighting conditions, we design a real-time refinement component to adjust lighting estimation results on AR devices. To train and test our generative models, we curate a large-scale environment lighting estimation dataset with diverse lighting conditions. Through quantitative evaluation and user study, we show that CleAR outperforms state-of-the-art lighting estimation methods on both estimation accuracy and robustness. Moreover, CleAR supports real-time refinement of lighting estimation results, ensuring robust and timely environment lighting updates for AR applications. Our end-to-end generative estimation takes as fast as 3.2 seconds, outperforming state-of-the-art methods by 110x.
[CV-18] SAFE: Slow and Fast Parameter-Efficient Tuning for Continual Learning with Pre-Trained Models NEURIPS2024
链接: https://arxiv.org/abs/2411.02175
作者: Linglan Zhao,Xuerui Zhang,Ke Yan,Shouhong Ding,Weiran Huang
关键词-EN: Continual learning aims, Continual learning, resisting forgetting previous, aims to incrementally, incrementally acquire
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Continual learning aims to incrementally acquire new concepts in data streams while resisting forgetting previous knowledge. With the rise of powerful pre-trained models (PTMs), there is a growing interest in training incremental learning systems using these foundation models, rather than learning from scratch. Existing works often view PTMs as a strong initial point and directly apply parameter-efficient tuning (PET) in the first session for adapting to downstream tasks. In the following sessions, most methods freeze model parameters for tackling forgetting issues. However, applying PET directly to downstream data cannot fully explore the inherent knowledge in PTMs. Additionally, freezing the parameters in incremental sessions hinders models’ plasticity to novel concepts not covered in the first session. To solve the above issues, we propose a Slow And Fast parameter-Efficient tuning (SAFE) framework. In particular, to inherit general knowledge from foundation models, we include a transfer loss function by measuring the correlation between the PTM and the PET-applied model. After calibrating in the first session, the slow efficient tuning parameters can capture more informative features, improving generalization to incoming classes. Moreover, to further incorporate novel concepts, we strike a balance between stability and plasticity by fixing slow efficient tuning parameters and continuously updating the fast ones. Specifically, a cross-classification loss with feature alignment is proposed to circumvent catastrophic forgetting. During inference, we introduce an entropy-based aggregation strategy to dynamically utilize the complementarity in the slow and fast learners. Extensive experiments on seven benchmark datasets verify the effectiveness of our method by significantly surpassing the state-of-the-art.
[CV-19] Improving Domain Generalization in Self-supervised Monocular Depth Estimation via Stabilized Adversarial Training
链接: https://arxiv.org/abs/2411.02149
作者: Yuanqi Yao,Gang Wu,Kui Jiang,Siao Liu,Jian Kuai,Xianming Liu,Junjun Jiang
关键词-EN: Monocular Depth Estimation, self-supervised Monocular Depth, great generalization remains, self-supervised MDE methods, remains significantly challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Learning a self-supervised Monocular Depth Estimation (MDE) model with great generalization remains significantly challenging. Despite the success of adversarial augmentation in the supervised learning generalization, naively incorporating it into self-supervised MDE models potentially causes over-regularization, suffering from severe performance degradation. In this paper, we conduct qualitative analysis and illuminate the main causes: (i) inherent sensitivity in the UNet-alike depth network and (ii) dual optimization conflict caused by over-regularization. To tackle these issues, we propose a general adversarial training framework, named Stabilized Conflict-optimization Adversarial Training (SCAT), integrating adversarial data augmentation into self-supervised MDE methods to achieve a balance between stability and generalization. Specifically, we devise an effective scaling depth network that tunes the coefficients of long skip connection and effectively stabilizes the training process. Then, we propose a conflict gradient surgery strategy, which progressively integrates the adversarial gradient and optimizes the model toward a conflict-free direction. Extensive experiments on five benchmarks demonstrate that SCAT can achieve state-of-the-art performance and significantly improve the generalization capability of existing self-supervised MDE methods.
[CV-20] Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery
链接: https://arxiv.org/abs/2411.02136
作者: Robert Fonod,Haechan Cho,Hwasoo Yeo,Nikolas Geroliminis
关键词-EN: addressing key challenges, high-altitude drone footage, extracting georeferenced vehicle, Songdo International Business, International Business District
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:This paper presents a framework for extracting georeferenced vehicle trajectories from high-altitude drone footage, addressing key challenges in urban traffic monitoring and limitations of traditional ground-based systems. We employ state-of-the-art computer vision and deep learning to create an end-to-end pipeline that enhances vehicle detection, tracking, and trajectory stabilization. Conducted in the Songdo International Business District, South Korea, the study used a multi-drone experiment over 20 intersections, capturing approximately 12TB of 4K video data over four days. We developed a novel track stabilization method that uses detected vehicle bounding boxes as exclusion masks during image registration, which, combined with advanced georeferencing techniques, accurately transforms vehicle coordinates into real-world geographical data. Additionally, our framework includes robust vehicle dimension estimation and detailed road segmentation for in-depth traffic analysis. The framework produced two high-quality datasets: the Songdo Traffic dataset, comprising nearly 1 million unique vehicle trajectories, and the Songdo Vision dataset, containing over 5,000 human-annotated frames with about 300,000 vehicle instances in four classes. Comparisons between drone-derived data and high-precision sensor data from an instrumented probe vehicle highlight the accuracy and consistency of our framework’s extraction in dense urban settings. By publicly releasing these datasets and the pipeline source code, this work sets new benchmarks for data quality, reproducibility, and scalability in traffic research. Results demonstrate the potential of integrating drone technology with advanced computer vision for precise, cost-effective urban traffic monitoring, providing valuable resources for the research community to develop intelligent transportation systems and improve traffic management strategies.
[CV-21] Multi-modal biometric authentication: Leveraging shared layer architectures for enhanced security
链接: https://arxiv.org/abs/2411.02112
作者: Vatchala S,Yogesh C,Yeshwanth Govindarajan,Krithik Raja M,Vishal Pramav Amirtha Ganesan,Aashish Vinod A,Dharun Ramesh
关键词-EN: enhance security measures, Convolutional Neural Networks, Recurrent Neural Networks, Neural Networks, integrates facial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this study, we introduce a novel multi-modal biometric authentication system that integrates facial, vocal, and signature data to enhance security measures. Utilizing a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), our model architecture uniquely incorporates dual shared layers alongside modality-specific enhancements for comprehensive feature extraction. The system undergoes rigorous training with a joint loss function, optimizing for accuracy across diverse biometric inputs. Feature-level fusion via Principal Component Analysis (PCA) and classification through Gradient Boosting Machines (GBM) further refine the authentication process. Our approach demonstrates significant improvements in authentication accuracy and robustness, paving the way for advanced secure identity verification solutions.
[CV-22] Deep Learning on 3D Semantic Segmentation: A Detailed Review
链接: https://arxiv.org/abs/2411.02104
作者: Thodoris Betsas,Andreas Georgopoulos,Anastasios Doulamis,Pierre Grussenmeyer
关键词-EN: Semantic Segmentation, deep learning methods, learning methods, deep learning, taxonomy scheme
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In this paper an exhaustive review and comprehensive analysis of recent and former deep learning methods in 3D Semantic Segmentation (3DSS) is presented. In the related literature, the taxonomy scheme used for the classification of the 3DSS deep learning methods is ambiguous. Based on the taxonomy schemes of 9 existing review papers, a new taxonomy scheme of the 3DSS deep learning methods is proposed, aiming to standardize it and improve the comparability and clarity across related studies. Furthermore, an extensive overview of the available 3DSS indoor and outdoor datasets is provided along with their links. The core part of the review is the detailed presentation of recent and former 3DSS deep learning methods and their classification using the proposed taxonomy scheme along with their GitHub repositories. Additionally, a brief but informative analysis of the evaluation metrics and loss functions used in 3DSS is included. Finally, a fruitful discussion of the examined 3DSS methods and datasets, is presented to foster new research directions and applications in the field of 3DSS. Supplementary, to this review a GitHub repository is provided (this https URL Detailed-Review) including a quick classification of over 400 3DSS methods, using the proposed taxonomy scheme.
[CV-23] he evolution of volumetric video: A survey of smart transcoding and compression approaches
链接: https://arxiv.org/abs/2411.02095
作者: Preetish Kakkar,Hariharan Ragothaman
关键词-EN: enabling immersive experiences, revolutionary technology poised, Volumetric video, data-intensive volumetric video, volumetric video streams
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Volumetric video, the capture and display of three-dimensional (3D) imagery, has emerged as a revolutionary technology poised to transform the media landscape, enabling immersive experiences that transcend the limitations of traditional 2D video. One of the key challenges in this domain is the efficient delivery of these high-bandwidth, data-intensive volumetric video streams, which requires innovative transcoding and compression techniques. This research paper explores the state-of-the-art in volumetric video compression and delivery, with a focus on the potential of AI-driven solutions to address the unique challenges posed by this emerging medium.
[CV-24] GraphVL: Graph-Enhanced Semantic Modeling via Vision-Language Models for Generalized Class Discovery
链接: https://arxiv.org/abs/2411.02074
作者: Bhupendra Solanki,Ashwin Nair,Mainak Singha,Souradeep Mukhopadhyay,Ankit Jha,Biplab Banerjee
关键词-EN: Generalized Category Discovery, Generalized Category, Category Discovery, aims to cluster, categories using labeled
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ACM ICVGIP 2024
点击查看摘要
Abstract:Generalized Category Discovery (GCD) aims to cluster unlabeled images into known and novel categories using labeled images from known classes. To address the challenge of transferring features from known to unknown classes while mitigating model bias, we introduce GraphVL, a novel approach for vision-language modeling in GCD, leveraging CLIP. Our method integrates a graph convolutional network (GCN) with CLIP’s text encoder to preserve class neighborhood structure. We also employ a lightweight visual projector for image data, ensuring discriminative features through margin-based contrastive losses for image-text mapping. This neighborhood preservation criterion effectively regulates the semantic space, making it less sensitive to known classes. Additionally, we learn textual prompts from known classes and align them to create a more contextually meaningful semantic feature space for the GCN layer using a contextual similarity loss. Finally, we represent unlabeled samples based on their semantic distance to class prompts from the GCN, enabling semi-supervised clustering for class discovery and minimizing errors. Our experiments on seven benchmark datasets consistently demonstrate the superiority of GraphVL when integrated with the CLIP backbone.
[CV-25] Model Integrity when Unlearning with T2I Diffusion Models
链接: https://arxiv.org/abs/2411.02068
作者: Andrea Schioppa,Emiel Hoogeboom,Jonathan Heek
关键词-EN: widespread public accessibility, approximate Machine Unlearning, public accessibility, widespread public, Diffusion Models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The rapid advancement of text-to-image Diffusion Models has led to their widespread public accessibility. However these models, trained on large internet datasets, can sometimes generate undesirable outputs. To mitigate this, approximate Machine Unlearning algorithms have been proposed to modify model weights to reduce the generation of specific types of images, characterized by samples from a forget distribution'', while preserving the model's ability to generate other images, characterized by samples from a
retain distribution’'. While these methods aim to minimize the influence of training data in the forget distribution without extensive additional computation, we point out that they can compromise the model’s integrity by inadvertently affecting generation for images in the retain distribution. Recognizing the limitations of FID and CLIPScore in capturing these effects, we introduce a novel retention metric that directly assesses the perceptual difference between outputs generated by the original and the unlearned models. We then propose unlearning algorithms that demonstrate superior effectiveness in preserving model integrity compared to existing baselines. Given their straightforward implementation, these algorithms serve as valuable benchmarks for future advancements in approximate Machine Unlearning for Diffusion Models.
[CV-26] AM Flow: Adapters for Temporal Processing in Action Recognition
链接: https://arxiv.org/abs/2411.02065
作者: Tanay Agrawal,Abid Ali,Antitza Dantcheva,Francois Bremond
关键词-EN: recently gained generalisability, Deep learning models, Deep learning, generalisability and robustness, recently gained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Deep learning models, in particular \textitimage models, have recently gained generalisability and robustness. %are becoming more general and robust by the day. In this work, we propose to exploit such advances in the realm of \textitvideo classification. Video foundation models suffer from the requirement of extensive pretraining and a large training time. Towards mitigating such limitations, we propose “\textitAttention Map (AM) Flow” for image models, a method for identifying pixels relevant to motion in each input video frame. In this context, we propose two methods to compute AM flow, depending on camera motion. AM flow allows the separation of spatial and temporal processing, while providing improved results over combined spatio-temporal processing (as in video models). Adapters, one of the popular techniques in parameter efficient transfer learning, facilitate the incorporation of AM flow into pretrained image models, mitigating the need for full-finetuning. We extend adapters to “\textittemporal processing adapters” by incorporating a temporal processing unit into the adapters. Our work achieves faster convergence, therefore reducing the number of epochs needed for training. Moreover, we endow an image model with the ability to achieve state-of-the-art results on popular action recognition datasets. This reduces training time and simplifies pretraining. We present experiments on Kinetics-400, Something-Something v2, and Toyota Smarthome datasets, showcasing state-of-the-art or comparable results.
[CV-27] Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation
链接: https://arxiv.org/abs/2411.02057
作者: Yan Li,Weiwei Guo,Xue Yang,Ning Liao,Shaofeng Zhang,Yi Yu,Wenxian Yu,Junchi Yan
关键词-EN: earth observation applications, aerial object detection, aerial object, object detection, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In recent years, aerial object detection has been increasingly pivotal in various earth observation applications. However, current algorithms are limited to detecting a set of pre-defined object categories, demanding sufficient annotated training samples, and fail to detect novel object categories. In this paper, we put forth a novel formulation of the aerial object detection problem, namely open-vocabulary aerial object detection (OVAD), which can detect objects beyond training categories without costly collecting new labeled data. We propose CastDet, a CLIP-activated student-teacher detection framework that serves as the first OVAD detector specifically designed for the challenging aerial scenario, where objects often exhibit weak appearance features and arbitrary orientations. Our framework integrates a robust localization teacher along with several box selection strategies to generate high-quality proposals for novel objects. Additionally, the RemoteCLIP model is adopted as an omniscient teacher, which provides rich knowledge to enhance classification capabilities for novel categories. A dynamic label queue is devised to maintain high-quality pseudo-labels during training. By doing so, the proposed CastDet boosts not only novel object proposals but also classification. Furthermore, we extend our approach from horizontal OVAD to oriented OVAD with tailored algorithm designs to effectively manage bounding box representation and pseudo-label generation. Extensive experiments for both tasks on multiple existing aerial object detection datasets demonstrate the effectiveness of our approach. The code is available at this https URL.
[CV-28] Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
链接: https://arxiv.org/abs/2411.02038
作者: Yongxin Zhu,Bocheng Li,Yifei Xin,Linli Xu
关键词-EN: converting continuous representations, unsupervised representation learning, Vector Quantization, latent generative models, representation collapse
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning and latent generative models. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we conduct a theoretical analysis of representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose \textbfSimVQ, a novel method which reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the \textitentire linear space spanned by the codebook, rather than merely updating \textitthe code vector selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works surprisingly well in resolving the collapse issue in VQ models with just one linear layer. We validate the efficacy of SimVQ through extensive experiments across various modalities, including image and audio data with different model architectures. Our code is available at \urlthis https URL.
[CV-29] ree level change detection over Ahmedabad city using very high resolution satellite images and Deep Learning
链接: https://arxiv.org/abs/2411.02009
作者: Jai G Singla,Gautam Jaiswal
关键词-EN: Indian urban region, high resolution satellite, resolution satellite datasets, Indian urban, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In this study, 0.5m high resolution satellite datasets over Indian urban region was used to demonstrate the applicability of deep learning models over Ahmedabad, India. Here, YOLOv7 instance segmentation model was trained on well curated trees canopy dataset (6500 images) in order to carry out the change detection. During training, evaluation metrics such as bounding box regression and mask regression loss, mean average precision (mAP) and stochastic gradient descent algorithm were used for evaluating and optimizing the performance of model. After the 500 epochs, the mAP of 0.715 and 0.699 for individual tree detection and tree canopy mask segmentation were obtained. However, by further tuning hyper parameters of the model, maximum accuracy of 80 % of trees detection with false segmentation rate of 2% on data was obtained.
[CV-30] QCS:Feature Refining from Quadruplet Cross Similarity for Facial Expression Recognition
链接: https://arxiv.org/abs/2411.01988
作者: Chengpeng Wang,Li Chen,Lili Wang,Zhaofan Li,Xuebin Lv
关键词-EN: facial expression recognition, numerous feature types, mine effective features, facial expression, facial expression datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:On facial expression datasets with complex and numerous feature types, where the significance and dominance of labeled features are difficult to predict, facial expression recognition(FER) encounters the challenges of inter-class similarity and intra-class variances, making it difficult to mine effective features. We aim to solely leverage the feature similarity among facial samples to address this. We introduce the Cross Similarity Attention (CSA), an input-output position-sensitive attention mechanism that harnesses feature similarity across different images to compute the corresponding global spatial attention. Based on this, we propose a four-branch circular framework, called Quadruplet Cross Similarity (QCS), to extract discriminative features from the same class and eliminate redundant ones from different classes synchronously to refine cleaner features. The symmetry of the network ensures balanced and stable training and reduces the amount of CSA interaction matrix. Contrastive residual distillation is utilized to transfer the information learned in the cross module back to the base network. The cross-attention module exists during training, and only one base branch is retained during inference. our proposed QCS model outperforms state-of-the-art methods on several popular FER datasets, without requiring additional landmark information or other extra training data. The code is available at this https URL.
[CV-31] ypicalness-Aware Learning for Failure Detection NEURIPS2024
链接: https://arxiv.org/abs/2411.01981
作者: Yijun Liu,Jiequan Cui,Zhuotao Tian,Senqiao Yang,Qingdong He,Xiaoling Wang,Jingyong Su
关键词-EN: Deep neural networks, high confidence scores, Deep neural, neural networks, confidence scores
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Deep neural networks (DNNs) often suffer from the overconfidence issue, where incorrect predictions are made with high confidence scores, hindering the applications in critical systems. In this paper, we propose a novel approach called Typicalness-Aware Learning (TAL) to address this issue and improve failure detection performance. We observe that, with the cross-entropy loss, model predictions are optimized to align with the corresponding labels via increasing logit magnitude or refining logit direction. However, regarding atypical samples, the image content and their labels may exhibit disparities. This discrepancy can lead to overfitting on atypical samples, ultimately resulting in the overconfidence issue that we aim to address. To tackle the problem, we have devised a metric that quantifies the typicalness of each sample, enabling the dynamic adjustment of the logit magnitude during the training process. By allowing atypical samples to be adequately fitted while preserving reliable logit direction, the problem of overconfidence can be mitigated. TAL has been extensively evaluated on benchmark datasets, and the results demonstrate its superiority over existing failure detection methods. Specifically, TAL achieves a more than 5% improvement on CIFAR100 in terms of the Area Under the Risk-Coverage Curve (AURC) compared to the state-of-the-art. Code is available at this https URL.
[CV-32] SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities
链接: https://arxiv.org/abs/2411.01975
作者: Ehsan Faghihi,Mohammedreza Zarenejad,Ali-Asghar Beheshti Shirazi
关键词-EN: meaning and critical, analyzing the subtle, subtle details, fundamental yet challenging, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:Capturing a video’s meaning and critical concepts by analyzing the subtle details is a fundamental yet challenging task in video captioning. Identifying the dominant emotional tone in a video significantly enhances the perception of its context. Despite a strong emphasis on video captioning, existing models often need to adequately address emotional themes, resulting in suboptimal captioning results. To address these limitations, this paper proposes a novel Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities (SPECTRUM) framework to empower the generation of emotionally and semantically credible captions. Leveraging our pioneering structure, SPECTRUM discerns multimodal semantics and emotional themes using Visual Text Attribute Investigation (VTAI) and determines the orientation of descriptive captions through a Holistic Concept-Oriented Theme (HCOT), expressing emotionally-informed and field-acquainted references. They exploit video-to-text retrieval capabilities and the multifaceted nature of video content to estimate the emotional probabilities of candidate captions. Then, the dominant theme of the video is determined by appropriately weighting embedded attribute vectors and applying coarse- and fine-grained emotional concepts, which define the video’s contextual alignment. Furthermore, using two loss functions, SPECTRUM is optimized to integrate emotional information and minimize prediction errors. Extensive experiments on the EmVidCap, MSVD, and MSRVTT video captioning datasets demonstrate that our model significantly surpasses state-of-the-art methods. Quantitative and qualitative evaluations highlight the model’s ability to accurately capture and convey video emotions and multimodal attributes.
[CV-33] UnSegMedGAT: Unsupervised Medical Image Segmentation using Graph Attention Networks Clustering
链接: https://arxiv.org/abs/2411.01966
作者: A. Mudit Adityaja,Saurabh J. Shigwan,Nitin Kumar
关键词-EN: supervised classification drives, data-intensive nature, classification drives, drives the interest, Graph Attention Network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The data-intensive nature of supervised classification drives the interest of the researchers towards unsupervised approaches, especially for problems such as medical image segmentation, where labeled data is scarce. Building on the recent advancements of Vision transformers (ViT) in computer vision, we propose an unsupervised segmentation framework using a pre-trained Dino-ViT. In the proposed method, we leverage the inherent graph structure within the image to realize a significant performance gain for segmentation in medical images. For this, we introduce a modularity-based loss function coupled with a Graph Attention Network (GAT) to effectively capture the inherent graph topology within the image. Our method achieves state-of-the-art performance, even significantly surpassing or matching that of existing (semi)supervised technique such as MedSAM which is a Segment Anything Model in medical images. We demonstrate this using two challenging medical image datasets ISIC-2018 and CVC-ColonDB. This work underscores the potential of unsupervised approaches in advancing medical image analysis in scenarios where labeled data is scarce. The github repository of the code is available on [this https URL].
[CV-34] Deep Learning for Leopard Individual Identification: An Adaptive Angular Margin Approach
链接: https://arxiv.org/abs/2411.01962
作者: David Colomer Matachana
关键词-EN: camera trap images, Accurate identification, camera trap, trap images, monitoring and ecological
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Accurate identification of individual leopards across camera trap images is critical for population monitoring and ecological studies. This paper introduces a deep learning framework to distinguish between individual leopards based on their unique spot patterns. This approach employs a novel adaptive angular margin method in the form of a modified CosFace architecture. In addition, I propose a preprocessing pipeline that combines RGB channels with an edge detection channel to underscore the critical features learned by the model. This approach significantly outperforms the Triplet Network baseline, achieving a Dynamic Top-5 Average Precision of 0.8814 and a Top-5 Rank Match Detection of 0.9533, demonstrating its potential for open-set learning in wildlife identification. While not surpassing the performance of the SIFT-based Hotspotter algorithm, this method represents a substantial advancement in applying deep learning to patterned wildlife identification. This research contributes to the field of computer vision and provides a valuable tool for biologists aiming to study and protect leopard populations. It also serves as a stepping stone for applying the power of deep learning in Capture-Recapture studies for other patterned species. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.01962 [cs.CV] (or arXiv:2411.01962v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.01962 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-35] Learning Where to Edit Vision Transformers
链接: https://arxiv.org/abs/2411.01948
作者: Yunqiao Yang,Long-Kai Huang,Shengzhuang Chen,Kede Ma,Ying Wei
关键词-EN: minimize unintended effects, data-efficiently correct predictive, correct predictive errors, aims to data-efficiently, data-efficiently correct
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Model editing aims to data-efficiently correct predictive errors of large pre-trained models while ensuring generalization to neighboring failures and locality to minimize unintended effects on unrelated examples. While significant progress has been made in editing Transformer-based large language models, effective strategies for editing vision Transformers (ViTs) in computer vision remain largely untapped. In this paper, we take initial steps towards correcting predictive errors of ViTs, particularly those arising from subpopulation shifts. Taking a locate-then-edit approach, we first address the where-to-edit challenge by meta-learning a hypernetwork on CutMix-augmented data generated for editing reliability. This trained hypernetwork produces generalizable binary masks that identify a sparse subset of structured model parameters, responsive to real-world failure samples. Afterward, we solve the how-to-edit problem by simply fine-tuning the identified parameters using a variant of gradient descent to achieve successful edits. To validate our method, we construct an editing benchmark that introduces subpopulation shifts towards natural underrepresented images and AI-generated images, thereby revealing the limitations of pre-trained ViTs for object recognition. Our approach not only achieves superior performance on the proposed benchmark but also allows for adjustable trade-offs between generalization and locality. Our code is available at this https URL.
[CV-36] Exploiting Contextual Uncertainty of Visual Data for Efficient Training of Deep Models
链接: https://arxiv.org/abs/2411.01925
作者: Sharat Agarwal
关键词-EN: exhibit typical arrangements, typical arrangements governed, real world, rarely occur, independent utility
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICVGIP, Young Researchers Symposium
点击查看摘要
Abstract:Objects, in the real world, rarely occur in isolation and exhibit typical arrangements governed by their independent utility, and their expected interaction with humans and other objects in the context. For example, a chair is expected near a table, and a computer is expected on top. Humans use this spatial context and relative placement as an important cue for visual recognition in case of ambiguities. Similar to human’s, DNN’s exploit contextual information from data to learn representations. Our research focuses on harnessing the contextual aspects of visual data to optimize data annotation and enhance the training of deep networks. Our contributions can be summarized as follows: (1) We introduce the notion of contextual diversity for active learning CDAL and show its applicability in three different visual tasks semantic segmentation, object detection and image classification, (2) We propose a data repair algorithm to curate contextually fair data to reduce model bias, enabling the model to detect objects out of their obvious context, (3) We propose Class-based annotation, where contextually relevant classes are selected that are complementary for model training under domain shift. Understanding the importance of well-curated data, we also emphasize the necessity of involving humans in the loop to achieve accurate annotations and to develop novel interaction strategies that allow humans to serve as fact-checkers. In line with this we are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads. For large-scale annotation, we are employing a strategic combination of human expertise and zero-shot models, while also integrating human input at various stages for continuous feedback.
[CV-37] Real-Time Polygonal Semantic Mapping for Humanoid Robot Stair Climbing
链接: https://arxiv.org/abs/2411.01919
作者: Teng Bin,Jianming Yao,Tin Lun Lam,Tianwei Zhang
关键词-EN: navigating complex terrains, semantic mapping tailored, robots navigating complex, planar semantic mapping, mapping tailored
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by The 2024 IEEE-RAS International Conference on Humanoid Robots. The code: this https URL
点击查看摘要
Abstract:We present a novel algorithm for real-time planar semantic mapping tailored for humanoid robots navigating complex terrains such as staircases. Our method is adaptable to any odometry input and leverages GPU-accelerated processes for planar extraction, enabling the rapid generation of globally consistent semantic maps. We utilize an anisotropic diffusion filter on depth images to effectively minimize noise from gradient jumps while preserving essential edge details, enhancing normal vector images’ accuracy and smoothness. Both the anisotropic diffusion and the RANSAC-based plane extraction processes are optimized for parallel processing on GPUs, significantly enhancing computational efficiency. Our approach achieves real-time performance, processing single frames at rates exceeding 30~Hz , which facilitates detailed plane extraction and map management swiftly and efficiently. Extensive testing underscores the algorithm’s capabilities in real-time scenarios and demonstrates its practical application in humanoid robot gait planning, significantly improving its ability to navigate dynamic environments.
[CV-38] Masked Autoencoders are Parameter-Efficient Federated Continual Learners
链接: https://arxiv.org/abs/2411.01916
作者: Yuchen He,Xiangfeng Wang
关键词-EN: clients’ local models, multiple clients’ local, maintaining data privacy, specific distributed learning, distributed learning paradigm
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Federated learning is a specific distributed learning paradigm in which a central server aggregates updates from multiple clients’ local models, thereby enabling the server to learn without requiring clients to upload their private data, maintaining data privacy. While existing federated learning methods are primarily designed for static data, real-world applications often require clients to learn new categories over time. This challenge necessitates the integration of continual learning techniques, resulting in federated continual learning (FCL). Although advanced prompt-based continual learning methods leverage pre-trained transformers to mitigate catastrophic forgetting, they do not adequately address the non-IID challenges in federated learning. To address both catastrophic forgetting and non-IID issues, we propose to use masked autoencoders (MAEs) as parameter-efficient federated continual learners, called pMAE. pMAE learns reconstructive prompt on the client side through image reconstruction using MAEs. On the server side, it reconstructs the uploaded restore information to capture the data distribution across previous tasks and different clients, using these reconstructed images to finetune discriminative prompt and classifier parameters designed for classification, thereby alleviating catastrophic forgetting and non-IID challenges on a global scale. Experimental results demonstrate that pMAE achieves performance comparable to existing prompt-based methods and can enhance their effectiveness, particularly when using self-supervised pre-trained transformers as the backbone. Code is available at: this https URL.
[CV-39] FPPL: An Efficient and Non-IID Robust Federated Continual Learning Framework
链接: https://arxiv.org/abs/2411.01904
作者: Yuchen He,Chuyun Shen,Xiangfeng Wang,Bo Jin
关键词-EN: Federated continual learning, federated learning setting, decentralized federated learning, classical continual learning, sequential data stream
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Federated continual learning (FCL) aims to learn from sequential data stream in the decentralized federated learning setting, while simultaneously mitigating the catastrophic forgetting issue in classical continual learning. Existing FCL methods usually employ typical rehearsal mechanisms, which could result in privacy violations or additional onerous storage and computational burdens. In this work, an efficient and non-IID robust federated continual learning framework, called Federated Prototype-Augmented Prompt Learning (FPPL), is proposed. The FPPL can collaboratively learn lightweight prompts augmented by prototypes without rehearsal. On the client side, a fusion function is employed to fully leverage the knowledge contained in task-specific prompts for alleviating catastrophic forgetting. Additionally, global prototypes aggregated from the server are used to obtain unified representation through contrastive learning, mitigating the impact of non-IID-derived data heterogeneity. On the server side, locally uploaded prototypes are utilized to perform debiasing on the classifier, further alleviating the performance degradation caused by both non-IID and catastrophic forgetting. Empirical evaluations demonstrate the effectiveness of FPPL, achieving notable performance with an efficient design while remaining robust to diverse non-IID degrees. Code is available at: this https URL.
[CV-40] A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding
链接: https://arxiv.org/abs/2411.01893
作者: Yitong Dong,Yijin Li,Zhaoyang Huang,Weikang Bian,Jingbo Liu,Hujun Bao,Zhaopeng Cui,Hongsheng Li,Guofeng Zhang
关键词-EN: depth range prior, prior-free MVS methods, recent prior-free MVS, range prior, depth range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In this paper, we propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior. Unlike recent prior-free MVS methods that work in a pair-wise manner, our method simultaneously considers all the source images. Specifically, we introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information within and across multi-view images. Considering the asymmetry of the epipolar disparity flow, the key to our method lies in accurately modeling multi-view geometric constraints. We integrate pose embedding to encapsulate information such as multi-view camera poses, providing implicit geometric constraints for multi-view disparity feature fusion dominated by attention. Additionally, we construct corresponding hidden states for each source image due to significant differences in the observation quality of the same pixel in the reference frame across multiple source frames. We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image and dynamically update hidden states through the uncertainty estimation module. Extensive results on the DTU dataset and TanksTemple benchmark demonstrate the effectiveness of our method. The code is available at our project page: this https URL.
[CV-41] GVKF: Gaussian Voxel Kernel Functions for Highly Efficient Surface Reconstruction in Open Scenes NEURIPS2024
链接: https://arxiv.org/abs/2411.01853
作者: Gaochao Song,Chong Cheng,Hao Wang
关键词-EN: Neural Radiance Fields, Existing Neural Radiance, paper we present, method for efficient, Radiance Fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024
点击查看摘要
Abstract:In this paper we present a novel method for efficient and effective 3D surface reconstruction in open scenes. Existing Neural Radiance Fields (NeRF) based works typically require extensive training and rendering time due to the adopted implicit representations. In contrast, 3D Gaussian splatting (3DGS) uses an explicit and discrete representation, hence the reconstructed surface is built by the huge number of Gaussian primitives, which leads to excessive memory consumption and rough surface details in sparse Gaussian areas. To address these issues, we propose Gaussian Voxel Kernel Functions (GVKF), which establish a continuous scene representation based on discrete 3DGS through kernel regression. The GVKF integrates fast 3DGS rasterization and highly effective scene implicit representations, achieving high-fidelity open scene surface reconstruction. Experiments on challenging scene datasets demonstrate the efficiency and effectiveness of our proposed GVKF, featuring with high reconstruction quality, real-time rendering speed, significant savings in storage and training memory consumption.
[CV-42] KptLLM : Unveiling the Power of Large Language Model for Keypoint Comprehension NEURIPS2024
链接: https://arxiv.org/abs/2411.01846
作者: Jie Yang,Wang Zeng,Sheng Jin,Lumin Xu,Wentao Liu,Chen Qian,Ruimao Zhang
关键词-EN: Large Language Models, Multimodal Large Language, Large Language, Recent advancements, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024
点击查看摘要
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have greatly improved their abilities in image understanding. However, these models often struggle with grasping pixel-level semantic details, e.g., the keypoints of an object. To bridge this gap, we introduce the novel challenge of Semantic Keypoint Comprehension, which aims to comprehend keypoints across different task scenarios, including keypoint semantic understanding, visual prompt-based keypoint detection, and textual prompt-based keypoint detection. Moreover, we introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy to effectively address these challenges. KptLLM underscores the initial discernment of semantics in keypoints, followed by the precise determination of their positions through a chain-of-thought process. With several carefully designed modules, KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations. Our extensive experiments demonstrate KptLLM’s superiority in various keypoint detection benchmarks and its unique semantic capabilities in interpreting keypoints.
[CV-43] OwMatch: Conditional Self-Labeling with Consistency for Open-World Semi-Supervised Learning NEURIPS2024
链接: https://arxiv.org/abs/2411.01833
作者: Shengjie Niu,Lifan Lin,Jian Huang,Chao Wang
关键词-EN: offers a robust, harnessing the potential, potential of unannotated, Semi-supervised learning, SSL
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: NeurIPS 2024 camera-ready (10 pages, 4 figures) with the appendices (10 pages, 7 figures)
点击查看摘要
Abstract:Semi-supervised learning (SSL) offers a robust framework for harnessing the potential of unannotated data. Traditionally, SSL mandates that all classes possess labeled instances. However, the emergence of open-world SSL (OwSSL) introduces a more practical challenge, wherein unlabeled data may encompass samples from unseen classes. This scenario leads to misclassification of unseen classes as known ones, consequently undermining classification accuracy. To overcome this challenge, this study revisits two methodologies from self-supervised and semi-supervised learning, self-labeling and consistency, tailoring them to address the OwSSL problem. Specifically, we propose an effective framework called OwMatch, combining conditional self-labeling and open-world hierarchical thresholding. Theoretically, we analyze the estimation of class distribution on unlabeled data through rigorous statistical analysis, thus demonstrating that OwMatch can ensure the unbiasedness of the self-label assignment estimator with reliability. Comprehensive empirical analyses demonstrate that our method yields substantial performance enhancements across both known and unknown classes in comparison to previous studies. Code is available at this https URL.
[CV-44] Distribution alignment based transfer fusion frameworks on quantum devices for seeking quantum advantages
链接: https://arxiv.org/abs/2411.01822
作者: Xi He,Feiyu Du,Xiaohan Yu,Yang Zhao,Tao Lei
关键词-EN: quantum machine learning, machine learning, specifically an urgent, urgent challenge, quantum
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The scarcity of labelled data is specifically an urgent challenge in the field of quantum machine learning (QML). Two transfer fusion frameworks are proposed in this paper to predict the labels of a target domain data by aligning its distribution to a different but related labelled source domain on quantum devices. The frameworks fuses the quantum data from two different, but related domains through a quantum information infusion channel. The predicting tasks in the target domain can be achieved with quantum advantages by post-processing quantum measurement results. One framework, the quantum basic linear algebra subroutines (QBLAS) based implementation, can theoretically achieve the procedure of transfer fusion with quadratic speedup on a universal quantum computer. In addition, the other framework, a hardware-scalable architecture, is implemented on the noisy intermediate-scale quantum (NISQ) devices through a variational hybrid quantum-classical procedure. Numerical experiments on the synthetic and handwritten digits datasets demonstrate that the variatioinal transfer fusion (TF) framework can reach state-of-the-art (SOTA) quantum DA method performance.
[CV-45] Bootstrapping Top-down Information for Self-modulating Slot Attention NEURIPS2
链接: https://arxiv.org/abs/2411.01801
作者: Dongwon Kim,Seoyeon Kim,Suha Kwak
关键词-EN: Object-centric learning, effective visual reasoning, aims to learn, manual supervision, facilitating efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS2 2024
点击查看摘要
Abstract:Object-centric learning (OCL) aims to learn representations of individual objects within visual scenes without manual supervision, facilitating efficient and effective visual reasoning. Traditional OCL methods primarily employ bottom-up approaches that aggregate homogeneous visual features to represent objects. However, in complex visual environments, these methods often fall short due to the heterogeneous nature of visual features within an object. To address this, we propose a novel OCL framework incorporating a top-down pathway. This pathway first bootstraps the semantics of individual objects and then modulates the model to prioritize features relevant to these semantics. By dynamically modulating the model based on its own output, our top-down pathway enhances the representational quality of objects. Our framework achieves state-of-the-art performance across multiple synthetic and real-world object-discovery benchmarks.
[CV-46] Expanding Sparse Tuning for Low Memory Usage NEURIPS2024
链接: https://arxiv.org/abs/2411.01800
作者: Shufan Shen,Junshu Sun,Xiangyang Ji,Qingming Huang,Shuhui Wang
关键词-EN: Parameter-efficient fine-tuning, low memory usage, sparse tuning, memory usage, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Parameter-efficient fine-tuning (PEFT) is an effective method for adapting pre-trained vision models to downstream tasks by tuning a small subset of parameters. Among PEFT methods, sparse tuning achieves superior performance by only adjusting the weights most relevant to downstream tasks, rather than densely tuning the whole weight matrix. However, this performance improvement has been accompanied by increases in memory usage, which stems from two factors, i.e., the storage of the whole weight matrix as learnable parameters in the optimizer and the additional storage of tunable weight indexes. In this paper, we propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices, saving from the costly storage of the whole original matrix. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes. To maintain the effectiveness of sparse tuning with low-rank matrices, we extend the low-rank decomposition by applying nonlinear kernel functions to the whole-matrix merging. Consequently, we gain an increase in the rank of the merged matrix, enhancing the ability of SNELL in adapting the pre-trained models to downstream tasks. Extensive experiments on multiple downstream tasks show that SNELL achieves state-of-the-art performance with low memory usage, endowing PEFT with sparse tuning to large-scale models. Codes are available at this https URL.
[CV-47] AIWR: Aerial Image Water Resource Dataset for Segmentation Analysis
链接: https://arxiv.org/abs/2411.01797
作者: Sangdaow Noppitaka,Emmanuel Okafor,Olarik Surinta
关键词-EN: Effective water resource, sandy soils poses, soils poses significant, limited water retention, northeastern Thailand
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 8 figures
点击查看摘要
Abstract:Effective water resource management is crucial in agricultural regions like northeastern Thailand, where limited water retention in sandy soils poses significant challenges. In response to this issue, the Aerial Image Water Resource (AIWR) dataset was developed, comprising 800 aerial images focused on natural and artificial water bodies in this region. The dataset was created using Bing Maps and follows the standards of the Fundamental Geographic Data Set (FGDS). It includes ground truth annotations validated by experts in remote sensing, making it an invaluable resource for researchers in geoinformatics, computer vision, and artificial intelligence. The AIWR dataset presents considerable challenges, such as segmentation due to variations in the size, color, shape, and similarity of water bodies, which often resemble other land use categories.
[CV-48] Non rigid geometric distortions correction – Application to atmospheric turbulence stabilization
链接: https://arxiv.org/abs/2411.01788
作者: Yu Mao,Jerome Gilles
关键词-EN: approach is presented, presented to recover, degraded by atmospheric, atmospheric turbulence, image degraded
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:A novel approach is presented to recover an image degraded by atmospheric turbulence. Given a sequence of frames affected by turbulence, we construct a variational model to characterize the static image. The optimization problem is solved by Bregman Iteration and the operator splitting method. Our algorithm is simple, efficient, and can be easily generalized for different scenarios.
[CV-49] MSTA3D: Multi-scale Twin-attention for 3D Instance Segmentation
链接: https://arxiv.org/abs/2411.01781
作者: Duc Dang Trung Tran,Byeongkeun Kang,Yeejin Lee
关键词-EN: transformer-based techniques incorporating, techniques incorporating superpoints, transformer-based techniques, techniques incorporating, Recently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 9 figures, 7 tables, conference
点击查看摘要
Abstract:Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces a twin-attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.
[CV-50] Learning predictable and robust neural representations by straightening image sequences NEURIPS2024
链接: https://arxiv.org/abs/2411.01777
作者: Xueyan Niu,Cristina Savin,Eero P. Simoncelli
关键词-EN: living organisms, fundamental capability, learning sensory representations, Prediction, sensory representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024
点击查看摘要
Abstract:Prediction is a fundamental capability of all living organisms, and has been proposed as an objective for learning sensory representations. Recent work demonstrates that in primate visual systems, prediction is facilitated by neural representations that follow straighter temporal trajectories than their initial photoreceptor encoding, which allows for prediction by linear extrapolation. Inspired by these experimental findings, we develop a self-supervised learning (SSL) objective that explicitly quantifies and promotes straightening. We demonstrate the power of this objective in training deep feedforward neural networks on smoothly-rendered synthetic image sequences that mimic commonly-occurring properties of natural videos. The learned model contains neural embeddings that are predictive, but also factorize the geometric, photometric, and semantic attributes of objects. The representations also prove more robust to noise and adversarial attacks compared to previous SSL methods that optimize for invariance to random augmentations. Moreover, these beneficial properties can be transferred to other training procedures by using the straightening objective as a regularizer, suggesting a broader utility for straightening as a principle for robust unsupervised learning.
[CV-51] ARN-LSTM: A Multi-Stream Attention-Based Model for Action Recognition with Temporal Dynamics
链接: https://arxiv.org/abs/2411.01769
作者: Chuanchuan Wang,Ahmad Sufril Azlan Mohmamed,Xiao Yang,Xiang Li
关键词-EN: paper presents ARN-LSTM, simultaneously capturing spatial, paper presents, designed to address, address the challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:This paper presents ARN-LSTM, a novel multi-stream action recognition model designed to address the challenge of simultaneously capturing spatial motion and temporal dynamics in action sequences. Traditional methods often focus solely on spatial or temporal features, limiting their ability to comprehend complex human activities fully. Our proposed model integrates joint, motion, and temporal information through a multi-stream fusion architecture. Specifically, it comprises a joint stream for extracting skeleton features, a temporal stream for capturing dynamic temporal features, and an ARN-LSTM block that utilizes Time-Distributed Long Short-Term Memory (TD-LSTM) layers followed by an Attention Relation Network (ARN) to model temporal relations. The outputs from these streams are fused in a fully connected layer to provide the final action prediction. Evaluations on the NTU RGB+D 60 and NTU RGB+D 120 datasets demonstrate the effectiveness of our model, achieving effective performance, particularly in group activity recognition.
[CV-52] Automatic Structured Pruning for Efficient Architecture in Federated Learning
链接: https://arxiv.org/abs/2411.01759
作者: Thai Vu Nguyen,Long Bao Le,Anderson Avila
关键词-EN: Federated Learning, limited computational resources, training is conducted, typically with limited, storage capacity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:In Federated Learning (FL), training is conducted on client devices, typically with limited computational resources and storage capacity. To address these constraints, we propose an automatic pruning scheme tailored for FL systems. Our solution improves computation efficiency on client devices, while minimizing communication costs. One of the challenges of tuning pruning hyper-parameters in FL systems is the restricted access to local data. Thus, we introduce an automatic pruning paradigm that dynamically determines pruning boundaries. Additionally, we utilized a structured pruning algorithm optimized for mobile devices that lack hardware support for sparse computations. Experimental results demonstrate the effectiveness of our approach, achieving accuracy comparable to existing methods. Our method notably reduces the number of parameters by 89% and FLOPS by 90%, with minimal impact on the accuracy of the FEMNIST and CelebFaces datasets. Furthermore, our pruning method decreases communication overhead by up to 5x and halves inference time when deployed on Android devices.
[CV-53] ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model
链接: https://arxiv.org/abs/2411.01756
作者: Yiming Sun,Fan Yu,Shaoxiang Chen,Yu Zhang,Junwei Huang,Chenhui Li,Yang Li,Changbo Wang
关键词-EN: initial bounding box, video sequence based, object tracking aims, Visual object tracking, targeted object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box. Recently, Vision-Language~(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. However, VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We found that this inferiority primarily results from their heavy reliance on manual textual annotations, which include the frequent provision of ambiguous language descriptions. In this paper, we propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions and enhance tracking performance. To this end, we propose a novel reflection-based prompt optimization module to iteratively refine the ambiguous and inaccurate descriptions of the target with tracking feedback. To further utilize semantic information produced by MLLM, a simple yet effective VL tracking framework is proposed and can be easily integrated as a plug-and-play module to boost the performance of both VL and visual trackers. Experimental results show that our proposed ChatTracker achieves a performance comparable to existing methods.
[CV-54] Multi-task Geometric Estimation of Depth and Surface Normal from Monocular 360deg Images
链接: https://arxiv.org/abs/2411.01749
作者: Kun Huang,Fang-Lue Zhang,Fangfang Zhang,Yu-Kun Lai,Paul Rosin,Neil A. Dodgson
关键词-EN: analysis in panoramic, surface normal, surface normal estimation, surface, MTL
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, this paper is accepted by Computational Visual Media Journal (CVMJ) but not pushlished yet
点击查看摘要
Abstract:Geometric estimation is required for scene understanding and analysis in panoramic 360° images. Current methods usually predict a single feature, such as depth or surface normal. These methods can lack robustness, especially when dealing with intricate textures or complex object surfaces. We introduce a novel multi-task learning (MTL) network that simultaneously estimates depth and surface normals from 360° images. Our first innovation is our MTL architecture, which enhances predictions for both tasks by integrating geometric information from depth and surface normal estimation, enabling a deeper understanding of 3D scene structure. Another innovation is our fusion module, which bridges the two tasks, allowing the network to learn shared representations that improve accuracy and robustness. Experimental results demonstrate that our MTL architecture significantly outperforms state-of-the-art methods in both depth and surface normal estimation, showing superior performance in complex and diverse scenes. Our model’s effectiveness and generalizability, particularly in handling intricate surface textures, establish it as a new benchmark in 360° image geometric estimation. The code and model are available at \urlthis https URL.
[CV-55] Rotation Perturbation Robustness in Point Cloud Analysis: A Perspective of Manifold Distillation
链接: https://arxiv.org/abs/2411.01748
作者: Xinyu Xu,Huazhen Liu,Feiming Wei,Huilin Xiong,Wenxian Yu,Tao Zhang
关键词-EN: point cloud learning, Point cloud, sampling of Riemannian, rotation perturbation, Riemannian manifold
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 8 figures, submitted to TCSVT
点击查看摘要
Abstract:Point cloud is often regarded as a discrete sampling of Riemannian manifold and plays a pivotal role in the 3D image interpretation. Particularly, rotation perturbation, an unexpected small change in rotation caused by various factors (like equipment offset, system instability, measurement errors and so on), can easily lead to the inferior results in point cloud learning tasks. However, classical point cloud learning methods are sensitive to rotation perturbation, and the existing networks with rotation robustness also have much room for improvements in terms of performance and noise tolerance. Given these, this paper remodels the point cloud from the perspective of manifold as well as designs a manifold distillation method to achieve the robustness of rotation perturbation without any coordinate transformation. In brief, during the training phase, we introduce a teacher network to learn the rotation robustness information and transfer this information to the student network through online distillation. In the inference phase, the student network directly utilizes the original 3D coordinate information to achieve the robustness of rotation perturbation. Experiments carried out on four different datasets verify the effectiveness of our method. Averagely, on the Modelnet40 and ScanobjectNN classification datasets with random rotation perturbations, our classification accuracy has respectively improved by 4.92% and 4.41%, compared to popular rotation-robust networks; on the ShapeNet and S3DIS segmentation datasets, compared to the rotation-robust networks, the improvements of mIoU are 7.36% and 4.82%, respectively. Besides, from the experimental results, the proposed algorithm also shows excellent performance in resisting noise and outliers.
[CV-56] Learning from Convolution-based Unlearnable Datastes
链接: https://arxiv.org/abs/2411.01742
作者: Dohyun Kim,Pedro Sandoval-Segura
关键词-EN: leading to increased, data, construction of large, deep learning, learning has raised
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The construction of large datasets for deep learning has raised concerns regarding unauthorized use of online data, leading to increased interest in protecting data from third-parties who want to use it for training. The Convolution-based Unlearnable DAtaset (CUDA) method aims to make data unlearnable by applying class-wise blurs to every image in the dataset so that neural networks learn relations between blur kernels and labels, as opposed to informative features for classifying clean data. In this work, we evaluate whether CUDA data remains unlearnable after image sharpening and frequency filtering, finding that this combination of simple transforms improves the utility of CUDA data for training. In particular, we observe a substantial increase in test accuracy over adversarial training for models trained with CUDA unlearnable data from CIFAR-10, CIFAR-100, and ImageNet-100. In training models to high accuracy using unlearnable data, we underscore the need for ongoing refinement in data poisoning techniques to ensure data privacy. Our method opens new avenues for enhancing the robustness of unlearnable datasets by highlighting that simple methods such as sharpening and frequency filtering are capable of breaking convolution-based unlearnable datasets.
[CV-57] Not Just Object But State: Compositional Incremental Learning without Forgetting
链接: https://arxiv.org/abs/2411.01739
作者: Yanyi Zhang,Binglin Qiu,Qi Jia,Yu Liu,Ran He
关键词-EN: excessively prioritize coarse, prioritize coarse classes, learners excessively prioritize, incremental learners excessively, color and material
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Most incremental learners excessively prioritize coarse classes of objects while neglecting various kinds of states (e.g. color and material) attached to the objects. As a result, they are limited in the ability to reason fine-grained compositionality of state-object pairs. To remedy this limitation, we propose a novel task called Compositional Incremental Learning (composition-IL), enabling the model to recognize state-object compositions as a whole in an incremental learning fashion. Since the lack of suitable benchmarks, we re-organize two existing datasets and make them tailored for composition-IL. Then, we propose a prompt-based Composition Incremental Learner (CompILer), to overcome the ambiguous composition boundary problem which challenges composition-IL largely. Specifically, we exploit multi-pool prompt learning, which is regularized by inter-pool prompt discrepancy and intra-pool prompt diversity. Besides, we devise object-injected state prompting by using object prompts to guide the selection of state prompts. Furthermore, we fuse the selected prompts by a generalized-mean strategy, to eliminate irrelevant information learned in the prompts. Extensive experiments on two datasets exhibit state-of-the-art performance achieved by CompILer.
[CV-58] Next Best View For Point-Cloud Model Acquisition: Bayesian Approximation and Uncertainty Analysis
链接: https://arxiv.org/abs/2411.01734
作者: Madalena Caldeira,Plinio Moreno
关键词-EN: problem widely studied, View problem, vision problem widely, problem widely, computer vision problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:The Next Best View problem is a computer vision problem widely studied in robotics. To solve it, several methodologies have been proposed over the years. Some, more recently, propose the use of deep learning models. Predictions obtained with the help of deep learning models naturally have some uncertainty associated with them. Despite this, the standard models do not allow for their quantification. However, Bayesian estimation theory contributed to the demonstration that dropout layers allow to estimate prediction uncertainty in neural networks. This work adapts the point-net-based neural network for Next-Best-View (PC-NBV). It incorporates dropout layers into the model’s architecture, thus allowing the computation of the uncertainty estimate associated with its predictions. The aim of the work is to improve the network’s accuracy in correctly predicting the next best viewpoint, proposing a way to make the 3D reconstruction process more efficient. Two uncertainty measurements capable of reflecting the prediction’s error and accuracy, respectively, were obtained. These enabled the reduction of the model’s error and the increase in its accuracy from 30% to 80% by identifying and disregarding predictions with high values of uncertainty. Another method that directly uses these uncertainty metrics to improve the final prediction was also proposed. However, it showed very residual improvements. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.01734 [cs.CV] (or arXiv:2411.01734v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.01734 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-59] A Probabilistic Formulation of LiDAR Mapping with Neural Radiance Fields
链接: https://arxiv.org/abs/2411.01725
作者: Matthew McDermott,Jason Rife
关键词-EN: Neural Radiance Field, Radiance Field, Neural Radiance, paper we reexamine, reexamine the process
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:In this paper we reexamine the process through which a Neural Radiance Field (NeRF) can be trained to produce novel LiDAR views of a scene. Unlike image applications where camera pixels integrate light over time, LiDAR pulses arrive at specific times. As such, multiple LiDAR returns are possible for any given detector and the classification of these returns is inherently probabilistic. Applying a traditional NeRF training routine can result in the network learning phantom surfaces in free space between conflicting range measurements, similar to how floater aberrations may be produced by an image model. We show that by formulating loss as an integral of probability (rather than as an integral of optical density) the network can learn multiple peaks for a given ray, allowing the sampling of first, nth, or strongest returns from a single output channel. Code is available at this https URL
[CV-60] ROAD-Waymo: Action Awareness at Scale for Autonomous Driving
链接: https://arxiv.org/abs/2411.01683
作者: Salman Khan,Izzeddin Teeti,Reza Javanmard Alitappeh,Mihaela C. Stoian,Eleonora Giunchiglia,Gurkirt Singh,Andrew Bradley,Fabio Cuzzolin
关键词-EN: Autonomous Vehicle, perception systems require, perception systems, systems require, road users
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Autonomous Vehicle (AV) perception systems require more than simply seeing, via e.g., object detection or scene segmentation. They need a holistic understanding of what is happening within the scene for safe interaction with other road users. Few datasets exist for the purpose of developing and training algorithms to comprehend the actions of other road users. This paper presents ROAD-Waymo, an extensive dataset for the development and benchmarking of techniques for agent, action, location and event detection in road scenes, provided as a layer upon the (US) Waymo Open dataset. Considerably larger and more challenging than any existing dataset (and encompassing multiple cities), it comes with 198k annotated video frames, 54k agent tubes, 3.9M bounding boxes and a total of 12.4M labels. The integrity of the dataset has been confirmed and enhanced via a novel annotation pipeline designed for automatically identifying violations of requirements specifically designed for this dataset. As ROAD-Waymo is compatible with the original (UK) ROAD dataset, it provides the opportunity to tackle domain adaptation between real-world road scenarios in different countries within a novel benchmark: ROAD++.
[CV-61] Degradation-Aware Residual-Conditioned Optimal Transport for Unified Image Restoration
链接: https://arxiv.org/abs/2411.01656
作者: Xiaole Tang,Xiang Gu,Xiaoyi He,Xin Hu,Jian Sun
关键词-EN: promising low-level vision, low-level vision task, Transport, image restoration, Optimal Transport
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:All-in-one image restoration has emerged as a practical and promising low-level vision task for real-world applications. In this context, the key issue lies in how to deal with different types of degraded images simultaneously. In this work, we present a Degradation-Aware Residual-Conditioned Optimal Transport (DA-RCOT) approach that models (all-in-one) image restoration as an optimal transport (OT) problem for unpaired and paired settings, introducing the transport residual as a degradation-specific cue for both the transport cost and the transport map. Specifically, we formalize image restoration with a residual-guided OT objective by exploiting the degradation-specific patterns of the Fourier residual in the transport cost. More crucially, we design the transport map for restoration as a two-pass DA-RCOT map, in which the transport residual is computed in the first pass and then encoded as multi-scale residual embeddings to condition the second-pass restoration. This conditioning process injects intrinsic degradation knowledge (e.g., degradation type and level) and structural information from the multi-scale residual embeddings into the OT map, which thereby can dynamically adjust its behaviors for all-in-one restoration. Extensive experiments across five degradations demonstrate the favorable performance of DA-RCOT as compared to state-of-the-art methods, in terms of distortion measures, perceptual quality, and image structure preservation. Notably, DA-RCOT delivers superior adaptability to real-world scenarios even with multiple degradations and shows distinctive robustness to both degradation levels and the number of degradations.
[CV-62] PreCM: The Padding-based Rotation Equivariant Convolution Mode for Semantic Segmentation
链接: https://arxiv.org/abs/2411.01624
作者: Xinyu Xu,Huazhen Liu,Huilin Xiong,Wenxian Yu,Tao Zhang
关键词-EN: semantic segmentation networks, Semantic segmentation, deep semantic segmentation, orientation information, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 14 figures, submitted to TIP
点击查看摘要
Abstract:Semantic segmentation is an important branch of image processing and computer vision. With the popularity of deep learning, various deep semantic segmentation networks have been proposed for pixel-level classification and segmentation tasks. However, the imaging angles are often arbitrary in real world, such as water body images in remote sensing, and capillary and polyp images in medical field, and we usually cannot obtain prior orientation information to guide these networks to extract more effective features. Additionally, learning the features of objects with multiple orientation information is also challenging, as most CNN-based semantic segmentation networks do not have rotation equivariance to resist the disturbance from orientation information. To address the same, in this paper, we first establish a universal convolution-group framework to more fully utilize the orientation information and make the networks rotation equivariant. Then, we mathematically construct the padding-based rotation equivariant convolution mode (PreCM), which can be used not only for multi-scale images and convolution kernels, but also as a replacement component to replace multiple convolutions, like dilated convolution, transposed convolution, variable stride convolution, etc. In order to verify the realization of rotation equivariance, a new evaluation metric named rotation difference (RD) is finally proposed. The experiments carried out on the datesets Satellite Images of Water Bodies, DRIVE and Floodnet show that the PreCM-based networks can achieve better segmentation performance than the original and data augmentation-based networks. In terms of the average RD value, the former is 0% and the latter two are respectively 7.0503% and 3.2606%. Last but not least, PreCM also effectively enhances the robustness of networks to rotation perturbations.
[CV-63] ANNE: Adaptive Nearest Neighbors and Eigenvector-based Sample Selection for Robust Learning with Noisy Labels
链接: https://arxiv.org/abs/2411.01613
作者: Filipe R. Cordeiro,Gustavo Carneiro
关键词-EN: sample selection, sample selection procedure, loss-based sampling, Adaptive KNN, Adaptive Nearest Neighbors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Pattern Recognition
点击查看摘要
Abstract:An important stage of most state-of-the-art (SOTA) noisy-label learning methods consists of a sample selection procedure that classifies samples from the noisy-label training set into noisy-label or clean-label subsets. The process of sample selection typically consists of one of the two approaches: loss-based sampling, where high-loss samples are considered to have noisy labels, or feature-based sampling, where samples from the same class tend to cluster together in the feature space and noisy-label samples are identified as anomalies within those clusters. Empirically, loss-based sampling is robust to a wide range of noise rates, while feature-based sampling tends to work effectively in particular scenarios, e.g., the filtering of noisy instances via their eigenvectors (FINE) sampling exhibits greater robustness in scenarios with low noise rates, and the K nearest neighbor (KNN) sampling mitigates better high noise-rate problems. This paper introduces the Adaptive Nearest Neighbors and Eigenvector-based (ANNE) sample selection methodology, a novel approach that integrates loss-based sampling with the feature-based sampling methods FINE and Adaptive KNN to optimize performance across a wide range of noise rate scenarios. ANNE achieves this integration by first partitioning the training set into high-loss and low-loss sub-groups using loss-based sampling. Subsequently, within the low-loss subset, sample selection is performed using FINE, while the high-loss subset employs Adaptive KNN for effective sample selection. We integrate ANNE into the noisy-label learning state of the art (SOTA) method SSR+, and test it on CIFAR-10/-100 (with symmetric, asymmetric and instance-dependent noise), Webvision and ANIMAL-10, where our method shows better accuracy than the SOTA in most experiments, with a competitive training time.
[CV-64] High-Fidelity Virtual Try-on with Large-Scale Unpaired Learning
链接: https://arxiv.org/abs/2411.01593
作者: Han Yang,Yanlong Zang,Ziwei Liu
关键词-EN: downstream e-commerce applications, Boosted Virtual Try-on, Virtual try-on, transfers a target, reference person
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report
点击查看摘要
Abstract:Virtual try-on (VTON) transfers a target clothing image to a reference person, where clothing fidelity is a key requirement for downstream e-commerce applications. However, existing VTON methods still fall short in high-fidelity try-on due to the conflict between the high diversity of dressing styles (\eg clothes occluded by pants or distorted by posture) and the limited paired data for training. In this work, we propose a novel framework \textbfBoosted Virtual Try-on (BVTON) to leverage the large-scale unpaired learning for high-fidelity try-on. Our key insight is that pseudo try-on pairs can be reliably constructed from vastly available fashion images. Specifically, \textbf1) we first propose a compositional canonicalizing flow that maps on-model clothes into pseudo in-shop clothes, dubbed canonical proxy. Each clothing part (sleeves, torso) is reversely deformed into an in-shop-like shape to compositionally construct the canonical proxy. \textbf2) Next, we design a layered mask generation module that generates accurate semantic layout by training on canonical proxy. We replace the in-shop clothes used in conventional pipelines with the derived canonical proxy to boost the training process. \textbf3) Finally, we propose an unpaired try-on synthesizer by constructing pseudo training pairs with randomly misaligned on-model clothes, where intricate skin texture and clothes boundaries can be generated. Extensive experiments on high-resolution ( 1024\times768 ) datasets demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively. Notably, BVTON shows great generalizability and scalability to various dressing styles and data sources.
[CV-65] One for All: Multi-Domain Joint Training for Point Cloud Based 3D Object Detection NEURIPS2024
链接: https://arxiv.org/abs/2411.01584
作者: Zhenyu Wang,Yali Li,Hengshuang Zhao,Shengjin Wang
关键词-EN: current trend, trend in computer, computer vision, universal model, universal model inevitably
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024
点击查看摘要
Abstract:The current trend in computer vision is to utilize one universal model to address all various tasks. Achieving such a universal model inevitably requires incorporating multi-domain data for joint training to learn across multiple problem scenarios. In point cloud based 3D object detection, however, such multi-domain joint training is highly challenging, because large domain gaps among point clouds from different datasets lead to the severe domain-interference problem. In this paper, we propose \textbfOneDet3D, a universal one-for-all model that addresses 3D detection across different domains, including diverse indoor and outdoor scenes, within the \emphsame framework and only \emphone set of parameters. We propose the domain-aware partitioning in scatter and context, guided by a routing mechanism, to address the data interference issue, and further incorporate the text modality for a language-guided classification to unify the multi-dataset label spaces and mitigate the category interference issue. The fully sparse structure and anchor-free head further accommodate point clouds with significant scale disparities. Extensive experiments demonstrate the strong universal ability of OneDet3D to utilize only one trained model for addressing almost all 3D object detection tasks.
[CV-66] Conditional Controllable Image Fusion NEURIPS2024
链接: https://arxiv.org/abs/2411.01573
作者: Bing Cao,Xingxin Xu,Pengfei Zhu,Qilong Wang,Qinghua Hu
关键词-EN: integrate complementary information, input images acquired, multiple input images, aims to integrate, integrate complementary
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted by NeurIPS 2024
点击查看摘要
Abstract:Image fusion aims to integrate complementary information from multiple input images acquired through various sources to synthesize a new fused image. Existing methods usually employ distinct constraint designs tailored to specific scenes, forming fixed fusion paradigms. However, this data-driven fusion approach is challenging to deploy in varying scenarios, especially in rapidly changing environments. To address this issue, we propose a conditional controllable fusion (CCF) framework for general image fusion tasks without specific training. Due to the dynamic differences of different samples, our CCF employs specific fusion constraints for each individual in practice. Given the powerful generative capabilities of the denoising diffusion model, we first inject the specific constraints into the pre-trained DDPM as adaptive fusion conditions. The appropriate conditions are dynamically selected to ensure the fusion process remains responsive to the specific requirements in each reverse diffusion stage. Thus, CCF enables conditionally calibrating the fused images step by step. Extensive experiments validate our effectiveness in general fusion tasks across diverse scenarios against the competing methods without additional training.
[CV-67] ParseCaps: An Interpretable Parsing Capsule Network for Medical Image Diagnosis
链接: https://arxiv.org/abs/2411.01564
作者: Xinyu Geng,Jiaming Wang,Jun Xu
关键词-EN: Deep learning, medical image classification, learning has excelled, excelled in medical, medical image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages
点击查看摘要
Abstract:Deep learning has excelled in medical image classification, but its clinical application is limited by poor interpretability. Capsule networks, known for encoding hierarchical relationships and spatial features, show potential in addressing this issue. Nevertheless, traditional capsule networks often underperform due to their shallow structures, and deeper variants lack hierarchical architectures, thereby compromising interpretability. This paper introduces a novel capsule network, ParseCaps, which utilizes the sparse axial attention routing and parse convolutional capsule layer to form a parse-tree-like structure, enhancing both depth and interpretability. Firstly, sparse axial attention routing optimizes connections between child and parent capsules, as well as emphasizes the weight distribution across instantiation parameters of parent capsules. Secondly, the parse convolutional capsule layer generates capsule predictions aligning with the parse tree. Finally, based on the loss design that is effective whether concept ground truth exists or not, ParseCaps advances interpretability by associating each dimension of the global capsule with a comprehensible concept, thereby facilitating clinician trust and understanding of the model’s classification results. Experimental results on CE-MRI, PH ^2 , and Derm7pt datasets show that ParseCaps not only outperforms other capsule network variants in classification accuracy, redundancy reduction and robustness, but also provides interpretable explanations, regardless of the availability of concept labels.
[CV-68] Decoupling Dark Knowledge via Block-wise Logit Distillation for Feature-level Alignment
链接: https://arxiv.org/abs/2411.01547
作者: Chengting Yu,Fengzhao Zhang,Ruizhe Chen,Zuozhu Liu,Shurun Tan,Er-Ping Li,Aili Wang
关键词-EN: transfers dark knowledge, larger teacher network, teacher network guiding, smaller student network, dark knowledge
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Knowledge Distillation (KD), a learning manner with a larger teacher network guiding a smaller student network, transfers dark knowledge from the teacher to the student via logits or intermediate features, with the aim of producing a well-performed lightweight model. Notably, many subsequent feature-based KD methods outperformed the earliest logit-based KD method and iteratively generated numerous state-of-the-art distillation methods. Nevertheless, recent work has uncovered the potential of the logit-based method, bringing the simple KD form based on logits back into the limelight. Features or logits? They partially implement the KD with entirely distinct perspectives; therefore, choosing between logits and features is not straightforward. This paper provides a unified perspective of feature alignment in order to obtain a better comprehension of their fundamental distinction. Inheriting the design philosophy and insights of feature-based and logit-based methods, we introduce a block-wise logit distillation framework to apply implicit logit-based feature alignment by gradually replacing teacher’s blocks as intermediate stepping-stone models to bridge the gap between the student and the teacher. Our method obtains comparable or superior results to state-of-the-art distillation methods. This paper demonstrates the great potential of combining logit and features, and we hope it will inspire future research to revisit KD from a higher vantage point.
[CV-69] owards Small Object Editing: A Benchmark Dataset and A Training-Free Approach
链接: https://arxiv.org/abs/2411.01545
作者: Qihe Pan,Zhen Zhao,Zicheng Wang,Sifan Long,Yiming Wu,Wei Ji,Haoran Liang,Ronghua Liang
关键词-EN: large-scale diffusion-based generative, Stable Diffusion, diffusion-based generative models, small object generation, small object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 8 figures, Accepted by ACMMM 2024
点击查看摘要
Abstract:A plethora of text-guided image editing methods has recently been developed by leveraging the impressive capabilities of large-scale diffusion-based generative models especially Stable Diffusion. Despite the success of diffusion models in producing high-quality images, their application to small object generation has been limited due to difficulties in aligning cross-modal attention maps between text and these objects. Our approach offers a training-free method that significantly mitigates this alignment issue with local and global attention guidance , enhancing the model’s ability to accurately render small objects in accordance with textual descriptions. We detail the methodology in our approach, emphasizing its divergence from traditional generation techniques and highlighting its advantages. What’s more important is that we also provide~\textitSOEBench (Small Object Editing), a standardized benchmark for quantitatively evaluating text-based small object generation collected from \textitMSCOCO and \textitOpenImage. Preliminary results demonstrate the effectiveness of our method, showing marked improvements in the fidelity and accuracy of small object generation compared to existing models. This advancement not only contributes to the field of AI and computer vision but also opens up new possibilities for applications in various industries where precise image generation is critical. We will release our dataset on our project page: \hrefthis https URLthis https URL.
[CV-70] FactorizePhys: Matrix Factorization for Multidimensional Attention in Remote Physiological Sensing NEURIPS
链接: https://arxiv.org/abs/2411.01542
作者: Jitesh Joshi,Sos S. Agaian,Youngjun Cho
关键词-EN: Remote photoplethysmography, enables non-invasive extraction, transforming spatial-temporal data, time series signals, enables non-invasive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS, 2024
点击查看摘要
Abstract:Remote photoplethysmography (rPPG) enables non-invasive extraction of blood volume pulse signals through imaging, transforming spatial-temporal data into time series signals. Advances in end-to-end rPPG approaches have focused on this transformation where attention mechanisms are crucial for feature extraction. However, existing methods compute attention disjointly across spatial, temporal, and channel dimensions. Here, we propose the Factorized Self-Attention Module (FSAM), which jointly computes multidimensional attention from voxel embeddings using nonnegative matrix factorization. To demonstrate FSAM’s effectiveness, we developed FactorizePhys, an end-to-end 3D-CNN architecture for estimating blood volume pulse signals from raw video frames. Our approach adeptly factorizes voxel embeddings to achieve comprehensive spatial, temporal, and channel attention, enhancing performance of generic signal extraction tasks. Furthermore, we deploy FSAM within an existing 2D-CNN-based rPPG architecture to illustrate its versatility. FSAM and FactorizePhys are thoroughly evaluated against state-of-the-art rPPG methods, each representing different types of architecture and attention mechanism. We perform ablation studies to investigate the architectural decisions and hyperparameters of FSAM. Experiments on four publicly available datasets and intuitive visualization of learned spatial-temporal features substantiate the effectiveness of FSAM and enhanced cross-dataset generalization in estimating rPPG signals, suggesting its broader potential as a multidimensional attention mechanism. The code is accessible at this https URL.
[CV-71] InstantGeoAvatar: Effective Geometry and Appearance Modeling of Animatable Avatars from Monocular Video ACCV2024
链接: https://arxiv.org/abs/2411.01512
作者: Alvaro Budria,Adrian Lopez-Rodriguez,Oscar Lorente,Francesc Moreno-Noguer
关键词-EN: animatable implicit human, implicit human avatars, video of detailed, method for efficient, efficient and effective
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted as poster to Asian Conference on Computer Vison (ACCV 2024)
点击查看摘要
Abstract:We present InstantGeoAvatar, a method for efficient and effective learning from monocular video of detailed 3D geometry and appearance of animatable implicit human avatars. Our key observation is that the optimization of a hash grid encoding to represent a signed distance function (SDF) of the human subject is fraught with instabilities and bad local minima. We thus propose a principled geometry-aware SDF regularization scheme that seamlessly fits into the volume rendering pipeline and adds negligible computational overhead. Our regularization scheme significantly outperforms previous approaches for training SDFs on hash grids. We obtain competitive results in geometry reconstruction and novel view synthesis in as little as five minutes of training time, a significant reduction from the several hours required by previous work. InstantGeoAvatar represents a significant leap forward towards achieving interactive reconstruction of virtual avatars.
[CV-72] Object segmentation from common fate: Motion energy processing enables human-like zero-shot generalization to random dot stimuli NEURIPS2024
链接: https://arxiv.org/abs/2411.01505
作者: Matthias Tangemann,Matthias Kümmerer,Matthias Bethge
关键词-EN: optical flow models, segmenting moving objects, common fate, motion energy model, optical flow
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024
点击查看摘要
Abstract:Humans excel at detecting and segmenting moving objects according to the Gestalt principle of “common fate”. Remarkably, previous works have shown that human perception generalizes this principle in a zero-shot fashion to unseen textures or random dots. In this work, we seek to better understand the computational basis for this capability by evaluating a broad range of optical flow models and a neuroscience inspired motion energy model for zero-shot figure-ground segmentation of random dot stimuli. Specifically, we use the extensively validated motion energy model proposed by Simoncelli and Heeger in 1998 which is fitted to neural recordings in cortex area MT. We find that a cross section of 40 deep optical flow models trained on different datasets struggle to estimate motion patterns in random dot videos, resulting in poor figure-ground segmentation performance. Conversely, the neuroscience-inspired model significantly outperforms all optical flow models on this task. For a direct comparison to human perception, we conduct a psychophysical study using a shape identification task as a proxy to measure human segmentation performance. All state-of-the-art optical flow models fall short of human performance, but only the motion energy model matches human capability. This neuroscience-inspired model successfully addresses the lack of human-like zero-shot generalization to random dot stimuli in current computer vision models, and thus establishes a compelling link between the Gestalt psychology of human object perception and cortical motion processing in the brain. Code, models and datasets are available at this https URL Comments: Accepted at NeurIPS 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.01505 [cs.CV] (or arXiv:2411.01505v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.01505 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-73] Polar R-CNN: End-to-End Lane Detection with Fewer Anchors
链接: https://arxiv.org/abs/2411.01499
作者: Shengqi Wang,Junmin Liu,Xiangyong Cao,Zengjie Song,Kai Sun
关键词-EN: complicating detection efforts, autonomous driving, critical and challenging, challenging task, task in autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Lane detection is a critical and challenging task in autonomous driving, particularly in real-world scenarios where traffic lanes can be slender, lengthy, and often obscured by other vehicles, complicating detection efforts. Existing anchor-based methods typically rely on prior lane anchors to extract features and subsequently refine the location and shape of lanes. While these methods achieve high performance, manually setting prior anchors is cumbersome, and ensuring sufficient coverage across diverse datasets often requires a large amount of dense anchors. Furthermore, the use of Non-Maximum Suppression (NMS) to eliminate redundant predictions complicates real-world deployment and may underperform in complex scenarios. In this paper, we propose Polar R-CNN, an end-to-end anchor-based method for lane detection. By incorporating both local and global polar coordinate systems, Polar R-CNN facilitates flexible anchor proposals and significantly reduces the number of anchors required without compromising this http URL, we introduce a triplet head with heuristic structure that supports NMS-free paradigm, enhancing deployment efficiency and performance in scenarios with dense this http URL method achieves competitive results on five popular lane detection benchmarks–Tusimple, CULane,LLAMAS, CurveLanes, and DL-Rai–while maintaining a lightweight design and straightforward structure. Our source code is available at this https URL.
[CV-74] Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation ECCV2024
链接: https://arxiv.org/abs/2411.01494
作者: Seongsu Ha,Chaeyun Kim,Donghwa Kim,Junho Lee,Sangho Lee,Joonseok Lee
关键词-EN: Referring Image Segmentation, Image Segmentation, textual query, comprehensive task, object referred
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024. Project page: this https URL
点击查看摘要
Abstract:Referring Image Segmentation is a comprehensive task to segment an object referred by a textual query from an image. In nature, the level of difficulty in this task is affected by the existence of similar objects and the complexity of the referring expression. Recent RIS models still show a significant performance gap between easy and hard scenarios. We pose that the bottleneck exists in the data, and propose a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo). This method augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging. We discover that it is critical to properly adjust the difficulty level, neither too ambiguous nor too trivial. The augmented training data encourages the RIS model to recognize subtle differences and relationships between similar visual entities and to concretely understand the whole expression to locate the right target better. Our approach shows consistent improvements on various datasets and models, verified by extensive experiments.
[CV-75] EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark
链接: https://arxiv.org/abs/2411.01492
作者: Ming Li,Jike Zhong,Tianle Chen,Yuxiang Lai,Konstantinos Psounis
关键词-EN: demonstrated promising skills, Recent studies, large language models, domains including science, large language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: preprint
点击查看摘要
Abstract:Recent studies on large language models (LLMs) and large multimodal models (LMMs) have demonstrated promising skills in various domains including science and mathematics. However, their capability in more challenging and real-world related scenarios like engineering has not been systematically studied. To bridge this gap, we propose EEE-Bench, a multimodal benchmark aimed at assessing LMMs’ capabilities in solving practical engineering tasks, using electrical and electronics engineering (EEE) as the testbed. Our benchmark consists of 2860 carefully curated problems spanning 10 essential subdomains such as analog circuits, control systems, etc. Compared to benchmarks in other domains, engineering problems are intrinsically 1) more visually complex and versatile and 2) less deterministic in solutions. Successful solutions to these problems often demand more-than-usual rigorous integration of visual and textual information as models need to understand intricate images like abstract circuits and system diagrams while taking professional instructions, making them excellent candidates for LMM evaluations. Alongside EEE-Bench, we provide extensive quantitative evaluations and fine-grained analysis of 17 widely-used open and closed-sourced LLMs and LMMs. Our results demonstrate notable deficiencies of current foundation models in EEE, with an average performance ranging from 19.48% to 46.78%. Finally, we reveal and explore a critical shortcoming in LMMs which we term laziness: the tendency to take shortcuts by relying on the text while overlooking the visual context when reasoning for technical image problems. In summary, we believe EEE-Bench not only reveals some noteworthy limitations of LMMs but also provides a valuable resource for advancing research on their application in practical engineering tasks, driving future improvements in their capability to handle complex, real-world scenarios.
[CV-76] Efficient Medical Image Retrieval Using DenseNet and FAISS for BIRADS Classification
链接: https://arxiv.org/abs/2411.01473
作者: MD Shaikh Rahman,Feiroz Humayara,Syed Maudud E Rabbi,Muhammad Mahbubur Rashid
关键词-EN: images, medical, medical field, retrieval, database
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 34 pages, 5 figures
点击查看摘要
Abstract:That datasets that are used in todays research are especially vast in the medical field. Different types of medical images such as X-rays, MRI, CT scan etc. take up large amounts of space. This volume of data introduces challenges like accessing and retrieving specific images due to the size of the database. An efficient image retrieval system is essential as the database continues to grow to save time and resources. In this paper, we propose an approach to medical image retrieval using DenseNet for feature extraction and use FAISS for similarity search. DenseNet is well-suited for feature extraction in complex medical images and FAISS enables efficient handling of high-dimensional data in large-scale datasets. Unlike existing methods focused solely on classification accuracy, our method prioritizes both retrieval speed and diagnostic relevance, addressing a critical gap in real-time case comparison for radiologists. We applied the classification of breast cancer images using the BIRADS system. We utilized DenseNet’s powerful feature representation and FAISSs efficient indexing capabilities to achieve high precision and recall in retrieving relevant images for diagnosis. We experimented on a dataset of 2006 images from the Categorized Digital Database for Low Energy and Subtracted Contrast Enhanced Spectral Mammography (CDD-CESM) images available on The Cancer Imaging Archive (TCIA). Our method outperforms conventional retrieval techniques, achieving a precision of 80% at k=5 for BIRADS classification. The dataset includes annotated CESM images and medical reports, providing a comprehensive foundation for our research.
[CV-77] Exploring PCA-based feature representations of image pixels via CNN to enhance food image segmentation
链接: https://arxiv.org/abs/2411.01469
作者: Ying Dai
关键词-EN: open vocabulary recognition, pixel-level feature representations, feature representations, crucial step, open vocabulary
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:For open vocabulary recognition of ingredients in food images, segmenting the ingredients is a crucial step. This paper proposes a novel approach that explores PCA-based feature representations of image pixels using a convolutional neural network (CNN) to enhance segmentation. An internal clustering metric based on the silhouette score is defined to evaluate the clustering quality of various pixel-level feature representations generated by different feature maps derived from various CNN backbones. Using this metric, the paper explores optimal feature representation selection and suitable clustering methods for ingredient segmentation. Additionally, it is found that principal component (PC) maps derived from concatenations of backbone feature maps improve the clustering quality of pixel-level feature representations, resulting in stable segmentation outcomes. Notably, the number of selected eigenvalues can be used as the number of clusters to achieve good segmentation results. The proposed method performs well on the ingredient-labeled dataset FoodSeg103, achieving a mean Intersection over Union (mIoU) score of 0.5423. Importantly, the proposed method is unsupervised, and pixel-level feature representations from backbones are not fine-tuned on specific datasets. This demonstrates the flexibility, generalizability, and interpretability of the proposed method, while reducing the need for extensive labeled datasets.
[CV-78] Efficient Non-Exemplar Class-Incremental Learning with Retrospective Feature Synthesis
链接: https://arxiv.org/abs/2411.01465
作者: Liang Bai,Hong Song,Yucong Lin,Tianyu Fu,Deqiang Xiao,Danni Ai,Jingfan Fan,Jian Yang
关键词-EN: neural networks suffer, continuous data streams, deep neural networks, individual tasks, real-world scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 9 figures
点击查看摘要
Abstract:Despite the outstanding performance in many individual tasks, deep neural networks suffer from catastrophic forgetting when learning from continuous data streams in real-world scenarios. Current Non-Exemplar Class-Incremental Learning (NECIL) methods mitigate forgetting by storing a single prototype per class, which serves to inject previous information when sequentially learning new classes. However, these stored prototypes or their augmented variants often fail to simultaneously capture spatial distribution diversity and precision needed for representing old classes. Moreover, as the model acquires new knowledge, these prototypes gradually become outdated, making them less effective. To overcome these limitations, we propose a more efficient NECIL method that replaces prototypes with synthesized retrospective features for old classes. Specifically, we model each old class’s feature space using a multivariate Gaussian distribution and generate deep representations by sampling from high-likelihood regions. Additionally, we introduce a similarity-based feature compensation mechanism that integrates generated old class features with similar new class features to synthesize robust retrospective representations. These retrospective features are then incorporated into our incremental learning framework to preserve the decision boundaries of previous classes while learning new ones. Extensive experiments on CIFAR-100, TinyImageNet, and ImageNet-Subset demonstrate that our method significantly improves the efficiency of non-exemplar class-incremental learning and achieves state-of-the-art performance.
[CV-79] HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation
链接: https://arxiv.org/abs/2411.01455
作者: Zirui Wang,Xinran Zhao,Simon Stepputtis,Woojun Kim,Tongshuang Wu,Katia Sycara,Yaqi Xie
关键词-EN: Understanding and predicting, predicting human actions, long-standing challenge, crucial measure, measure of perception
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Understanding and predicting human actions has been a long-standing challenge and is a crucial measure of perception in robotics AI. While significant progress has been made in anticipating the future actions of individual agents, prior work has largely overlooked a key aspect of real-world human activity – interactions. To address this gap in human-like forecasting within multi-agent environments, we present the Hierarchical Memory-Aware Transformer (HiMemFormer), a transformer-based model for online multi-agent action anticipation. HiMemFormer integrates and distributes global memory that captures joint historical information across all agents through a transformer framework, with a hierarchical local memory decoder that interprets agent-specific features based on these global representations using a coarse-to-fine strategy. In contrast to previous approaches, HiMemFormer uniquely hierarchically applies the global context with agent-specific preferences to avoid noisy or redundant information in multi-agent action anticipation. Extensive experiments on various multi-agent scenarios demonstrate the significant performance of HiMemFormer, compared with other state-of-the-art methods.
[CV-80] A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning
链接: https://arxiv.org/abs/2411.01445
作者: Fei Wang,Chengcheng Chen,Hongyu Chen,Yugang Chang,Weiming Zeng
关键词-EN: Current visual question, Synthetic Aperture Radar, require constructing multimodal, demands significant time, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Current visual question answering (VQA) tasks often require constructing multimodal datasets and fine-tuning visual language models, which demands significant time and resources. This has greatly hindered the application of VQA to downstream tasks, such as ship information analysis based on Synthetic Aperture Radar (SAR) imagery. To address this challenge, this letter proposes a novel VQA approach that integrates object detection networks with visual language models, specifically designed for analyzing ships in SAR images. This integration aims to enhance the capabilities of VQA systems, focusing on aspects such as ship location, density, and size analysis, as well as risk behavior detection. Initially, we conducted baseline experiments using YOLO networks on two representative SAR ship detection datasets, SSDD and HRSID, to assess each model’s performance in terms of detection accuracy. Based on these results, we selected the optimal model, YOLOv8n, as the most suitable detection network for this task. Subsequently, leveraging the vision-language model Qwen2-VL, we designed and implemented a VQA task specifically for SAR scenes. This task employs the ship location and size information output by the detection network to generate multi-turn dialogues and scene descriptions for SAR imagery. Experimental results indicate that this method not only enables fundamental SAR scene question-answering without the need for additional datasets or fine-tuning but also dynamically adapts to complex, multi-turn dialogue requirements, demonstrating robust semantic understanding and adaptability.
[CV-81] Activating Self-Attention for Multi-Scene Absolute Pose Regression NEURIPS2024
链接: https://arxiv.org/abs/2411.01443
作者: Miso Lee,Jihwan Kim,Jae-Pil Heo
关键词-EN: Multi-scene absolute pose, absolute pose regression, pose regression addresses, memory-efficient camera pose, camera pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024
点击查看摘要
Abstract:Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments. Nowadays, transformer-based model has been devised to regress the camera pose directly in multi-scenes. Despite its potential, transformer encoders are underutilized due to the collapsed self-attention map, having low representation capacity. This work highlights the problem and investigates it from a new perspective: distortion of query-key embedding space. Based on the statistical analysis, we reveal that queries and keys are mapped in completely different spaces while only a few keys are blended into the query region. This leads to the collapse of the self-attention map as all queries are considered similar to those few keys. Therefore, we propose simple but effective solutions to activate self-attention. Concretely, we present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention. In addition, the fixed sinusoidal positional encoding is adopted instead of undertrained learnable one to reflect appropriate positional clues into the inputs of self-attention. As a result, our approach resolves the aforementioned problem effectively, thus outperforming existing methods in both outdoor and indoor scenes.
[CV-82] Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning
链接: https://arxiv.org/abs/2411.01432
作者: Fei Zhou,Peng Wang,Lei Zhang,Zhenghua Chen,Wei Wei,Chen Ding,Guosheng Lin,Yanning Zhang
关键词-EN: synthetic FSL tasks, synthetic FSL, FSL tasks, source domain, FSL
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Meta-learning offers a promising avenue for few-shot learning (FSL), enabling models to glean a generalizable feature embedding through episodic training on synthetic FSL tasks in a source domain. Yet, in practical scenarios where the target task diverges from that in the source domain, meta-learning based method is susceptible to over-fitting. To overcome this, we introduce a novel framework, Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning, which is crafted to comprehensively exploit the cross-domain transferable image prior that each image can be decomposed into complementary low-frequency content details and high-frequency robust structural characteristics. Motivated by this insight, we propose to decompose each query image into its high-frequency and low-frequency components, and parallel incorporate them into the feature embedding network to enhance the final category prediction. More importantly, we introduce a feature reconstruction prior and a prediction consistency prior to separately encourage the consistency of the intermediate feature as well as the final category prediction between the original query image and its decomposed frequency components. This allows for collectively guiding the network’s meta-learning process with the aim of learning generalizable image feature embeddings, while not introducing any extra computational cost in the inference phase. Our framework establishes new state-of-the-art results on multiple cross-domain few-shot learning benchmarks.
[CV-83] Mapping Global Floods with 10 Years of Satellite Radar Data
链接: https://arxiv.org/abs/2411.01411
作者: Amit Misra,Kevin White,Simone Fobi Nsutezo,William Straka,Juan Lavista
关键词-EN: global damage annually, extensive global damage, effective monitoring essential, making effective monitoring, Synthetic Aperture Radar
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 7 figures, submitted
点击查看摘要
Abstract:Floods cause extensive global damage annually, making effective monitoring essential. While satellite observations have proven invaluable for flood detection and tracking, comprehensive global flood datasets spanning extended time periods remain scarce. In this study, we introduce a novel deep learning flood detection model that leverages the cloud-penetrating capabilities of Sentinel-1 Synthetic Aperture Radar (SAR) satellite imagery, enabling consistent flood extent mapping in any weather condition. By applying this model to nearly 10 years of SAR data, we create a unique, longitudinal global flood extent dataset with predictions unaffected by cloud coverage, offering comprehensive and consistent insights into historically flood-prone areas over the past decade. We use our model predictions to identify historically flood-prone areas in Ethiopia and demonstrate real-time disaster response capabilities during the May 2024 floods in Kenya. Additionally, our longitudinal analysis reveals potential increasing trends in global flood extent over time, although further validation is required to explore links to climate change. To maximize impact, we provide public access to both our model predictions and a code repository, empowering researchers and practitioners worldwide to advance flood monitoring and enhance disaster response strategies.
[CV-84] MambaReg: Mamba-Based Disentangled Convolutional Sparse Coding for Unsupervised Deformable Multi-Modal Image Registration
链接: https://arxiv.org/abs/2411.01399
作者: Kaiang Wen,Bin Xie,Bin Duan,Yan Yan
关键词-EN: inherent feature discrepancies, feature discrepancies poses, discrepancies poses, poses a pivotal, deformable image registration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Precise alignment of multi-modal images with inherent feature discrepancies poses a pivotal challenge in deformable image registration. Traditional learning-based approaches often consider registration networks as black boxes without interpretability. One core insight is that disentangling alignment features and non-alignment features across modalities bring benefits. Meanwhile, it is challenging for the prominent methods for image registration tasks, such as convolutional neural networks, to capture long-range dependencies by their local receptive fields. The methods often fail when the given image pair has a large misalignment due to the lack of effectively learning long-range dependencies and correspondence. In this paper, we propose MambaReg, a novel Mamba-based architecture that integrates Mamba’s strong capability in capturing long sequences to address these challenges. With our proposed several sub-modules, MambaReg can effectively disentangle modality-independent features responsible for registration from modality-dependent, non-aligning features. By selectively attending to the relevant features, our network adeptly captures the correlation between multi-modal images, enabling focused deformation field prediction and precise image alignment. The Mamba-based architecture seamlessly integrates the local feature extraction power of convolutional layers with the long-range dependency modeling capabilities of Mamba. Experiments on public non-rigid RGB-IR image datasets demonstrate the superiority of our method, outperforming existing approaches in terms of registration accuracy and deformation field smoothness.
[CV-85] A New Logic For Pediatric Brain Tumor Segmentation
链接: https://arxiv.org/abs/2411.01390
作者: Max Bengtsson,Elif Keles,Gorkem Durak,Syed Anwar,Yuri S. Velichko,Marius G. Linguraru,Angela J. Waanders,Ulas Bagci
关键词-EN: deep learning architecture, pediatric brain tumors, segmenting pediatric brain, PED BraTS, Children Brain Tumor
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:
点击查看摘要
Abstract:In this paper, we present a novel approach for segmenting pediatric brain tumors using a deep learning architecture, inspired by expert radiologists’ segmentation strategies. Our model delineates four distinct tumor labels and is benchmarked on a held-out PED BraTS 2024 test set (i.e., pediatric brain tumor datasets introduced by BraTS). Furthermore, we evaluate our model’s performance against the state-of-the-art (SOTA) model using a new external dataset of 30 patients from CBTN (Children’s Brain Tumor Network), labeled in accordance with the PED BraTS 2024 guidelines. We compare segmentation outcomes with the winning algorithm from the PED BraTS 2023 challenge as the SOTA model. Our proposed algorithm achieved an average Dice score of 0.642 and an HD95 of 73.0 mm on the CBTN test data, outperforming the SOTA model, which achieved a Dice score of 0.626 and an HD95 of 84.0 mm. Our results indicate that the proposed model is a step towards providing more accurate segmentation for pediatric brain tumors, which is essential for evaluating therapy response and monitoring patient progress.
[CV-86] Optimizing Violence Detection in Video Classification Accuracy through 3D Convolutional Neural Networks
链接: https://arxiv.org/abs/2411.01348
作者: Aarjav Kavathia,Simeon Sayer
关键词-EN: violent crimes continue, rapidly identify moments, continue to happen, violent crimes, crimes continue
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures
点击查看摘要
Abstract:As violent crimes continue to happen, it becomes necessary to have security cameras that can rapidly identify moments of violence with excellent accuracy. The purpose of this study is to identify how many frames should be analyzed at a time in order to optimize a violence detection model’s accuracy as a parameter of the depth of a 3D convolutional network. Previous violence classification models have been created, but their application to live footage may be flawed. In this project, a convolutional neural network was created to analyze optical flow frames of each video. The number of frames analyzed at a time would vary with one, two, three, ten, and twenty frames, and each model would be trained for 20 epochs. The greatest validation accuracy was 94.87% and occurred with the model that analyzed three frames at a time. This means that machine learning models to detect violence may function better when analyzing three frames at a time for this dataset. The methodology used to identify the optimal number of frames to analyze at a time could be used in other applications of video classification, especially those of complex or abstract actions, such as violence.
[CV-87] Diffusion Models as Cartoonists! The Curious Case of High Density Regions
链接: https://arxiv.org/abs/2411.01293
作者: Rafał Karczewski,Markus Heinonen,Vikas Garg
关键词-EN: investigate what kind, high-density regions, images lie, diffusion models, higher likelihood
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We investigate what kind of images lie in the high-density regions of diffusion models. We introduce a theoretical mode-tracking process capable of pinpointing the exact mode of the denoising distribution, and we propose a practical high-probability sampler that consistently generates images of higher likelihood than usual samplers. Our empirical findings reveal the existence of significantly higher likelihood samples that typical samplers do not produce, often manifesting as cartoon-like drawings or blurry images depending on the noise level. Curiously, these patterns emerge in datasets devoid of such examples. We also present a novel approach to track sample likelihoods in diffusion SDEs, which remarkably incurs no additional computational cost.
[CV-88] Confidence Aware Learning for Reliable Face Anti-spoofing
链接: https://arxiv.org/abs/2411.01263
作者: Xingming Long,Jie Zhang,Shiguang Shan
关键词-EN: Current Face Anti-spoofing, Aware Face Anti-spoofing, make overly confident, encountering unfamiliar scenarios, Face Anti-spoofing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: v1
点击查看摘要
Abstract:Current Face Anti-spoofing (FAS) models tend to make overly confident predictions even when encountering unfamiliar scenarios or unknown presentation attacks, which leads to serious potential risks. To solve this problem, we propose a Confidence Aware Face Anti-spoofing (CA-FAS) model, which is aware of its capability boundary, thus achieving reliable liveness detection within this boundary. To enable the CA-FAS to “know what it doesn’t know”, we propose to estimate its confidence during the prediction of each sample. Specifically, we build Gaussian distributions for both the live faces and the known attacks. The prediction confidence for each sample is subsequently assessed using the Mahalanobis distance between the sample and the Gaussians for the “known data”. We further introduce the Mahalanobis distance-based triplet mining to optimize the parameters of both the model and the constructed Gaussians as a whole. Extensive experiments show that the proposed CA-FAS can effectively recognize samples with low prediction confidence and thus achieve much more reliable performance than other FAS models by filtering out samples that are beyond its reliable range.
[CV-89] MonoPlane: Exploiting Monocular Geometric Cues for Generalizable 3D Plane Reconstruction IROS2024
链接: https://arxiv.org/abs/2411.01226
作者: Wang Zhao,Jiachen Liu,Sheng Zhang,Yishu Li,Sili Chen,Sharon X Huang,Yong-Jin Liu,Hengkai Guo
关键词-EN: framework named MonoPlane, presents a generalizable, paper presents, reconstruction framework named, plane
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: IROS 2024 (oral)
点击查看摘要
Abstract:This paper presents a generalizable 3D plane detection and reconstruction framework named MonoPlane. Unlike previous robust estimator-based works (which require multiple images or RGB-D input) and learning-based works (which suffer from domain shift), MonoPlane combines the best of two worlds and establishes a plane reconstruction pipeline based on monocular geometric cues, resulting in accurate, robust and scalable 3D plane detection and reconstruction in the wild. Specifically, we first leverage large-scale pre-trained neural networks to obtain the depth and surface normals from a single image. These monocular geometric cues are then incorporated into a proximity-guided RANSAC framework to sequentially fit each plane instance. We exploit effective 3D point proximity and model such proximity via a graph within RANSAC to guide the plane fitting from noisy monocular depths, followed by image-level multi-plane joint optimization to improve the consistency among all plane instances. We further design a simple but effective pipeline to extend this single-view solution to sparse-view 3D plane reconstruction. Extensive experiments on a list of datasets demonstrate our superior zero-shot generalizability over baselines, achieving state-of-the-art plane reconstruction performance in a transferring setting. Our code is available at this https URL .
[CV-90] RLE: A Unified Perspective of Data Augmentation for Cross-Spectral Re-identification NEURIPS2024
链接: https://arxiv.org/abs/2411.01225
作者: Lei Tan,Yukang Zhang,Keke Han,Pingyang Dai,Yan Zhang,Yongjian Wu,Rongrong Ji
关键词-EN: Random Linear Enhancement, Radical Random Linear, Random Linear, Moderate Random Linear, Linear Enhancement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024
点击查看摘要
Abstract:This paper makes a step towards modeling the modality discrepancy in the cross-spectral re-identification task. Based on the Lambertain model, we observe that the non-linear modality discrepancy mainly comes from diverse linear transformations acting on the surface of different materials. From this view, we unify all data augmentation strategies for cross-spectral re-identification by mimicking such local linear transformations and categorizing them into moderate transformation and radical transformation. By extending the observation, we propose a Random Linear Enhancement (RLE) strategy which includes Moderate Random Linear Enhancement (MRLE) and Radical Random Linear Enhancement (RRLE) to push the boundaries of both types of transformation. Moderate Random Linear Enhancement is designed to provide diverse image transformations that satisfy the original linear correlations under constrained conditions, whereas Radical Random Linear Enhancement seeks to generate local linear transformations directly without relying on external information. The experimental results not only demonstrate the superiority and effectiveness of RLE but also confirm its great potential as a general-purpose data augmentation for cross-spectral re-identification. The code is available at \textcolormagenta\urlthis https URL.
[CV-91] Real-Time Spatio-Temporal Reconstruction of Dynamic Endoscopic Scenes with 4D Gaussian Splatting
链接: https://arxiv.org/abs/2411.01218
作者: Fengze Li,Jishuai He,Jieming Ma,Zhijing Wu
关键词-EN: robotic minimally invasive, providing crucial spatial, minimally invasive surgery, crucial spatial information, enhances surgical precision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Dynamic scene reconstruction is essential in robotic minimally invasive surgery, providing crucial spatial information that enhances surgical precision and outcomes. However, existing methods struggle to address the complex, temporally dynamic nature of endoscopic scenes. This paper presents ST-Endo4DGS, a novel framework that models the spatio-temporal volume of dynamic endoscopic scenes using unbiased 4D Gaussian Splatting (4DGS) primitives, parameterized by anisotropic ellipses with flexible 4D rotations. This approach enables precise representation of deformable tissue dynamics, capturing intricate spatial and temporal correlations in real time. Additionally, we extend spherindrical harmonics to represent time-evolving appearance, achieving realistic adaptations to lighting and view changes. A new endoscopic normal alignment constraint (ENAC) further enhances geometric fidelity by aligning rendered normals with depth-derived geometry. Extensive evaluations show that ST-Endo4DGS outperforms existing methods in both visual quality and real-time performance, establishing a new state-of-the-art in dynamic scene reconstruction for endoscopic surgery.
[CV-92] MultiPull: Detailing Signed Distance Functions by Pulling Multi-Level Queries at Multi-Step NEURIPS2024
链接: https://arxiv.org/abs/2411.01208
作者: Takeshi Noda,Chao Chen,Weiqi Zhang,Xinhai Liu,Yu-Shen Liu,Zhizhong Han
关键词-EN: Reconstructing a continuous, challenging task, Reconstructing, point cloud, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024. Project page: this https URL
点击查看摘要
Abstract:Reconstructing a continuous surface from a raw 3D point cloud is a challenging task. Recent methods usually train neural networks to overfit on single point clouds to infer signed distance functions (SDFs). However, neural networks tend to smooth local details due to the lack of ground truth signed distances or normals, which limits the performance of overfitting-based methods in reconstruction tasks. To resolve this issue, we propose a novel method, named MultiPull, to learn multi-scale implicit fields from raw point clouds by optimizing accurate SDFs from coarse to fine. We achieve this by mapping 3D query points into a set of frequency features, which makes it possible to leverage multi-level features during optimization. Meanwhile, we introduce optimization constraints from the perspective of spatial distance and normal consistency, which play a key role in point cloud reconstruction based on multi-scale optimization strategies. Our experiments on widely used object and scene benchmarks demonstrate that our method outperforms the state-of-the-art methods in surface reconstruction.
[CV-93] HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction
链接: https://arxiv.org/abs/2411.01139
作者: Rujiao Long,Pengfei Wang,Zhibo Yang,Cong Yao
关键词-EN: visual information extraction, including text spotting, word grouping, visual information, information extraction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:End-to-end visual information extraction (VIE) aims at integrating the hierarchical subtasks of VIE, including text spotting, word grouping, and entity labeling, into a unified framework. Dealing with the gaps among the three subtasks plays a pivotal role in designing an effective VIE model. OCR-dependent methods heavily rely on offline OCR engines and inevitably suffer from OCR errors, while OCR-free methods, particularly those employing a black-box model, might produce outputs that lack interpretability or contain hallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task. Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities. Furthermore, we devise corresponding hierarchical pre-training strategies, categorized as image reconstruction, layout learning, and language enhancement, to reinforce the cross-modality representation of the hierarchical encoders. Quantitative experiments on public benchmarks demonstrate that HIP outperforms previous state-of-the-art methods, while qualitative results show its excellent interpretability.
[CV-94] X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios
链接: https://arxiv.org/abs/2411.01123
作者: Yichen Xie,Chenfeng Xu,Chensheng Peng,Shuqi Zhao,Nhat Ho,Alexander T. Pham,Mingyu Ding,Masayoshi Tomizuka,Wei Zhan
关键词-EN: Recent advancements, exploited diffusion models, camera image data, LiDAR point clouds, point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Recent advancements have exploited diffusion models for the synthesis of either LiDAR point clouds or camera image data in driving scenarios. Despite their success in modeling single-modality data marginal distribution, there is an under-exploration in the mutual reliance between different modalities to describe complex driving scenes. To fill in this gap, we propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual-branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence. Besides, X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds. Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency. Our code will be made publicly available at this https URL.
[CV-95] OnlineTAS: An Online Baseline for Temporal Action Segmentation NEURIPS2024
链接: https://arxiv.org/abs/2411.01122
作者: Qing Zhong,Guodong Ding,Angela Yao
关键词-EN: temporal action segmentation, Temporal context plays, plays a significant, significant role, temporal action
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 4 figures, 12 tables. Accepted to NeurIPS 2024
点击查看摘要
Abstract:Temporal context plays a significant role in temporal action segmentation. In an offline setting, the context is typically captured by the segmentation network after observing the entire sequence. However, capturing and using such context information in an online setting remains an under-explored problem. This work presents the an online framework for temporal action segmentation. At the core of the framework is an adaptive memory designed to accommodate dynamic changes in context over time, alongside a feature augmentation module that enhances the frames with the memory. In addition, we propose a post-processing approach to mitigate the severe over-segmentation in the online setting. On three common segmentation benchmarks, our approach achieves state-of-the-art performance.
[CV-96] AquaFuse: Waterbody Fusion for Physics Guided View Synthesis of Underwater Scenes
链接: https://arxiv.org/abs/2411.01119
作者: Md Abu Bakr Siddique,Jiayi Wu,Ioannis Rekleitis,Md Jahidul Islam
关键词-EN: synthesizing waterbody properties, introduce the idea, physics-based method, method for synthesizing, underwater imagery
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:We introduce the idea of AquaFuse, a physics-based method for synthesizing waterbody properties in underwater imagery. We formulate a closed-form solution for waterbody fusion that facilitates realistic data augmentation and geometrically consistent underwater scene rendering. AquaFuse leverages the physical characteristics of light propagation underwater to synthesize the waterbody from one scene to the object contents of another. Unlike data-driven style transfer, AquaFuse preserves the depth consistency and object geometry in an input scene. We validate this unique feature by comprehensive experiments over diverse underwater scenes. We find that the AquaFused images preserve over 94% depth consistency and 90-95% structural similarity of the input scenes. We also demonstrate that it generates accurate 3D view synthesis by preserving object geometry while adapting to the inherent waterbody fusion process. AquaFuse opens up a new research direction in data augmentation by geometry-preserving style transfer for underwater imaging and robot vision applications.
[CV-97] st-Time Adaptation in Point Clouds: Leveraging Sampling Variation with Weight Averaging
链接: https://arxiv.org/abs/2411.01116
作者: Ali Bahri,Moslem Yazdanpanah,Mehrdad Noori,Sahar Dastani Oghani,Milad Cheraghalikhani,David Osowiech,Farzad Beizaee,Gustavo adolfo.vargas-hakim,Ismail Ben Ayed,Christian Desrosiers
关键词-EN: Test-Time Adaptation, source data, addresses distribution shifts, access to source, Farthest Point Sampling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Test-Time Adaptation (TTA) addresses distribution shifts during testing by adapting a pretrained model without access to source data. In this work, we propose a novel TTA approach for 3D point cloud classification, combining sampling variation with weight averaging. Our method leverages Farthest Point Sampling (FPS) and K-Nearest Neighbors (KNN) to create multiple point cloud representations, adapting the model for each variation using the TENT algorithm. The final model parameters are obtained by averaging the adapted weights, leading to improved robustness against distribution shifts. Extensive experiments on ModelNet40-C, ShapeNet-C, and ScanObjectNN-C datasets, with different backbones (Point-MAE, PointNet, DGCNN), demonstrate that our approach consistently outperforms existing methods while maintaining minimal resource overhead. The proposed method effectively enhances model generalization and stability in challenging real-world conditions.
[CV-98] LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding
链接: https://arxiv.org/abs/2411.01106
作者: Jian Chen,Ruiyi Zhang,Yufan Zhou,Tong Yu,Franck Dernoncourt,Jiuxiang Gu,Ryan A. Rossi,Changyou Chen,Tong Sun
关键词-EN: recently shown great, shown great progress, Large multimodal models, text-rich image understanding, struggle with complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Currently Under Review
点击查看摘要
Abstract:Large multimodal models (LMMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page, visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to LMMs leads to inefficiencies, especially with lengthy documents. In this work, we present a novel framework named LoRA-Contextualizing Adaptation of Large multimodal models (LoCAL), which broadens the capabilities of any LMM to support long-document understanding. We demonstrate that LMMs can effectively serve as multimodal retrievers, fetching relevant pages to answer user questions based on these pages. LoCAL is implemented with two specific LMM adapters: one for evidence page retrieval and another for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of LoCAL.
[CV-99] Few-Class Arena: A Benchmark for Efficient Selection of Vision Models and Dataset Difficulty Measurement
链接: https://arxiv.org/abs/2411.01099
作者: Bryan Bo Cao,Lawrence O’Gorman,Michael Coss,Shubham Jain
关键词-EN: propose Few-Class Arena, image classification models, Few-Class Arena, testing efficient image, efficient image classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 27 pages including References and Appendix, 20 figures, 5 tables
点击查看摘要
Abstract:We propose Few-Class Arena (FCA), as a unified benchmark with focus on testing efficient image classification models for few classes. A wide variety of benchmark datasets with many classes (80-1000) have been created to assist Computer Vision architectural evolution. An increasing number of vision models are evaluated with these many-class datasets. However, real-world applications often involve substantially fewer classes of interest (2-10). This gap between many and few classes makes it difficult to predict performance of the few-class applications using models trained on the available many-class datasets. To date, little has been offered to evaluate models in this Few-Class Regime. We conduct a systematic evaluation of the ResNet family trained on ImageNet subsets from 2 to 1000 classes, and test a wide spectrum of Convolutional Neural Networks and Transformer architectures over ten datasets by using our newly proposed FCA tool. Furthermore, to aid an up-front assessment of dataset difficulty and a more efficient selection of models, we incorporate a difficulty measure as a function of class similarity. FCA offers a new tool for efficient machine learning in the Few-Class Regime, with goals ranging from a new efficient class similarity proposal, to lightweight model architecture design, to a new scaling law. FCA is user-friendly and can be easily extended to new models and datasets, facilitating future research work. Our benchmark is available at this https URL.
[CV-100] MultiDepth: Multi-Sample Priors for Refining Monocular Metric Depth Estimations in Indoor Scenes
链接: https://arxiv.org/abs/2411.01048
作者: Sanghyun Byun,Jacob Song,Woo Seong Chung
关键词-EN: indoor scene reconstruction, Monocular metric depth, metric depth estimation, edge devices, Monocular metric
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
点击查看摘要
Abstract:Monocular metric depth estimation (MMDE) is a crucial task to solve for indoor scene reconstruction on edge devices. Despite this importance, existing models are sensitive to factors such as boundary frequency of objects in the scene and scene complexity, failing to fully capture many indoor scenes. In this work, we propose to close this gap through the task of monocular metric depth refinement (MMDR) by leveraging state-of-the-art MMDE models. MultiDepth proposes a solution by taking samples of the image along with the initial depth map prediction made by a pre-trained MMDE model. Compared to existing iterative depth refinement techniques, MultiDepth does not employ normal map prediction as part of its architecture, effectively lowering the model size and computation overhead while outputting impactful changes from refining iterations. MultiDepth implements a lightweight encoder-decoder architecture for the refinement network, processing multiple samples from the given image, including segmentation masking. We evaluate MultiDepth on four datasets and compare them to state-of-the-art methods to demonstrate its effective refinement with minimal overhead, displaying accuracy improvement upward of 45%.
[CV-101] FISHing in Uncertainty: Synthetic Contrastive Learning for Genetic Aberration Detection
链接: https://arxiv.org/abs/2411.01025
作者: Simon Gutwein,Martin Kampel,Sabine Taschner-Mandl,Roxane Licandro
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-102] Re-thinking Richardson-Lucy without Iteration Cutoffs: Physically Motivated Bayesian Deconvolution
链接: https://arxiv.org/abs/2411.00991
作者: Zachary H. Hendrix,Peter T. Brown,Tim Flanagan,Douglas P. Shepherd,Ayush Saurabh,Steve Pressé
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Biological Physics (physics.bio-ph); Data Analysis, Statistics and Probability (physics.data-an); Optics (physics.optics)
*备注: 5 figures
[CV-103] Retrieval-enriched zero-shot image classification in low-resource domains EMNLP2024
链接: https://arxiv.org/abs/2411.00988
作者: Nicola Dall’Asen,Yiming Wang,Enrico Fini,Elisa Ricci
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to EMNLP 2024 (Main)
[CV-104] Inter-Feature-Map Differential Coding of Surveillance Video
链接: https://arxiv.org/abs/2411.00984
作者: Kei Iino,Miho Takahashi,Hiroshi Watanabe,Ichiro Morinaga,Shohei Enomoto,Xu Shi,Akira Sakamoto,Takeharu Eda
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: \c{opyright} 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
[CV-105] Raspberry PhenoSet: A Phenology-based Dataset for Automated Growth Detection and Yield Estimation
链接: https://arxiv.org/abs/2411.00967
作者: Parham Jafary,Anna Bazangeya,Michelle Pham,Lesley G. Campbell,Sajad Saeedi,Kourosh Zareinia,Habiba Bougherara
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:
[CV-106] AI-EDI-SPACE: A Co-designed Dataset for Evaluating the Quality of Public Spaces CVPR2024
链接: https://arxiv.org/abs/2411.00956
作者: Shreeyash Gowaikar,Hugo Berard,Rashid Mushkani,Emmanuel Beaudry Marchand,Toumadher Ammar,Shin Koseki
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注: Presented at CVPR 2024 Workshop on Responsible Data
[CV-107] chnical Report for ActivityNet Challenge 2022 – Temporal Action Localization
链接: https://arxiv.org/abs/2411.00883
作者: Shimin Chen,Wei Li,Jianyang Gu,Chen Chen,Yandong Guo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2204.02674
[CV-108] chnical Report for Soccernet 2023 – Dense Video Captioning
链接: https://arxiv.org/abs/2411.00882
作者: Zheng Ruan,Ruixuan Liu,Shimin Chen,Mengying Zhou,Xinquan Yang,Wei Li,Chen Chen,Wei Shen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-109] chnical Report for SoccerNet Challenge 2022 – Replay Grounding Task
链接: https://arxiv.org/abs/2411.00881
作者: Shimin Chen,Wei Li,Jiaming Chu,Chen Chen,Chen Zhang,Yandong Guo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-110] Unsupervised Object Discovery: A Comprehensive Survey and Unified Taxonomy
链接: https://arxiv.org/abs/2411.00868
作者: José-Fabian Villa-Vásquez,Marco Pedersoli
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-111] Deep Learning for 3D Point Cloud Enhancement: A Survey
链接: https://arxiv.org/abs/2411.00857
作者: Siwen Quan,Junhao Yu,Ziming Nie,Muze Wang,Sijia Feng,Pei An,Jiaqi Yang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-112] Video prediction using score-based conditional density estimation
链接: https://arxiv.org/abs/2411.00842
作者: Pierre-Étienne H. Fiquet,Eero P. Simoncelli
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
[CV-113] Yoga Pose Classification Using Transfer Learning
链接: https://arxiv.org/abs/2411.00833
作者: M. M. Akash,Rahul Deb Mohalder,Md. Al Mamun Khan,Laboni Paul,Ferdous Bin Ali
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[CV-114] Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data CONLL2024
链接: https://arxiv.org/abs/2411.00828
作者: Badr AlKhamissi,Yingtian Tang,Abdülkadir Gökce,Johannes Mehrer,Martin Schrimpf
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to BabyLM Challenge at CoNLL 2024
[CV-115] Leaving Some Facial Features Behind
链接: https://arxiv.org/abs/2411.00824
作者: Cheng Qiu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages. 5 figures
[CV-116] Robust plug-and-play methods for highly accelerated non-Cartesian MRI reconstruction
链接: https://arxiv.org/abs/2411.01955
作者: Pierre-Antoine Comby(MIND, JOLIOT),Benjamin Lapostolle(MIND),Matthieu Terris(MIND),Philippe Ciuciu(MIND, JOLIOT)
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-117] A Novel Deep Learning Tractography Fiber Clustering Framework for Functionally Consistent White Matter Parcellation Using Multimodal Diffusion MRI and Functional MRI
链接: https://arxiv.org/abs/2411.01859
作者: Jin Wang,Bocheng Guo,Yijie Li,Junyi Wang,Yuqian Chen,Jarrett Rushmore,Nikos Makris,Yogesh Rathi,Lauren J O’Donnell,Fan Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures
[CV-118] Disentangled PET Lesion Segmentation
链接: https://arxiv.org/abs/2411.01758
作者: Tanya Gatsak,Kumar Abhishek,Hanene Ben Yedder,Saeid Asgari Taghanaki,Ghassan Hamarneh
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, 1 table
[CV-119] MamT4: Multi-view Attention Networks for Mammography Cancer Classification
链接: https://arxiv.org/abs/2411.01669
作者: Alisher Ibragimov,Sofya Senotrusova,Arsenii Litvinov,Egor Ushakov,Evgeny Karpulevich,Yury Markin
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: The crop model is available here: this https URL
[CV-120] HC3L-Diff: Hybrid conditional latent diffusion with high frequency enhancement for CBCT-to-CT synthesis
链接: https://arxiv.org/abs/2411.01575
作者: Shi Yin,Hongqi Tan,Li Ming Chong,Haofeng Liu,Hui Liu,Kang Hao Lee,Jeffrey Kit Loong Tuan,Dean Ho,Yueming Jin
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 5 figures
[CV-121] DSDE: Using Proportion Estimation to Improve Model Selection for Out-of-Distribution Detection
链接: https://arxiv.org/abs/2411.01487
作者: Jingyao Geng,Yuan Zhang,Jiaqi Huang,Feng Xue,Falong Tan,Chuanlong Xie,Shumei Zhang
关键词-EN:
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 2 figures
[CV-122] POT: Topology Preserving Optimal Transport in Retinal Fundus Image Enhancement
链接: https://arxiv.org/abs/2411.01403
作者: Xuanzhao Dong,Wenhui Zhu,Xin Li,Guoxin Sun,Yi Su,Oana M. Dumitrascu,Yalin Wang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-123] Deep Multi-contrast Cardiac MRI Reconstruction via vSHARP with Auxiliary Refinement Network
链接: https://arxiv.org/abs/2411.01291
作者: George Yiasemis,Nikita Moriakov,Jan-Jakob Sonke,Jonas Teuwen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 11 pages, 1 figure, 3 tables, CMRxRecon Challenge 2024
[CV-124] Enhancing Diabetic Retinopathy Detection with CNN-Based Models: A Comparative Study of UNET and Stacked UNET Architectures
链接: https://arxiv.org/abs/2411.01251
作者: Ameya Uppina,S Navaneetha Krishnan,Talluri Krishna Sai Teja,Nikhil N Iyer,Joe Dhanith P R
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[CV-125] Rotational Odometry using Ultra Low Resolution Thermal Cameras
链接: https://arxiv.org/abs/2411.01227
作者: Ali Safa
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:
[CV-126] MIC: Medical Image Classification Using Chest X-ray (COVID-19 and Pneumonia) Dataset with the Help of CNN and Customized CNN
链接: https://arxiv.org/abs/2411.01163
作者: Nafiz Fahad,Fariha Jahan,Md Kishor Morol,Rasel Ahmed,Md. Abdullah-Al-Jubair
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented at ICCA 2024
[CV-127] A lightweight Convolutional Neural Network based on U shape structure and Attention Mechanism for Anterior Mediastinum Segmentation
链接: https://arxiv.org/abs/2411.01019
作者: Sina Soleimani-Fard,Won Gi Jeong,Francis Ferri Ripalda,Hasti Sasani,Younhee Choi,S Deiva,Gong Yong Jin,Seok-bum Ko
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-128] Automated Assessment of Residual Plots with Computer Vision Models
链接: https://arxiv.org/abs/2411.01001
作者: Weihao Li,Dianne Cook,Emi Tanaka,Susan VanderPlas,Klaus Ackermann
关键词-EN:
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
[CV-129] Multiplex Imaging Analysis in Pathology: a Comprehensive Review on Analytical Approaches and Digital Toolkits
链接: https://arxiv.org/abs/2411.00948
作者: Mohamed Omar,Giuseppe Nicolo Fanelli,Fabio Socciarelli,Varun Ullanat,Sreekar Reddy Puchala,James Wen,Alex Chowdhury,Itzel Valencia,Cristian Scatena,Luigi Marchionni,Renato Umeton,Massimo Loda
关键词-EN:
类目: Tissues and Organs (q-bio.TO); Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB); Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)
*备注: 54 pages (39 manuscript + 14 supplementary), 3 figures (figure 1, 2 and supplementary figure 1), 6 Tables (Table 1, 2, 3 and supplementary table 1,2,3)
[CV-130] Lung tumor segmentation in MRI mice scans using 3D nnU-Net with minimum annotations
链接: https://arxiv.org/abs/2411.00922
作者: Piotr Kaniewski,Fariba Yousefi,Yeman Brhane Hagos,Nikolay Burlutskiy
关键词-EN: lung tumor segmentation, accurate lung tumor, lung tumor, tumor, drug discovery
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In drug discovery, accurate lung tumor segmentation is an important step for assessing tumor size and its progression using \textitin-vivo imaging such as MRI. While deep learning models have been developed to automate this process, the focus has predominantly been on human subjects, neglecting the pivotal role of animal models in pre-clinical drug development. In this work, we focus on optimizing lung tumor segmentation in mice. First, we demonstrate that the nnU-Net model outperforms the U-Net, U-Net3+, and DeepMeta models. Most importantly, we achieve better results with nnU-Net 3D models than 2D models, indicating the importance of spatial context for segmentation tasks in MRI mice scans. This study demonstrates the importance of 3D input over 2D input images for lung tumor segmentation in MRI scans. Finally, we outperform the prior state-of-the-art approach that involves the combined segmentation of lungs and tumors within the lungs. Our work achieves comparable results using only lung tumor annotations requiring fewer annotations, saving time and annotation efforts. This work\footnote\urlthis https URL is an important step in automating pre-clinical animal studies to quantify the efficacy of experimental drugs, particularly in assessing tumor changes.
[CV-131] Zero-Shot Self-Consistency Learning for Seismic Irregular Spatial Sampling Reconstruction
链接: https://arxiv.org/abs/2411.00911
作者: Junheng Peng,Yingtian Liu,Mingwei Wang,Yong Li,Huating Li
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 12 pages, 8 figures
[CV-132] Intensity Field Decomposition for Tissue-Guided Neural Tomography
链接: https://arxiv.org/abs/2411.00900
作者: Meng-Xun Li,Jin-Gang Yu,Yuan Gao,Cui Huang,Gui-Song Xia
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:
[CV-133] Multiscale texture separation
链接: https://arxiv.org/abs/2411.00894
作者: Jerome Gilles
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Functional Analysis (math.FA)
*备注:
[CV-134] Blind Time-of-Flight Imaging: Sparse Deconvolution on the Continuum with Unknown Kernels
链接: https://arxiv.org/abs/2411.00893
作者: Ruiming Guo,Ayush Bhandari
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: 27 pages
[CV-135] opology-Aware Graph Augmentation for Predicting Clinical Trajectories in Neurocognitive Disorders
链接: https://arxiv.org/abs/2411.00888
作者: Qianqian Wang,Wei Wang,Yuqi Fang,Hong-Jun Li,Andrea Bozoki,Mingxia Liu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
[CV-136] Federated Learning for Diabetic Retinopathy Diagnosis: Enhancing Accuracy and Generalizability in Under-Resourced Regions
链接: https://arxiv.org/abs/2411.00869
作者: Gajan Mohan Raj,Michael G. Morley,Mohammad Eslami
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:
机器学习
[LG-0] Boulder2Vec: Modeling Climber Performances in Professional Bouldering Competitions
链接: https://arxiv.org/abs/2411.02343
作者: Ethan Baron,Victor Hau,Zeke Weng
关键词-EN: Probabilistic Matrix Factorization, PMF, climber, climbers, predict climber results
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Using data from professional bouldering competitions from 2008 to 2022, we train a logistic regression to predict climber results and measure climber skill. However, this approach is limited, as a single numeric coefficient per climber cannot adequately capture the intricacies of climbers’ varying strengths and weaknesses in different boulder problems. For example, some climbers might prefer more static, technical routes while other climbers may specialize in powerful, dynamic problems. To this end, we apply Probabilistic Matrix Factorization (PMF), a framework commonly used in recommender systems, to represent the unique characteristics of climbers and problems with latent, multi-dimensional vectors. In this framework, a climber’s performance on a given problem is predicted by taking the dot product of the corresponding climber vector and problem vectors. PMF effectively handles sparse datasets, such as our dataset where only a subset of climbers attempt each particular problem, by extrapolating patterns from similar climbers. We contrast the empirical