本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-05-31)

今日共更新525篇论文,其中:

  • 自然语言处理77篇(Computation and Language (cs.CL))
  • 计算机视觉144篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能154篇(Artificial Intelligence (cs.AI))
  • 机器学习182篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] From Zero to Hero: Cold-Start Anomaly Detection
[NLP-0] 从零到英雄:冷启动异常检测

链接: https://arxiv.org/abs/2405.20341
作者: Tal Reiss,George Kour,Naama Zwerdling,Ateret Anaby-Tavor,Yedid Hoshen
关键词: making data-driven approaches, data-driven approaches ineffective, anomaly detection system, queries in chatbots, observed data
中文关键词: 使数据驱动方法、数据驱动方法无效、异常检测系统、聊天机器人中的查询、观察到的数据
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ACL 2024. Our code is available at this https URL

点击查看摘要

Abstract:When first deploying an anomaly detection system, e.g., to detect out-of-scope queries in chatbots, there are no observed data, making data-driven approaches ineffective. Zero-shot anomaly detection methods offer a solution to such “cold-start” cases, but unfortunately they are often not accurate enough. This paper studies the realistic but underexplored cold-start setting where an anomaly detection model is initialized using zero-shot guidance, but subsequently receives a small number of contaminated observations (namely, that may include anomalies). The goal is to make efficient use of both the zero-shot guidance and the observations. We propose ColdFusion, a method that effectively adapts the zero-shot anomaly detector to contaminated observations. To support future development of this new setting, we propose an evaluation suite consisting of evaluation protocols and metrics.
摘要:当首次部署异常检测系统时,例如,为了检测聊天机器人中的超范围查询,没有观察到的数据,这使得数据驱动的方法无效。零激发异常检测方法为此类“冷启动”情况提供了解决方案,但不幸的是,它们往往不够准确。本文研究了现实但探索不足的冷启动设置,其中异常检测模型使用零激发引导初始化,但随后收到少量受污染的观察结果(即,可能包括异常)。目标是有效利用零射制导和观测。我们提出了ColdFusion,这是一种有效地使零激发异常检测器适应受污染的观察的方法。为了支持这种新环境的未来发展,我们提出了一个由评估协议和指标组成的评估套件。

[NLP-1] Xwin-LM: Strong and Scalable Alignment Practice for LLMs
[NLP-1] Xwin-LM:针对LLM的强大且可扩展的对齐实践

链接: https://arxiv.org/abs/2405.20335
作者: Bolin Ni,JingCheng Hu,Yixuan Wei,Houwen Peng,Zheng Zhang,Gaofeng Meng,Han Hu
关键词: large language models, alignment methodologies, methodologies for large, large language, comprehensive suite
中文关键词: 大型语言模型、对齐方法论、大型语言方法论、综合套件
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we present Xwin-LM, a comprehensive suite of alignment methodologies for large language models (LLMs). This suite encompasses several key techniques, including supervised finetuning (SFT), reward modeling (RM), rejection sampling finetuning (RS), and direct preference optimization (DPO). The key components are as follows: (1) Xwin-LM-SFT, models initially finetuned with high-quality instruction data; (2) Xwin-Pair, a large-scale, multi-turn preference dataset meticulously annotated using GPT-4; (3) Xwin-RM, reward models trained on Xwin-Pair, developed at scales of 7B, 13B, and 70B parameters; (4) Xwin-Set, a multiwise preference dataset in which each prompt is linked to 64 unique responses generated by Xwin-LM-SFT and scored by Xwin-RM; (5) Xwin-LM-RS, models finetuned with the highest-scoring responses from Xwin-Set; (6) Xwin-LM-DPO, models further optimized on Xwin-Set using the DPO algorithm. Our evaluations on AlpacaEval and MT-bench demonstrate consistent and significant improvements across the pipeline, demonstrating the strength and scalability of Xwin-LM. The repository this https URL will be continually updated to foster community research.
摘要:在这项工作中,我们提出了Xwin-LM,这是一套全面的大型语言模型(LLM)对齐方法。该套件包含几个关键技术,包括监督精调(SFT)、奖励建模(RM)、拒绝抽样精调(RS)和直接偏好优化(DPO)。其关键组成部分如下:(1)Xwin-LM-SFT,最初用高质量的教学数据优化的模型;(2)Xwin-Pair,一个使用GPT-4精心标注的大规模多回合偏好数据集;(3)Xwin-RM,在Xwin-Pair上训练的奖励模型,在7B、13B和70B参数的尺度上开发;(4)Xwin-Set,一个多向偏好数据集,其中每个提示都与由Xwin-LM-SFT生成的唯一回答相关联,并由Xwin-RM评分;(5)Xwin-LM-RS,从Xwin-Set得到最高分响应的优化模型;(6)Xwin-LM-DPO,使用DPO算法在Xwin-Set上进一步优化的模型。我们在AlpacaEval和MT-BENCH上的评估表明,整个流水线都得到了一致和显著的改进,证明了Xwin-LM的实力和可扩展性。此HTTPS URL的存储库将不断更新,以促进社区研究。

[NLP-2] CausalQuest: Collecting Natural Causal Questions for AI Agents
[NLP-2] Caesion Quest:为人工智能代理收集自然因果问题

链接: https://arxiv.org/abs/2405.20318
作者: Roberto Ceraolo,Dmitrii Kharlapenko,Amélie Reymond,Rada Mihalcea,Mrinmaya Sachan,Bernhard Schölkopf,Zhijing Jin
关键词: innate drive, drive to seek, questions, causal questions, dataset
中文关键词: 先天驱动力、寻求驱动力、问题、因果问题、数据集
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans have an innate drive to seek out causality. Whether fuelled by curiosity or specific goals, we constantly question why things happen, how they are interconnected, and many other related phenomena. To develop AI agents capable of addressing this natural human quest for causality, we urgently need a comprehensive dataset of natural causal questions. Unfortunately, existing datasets either contain only artificially-crafted questions that do not reflect real AI usage scenarios or have limited coverage of questions from specific sources. To address this gap, we present CausalQuest, a dataset of 13,500 naturally occurring questions sourced from social networks, search engines, and AI assistants. We formalize the definition of causal questions and establish a taxonomy for finer-grained classification. Through a combined effort of human annotators and large language models (LLMs), we carefully label the dataset. We find that 42% of the questions humans ask are indeed causal, with the majority seeking to understand the causes behind given effects. Using this dataset, we train efficient classifiers (up to 2.85B parameters) for the binary task of identifying causal questions, achieving high performance with F1 scores of up to 0.877. We conclude with a rich set of future research directions that can build upon our data and models.
摘要:人类有一种与生俱来的寻找因果关系的动力。无论是出于好奇心,还是出于特定的目标,我们都会不断地质疑事情为什么会发生,它们是如何相互联系的,以及许多其他相关的现象。为了开发能够解决人类对因果关系的自然探索的人工智能代理,我们迫切需要一个自然因果问题的全面数据集。不幸的是,现有数据集要么只包含不反映真实人工智能使用场景的人工制作的问题,要么覆盖特定来源的问题的范围有限。为了弥补这一差距,我们提出了CausalQuest,这是一个包含13,500个自然产生的问题的数据集,这些问题来自社交网络、搜索引擎和人工智能助手。我们形式化了因果问题的定义,并建立了一个用于细粒度分类的分类法。通过人工注释员和大型语言模型(LLM)的共同努力,我们仔细地为数据集添加了标签。我们发现,人类提出的问题中有42%确实是因果性的,大多数人试图理解给定结果背后的原因。使用这个数据集,我们训练了高效的分类器(高达2.85B参数),用于识别因果问题的二元任务,实现了高性能,F1分数高达0.877。我们总结了一组丰富的未来研究方向,这些方向可以建立在我们的数据和模型的基础上。

[NLP-3] ANAH: Analytical Annotation of Hallucinations in Large Language Models
[NLP-3] ANAH:大型语言模型中幻觉的分析注释

链接: https://arxiv.org/abs/2405.20315
作者: Ziwei Ji,Yuzhe Gu,Wenwei Zhang,Chengqi Lyu,Dahua Lin,Kai Chen
关键词: Large Language Models, Language Models, Large Language, problem of Large, wide applications
中文关键词: 大型语言模型,语言模型,大型语言,大型、广泛应用的问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2024

点击查看摘要

Abstract:Reducing the ` \textithallucination ’ problem of Large Language Models (LLMs) is crucial for their wide applications. A comprehensive and fine-grained measurement of the hallucination is the first key step for the governance of this issue but is under-explored in the community. Thus, we present \textbfANAH , a bilingual dataset that offers \textbfAN alytical \textbfA nnotation of \textbfH allucinations in LLMs within Generative Question Answering. Each answer sentence in our dataset undergoes rigorous annotation, involving the retrieval of a reference fragment, the judgment of the hallucination type, and the correction of hallucinated content. ANAH consists of ~12k sentence-level annotations for ~4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline. Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs progressively accumulate in the answer and use ANAH to train and evaluate hallucination annotators. We conduct extensive experiments on studying generative and discriminative annotators and show that, although current open-source LLMs have difficulties in fine-grained hallucination annotation, the generative annotator trained with ANAH can surpass all open-source LLMs and GPT-3.5, obtain performance competitive with GPT-4, and exhibits better generalization ability on unseen questions.
摘要:减少大语言模型的“文本化”问题是其广泛应用的关键。对幻觉进行全面和精细的测量是治理这一问题的第一个关键步骤,但社区对此探索不足。因此,我们提出了一个双语数据集我们的数据集中的每个答案句子都经过了严格的注释,包括检索参考片段、判断幻觉类型和更正幻觉内容。ANAH由~4.3k个LLM回复的~12000个句子级注释组成,涵盖700多个主题,由人在循环中的管道构建。由于幻觉注解的细粒度,我们可以定量地确认LLM的幻觉在答案中逐渐积累,并使用ANAH来训练和评估幻觉注释者。我们对生成式和鉴别式注释器进行了大量的实验研究,结果表明,尽管现有的开源LLMS在细粒度幻觉标注方面存在困难,但基于ANAH训练的生成式注释器可以超过所有开源LLMS和GPT-3.5,获得与GPT-4相当的性能,并对未知问题表现出更好的泛化能力。

[NLP-4] S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
[NLP-4] S3 D:一种简单且经济高效的低内存图形处理器的自推测解码方案

链接: https://arxiv.org/abs/2405.20314
作者: Wei Zhong,Manasa Bharadwaj
关键词: research attention due, LLM inference, Speculative decoding, Simultaneous Speculative Decoding, Skippy Simultaneous Speculative
中文关键词: 研究关注,LLM推理,推测解码,同时推测解码,Skippy同时推测
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. However, despite the high speedups they offer, speculative decoding methods often achieve optimal performance on high-end devices or with a substantial GPU memory overhead. Given limited memory and the necessity of quantization, a high-performing model on a high-end GPU can slow down by up to 7 times. To this end, we propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. When compared against recent effective open-source SD systems, our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data. Leveraging our memory efficiency, we created a smaller yet more effective SD model based on Phi-3. It is 1.4 to 2 times faster than the quantized EAGLE model and operates in half-precision while using less VRAM.
摘要:投机性译码(SD)由于能够实现对LLM推理的大幅加速而引起了大量研究的关注。然而,尽管它们提供了很高的加速比,但推测解码方法通常在高端设备上获得最佳性能,或者具有大量的GPU内存开销。考虑到有限的内存和量化的必要性,在高端GPU上运行的高性能模型的速度最高可达7倍。为此,我们提出了Skippy同时推测解码(或S3D),这是一种基于同时多令牌解码和中间层跳过的高性价比自推测SD方法。与最近有效的开源SD系统相比,我们的方法在需要最少的体系结构更改和训练数据的情况下,实现了最高的性能-内存比之一。利用我们的内存效率,我们创建了一个更小但更有效的基于Phi-3的SD模型。它比量化的Eagle模型快1.4到2倍,并且在使用较少VRAM的情况下以半精度运行。

[NLP-5] Large Language Models Can Self-Improve At Web Agent Tasks
[NLP-5] 大型语言模型可以在Web代理任务中自我改进

链接: https://arxiv.org/abs/2405.20309
作者: Ajay Patel,Markus Hofmarcher,Claudiu Leoveanu-Condrei,Marius-Constantin Dinu,Chris Callison-Burch,Sepp Hochreiter
关键词: typically been challenging, challenging due, due to lack, complex environment, effectively navigate
中文关键词: 通常具有挑战性,具有挑战性,由于缺乏,复杂的环境,有效导航
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.
摘要:由于缺乏训练数据,训练模型充当能够在复杂环境(如Web浏览器)中有效导航和执行操作的代理,通常是具有挑战性的。大型语言模型(LLM)最近显示了一些能力,可以作为代理以零镜头或少镜头的方式导航,纯粹是在自然语言指令的提示下进行导航。最近的研究还表明,最小二乘模型有能力通过自我改进,即对模型本身生成的数据进行微调,来超过其基本性能。在这项工作中,我们使用WebArena基准测试来探索LLM作为代理在复杂环境中作为代理的自我提高性能的程度。在WebArena中,代理必须自主导航并在网页上执行操作,以实现指定的目标。我们探索了对三种不同的合成训练数据混合的微调,并通过自我改进过程实现了任务完成率比WebArena基准测试的基本模型提高31%。此外,我们还提供了新的评估指标,用于评估我们微调的代理模型的性能、稳健性、能力和质量,其程度高于目前用于衡量自我改进的简单、聚合级别的基准分数。

[NLP-6] Group Robust Preference Optimization in Reward-free RLHF
[NLP-6] 无奖励RL HF中的群体鲁棒偏好优化

链接: https://arxiv.org/abs/2405.20304
作者: Shyam Sundhar Ramesh,Yifan Hu,Iason Chaimalas,Viraj Mehta,Pier Giuseppe Sessa,Haitham Bou Ammar,Ilija Bogunovic
关键词: Adapting large language, large language models, Adapting large, human feedback, traditional RLHF approaches
中文关键词: 适应大型语言、大型语言模型、适应大型人类反馈、传统的HLHF方法
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers’ groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a “one-size-fits-all” approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups’ preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.
摘要:使大语言模型(LLM)适应特定任务通常涉及到通过带人类反馈的强化学习(RLHF)对偏好数据进行微调。虽然这些数据通常来自不同的标签者群体(例如,不同的人口统计数据、种族、公司团队等),但传统的RLHF方法采用的是“一刀切”的方法,即它们不分青红皂白地假设和优化单一偏好模型,从而不能适应不同群体的独特特征和需求。针对这一局限性,我们提出了一种新的群体稳健偏好优化方法(GRPO),使最小似然模型与个体群体的偏好稳健地对齐。我们的方法建立在无报酬的直接偏好优化方法的基础上,但与以前的方法不同的是,它寻求一个健壮的策略来最大化最坏情况下的群体性能。为了实现这一点,GRPO自适应地并按顺序对不同群体的重要性进行加权,对累积损失较差的群体进行优先排序。我们从理论上研究了GRPO的可行性,并对对数线性策略类进行了收敛分析。通过使用不同的基于组的全球意见数据使用GRPO微调LLM,我们显著改善了表现最差组的性能,减少了组间的损失不平衡,并与非稳健基准相比提高了概率精度。

[NLP-7] Who Writes the Review Human or AI?
[NLP-7] 谁撰写评论是人类还是人工智能?

链接: https://arxiv.org/abs/2405.20285
作者: Panagiotis C. Theocharopoulos,Spiros V. Georgakopoulos,Sotiris K. Tasoulis,Vassilis P. Plagianakos
关键词: Natural Language Processing, Artificial Intelligence, Intelligence in Natural, Language Processing, Natural Language
中文关键词: 自然语言处理、人工智能、自然智能、语言处理、自然语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the increasing use of Artificial Intelligence in Natural Language Processing, concerns have been raised regarding the detection of AI-generated text in various domains. This study aims to investigate this issue by proposing a methodology to accurately distinguish AI-generated and human-written book reviews. Our approach utilizes transfer learning, enabling the model to identify generated text across different topics while improving its ability to detect variations in writing style and vocabulary. To evaluate the effectiveness of the proposed methodology, we developed a dataset consisting of real book reviews and AI-generated reviews using the recently proposed Vicuna open-source language model. The experimental results demonstrate that it is feasible to detect the original source of text, achieving an accuracy rate of 96.86%. Our efforts are oriented toward the exploration of the capabilities and limitations of Large Language Models in the context of text identification. Expanding our knowledge in these aspects will be valuable for effectively navigating similar models in the future and ensuring the integrity and authenticity of human-generated content.
摘要:随着人工智能在自然语言处理中的应用越来越广泛,人工智能生成文本的检测在各个领域都引起了人们的关注。本研究旨在通过提出一种准确区分人工智能生成的书评和人类撰写的书评的方法来研究这一问题。我们的方法利用迁移学习,使模型能够识别跨不同主题的生成文本,同时提高其检测写作风格和词汇变化的能力。为了评估所提出的方法的有效性,我们使用最近提出的维古纳开源语言模型开发了一个由真实书评和人工智能生成的评论组成的数据集。实验结果表明,该方法对原始文本源的检测是可行的,准确率达到96.86%。我们的努力旨在探索大型语言模型在文本识别上下文中的能力和局限性。扩大我们在这些方面的知识,对于未来有效地导航类似的模式,确保人类生成内容的完整性和真实性将是有价值的。

[NLP-8] ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection
[NLP-8] ROAST:评论级意见方面情绪目标联合检测

链接: https://arxiv.org/abs/2405.20274
作者: Siva Uday Sampreeth Chebolu,Franck Dernoncourt,Nedim Lipka,Thamar Solorio
关键词: Aspect-Based Sentiment Analysis, experienced tremendous expansion, workshops and Germeval, shared tasks spanning, Aspect Sentiment Target
中文关键词: 基于Aspect的情绪分析,经历了巨大的扩展、研讨会和Germeval,跨越共享任务,Aspect Sentiment Target
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2309.13297

点击查看摘要

Abstract:Aspect-Based Sentiment Analysis (ABSA) has experienced tremendous expansion and diversity due to various shared tasks spanning several languages and fields and organized via SemEval workshops and Germeval. Nonetheless, a few shortcomings still need to be addressed, such as the lack of low-resource language evaluations and the emphasis on sentence-level analysis. To thoroughly assess ABSA techniques in the context of complete reviews, this research presents a novel task, Review-Level Opinion Aspect Sentiment Target (ROAST). ROAST seeks to close the gap between sentence-level and text-level ABSA by identifying every ABSA constituent at the review level. We extend the available datasets to enable ROAST, addressing the drawbacks noted in previous research by incorporating low-resource languages, numerous languages, and a variety of topics. Through this effort, ABSA research will be able to cover more ground and get a deeper comprehension of the task and its practical application in a variety of languages and domains (this https URL).
摘要:基于方面的情感分析(ABSA)通过SemEval研讨会和Germeval组织的跨语言和领域的各种共享任务,经历了巨大的扩展和多样性。尽管如此,仍有一些不足之处需要解决,如缺乏低资源的语文评价和强调句子一级的分析。为了在完整评论的背景下彻底评估ABSA技术,本研究提出了一个新的任务-评论级别的意见方面情绪目标(ROAST)。Roast试图通过确定审查级别的每个ABSA成分来缩小句子级别和文本级别的ABSA之间的差距。我们扩展了可用的数据集以支持ROAST,通过结合低资源语言、多种语言和各种主题来解决以前研究中注意到的缺陷。通过这一努力,ABSA的研究将能够覆盖更多的领域,并对该任务及其在各种语言和域(这是HTTPS URL)中的实际应用有更深入的理解。

[NLP-9] ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections
[NLP-9] ETHER:具有超平面反射的大规模模型的有效微调

链接: https://arxiv.org/abs/2405.20271
作者: Massimo Bini,Karsten Roth,Zeynep Akata,Anna Khoreva
关键词: adapt foundation models, downstream task requirements, generalization ability, ubiquitous to adapt, adapt foundation
中文关键词: 适应基础模型、下游任务要求、概括能力、无处不在的适应、适应基础
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2024. Code available at this https URL

点击查看摘要

Abstract:Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters ( \sim 10 - 100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility. The code is available at this https URL.
摘要:为了使基础模型适应下游任务需求,同时保持其泛化能力,参数高效精调(PEFT)已变得无处不在。然而,用于成功适应和超参数搜索的额外引入的参数和计算量可能会迅速增长,特别是在大规模部署以服务于大量个人请求时。为了确保高效、参数高效和超参数稳健的自适应,我们提出了以太变换族,它通过HypErplan反射执行高效的精调优。通过设计,以太变换需要最少的参数,不太可能降低模型性能,并且表现出对超参数和学习率选择的稳健性。特别是,我们引入了乙醚及其松弛乙醚+,它们在多个图像合成和自然语言任务中使用明显更少的参数(比LORA或OFT低10-100倍)与现有的PEFT方法相匹配或优于现有的PEFT方法,而无需穷尽的超参数调整。最后,我们调查了最近对超球能量保持用于适应的关注,并对其实际应用提出了质疑。代码可在此HTTPS URL上找到。

[NLP-10] IsraParlTweet: The Israeli Parliamentary and Twitter Resource
[NLP-10] IsraParlTwitter:以色列议会和Twitter资源

链接: https://arxiv.org/abs/2405.20269
作者: Guy Mor-Lan,Effi Levi,Tamir Sheafer,Shaul R. Shenhav
关键词: million Hebrew tokens, Hebrew-language parliamentary discussions, Twitter posts made, Israeli Parliament, million Hebrew
中文关键词: 百万希伯来代币、希伯来语议会讨论、Twitter帖子、以色列议会、百万希伯来语
类目: Computation and Language (cs.CL)
备注: Presented at LREC-COLING 2024

点击查看摘要

Abstract:We introduce IsraParlTweet, a new linked corpus of Hebrew-language parliamentary discussions from the Knesset (Israeli Parliament) between the years 1992-2023 and Twitter posts made by Members of the Knesset between the years 2008-2023, containing a total of 294.5 million Hebrew tokens. In addition to raw text, the corpus contains comprehensive metadata on speakers and Knesset sessions as well as several linguistic annotations. As a result, IsraParlTweet can be used to conduct a wide variety of quantitative and qualitative analyses and provide valuable insights into political discourse in Israel.
摘要:我们介绍IsraParlTwitter,这是一个新的链接文集,包含1992年至2023年以色列议会(以色列议会)希伯来语议会讨论以及2008年至2023年以色列议会议员在Twitter上发布的帖子,总共包含2.945亿个希伯来代币。除了原始文本外,该数据库还包含有关发言者和议会会议的全面元数据以及一些语言注释。因此,IsraParl推文可用于进行各种定量和定性分析,并为以色列的政治话语提供有价值的见解。

[NLP-11] Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
[NLP-11] LLM汽车竞技场:通过代理同行之战和委员会讨论自动化LLM评估

链接: https://arxiv.org/abs/2405.20267
作者: Ruochen Zhao,Wenxuan Zhang,Yew Ken Chia,Deli Zhao,Lidong Bing
关键词: daily basis, timely fashion, Chatbot Arena, trustworthy evaluation method, robust evaluation results
中文关键词: 日常、及时时尚、Chatbot Arena、值得信赖的评估方法、稳健的评估结果
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLMs evolve on a daily basis, there is an urgent need for a trustworthy evaluation method that can provide robust evaluation results in a timely fashion. Currently, as static benchmarks are prone to contamination concerns, users tend to trust human voting platforms, such as Chatbot Arena. However, human annotations require extensive manual efforts. To provide an automatic, robust, and trustworthy evaluation framework, we innovatively propose the Auto-Arena of LLMs, which automates the entire evaluation process with LLM agents. Firstly, an examiner LLM devises queries. Then, a pair of candidate LLMs engage in a multi-round peer-battle around the query, during which the LLM’s true performance gaps become visible. Finally, a committee of LLM judges collectively discuss and determine the winner, which alleviates bias and promotes fairness. In our extensive experiment on the 17 newest LLMs, Auto-Arena shows the highest correlation with human preferences, providing a promising alternative to human evaluation platforms.
摘要:随着LLMS的日新月异,迫切需要一种能够及时提供稳健评估结果的可信性评估方法。目前,由于静态基准容易出现污染问题,用户倾向于信任人类投票平台,如聊天机器人竞技场。然而,人工注释需要大量的手动工作。为了提供一个自动的、健壮的和可信的评估框架,我们创新性地提出了LLMS的Auto-Arena,它与LLM代理一起自动化了整个评估过程。首先,考官法学院设计了疑问句。然后,两个候选LLM围绕查询展开多轮对等竞争,在此期间LLM的真正性能差距变得明显。最后,由法学院评委组成的委员会集体讨论并决定获胜者,这减轻了偏见,促进了公平。在我们对17个最新的LLM进行的广泛实验中,Auto-Arena显示出与人类偏好的最高相关性,为人类评估平台提供了一个有前途的替代方案。

[NLP-12] Evaluating Large Language Model Biases in Persona-Steered Generation
[NLP-12] 评估人物引导一代中的大型语言模型偏差

链接: https://arxiv.org/abs/2405.20253
作者: Andy Liu,Mona Diab,Daniel Fried
关键词: requires large language, generation requires large, large language models, task of persona-steered, requires large
中文关键词: 需要大型语言,生成需要大型语言模型,人物引导的任务,需要大型
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2024. Code and data available at this https URL

点击查看摘要

Abstract:The task of persona-steered text generation requires large language models (LLMs) to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.
摘要:人物角色导向的文本生成任务需要大型语言模型(LLM)来生成反映适合人物角色的个人可能具有的观点分布的文本。人们有多方面的人物角色,但之前关于LLM生成的意见中的偏见的工作只探索了多项选择设置或一维人物角色。我们将不协调的角色定义为具有多个特征的角色,其中一个特征使其在人类调查数据中不太可能具有其他特征,例如支持增加军费开支的政治自由派。我们发现,与一致的人物角色相比,LLM对不协调的人物角色的导向能力降低了9.7%,有时会产生与其人口统计相关的刻板印象立场,而不是目标立场。我们评估的模型通过从人类反馈中强化学习(RLHF)进行了微调,更具指导性,特别是对与政治自由主义者和女性相关的立场,但对人物角色的看法明显不那么多样化。我们还发现LLM操纵性的差异不能从多项选择意见评估中预测出来。我们的结果表明了评估模型在开放式文本生成中的重要性,因为它可以暴露出新的LLM观点偏见。此外,这样的设置可以说明我们有能力将模型引导到更丰富、更多样化的范围内。

[NLP-13] owards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization
[NLP-13] 面向零镜头提示优化的分层多智能体工作流

链接: https://arxiv.org/abs/2405.20252
作者: Yuchi Liu,Jaskirat Singh,Gaowen Liu,Ali Payani,Liang Zheng
关键词: Large language models, shown great progress, Large language, language models, diverse applications
中文关键词: 大型语言模型,显示出巨大进步,大型语言,语言模型,多样化的应用
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs.
摘要:大型语言模型(LLM)在回答用户问题方面取得了很大的进步,支持多种不同的应用。然而,LLM输出的质量在很大程度上取决于提示设计,在这种设计中,良好的提示可能使LLM能够正确回答一个非常具有挑战性的问题。因此,最近的工作已经开发了许多策略来提高提示,包括手动制作和域内优化。然而,它们在不受限制的场景中的有效性仍然值得怀疑,因为前者依赖于人对特定问题的设计,而后者通常对看不见的场景概括得很差。为了解决这些问题,我们允许LLM根据自己的情况自由设计最佳提示。具体地说,我们包括LLMS的层次结构,首先以层次方式构造具有准确指令和准确措辞的提示,然后使用该提示生成对用户查询的最终答案。我们称这种管道为分层多代理工作流,或HMAW。与以前的工作不同,HMAW没有施加任何人类限制,也不需要培训,并且完全与任务无关,同时能够适应底层任务的细微差别。通过在多个基准上的定量和定性实验,我们验证了该方法尽管简单,但可以生成详细和合适的提示,进一步提高了现有LLMS的性能。

[NLP-14] Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use
[NLP-14] 检索增强结构化生成:业务文档信息提取作为工具的使用

链接: https://arxiv.org/abs/2405.20245
作者: Franz Louis Cesista,Rui Aguiar,Jason Kim,Paolo Acilo
关键词: Business Document Information, Document Information Extraction, Business Document, Line Items Recognition, scanned documents
中文关键词: 业务文档信息、文档信息提取、业务文档、行项目识别、扫描文档
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), 2024

点击查看摘要

Abstract:Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks. The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs + RASG is oftentimes superior given real-world applications and constraints of BDIE. Comments: Accepted by IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2405.20245 [cs.CL] (or arXiv:2405.20245v1 [cs.CL] for this version)
摘要:业务文档信息提取(BDIE)是对非结构化信息(原始文本、扫描文档等)进行转换的问题。转换成下游系统可以解析和使用的结构化格式。它有两个主要任务:关键信息提取(KIE)和行项目识别(LIR)。在本文中,我们认为BDIE最好被建模为一个工具使用问题,其中工具是这些下游系统。然后,我们提出了检索增强结构化生成(RASG),这是一个新颖的BDIE通用框架,在BDIE基准上实现了KIE和LIR任务的最新结果(SOTA)。本文的贡献有三个方面:(1)在BDIE基准测试中,具有RASG的大型语言模型(LLM)已经与现有的SOTA大型多模式模型(LMM)相媲美或超过。(2)我们提出了一种新的行项目识别度量类–通用行项目识别度量(General Line Items Recognition Metric,GLIRM),与现有的ANLS*、DOWILE和GRITS等度量相比,它更符合BDIE的实际用例。(3)提出了一种启发式算法,用于在不需要视觉编码器的情况下,对预测行项目和表格的边界框进行反向计算。最后,我们认为,虽然LMM有时可能会提供边际性能优势,但考虑到实际应用和BDIE的限制,LLMS+RASG通常是更好的。评论:被IEEE第七届国际多媒体信息处理和检索会议接受,2024年主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)引用AS:arxiv:2405.20245cs.CL

[NLP-15] S-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models
[NLP-15] S-Align:一个用于大型语言模型可扩展迭代微调的师生合作框架

链接: https://arxiv.org/abs/2405.20215
作者: Chen Zhang,Chengguang Tang,Dading Chong,Ke Shi,Guohua Tang,Feng Jiang,Haizhou Li
关键词: aligning large language, require periodic updates, Mainstream approaches, large language models, models require periodic
中文关键词: 调整大型语言,需要定期更新,主流方法,大型语言模型,模型需要定期更新
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mainstream approaches to aligning large language models (LLMs) heavily rely on human preference data, particularly when models require periodic updates. The standard process for iterative alignment of LLMs involves collecting new human feedback for each update. However, the data collection process is costly and challenging to scale. To address this issue, we introduce the “TS-Align” framework, which fine-tunes a policy model using pairwise feedback data automatically mined from its outputs. This automatic mining process is efficiently accomplished through the collaboration between a large-scale teacher model and a small-scale student model. The policy fine-tuning process can be iteratively repeated using on-policy generations within our proposed teacher-student collaborative framework. Through extensive experiments, we demonstrate that our final aligned policy outperforms the base policy model with an average win rate of 69.7% across seven conversational or instruction-following datasets. Furthermore, we show that the ranking capability of the teacher is effectively distilled into the student through our pipeline, resulting in a small-scale yet effective reward model for policy model alignment.
摘要:大型语言模型对齐的主流方法严重依赖于人类偏好数据,特别是当模型需要定期更新时。LLM迭代对齐的标准流程涉及为每次更新收集新的人工反馈。然而,数据收集过程的成本很高,而且扩展起来也具有挑战性。为了解决这个问题,我们引入了“TS-Align”框架,该框架使用从其输出中自动挖掘的成对反馈数据来微调策略模型。这一自动挖掘过程通过大规模教师模型和小规模学生模型之间的协作高效地完成。在我们建议的师生协作框架内,可以使用策略生成反复重复策略微调过程。通过大量的实验,我们证明了我们的最终对齐策略优于基本策略模型,在七个会话或指令跟随数据集上的平均胜率为69.7%。此外,我们通过我们的管道将教师的排名能力有效地蒸馏到学生中,从而产生了一个小规模但有效的政策模型匹配的奖励模型。

[NLP-16] PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization
[NLP-16] PostDoc:使用深度子模块优化从长多模式文档生成海报

链接: https://arxiv.org/abs/2405.20213
作者: Vijay Jaisankar,Sambaran Bandyopadhyay,Kalp Vyas,Varre Chaitanya,Shwetha Somasundaram
关键词: summary presented, long input document, good design elements, text and images, input document
中文关键词: 呈现的摘要、长输入文档、良好的设计元素、文本和图像、输入文档
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.
摘要:来自长输入文档的海报可以被视为一页易于阅读的多模式(文本和图像)摘要,呈现在具有良好设计元素的良好模板上。将长文档自动转换为海报是一项研究较少但具有挑战性的任务。它涉及输入文档的内容总结,然后是模板生成和协调。在这项工作中,我们提出了一种新颖的深度子模块函数,它可以在地面真相摘要上进行训练,以从文档中提取多模式内容,并明确确保文本和图像的良好覆盖率、多样性和对齐性。然后,我们使用基于LLM的重述,并建议根据输入内容生成具有各种设计方面的模板。我们通过广泛的自动化和人为评估来展示我们方法的优点。

[NLP-17] Jina CLIP: Your CLIP Model Is Also Your Text Retriever
[NLP-17] Jina CLIP:您的CLIP模型也是您的文本检索器

链接: https://arxiv.org/abs/2405.20204
作者: Andreas Koukounas,Georgios Mastrapas,Michael Günther,Bo Wang,Scott Martens,Isabelle Mohr,Saba Sturua,Mohammad Kalim Akram,Joan Fontanals Martínez,Saahil Ognawala,Susana Guzman,Maximilian Werk,Nan Wang,Han Xiao
关键词: Contrastive Language-Image Pretraining, Language-Image Pretraining, common embedding space, fixed-sized vectors, align images
中文关键词: 对比图像-图像预训练、图像-图像预训练、公共嵌入空间、固定大小的载体、对齐图像
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 4 pages, ICML2024 workshop submission

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
摘要:对比映射-图像预训练(CLIP)广泛用于训练模型,通过将图像和文本映射到固定大小的载体来在公共嵌入空间中对齐图像和文本。这些模型是多模式信息检索和相关任务的关键。然而,与专门的文本模型相比,CLIP模型在纯文本任务中的表现通常较差。这对于为纯文本和多模式任务保留单独的嵌入和模型的信息检索系统来说造成了效率低下。我们提出了一种新颖的多任务对比训练方法来解决这个问题,我们使用它来训练jina-clip-v1模型,以在文本-图像和文本-文本检索任务上实现最先进的性能。

[NLP-18] AIA: Large Language Models are Out-of-Distribution Data Learners
[NLP-18] AIA:大型语言模型是分布外的数据学习者

链接: https://arxiv.org/abs/2405.20192
作者: Shuyang Jiang,Yusheng Liao,Ya Zhang,Yu Wang,Yanfeng Wang
关键词: task-specific question-answer pairs, instruction-tuned large language, large language models, task-specific question-answer, question-answer pairs
中文关键词: 特定于任务的问答对、翻译调整的大型语言、大型语言模型、特定于任务的问答、问答对
类目: Computation and Language (cs.CL)
备注: 25 pages

点击查看摘要

Abstract:Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set’s distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: \ulineTraining \ulineAll parameters but \ulineInferring with only \ulineAttention (\trainallInfAttn). We empirically validate \trainallInfAttn using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that \trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of \trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data.
摘要:对特定于任务的问答对进行微调是提高指令调优的大型语言模型在下游任务上性能的主要方法。然而,在某些专业领域,如医疗保健或无害内容生成,几乎不可能获得与下游分发相匹配的大量高质量数据。为了提高LLMS在数据稀缺域和域不匹配数据中的性能,我们重新评估了Transformer架构,发现在微调过程中并不是所有的参数更新都对下游性能有积极的贡献。我们的分析表明,在自我注意和前馈网络中,当训练集的分布与测试集不完全一致时,只有微调的注意参数特别有用。基于这一认识,我们提出了一种有效的推理时间干预方法:训练所有参数,但只需注意推理(TrainalInfAttn)。我们使用两个通用的指令调优数据集对\TradallInfAttn进行了经验性验证,并在七个下游任务上对其进行了评估,这些任务涉及不同参数大小和微调技术的LLM的数学、推理和知识理解。我们的综合实验表明,在大多数情况下,与完全微调的模型和基本模型相比,\traallInfAttn具有更好的性能改进,并具有显著的性能提升。TrainallInfAttn对数据不匹配的高容忍度使其对越狱调整具有抵抗力,并增强了使用常规数据的专门任务。

[NLP-19] Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs
[NLP-19] Robo-Direct:用于微调代码LLM的模拟器增强指令对齐

链接: https://arxiv.org/abs/2405.20179
作者: Zichao Hu,Junyi Jessy Li,Arjun Guha,Joydeep Biswas
关键词: application programming interfaces, shown great promise, robot application programming, Large language models, smaller open-weight LLMs
中文关键词: 应用程序编程接口,表现出巨大的前景,机器人应用程序编程,大型语言模型,较小的开重度LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown great promise at generating robot programs from natural language given domain-specific robot application programming interfaces (APIs). However, the performance gap between proprietary LLMs and smaller open-weight LLMs remains wide. This raises a question: Can we fine-tune smaller open-weight LLMs for generating domain-specific robot programs to close the performance gap with proprietary LLMs? While Self-Instruct is a promising solution by generating a diverse set of training data, it cannot verify the correctness of these programs. In contrast, a robot simulator with a well-defined world can identify execution errors but limits the diversity of programs that it can verify. In this work, we introduce Robo-Instruct, which brings the best of both worlds – it promotes the diversity of Self-Instruct while providing the correctness of simulator-based checking. Robo-Instruct introduces RoboSim to synthesize a consistent world state on the fly by inferring properties relevant to the program being checked, and simulating actions accordingly. Furthermore, the instructions and programs generated by Self-Instruct may be subtly inconsistent – such as the program missing a step implied by the instruction. Robo-Instruct further addresses this with InstAlign, an instruction-program alignment procedure that revises the task instruction to reflect the actual results of the generated program. Given a few seed task descriptions and the robot APIs, Robo-Instruct is capable of generating a training dataset using only a small open-weight model. This dataset can then be used to fine-tune small open-weight language models, enabling them to match or even exceed the performance of several proprietary LLMs, such as GPT-3.5-Turbo and Gemini-Pro.
摘要:在给定特定领域的机器人应用编程接口(API)的情况下,大型语言模型(LLM)在从自然语言生成机器人程序方面显示出巨大的潜力。然而,专有LLM和较小的开放式重量LLM之间的性能差距仍然很大。这提出了一个问题:我们能否微调较小的开放重量LLM,以生成特定于领域的机器人程序,以缩小与专有LLM的性能差距?虽然自学是一种很有前途的解决方案,可以生成一组不同的训练数据,但它无法验证这些程序的正确性。相比之下,定义良好的机器人模拟器可以识别执行错误,但限制了它可以验证的程序的多样性。在这项工作中,我们引入了Robo-Indict,它结合了两个世界的优点–它促进了自我指令的多样性,同时提供了基于模拟器的检查的正确性。Robo-Indict引入了RoboSim,通过推断与被检查的程序相关的属性,并相应地模拟操作,动态合成一致的世界状态。此外,自指令生成的指令和程序可能会微妙地不一致–例如,程序缺少指令所暗示的步骤。Robo-Indict通过InstAlign进一步解决了这一问题,InstAlign是一个指令-程序对齐过程,它修改任务指令以反映生成的程序的实际结果。在给定一些种子任务描述和机器人API的情况下,Robo-Indict能够仅使用一个小的开放权重模型来生成训练数据集。然后,该数据集可用于微调小型开放重量语言模型,使它们的性能达到甚至超过几个专有LLM的性能,如GPT-3.5-Turbo和Gemini-Pro。

[NLP-20] InstructionCP: A fast approach to transfer Large Language Models into target language
[NLP-20] DirectionCP:将大型语言模型转化为目标语言的快速方法

链接: https://arxiv.org/abs/2405.20175
作者: Kuang-Ming Chen,Hung-yi Lee
关键词: English, focused on English, exclusively in English, continual pre-training, rapid development
中文关键词: 英语,专注英语,专门用英语,持续预培训,快速发展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:The rapid development of large language models (LLMs) in recent years has largely focused on English, resulting in models that respond exclusively in English. To adapt these models to other languages, continual pre-training (CP) is often employed, followed by supervised fine-tuning (SFT) to maintain conversational abilities. However, CP and SFT can reduce a model’s ability to filter harmful content. We propose Instruction Continual Pre-training (InsCP), which integrates instruction tags into the CP process to prevent loss of conversational proficiency while acquiring new languages. Our experiments demonstrate that InsCP retains conversational and Reinforcement Learning from Human Feedback (RLHF) abilities. Empirical evaluations on language alignment, reliability, and knowledge benchmarks confirm the efficacy of InsCP. Notably, this approach requires only 0.1 billion tokens of high-quality instruction-following data, thereby reducing resource consumption.
摘要:近年来大型语言模型(LLM)的快速发展主要集中在英语上,导致模型仅以英语响应。为了使这些模型适应其他语言,通常采用持续预训练(CP),然后进行监督微调(SFT)以保持对话能力。然而,CP和SFT可能会降低模型过滤有害内容的能力。我们提出指令连续预训练(InsCP),它将指令标签集成到CP流程中,以防止在学习新语言时失去对话熟练度。我们的实验表明InsCP保留了对话和人类反馈强化学习(RL HF)能力。对语言一致性、可靠性和知识基准的实证评估证实了InsCP的有效性。值得注意的是,这种方法仅需要1亿个代币的高质量描述跟踪数据,从而减少了资源消耗。

[NLP-21] Iterative Feature Boosting for Explainable Speech Emotion Recognition
[NLP-21] 可解释语音情感识别的迭代特征增强

链接: https://arxiv.org/abs/2405.20172
作者: Alaa Nfissi,Wassim Bouachir,Nizar Bouguila,Brian Mishara
关键词: high dimensional datasets, including redundant, irrelevant information, lead to high, high dimensional
中文关键词: 多维数据集,包括冗余的、不相关的信息,导致高维度
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Published in: 2023 International Conference on Machine Learning and Applications (ICMLA)

点击查看摘要

Abstract:In speech emotion recognition (SER), using predefined features without considering their practical importance may lead to high dimensional datasets, including redundant and irrelevant information. Consequently, high-dimensional learning often results in decreasing model accuracy while increasing computational complexity. Our work underlines the importance of carefully considering and analyzing features in order to build efficient SER systems. We present a new supervised SER method based on an efficient feature engineering approach. We pay particular attention to the explainability of results to evaluate feature relevance and refine feature sets. This is performed iteratively through feature evaluation loop, using Shapley values to boost feature selection and improve overall framework performance. Our approach allows thus to balance the benefits between model performance and transparency. The proposed method outperforms human-level performance (HLP) and state-of-the-art machine learning methods in emotion recognition on the TESS dataset.
摘要:在语音情感识别(SER)中,使用预定义的特征而不考虑它们的实际重要性可能会导致高维数据集,包括冗余和不相关的信息。因此,高维学习往往会导致模型精度降低,同时增加计算复杂性。我们的工作强调了仔细考虑和分析特征以建立高效的SER系统的重要性。基于一种有效的特征工程方法,提出了一种新的有监督的SER方法。我们特别注意结果的可解释性,以评估特征相关性和提炼特征集。这是通过功能评估循环迭代执行的,使用Shapley值来提高功能选择和整体框架性能。因此,我们的方法允许在模型性能和透明度之间取得平衡。该方法在TESS数据集上的情感识别性能优于人类水平性能(HLP)和最新的机器学习方法。

[NLP-22] Reasoning about concepts with LLMs: Inconsistencies abound
[NLP-22] 用LLM推理概念:层出不穷

链接: https://arxiv.org/abs/2405.20163
作者: Rosario Uceda-Sosa,Karthikeyan Natesan Ramamurthy,Maria Chang,Moninder Singh
关键词: ability to summarize, summarize and organize, abstract concepts, Abstract, organize knowledge
中文关键词: 总结、总结和组织的能力,抽象概念,抽象,组织知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 3 tables

点击查看摘要

Abstract:The ability to summarize and organize knowledge into abstract concepts is key to learning and reasoning. Many industrial applications rely on the consistent and systematic use of concepts, especially when dealing with decision-critical knowledge. However, we demonstrate that, when methodically questioned, large language models (LLMs) often display and demonstrate significant inconsistencies in their knowledge. Computationally, the basic aspects of the conceptualization of a given domain can be represented as Is-A hierarchies in a knowledge graph (KG) or ontology, together with a few properties or axioms that enable straightforward reasoning. We show that even simple ontologies can be used to reveal conceptual inconsistencies across several LLMs. We also propose strategies that domain experts can use to evaluate and improve the coverage of key domain concepts in LLMs of various sizes. In particular, we have been able to significantly enhance the performance of LLMs of various sizes with openly available weights using simple knowledge-graph (KG) based prompting strategies.
摘要:将知识总结和组织成抽象概念的能力是学习和推理的关键。许多工业应用程序依赖于概念的一致和系统使用,特别是在处理决策关键知识时。然而,我们证明,当有条不紊地提问时,大型语言模型(LLM)往往表现出并表现出其知识中显著的不一致。在计算上,给定域的概念化的基本方面可以表示为知识图(KG)或本体中的IS-A层次结构,以及一些支持直接推理的属性或公理。我们表明,即使是简单的本体也可以用来揭示几个LLM之间的概念不一致。我们还提出了一些策略,领域专家可以用来评估和提高不同规模的LLM中关键领域概念的覆盖率。特别是,我们已经能够使用基于简单知识图(KG)的提示策略来显著提高具有开放可用权重的各种大小的LLMS的性能。

[NLP-23] Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers
[NLP-23] Heidelberg-Boston @ SIGTYP 2024共享任务:利用搜索器感知分层转换器增强低资源语言分析

链接: https://arxiv.org/abs/2405.20145
作者: Frederick Riemenschneider,Kevin Krahn
关键词: Historical languages present, present unique challenges, languages present unique, NLP community, Historical languages
中文关键词: 存在的历史语言,存在的独特挑战,存在的独特语言,NLP社区,历史语言
类目: Computation and Language (cs.CL)
备注: Accepted for publication at the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP-WS) 2024; 11 pages, 1 figure, 9 tables

点击查看摘要

Abstract:Historical languages present unique challenges to the NLP community, with one prominent hurdle being the limited resources available in their closed corpora. This work describes our submission to the constrained subtask of the SIGTYP 2024 shared task, focusing on PoS tagging, morphological tagging, and lemmatization for 13 historical languages. For PoS and morphological tagging we adapt a hierarchical tokenization method from Sun et al. (2023) and combine it with the advantages of the DeBERTa-V3 architecture, enabling our models to efficiently learn from every character in the training data. We also demonstrate the effectiveness of character-level T5 models on the lemmatization task. Pre-trained from scratch with limited data, our models achieved first place in the constrained subtask, nearly reaching the performance levels of the unconstrained task’s winner. Our code is available at this https URL
摘要:历史语言给NLP社区带来了独特的挑战,其中一个突出的障碍是其封闭库中可用的资源有限。这项工作描述了我们对SIGTYP 2024共享任务的受约束子任务的提交,重点关注PoS标签、形态标签和13种历史语言的词形化。对于PoS和形态标记,我们采用了Sun等人(2023)的分层标记化方法,并将其与DeBERTa-V3架构的优势相结合,使我们的模型能够有效地学习训练数据中的每个字符。我们还展示了角色级T5模型在引理化任务中的有效性。我们的模型使用有限的数据从头开始进行预训练,在受约束子任务中获得了第一名,几乎达到了无约束任务获胜者的性能水平。我们的代码可在httpsURL上获取

[NLP-24] GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning
[NLP-24] GNN-RAG:用于大型语言模型推理的图神经检索

链接: https://arxiv.org/abs/2405.20139
作者: Costas Mavromatis,George Karypis
关键词: human-crafted factual knowledge, factual knowledge, Knowledge Graphs, represent human-crafted factual, collectively form
中文关键词: 人造事实知识、事实知识、知识图,代表人造事实的集体形式
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9–15.5% points at answer F1.
摘要:知识图(KG)以三元组(头、关系、尾)的形式表示人类创造的事实知识,它们共同构成一个图。KGQA是根据KG提供的信息进行推理,回答自然问题的任务。大型语言模型(LLM)因其理解自然语言的非凡能力而成为QA任务的最新模型。另一方面,图神经网络(GNN)由于能够处理存储在KG中的复杂图形信息而被广泛应用于KGQA。在这项工作中,我们介绍了一种新的方法GNN-RAG,它以检索-增强生成(RAG)的方式结合了LLMS的语言理解能力和GNN的推理能力。首先,GNN在密集的KG子图上进行推理,以检索给定问题的候选答案。其次,提取KG中连接问题实体和答案候选的最短路径来表示KG推理路径。提取的路径被描述并作为输入,用于使用RAG进行LLM推理。在我们的GNN-RAG框架中,GNN充当密集的子图推理器来提取有用的图信息,而LLM利用其自然语言处理能力来最终实现KGQA。此外,我们开发了一种检索增强(RA)技术来进一步提高GNN-RAG的KGQA性能。实验结果表明,GNN-RAG在两个广泛使用的KGQA基准测试(WebQSP和CWQ)上达到了最先进的性能,性能超过或接近GPT-4的7B调谐LLM。此外,GNN-RAG在多跳和多实体问题上的表现比竞争对手的答案F1高出8.9-15.5%。

[NLP-25] Language Models Need Inductive Biases to Count Inductively
[NLP-25] 语言模型需要归纳偏差才能进行归纳计算

链接: https://arxiv.org/abs/2405.20131
作者: Yingshan Chang,Yonatan Bisk
关键词: Peano axioms defining, lens of Peano, Peano axioms, cognitive science literature, learning to count
中文关键词: 皮亚诺公理定义,皮亚诺镜头,皮亚诺公理,认知科学文献,学习数数
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano’s axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer “reasoning” to the simplest case of counting, investigating length generalization does occur throughout the literature. In the “train short, test long” paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.
摘要:无论是从定义自然数的皮亚诺公理的数学透镜,还是从儿童学习计数的认知科学文献来看,计数都是泛化的基本例子。这两种情况的论点都是成立的,即学习数数就意味着学习无限数数。虽然很少有论文试图将变换的“推理”提炼为最简单的计数情况,但在整个文献中确实出现了调查长度泛化的情况。在自然语言处理的“训练短,测试长”范式中,长度指的是训练句子的长度。在形式语言识别中,长度指的是输入序列的长度,即下推自动机产生的最大堆栈大小。在一般的问题解决中,长度指的是演绎推理链中的跳数或递归深度。在所有情况下,计算都是任务成功的关键。至关重要的是,归纳计数的泛化是OOD实例成功的关键。这项工作为训练语言模型进行计数提供了广泛的实证结果。我们试验了从RNN、Transformers、状态空间模型到RWKV的各种体系结构。我们提出了精心设计的任务格式、辅助任务和位置嵌入,以避免OOD-位置和OOD-词汇泛化的限制。我们发现,虽然传统的RNN很容易实现归纳计数,但变压器必须依赖位置嵌入来进行域外计数。由于计数是许多关于变形金刚表现力的争论的基础,我们的发现呼吁社区重新审查在形式刻画中定义的原始函数的应用范围。最后,现代RNN在归纳计数方面也大大落后于传统RNN。我们讨论了支持现代RNN并行化训练的设计选择如何导致它们失去重复性质的优点。

[NLP-26] Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting
[NLP-26] 填补空白!将自我监督表示学习与神经音频合成相结合用于语音修复

链接: https://arxiv.org/abs/2405.20101
作者: Ihab Asaad,Maxime Jacquelin,Olivier Perrotin,Laurent Girin,Thomas Hueber
关键词: speech self-supervised learning, predicting missing parts, causal prediction, non-causal prediction, speech SSL model
中文关键词: 语音自我监督学习、预测缺失部分、因果预测、非因果预测、语音SSL模型
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.
摘要:大多数语音自监督学习(SSL)模型都是用一个借口任务来训练的,该任务包括预测输入信号中缺失的部分,要么是未来的片段(因果预测),要么是被掩蔽在输入中任何位置的片段(非因果预测)。然后,可以将学习的语音表示有效地转移到下游任务(例如,自动语音或说话人识别)。在本研究中,我们研究了用于语音修复的语音SSL模型的使用,即从语音信号的周围上下文中重建语音信号的缺失部分,即完成与借口任务非常相似的下游任务。为此,我们结合了一个SSL码,即休伯特,和一个神经声码,即HiFiGAN,扮演解码器的角色。特别是,我们提出了两种解决方案来匹配休伯特输出和HiFiGan输入,通过冻结一个和微调另一个,反之亦然。在单说话人和多说话人设置下,使用不同的客观度量和感知评估,对知情和盲修复配置(即,面具的位置分别为已知或未知)的两种方法的性能进行了评估。性能表明,如果两种解决方案都允许正确地重建大小高达200ms(在某些情况下甚至是400ms)的信号部分,则微调SSL编码器在单扬声器设置的情况下提供了更准确的信号重构,而冻结它(并改为训练神经声码器)是处理多扬声器数据时的更好策略。

[NLP-27] Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation
[NLP-27] 分治达成共识:释放代码生成中功能的力量

链接: https://arxiv.org/abs/2405.20092
作者: Jingchang Chen,Hongxuan Tang,Zheng Chu,Qianglong Chen,Zekun Wang,Ming Liu,Bing Qin
关键词: recent progress made, large language models, progress made, made by large, large language
中文关键词: 最近取得的进展,大型语言模型,大型语言取得的进展
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage self-tests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement. To this end, we propose FunCoder, a code generation framework incorporating the divide-and-conquer strategy with functional consensus. Specifically, FunCoder recursively branches off sub-functions as smaller goals during code generation, represented by a tree hierarchy. These sub-functions are then composited to attain more complex objectives. Additionally, we designate functions via a consensus formed by identifying similarities in program behavior, mitigating error propagation. FunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method demonstrates superiority on smaller models: With FunCoder, StableCode-3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4’s performance on HumanEval. Further analysis reveals that our proposed dynamic function decomposition is capable of handling complex requirements, and the functional consensus prevails over self-testing in correctness evaluation.
摘要:尽管大型语言模型最近在代码生成方面取得了进展,但它们仍然在满足复杂需求的程序中苦苦挣扎。最近的工作利用计划和求解分解来降低复杂性,并利用自测试来精炼生成的程序。然而,提前计划深层次的需求可能是具有挑战性的,测试需要准确以实现自我改进。为此,我们提出了FunCoder,这是一个融合了分而治之策略和功能共识的代码生成框架。具体地说,FunCoder在代码生成期间递归地将子功能分支为较小的目标,由树层次结构表示。然后将这些子功能组合在一起,以实现更复杂的目标。此外,我们通过识别程序行为中的相似性而形成的共识来指定功能,从而减少错误传播。与GPT-3.5和GPT-4相比,FunCoder在HumanEval、MBPP、xCodeEval和MATH方面的表现平均比最先进的方法高出9.8%。此外,我们的方法在较小的模型上表现出了优越性:使用FunCoder,SableCode-3b比GPT-3.5提高了+18.6%,在人类评价上达到了GPT-4‘S的97.7%。进一步的分析表明,我们提出的动态功能分解能够处理复杂的需求,并且在正确性评估中,功能共识优于自测试。

[NLP-28] he Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities
[NLP-28] 微调悖论:在不牺牲法学硕士能力的情况下提高翻译质量

链接: https://arxiv.org/abs/2405.20089
作者: David Stap,Eva Hasler,Bill Byrne,Christof Monz,Ke Tran
关键词: large language models, Fine-tuning large language, machine translation, translation, large language
中文关键词: 大型语言模型、微调大型语言、机器翻译、翻译、大型语言
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 (long, main)

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) for machine translation has shown improvements in overall translation quality. However, it is unclear what is the impact of fine-tuning on desirable LLM behaviors that are not present in neural machine translation models, such as steerability, inherent document-level translation abilities, and the ability to produce less literal translations. We perform an extensive translation evaluation on the LLaMA and Falcon family of models with model size ranging from 7 billion up to 65 billion parameters. Our results show that while fine-tuning improves the general translation quality of LLMs, several abilities degrade. In particular, we observe a decline in the ability to perform formality steering, to produce technical translations through few-shot examples, and to perform document-level translation. On the other hand, we observe that the model produces less literal translations after fine-tuning on parallel data. We show that by including monolingual data as part of the fine-tuning data we can maintain the abilities while simultaneously enhancing overall translation quality. Our findings emphasize the need for fine-tuning strategies that preserve the benefits of LLMs for machine translation.
摘要:针对机器翻译的大语言模型(LLM)的微调已经显示出整体翻译质量的改善。然而,目前尚不清楚微调对神经机器翻译模型中不存在的理想LLM行为的影响,例如可操纵性、固有的文档级翻译能力以及产生较少直译的能力。我们对骆驼和猎鹰系列模型进行了广泛的翻译评估,模型大小从70亿到650亿个参数不等。我们的结果表明,微调虽然提高了LLMS的整体翻译质量,但也降低了一些翻译能力。特别是,我们注意到执行形式指导的能力下降,通过少量例子产生技术翻译的能力下降,以及执行文档级翻译的能力下降。另一方面,我们观察到,在对并行数据进行微调后,该模型产生的直译结果较少。我们表明,通过将单语数据作为微调数据的一部分,我们可以在保持这些能力的同时提高整体翻译质量。我们的发现强调了微调策略的必要性,以保持LLMS对机器翻译的好处。

[NLP-29] Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning
[NLP-29] 学生答案预测:转化器驱动的语言学习答案选择预测

链接: https://arxiv.org/abs/2405.20079
作者: Elena Grazia Gado,Tommaso Martorella,Luca Zunino,Paola Mejia-Domenzain,Vinitra Swamy,Jibril Frej,Tanja Käser
关键词: Intelligent Tutoring Systems, Intelligent Tutoring, Tutoring Systems, specific answer choices, answer choices
中文关键词: 智能辅导系统,智能辅导,辅导系统,具体答案选择,答案选择
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted as a poster paper at EDM 2024: 17th International Conference on Educational Data Mining in Atlanta, USA

点击查看摘要

Abstract:Intelligent Tutoring Systems (ITS) enhance personalized learning by predicting student answers to provide immediate and customized instruction. However, recent research has primarily focused on the correctness of the answer rather than the student’s performance on specific answer choices, limiting insights into students’ thought processes and potential misconceptions. To address this gap, we present MCQStudentBert, an answer forecasting model that leverages the capabilities of Large Language Models (LLMs) to integrate contextual understanding of students’ answering history along with the text of the questions and answers. By predicting the specific answer choices students are likely to make, practitioners can easily extend the model to new answer choices or remove answer choices for the same multiple-choice question (MCQ) without retraining the model. In particular, we compare MLP, LSTM, BERT, and Mistral 7B architectures to generate embeddings from students’ past interactions, which are then incorporated into a finetuned BERT’s answer-forecasting mechanism. We apply our pipeline to a dataset of language learning MCQ, gathered from an ITS with over 10,000 students to explore the predictive accuracy of MCQStudentBert, which incorporates student interaction patterns, in comparison to correct answer prediction and traditional mastery-learning feature-based approaches. This work opens the door to more personalized content, modularization, and granular support.
摘要:智能教学系统(ITS)通过预测学生的答案来提供即时和定制的教学,从而增强了个性化学习。然而,最近的研究主要集中在答案的正确性上,而不是学生在具体答案选择上的表现,限制了对学生思维过程和潜在误解的洞察。为了解决这一差距,我们提出了一个答案预测模型MCQStudentBert,该模型利用大型语言模型(LLMS)的能力来整合对学生回答历史的上下文理解以及问题和答案的文本。通过预测学生可能做出的具体答案选择,实践者可以轻松地将模型扩展到新的答案选择或删除同一多项选择题(McQ)的答案选择,而无需重新训练模型。特别是,我们比较了MLP、LSTM、BERT和西北风7B架构,以从学生过去的交互中生成嵌入,然后将其合并到BERT的精细答案预测机制中。我们将我们的流程应用于从有10,000多名学生的ITS收集的语言学习McQ数据集,以探索MCQStudentBert的预测准确性,它包含了学生的交互模式,与正确答案预测和传统的基于掌握学习特征的方法进行比较。这项工作为更个性化的内容、模块化和细粒度支持打开了大门。

[NLP-30] Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads
[NLP-30] 我会骗你吗?使用直接偏好头进行语言模型的推理时间对齐

链接: https://arxiv.org/abs/2405.20053
作者: Avelina Asada Hadji-Kyriacou,Ognjen Arandjelovic
关键词: Pre-trained Language Models, Direct Preference Optimization, in-context learning capabilities, exhibit strong zero-shot, Pre-trained Language
中文关键词: 预训练的语言模型、直接偏好优化、上下文学习能力,展现出强大的零射击、预训练语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model’s reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.
摘要:预训练语言模型(LMS)具有很强的零射击和情景学习能力,但其行为往往难以控制。通过利用从人类反馈的强化学习(RLHF),可以微调无监督的LMS以遵循指令并产生反映人类偏好的输出。尽管RLHF有好处,但已被证明有可能损害语言模型的推理能力,并引入幻觉等假象,模型可能会捏造事实。为了解决这个问题,我们引入了直接偏好头部(DPH),这是一个微调框架,使LMS能够通过辅助奖励头部学习人类偏好信号,而不直接影响语言建模头部的输出分布。我们对我们的目标函数进行了理论分析,发现它与保守派直接偏好优化(CDPO)有很强的联系。最后,我们在GLUE、RACE和GPT4All评估套件上对我们的模型进行了评估,并证明了我们的方法产生的模型比仅使用监督精调(SFT)或直接偏好优化(DPO)进行微调的模型获得了更高的分数。

[NLP-31] Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities
[NLP-31] 核心语言熵:根据语义相似性对LLM进行细粒度不确定性量化

链接: https://arxiv.org/abs/2405.20003
作者: Alexander Nikitin,Jannik Kossen,Yarin Gal,Pekka Marttinen
关键词: Large Language Models, Large Language, reliability are important, crucial for applications, applications where safety
中文关键词: 大型语言模型、大型语言、可靠性很重要,对于应用程序、安全应用程序至关重要
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Uncertainty quantification in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. In particular, uncertainty can be used to improve the trustworthiness of LLMs by detecting factually incorrect model responses, commonly called hallucinations. Critically, one should seek to capture the model’s semantic uncertainty, i.e., the uncertainty over the meanings of LLM outputs, rather than uncertainty over lexical or syntactic variations that do not affect answer correctness. To address this problem, we propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs. KLE defines positive semidefinite unit trace kernels to encode the semantic similarities of LLM outputs and quantifies uncertainty using the von Neumann entropy. It considers pairwise semantic dependencies between answers (or semantic clusters), providing more fine-grained uncertainty estimates than previous methods based on hard clustering of answers. We theoretically prove that KLE generalizes the previous state-of-the-art method called semantic entropy and empirically demonstrate that it improves uncertainty quantification performance across multiple natural language generation datasets and LLM architectures.
摘要:在安全性和可靠性要求很高的应用中,大型语言模型中的不确定性量化是至关重要的。特别是,不确定性可以通过检测实际不正确的模型反应(通常称为幻觉)来提高LLMS的可信度。关键是,我们应该设法捕捉模型的语义不确定性,即LLM输出意义上的不确定性,而不是不影响答案正确性的词汇或句法变化的不确定性。为了解决这个问题,我们提出了一种新的白盒和黑盒LLMS不确定性估计方法–核语言熵。KLE定义了半正定单位迹核来编码LLM输出的语义相似性,并使用von Neumann熵来量化不确定性。它考虑了答案(或语义簇)之间的成对语义依赖关系,比以往基于答案硬聚类的方法提供了更细粒度的不确定性估计。我们从理论上证明了KLE推广了以往最先进的语义熵方法,并通过实验证明了它提高了多个自然语言生成数据集和LLM体系结构上的不确定性量化性能。

[NLP-32] Improved Out-of-Scope Intent Classification with Dual Encoding and Threshold-based Re-Classification
[NLP-32] 利用双重编码和基于阈值的重新分类改进的范围外意图分类

链接: https://arxiv.org/abs/2405.19967
作者: Hossam M. Zawbaa,Wael Rashwan,Sourav Dutta,Haytham Assem
关键词: essential for task-oriented, task-oriented dialogues, Universal Sentence Encoder, Detecting, DETER
中文关键词: 对于面向任务、面向任务的对话、通用句子编码器、检测、DETER至关重要
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting out-of-scope user utterances is essential for task-oriented dialogues and intent classification. Current methodologies face difficulties with the unpredictable distribution of outliers and often rely on assumptions about data distributions. We present the Dual Encoder for Threshold-Based Re-Classification (DETER) to address these challenges. This end-to-end framework efficiently detects out-of-scope intents without requiring assumptions on data distributions or additional post-processing steps. The core of DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and the Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance embeddings, which are classified through a branched neural architecture. Further, DETER generates synthetic outliers using self-supervision and incorporates out-of-scope phrases from open-domain datasets. This approach ensures a comprehensive training set for out-of-scope detection. Additionally, a threshold-based re-classification mechanism refines the model’s initial predictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77 datasets demonstrate DETER’s efficacy. Our model outperforms previous benchmarks, increasing up to 13% and 5% in F1 score for known and unknown intents on CLINC-150 and Stackoverflow, and 16% for known and 24% % for unknown intents on Banking77. The source code has been released at this https URL_Classification_OOS.
摘要:检测用户的越界话语对于面向任务的对话和意图分类是至关重要的。当前的方法面临着异常值的不可预测分布的困难,并且往往依赖于关于数据分布的假设。为了解决这些问题,我们提出了一种基于阈值重分类的双编码器。这个端到端框架可以有效地检测超出范围的意图,而不需要假设数据分布或额外的后处理步骤。Deter的核心利用双重文本编码器,通用语句编码器(USE)和基于转换器的去噪自动编码器(TSDAE)来生成用户话语嵌入,并通过分支神经体系结构对其进行分类。此外,Dirter使用自我监督生成合成异常值,并从开放领域数据集中纳入超出范围的短语。这种方法确保了用于范围外检测的全面训练集。此外,基于阈值的重新分类机制改进了模型的初始预测。对Clinc-150、Stackoverflow和Banking77数据集的评估证明了Dreat的有效性。我们的模型比以前的基准测试性能更好,在Clinc-150和Stackoverflow上,已知和未知意图的F1分数分别增加了13%和5%,在Banking77上,已知意图和未知意图的F1分数分别增加了16%和24%。源代码已在此HTTPS URL_分类_OOS上发布。

[NLP-33] Multi-Aspect Controllable Text Generation with Disentangled Counterfactual Augmentation
[NLP-33] 具有解开反事实增强的多方面可控文本生成

链接: https://arxiv.org/abs/2405.19958
作者: Yi Liu,Xiangyu Liu,Xiangrong Zhu,Wei Hu
关键词: Multi-aspect controllable text, multiple aspects, controllable text generation, text generation aims, attribute correlations
中文关键词: 多方面可控文本、多方面、可控文本生成、文本生成目标、属性相关性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

点击查看摘要

Abstract:Multi-aspect controllable text generation aims to control the generated texts in attributes from multiple aspects (e.g., “positive” from sentiment and “sport” from topic). For ease of obtaining training samples, existing works neglect attribute correlations formed by the intertwining of different attributes. Particularly, the stereotype formed by imbalanced attribute correlations significantly affects multi-aspect control. In this paper, we propose MAGIC, a new multi-aspect controllable text generation method with disentangled counterfactual augmentation. We alleviate the issue of imbalanced attribute correlations during training using counterfactual feature vectors in the attribute latent space by disentanglement. During inference, we enhance attribute correlations by target-guided counterfactual augmentation to further improve multi-aspect control. Experiments show that MAGIC outperforms state-of-the-art baselines in both imbalanced and balanced attribute correlation scenarios. Our source code and data are available at this https URL.
摘要:多方面可控文本生成的目的是从多个方面控制生成的文本的属性(例如,情感上的“积极”和话题上的“体育”)。为了便于训练样本的获取,现有的研究忽略了不同属性相互交织形成的属性相关性。其中,属性关联不平衡形成的刻板印象显著影响多方面控制。在本文中,我们提出了一种新的多方面可控文本生成方法MAGIC,该方法带有解缠的反事实增强。我们利用属性潜在空间中的反事实特征向量,通过解缠来缓解训练过程中属性相关性不平衡的问题。在推理过程中,我们通过目标引导的反事实增强来增强属性相关性,以进一步改善多方面控制。实验表明,MAGIC在不平衡和平衡属性关联场景中的性能都优于最新的基线。我们的源代码和数据可以在这个HTTPS URL上找到。

[NLP-34] GenKubeSec: LLM-Based Kubernetes Misconfiguration Detection Localization Reasoning and Remediation
[NLP-34] GenKubeSec:基于LLM的Kubernetes错误配置检测本地化推理和修复

链接: https://arxiv.org/abs/2405.19954
作者: Ehud Malul,Yair Meidan,Dudu Mimran,Yuval Elovici,Asaf Shabtai
关键词: Kubernetes configuration files, configuration files, complex and error-prone, operational setbacks, highly complex
中文关键词: Kubernetes配置文件,配置文件,复杂且容易出错,操作挫折,高度复杂
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A key challenge associated with Kubernetes configuration files (KCFs) is that they are often highly complex and error-prone, leading to security vulnerabilities and operational setbacks. Rule-based (RB) tools for KCF misconfiguration detection rely on static rule sets, making them inherently limited and unable to detect newly-discovered misconfigurations. RB tools also suffer from misdetection, since mistakes are likely when coding the detection rules. Recent methods for detecting and remediating KCF misconfigurations are limited in terms of their scalability and detection coverage, or due to the fact that they have high expertise requirements and do not offer automated remediation along with misconfiguration detection. Novel approaches that employ LLMs in their pipeline rely on API-based, general-purpose, and mainly commercial models. Thus, they pose security challenges, have inconsistent classification performance, and can be costly. In this paper, we propose GenKubeSec, a comprehensive and adaptive, LLM-based method, which, in addition to detecting a wide variety of KCF misconfigurations, also identifies the exact location of the misconfigurations and provides detailed reasoning about them, along with suggested remediation. When empirically compared with three industry-standard RB tools, GenKubeSec achieved equivalent precision (0.990) and superior recall (0.999). When a random sample of KCFs was examined by a Kubernetes security expert, GenKubeSec’s explanations as to misconfiguration localization, reasoning and remediation were 100% correct, informative and useful. To facilitate further advancements in this domain, we share the unique dataset we collected, a unified misconfiguration index we developed for label standardization, our experimentation code, and GenKubeSec itself as an open-source tool.
摘要:与Kubernetes配置文件(KCF)相关的一个关键挑战是,它们往往非常复杂和容易出错,导致安全漏洞和操作受挫。用于检测KCF错误配置的基于规则(RB)的工具依赖于静态规则集,这使得它们具有固有的局限性,并且无法检测新发现的错误配置。RB工具也会受到误检的影响,因为在编码检测规则时可能会出错。目前用于检测和修复KCF错误配置的方法在可扩展性和检测覆盖范围方面受到限制,或者是因为它们具有很高的专业知识要求,并且不提供自动修复和错误配置检测。正在开发中的采用LLM的新方法依赖于基于API的、通用的、主要是商业模型。因此,它们构成了安全挑战,具有不一致的分类性能,并且可能成本高昂。在本文中,我们提出了GenKubeSec,这是一个全面的、自适应的、基于LLM的方法,除了检测各种KCF错误配置外,还识别错误配置的确切位置,提供关于错误配置的详细推理,以及建议的补救措施。与三种行业标准的RB工具进行经验比较时,GenKubeSec获得了同等的准确率(0.990)和更好的召回率(0.999)。当Kubernetes安全专家对随机抽样的KCFs进行检查时,GenKubeSec关于错误配置定位、推理和补救的解释是100%正确的、信息丰富的和有用的。为了促进这一领域的进一步发展,我们共享了我们收集的唯一数据集、我们为标签标准化开发的统一错误配置索引、我们的实验代码以及作为开源工具的GenKubeSec本身。

[NLP-35] KNOW: A Real-World Ontology for Knowledge Capture with Large Language Models
[NLP-35] KNOW:用于使用大型语言模型知识捕获的现实世界的本体

链接: https://arxiv.org/abs/2405.19877
作者: Arto Bendiken
关键词: Knowledge Navigator Ontology, augment large language, Knowledge Navigator, Navigator Ontology, personal AI assistants
中文关键词: 知识导航器Ontology、增强大型语言、知识导航器、导航器Ontology、个人AI助手
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:We present KNOW–the Knowledge Navigator Ontology for the World–the first ontology designed to capture everyday knowledge to augment large language models (LLMs) in real-world generative AI use cases such as personal AI assistants. Our domain is human life, both its everyday concerns and its major milestones. We have limited the initial scope of the modeled concepts to only established human universals: spacetime (places, events) plus social (people, groups, organizations). The inclusion criteria for modeled concepts are pragmatic, beginning with universality and utility. We compare and contrast previous work such as this http URL and Cyc–as well as attempts at a synthesis of knowledge graphs and language models–noting how LLMs already encode internally much of the commonsense tacit knowledge that took decades to capture in the Cyc project. We also make available code-generated software libraries for the 12 most popular programming languages, enabling the direct use of ontology concepts in software engineering. We emphasize simplicity and developer experience in promoting AI interoperability.
摘要:我们提出了Knowledge–Knowledge Navigator Ontology for the World–第一个本体,旨在捕获日常知识,以增强现实世界生成性AI用例(如个人AI助手)中的大型语言模型(LLM)。我们的领域是人类的生活,既是它的日常事务,也是它的主要里程碑。我们已将模型化概念的初始范围限制在已建立的人类共性:时空(地点、事件)加上社会(人、团体、组织)。模型化概念的纳入标准是务实的,从普遍性和实用性开始。我们比较和对比了以前的工作,如http URL和Cyc–以及合成知识图和语言模型的尝试–注意到LLM已经在内部编码了许多常识默示知识,这些知识在Cyc项目中花了几十年才捕获。我们还为12种最流行的编程语言提供了代码生成软件库,从而能够在软件工程中直接使用本体概念。我们在促进AI互操作性方面强调简单性和开发人员体验。

[NLP-36] Is In-Context Learning Sufficient for Instruction Following in LLMs?
[NLP-36] 上下文学习足以满足法学硕士的教学要求吗?

链接: https://arxiv.org/abs/2405.19874
作者: Hao Zhao,Maksym Andriushchenko,Francesco Croce,Nicolas Flammarion
关键词: potentially learn, changing their weights, promising capability, In-context learning, ICL
中文关键词: 潜在学习、改变权重、有前途的能力、背景学习、ICL
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Code at this https URL

点击查看摘要

Abstract:In-context learning (ICL) allows LLMs to learn from examples without changing their weights, which is a particularly promising capability for long-context LLMs that can potentially learn from many examples. Recently, Lin et al. (2024) proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on established benchmarks such as MT-Bench and AlpacaEval 2.0 (LC), especially with more capable base LMs. Unlike for tasks such as classification, translation, or summarization, adding more ICL demonstrations for long-context LLMs does not systematically improve instruction following performance. To address this limitation, we derive a greedy selection approach for ICL examples that noticeably improves performance, yet without bridging the gap to instruction fine-tuning. Finally, we provide a series of ablation studies to better understand the reasons behind the remaining gap, and we show how some aspects of ICL depart from the existing knowledge and are specific to the instruction tuning setting. Overall, our work advances the understanding of ICL as an alignment technique. We provide our code at this https URL.
摘要:上下文中学习(ICL)允许LLM在不改变权值的情况下从示例中学习,这对于可以从许多示例中学习的长上下文LLM来说是一种特别有前途的能力。最近,林等人提出了自己的观点。(2024)提出了一种仅使用三个上下文中的例子来对齐基本LLM的方法Urial,实现了非平凡的指令跟随性能。在这项工作中,我们表明,尽管ICL与Urial的匹配有效,但与在已建立的基准测试(如MT-BENCH和AlpacaEval 2.0(LC))上进行指令微调相比,ICL与Urial的匹配仍然表现不佳,特别是在功能更强大的基本LMS上。与分类、翻译或总结等任务不同,为长上下文LLM添加更多的ICL演示并不能系统地提高指令跟随性能。为了解决这一局限性,我们为ICL示例推导了一种贪婪的选择方法,该方法显著提高了性能,但没有弥合与指令微调的差距。最后,我们提供了一系列的消融研究,以更好地了解剩余差距背后的原因,并展示了ICL的某些方面如何偏离现有知识,并特定于教学调整设置。总体而言,我们的工作促进了对ICL作为一种比对技术的理解。我们在此HTTPS URL上提供我们的代码。

[NLP-37] DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
[NLP-37] DevEval:与现实世界代码库保持一致的手动注释代码生成基准

链接: https://arxiv.org/abs/2405.19856
作者: Jia Li,Ge Li,Yunfei Zhao,Yongmin Li,Huanyu Liu,Hao Zhu,Lecheng Wang,Kaibo Liu,Zheng Fang,Lanshen Wang,Jiazheng Ding,Xuanming Zhang,Yuqi Zhu,Yihong Dong,Zhi Jin,Binhua Li,Fei Huang,Yongbin Li
关键词: Large Language Models, Language Models, Large Language, coding abilities, abilities of Large
中文关键词: 大型语言模型,语言模型,大型语言,编码能力,大型能力
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted by the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). arXiv admin note: substantial text overlap with arXiv:2404.00599 , arXiv:2401.06401

点击查看摘要

Abstract:How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs’ coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs’ failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs’ predictions have been released. Comments: Accepted by the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). arXiv admin note: substantial text overlap with arXiv:2404.00599, arXiv:2401.06401 Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2405.19856 [cs.CL] (or arXiv:2405.19856v1 [cs.CL] for this version)
摘要:如何评价大型语言模型的编码能力一直是一个悬而未决的问题。我们发现,现有的基准测试与真实世界的代码库的一致性很差,不足以评估LLM的编码能力。为了解决知识鸿沟,我们提出了一个名为DevEval的新基准,该基准有三个方面的进步。(1)DevEval在多个维度上与现实世界的存储库保持一致,例如代码分发和依赖分发。(2)DevEval由13个开发人员进行注释,包含全面的注释(例如,需求、原始存储库、参考代码和参考依赖项)。(3)DevEval包含来自117个仓库的1874个测试样本,覆盖10个热门领域(如互联网、数据库)。基于DevEval,我们提出了存储库级代码生成,并在DevEval上对8种流行的LLMS(如GPT-4、GPT-3.5、StarCoder 2、DeepSeek Coder、CodeLLaMa)进行了评估。我们的实验揭示了这些LLM在真实世界代码库中的编码能力。例如,在我们的实验中,GPT-4-Turbo的最高通过率@1仅为53.04%。我们还分析了LLMS的失败案例,并总结了它们的不足之处。我们希望DevEval能够促进实际代码库中LLM的开发。DevEval、Prompt和LLMS的预测已经发布。评论:被第62届计算语言学协会年会(ACL2024)接受。Arxiv管理员注:大量文本与arxiv:2404.00599、arxiv:2401.06401主题重叠:计算和语言(cs.CL);软件工程(cs.SE)引用为:arxiv:2405.19856cs.CL

[NLP-38] Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model
[NLP-38] Quest:以查询为中心的数据合成方法,用于大型语言模型的长上下文扩展

链接: https://arxiv.org/abs/2405.19846
作者: Chaochen Gao,Xing Wu,Qi Fu,Songlin Hu
关键词: handle longer texts, initially pre-trained, handle longer, longer texts, texts by continuing
中文关键词: 处理更长的文本,最初预先训练,处理更长的文本,通过继续处理更长的文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models, initially pre-trained with a limited context length, can better handle longer texts by continuing training on a corpus with extended contexts. However, obtaining effective long-context data is challenging due to the scarcity and uneven distribution of long documents across different domains. To address this issue, we propose a Query-centric data synthesis method, abbreviated as Quest. Quest is an interpretable method based on the observation that documents retrieved by similar queries are relevant but low-redundant, thus well-suited for synthesizing long-context data. The method is also scalable and capable of constructing large amounts of long-context data. Using Quest, we synthesize a long-context dataset up to 128k context length, significantly outperforming other data synthesis methods on multiple long-context benchmark datasets. In addition, we further verify that the Quest method is predictable through scaling law experiments, making it a reliable solution for advancing long-context models.
摘要:大型语言模型最初是用有限的上下文长度进行预训练的,通过在具有扩展上下文的语料库上继续训练,可以更好地处理较长的文本。然而,由于长文档在不同领域中的稀缺性和不均匀分布,获取有效的长上下文数据是具有挑战性的。为了解决这个问题,我们提出了一种以查询为中心的数据合成方法,简称Quest。Quest是一种可解释的方法,基于这样的观察,即通过类似查询检索的文档是相关的,但冗余度较低,因此非常适合合成长上下文数据。该方法还具有可伸缩性,能够构建大量的长上下文数据。使用Quest,我们合成了长达128k的长上下文数据集,在多个长上下文基准数据集上的性能显著优于其他数据合成方法。此外,我们还通过标度律实验进一步验证了Quest方法是可预测的,使其成为改进长上下文模型的可靠解决方案。

[NLP-39] Improve Students Reasoning Generalizability through Cascading Decomposed CoTs Distillation
[NLP-39] 通过级联分解CoTS蒸馏提高学生推理的概括性

链接: https://arxiv.org/abs/2405.19842
作者: Chengwei Dai,Kun Li,Wei Zhou,Songlin Hu
关键词: Large language models, Large language, exhibit enhanced reasoning, exhibit enhanced, larger scales
中文关键词: 大型语言模型,大型语言,表现出增强的推理,表现出增强的、更大的规模
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit enhanced reasoning at larger scales, driving efforts to distill these capabilities into smaller models via teacher-student learning. Previous works simply fine-tune student models on teachers’ generated Chain-of-Thoughts (CoTs) data. Although these methods enhance in-domain (IND) reasoning performance, they struggle to generalize to out-of-domain (OOD) tasks. We believe that the widespread spurious correlations between questions and answers may lead the model to preset a specific answer which restricts the diversity and generalizability of its reasoning process. In this paper, we propose Cascading Decomposed CoTs Distillation (CasCoD) to address these issues by decomposing the traditional single-step learning process into two cascaded learning steps. Specifically, by restructuring the training objectives – removing the answer from outputs and concatenating the question with the rationale as input – CasCoD’s two-step learning process ensures that students focus on learning rationales without interference from the preset answers, thus improving reasoning generalizability. Extensive experiments demonstrate the effectiveness of CasCoD on both IND and OOD benchmark reasoning datasets. Code can be found at this https URL.
摘要:大型语言模型在更大的尺度上展示了更强的推理能力,推动了通过师生学习将这些能力提炼成更小的模型的努力。以前的工作只是根据教师生成的思维链(COTS)数据对学生模型进行微调。虽然这些方法提高了域内(IND)推理性能,但它们难以推广到域外(OOD)任务。我们认为,问题和答案之间普遍存在的虚假相关性可能会导致该模型预设一个特定的答案,这限制了其推理过程的多样性和普适性。在本文中,我们提出了级联分解CoTS蒸馏(CasCoD)来解决这些问题,将传统的单步学习过程分解为两个级联学习步骤。具体地说,通过调整培训目标–从输出中去掉答案,并将问题与基本原理作为输入–CasCoD的两步学习过程确保学生专注于学习基本原理,而不受预设答案的干扰,从而提高了推理的泛化能力。大量实验证明了CasCoD在IND和OOD基准推理数据集上的有效性。代码可以在此HTTPS URL中找到。

[NLP-40] Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten Text
[NLP-40] 只需重写一遍:一种增强差异隐私重写文本语义相似性和隐私保护的后处理方法

链接: https://arxiv.org/abs/2405.19831
作者: Stephen Meisenbacher,Florian Matthes
关键词: Natural Language Processing, Natural Language, Language Processing, implicit private information, study of Differential
中文关键词: 自然语言处理,自然语言,语言处理,隐性私人信息,差异研究
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 2 tables. Accepted to ARES 2024 (IWAPS)

点击查看摘要

Abstract:The study of Differential Privacy (DP) in Natural Language Processing often views the task of text privatization as a \textitrewriting task, in which sensitive input texts are rewritten to hide explicit or implicit private information. In order to evaluate the privacy-preserving capabilities of a DP text rewriting mechanism, \textitempirical privacy tests are frequently employed. In these tests, an adversary is modeled, who aims to infer sensitive information (e.g., gender) about the author behind a (privatized) text. Looking to improve the empirical protections provided by DP rewriting methods, we propose a simple post-processing method based on the goal of aligning rewritten texts with their original counterparts, where DP rewritten texts are rewritten \textitagain . Our results shown that such an approach not only produces outputs that are more semantically reminiscent of the original inputs, but also texts which score on average better in empirical privacy evaluations. Therefore, our approach raises the bar for DP rewriting methods in their empirical privacy evaluations, providing an extra layer of protection against malicious adversaries.
摘要:在自然语言处理中的差异隐私研究中,通常将文本私有化视为一种文本写作任务,即对输入的敏感文本进行重写,以隐藏显性或隐含的隐私信息。为了评估DP文本重写机制的隐私保护能力,经常使用文本隐私测试。在这些测试中,模拟了一个对手,他的目标是推断(私有化)文本背后关于作者的敏感信息(例如,性别)。为了改善DP重写方法提供的经验保护,我们提出了一种简单的后处理方法,其目标是将重写的文本与原始文本对齐,其中DP重写的文本被再次重写。我们的结果表明,这种方法不仅产生的输出在语义上更像原始输入,而且文本在经验隐私评估中的平均得分更高。因此,我们的方法提高了DP重写方法在其经验隐私评估中的门槛,提供了针对恶意攻击者的额外一层保护。

[NLP-41] Unsupervised Mutual Learning of Dialogue Discourse Parsing and Topic Segmentation
[NLP-41] 对话话语解析和话题分割的无监督相互学习

链接: https://arxiv.org/abs/2405.19799
作者: Jiahui Xu,Feng Jiang,Anningzhe Gao,Haizhou Li
关键词: large language models, advancement of large, large language, propelled the development, dialogue systems
中文关键词: 大型语言模型,大型语言的进步,推动了对话系统的发展
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) has propelled the development of dialogue systems. Unlike the popular ChatGPT-like assistant model, which only satisfies the user’s preferences, task-oriented dialogue systems have also faced new requirements and challenges in the broader business field. They are expected to provide correct responses at each dialogue turn, at the same time, achieve the overall goal defined by the task. By understanding rhetorical structures and topic structures via topic segmentation and discourse parsing, a dialogue system may do a better planning to achieve both objectives. However, while both structures belong to discourse structure in linguistics, rhetorical structure and topic structure are mostly modeled separately or with one assisting the other in the prior work. The interaction between these two structures has not been considered for joint modeling and mutual learning. Furthermore, unsupervised learning techniques to achieve the above are not well explored. To fill this gap, we propose an unsupervised mutual learning framework of two structures leveraging the global and local connections between them. We extend the topic modeling between non-adjacent discourse units to ensure global structural relevance with rhetorical structures. We also incorporate rhetorical structures into the topic structure through a graph neural network model to ensure local coherence consistency. Finally, we utilize the similarity between the two fused structures for mutual learning. The experimental results demonstrate that our methods outperform all strong baselines on two dialogue rhetorical datasets (STAC and Molweni), as well as dialogue topic datasets (Doc2Dial and TIAGE).
摘要:大型语言模型的提出推动了对话系统的发展。与流行的类似ChatGPT的助手模式只满足用户的偏好不同,面向任务的对话系统在更广泛的商业领域也面临着新的要求和挑战。他们被期望在每一轮对话中做出正确的回答,同时实现任务规定的总体目标。通过话题切分和语篇分析来理解修辞结构和话题结构,对话系统可以更好地规划实现这两个目标。然而,虽然这两种结构在语言学中都属于语篇结构,但修辞结构和话题结构在以往的工作中大多是单独建模或一方辅助另一方的。在联合建模和相互学习中,没有考虑这两个结构之间的相互作用。此外,实现上述目标的非监督学习技术还没有得到很好的探索。为了填补这一空白,我们提出了一个由两个结构组成的无监督相互学习框架,利用它们之间的全局和局部联系。我们扩展了不相邻语篇单元之间的主题建模,以确保全局结构与修辞结构的相关性。我们还通过图神经网络模型将修辞结构融入到主题结构中,以确保局部连贯一致性。最后,我们利用两种融合结构之间的相似性进行相互学习。实验结果表明,我们的方法在两个对话修辞数据集(STAC和Molweni)以及对话主题数据集(Doc2Dial和TIAGE)上的性能都优于所有强基线。

[NLP-42] SLM as Guardian: Pioneering AI Safety with Small Language Models
[NLP-42] 作为守护者的LAM:利用小语言模型开创人工智能安全

链接: https://arxiv.org/abs/2405.19795
作者: Ohjoon Kwon,Donghyeon Jeon,Nayoung Choi,Gyu-Hwung Cho,Changbong Kim,Hyunwoo Lee,Inho Kang,Sun Kim,Taiwoo Park
关键词: prior safety research, large language models, research of large, focused on enhancing, enhancing the alignment
中文关键词: 先前的安全研究、大型语言模型、大型研究、专注于增强、增强一致性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2405.19795 [cs.CL] (or arXiv:2405.19795v1 [cs.CL] for this version)
摘要:以往对大语言模型的安全性研究主要集中在增强大语言模型的对齐能力,以更好地适应人类的安全需求。然而,将这些保障功能内部化到更大的模型中带来了更高的培训成本和帮助的意外降级的挑战。为了克服这些挑战,模块化的方法利用较小的LLM来检测有害的用户查询,被认为是设计基于LLM的具有安全要求的系统的一种方便的解决方案。在本文中,我们利用较小的LLM来进行有害查询检测和安全响应生成。我们介绍了我们的安全需求和危害分类,然后提出了一种将两个任务融合到一个模型中的多任务学习机制。我们证明了我们的方法的有效性,与公开可用的LLMS相比,提供了同等的或超过了有害的查询检测和保障响应性能。学科:计算与语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2405.19795cs.CL

[NLP-43] PDDLEGO: Iterative Planning in Textual Environments
[NLP-43] PDDLEGO:文本环境中的迭代规划

链接: https://arxiv.org/abs/2405.19793
作者: Li Zhang,Peter Jansen,Tianyi Zhang,Peter Clark,Chris Callison-Burch,Niket Tandon
关键词: current models, long-standing challenge, textual environments, representation, Planning in textual
中文关键词: 当前模型、长期挑战、文本环境、表示、文本规划
类目: Computation and Language (cs.CL)
备注: In *SEM 2024

点击查看摘要

Abstract:Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%).
摘要:文本环境中的规划已被证明是一个长期存在的挑战,即使对当前的模型也是如此。最近,一种前景看好的工作领域使用LLMS来生成环境的正式表示,该表示可以通过符号规划师来解决。然而,现有的方法依赖于完全观察的环境,其中所有实体状态最初都是已知的,因此可以构建一次性表示,从而产生完整的计划。相比之下,我们处理的是部分观察到的环境,在这些环境中,最初没有足够的信息来计划最终目标。我们提出了PDDLEGO,它迭代地构造一个规划表示,该表示可以导致给定子目标的部分计划。通过完成子目标,获得了更多的信息来增强表征,最终实现了最终目标。我们表明,少镜头PDDLEGO生成的计划比在Coin Collector模拟上端到端生成计划的效率高43%,在更复杂的Cooking World模拟上具有很强的性能(98%),其中端到端LLM无法生成连贯的计划(4%)。

[NLP-44] From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers
[NLP-44] 从符号任务到代码生成:多元化带来更好的任务执行者

链接: https://arxiv.org/abs/2405.19787
作者: Dylan Zhang,Justin Wang,Francois Charton
关键词: tuning large language, large language models, instruction-output pairs, real world, tuning large
中文关键词: 调优大型语言、大型语言模型、描述输出对、现实世界、调优大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Instruction tuning – tuning large language models on instruction-output pairs – is a promising technique for making models better adapted to the real world. Yet, the key factors driving the model’s capability to understand and follow instructions not seen during training remain under-explored. Our investigation begins with a series of synthetic experiments within the theoretical framework of a Turing-complete algorithm called Markov algorithm, which allows fine-grained control over the instruction-tuning data. Generalization and robustness with respect to the training distribution emerge once a diverse enough set of tasks is provided, even though very few examples are provided for each task. We extend these initial results to a real-world application scenario of code generation and find that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation. Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model’s ability to follow instructions and perform tasks.
摘要:指令调优–在指令-输出对上调优大型语言模型–是一种使模型更好地适应现实世界的有前途的技术。然而,推动该模型理解和遵循培训期间没有看到的指令的关键因素仍然没有得到充分的探索。我们的研究首先在图灵完成算法的理论框架内进行了一系列合成实验,该算法被称为马尔可夫算法,该算法允许对指令调优数据进行细粒度控制。一旦提供了一组足够多样化的任务,就会出现关于训练分布的泛化和稳健性,即使为每个任务提供的例子很少。我们将这些初步结果扩展到代码生成的真实应用场景中,并发现除了与代码相关的任务之外,更多样化的指令集可以提高代码生成的性能。我们的观察表明,指令调优集的更多样化的语义空间极大地提高了模型遵循指令和执行任务的能力。

[NLP-45] Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion
[NLP-45] 数据流引导检索增强以实现存储库级代码完成

链接: https://arxiv.org/abs/2405.19782
作者: Wei Cheng,Yuhan Wu,Wei Hu
关键词: Recent years, code language models, code intelligence tasks, language models, years have witnessed
中文关键词: 近年来,代码语言模型、代码智能任务、语言模型,多年来见证
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

点击查看摘要

Abstract:Recent years have witnessed the deployment of code language models (LMs) in various code intelligence tasks such as code completion. Yet, it is challenging for pre-trained LMs to generate correct completions in private repositories. Previous studies retrieve cross-file context based on import relations or text similarity, which is insufficiently relevant to completion targets. In this paper, we propose a dataflow-guided retrieval augmentation approach, called DraCo, for repository-level code completion. DraCo parses a private repository into code entities and establishes their relations through an extended dataflow analysis, forming a repo-specific context graph. Whenever triggering code completion, DraCo precisely retrieves relevant background knowledge from the repo-specific context graph and generates well-formed prompts to query code LMs. Furthermore, we construct a large Python dataset, ReccEval, with more diverse completion targets. Our experiments demonstrate the superior accuracy and applicable efficiency of DraCo, improving code exact match by 3.43% and identifier F1-score by 3.27% on average compared to the state-of-the-art approach.
摘要:近年来,代码语言模型(LMS)在代码补全等各种代码智能任务中得到了广泛的应用。然而,对于经过预先训练的LMS来说,在私有存储库中生成正确的补全是具有挑战性的。以前的研究基于输入关系或文本相似度来检索跨文件上下文,这与完成目标的相关性不够。在本文中,我们提出了一种数据流引导的检索增强方法,称为Draco,用于存储库级代码完成。Draco将私有存储库解析为代码实体,并通过扩展的数据流分析建立它们之间的关系,形成特定于repo的上下文图。每当触发代码完成时,Draco都会准确地从特定于repo的上下文图中检索相关的背景知识,并生成格式良好的提示来查询代码LMS。此外,我们还构建了一个大型的Python数据集ReccEval,它具有更多样化的完成目标。实验表明,Draco算法具有较高的准确率和应用效率,与现有方法相比,编码准确率平均提高了3.43%,识别符F1-Score平均提高了3.27%。

[NLP-46] Enhancing Consistency and Role-Specific Knowledge Capturing by Rebuilding Fictional Characters Persona
[NLP-46] 通过重建虚构人物角色来增强一致性和特定角色知识的捕获

链接: https://arxiv.org/abs/2405.19778
作者: Jeiyoon Park,Chanjun Park,Heuiseok Lim
关键词: Assistants API, document-based language models, Assistants, recent introduction, expected that document-based
中文关键词: 助理API,基于文档的语言模型,助理,最近的介绍,预计基于文档的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:With the recent introduction of Assistants API, it is expected that document-based language models will be actively used in various domains, especially Role-playing. However, a key challenge lies in utilizing protagonist’s persona: Assistants API often fails to achieve with its search because the information extraction part is different each time and it often omits important information such as protagonist’s backstory or relationships. It is hard to maintain a consistent persona simply by using the persona document as input to the Assistants API. To address the challenge of achieving stable persona consistency, we propose CharacterGPT, a novel persona reconstruction framework to alleviate the shortcomings of the Assistants API. Our method involves Character Persona Training (CPT), an effective persona rebuilding process that updates the character persona by extracting the character’s traits from given summary of the novel for each character as if the story in a novel progresses. In our experiments, we ask each character to take the Big Five Inventory personality test in various settings and analyze the results. To assess whether it can think outside the box, we let each character generate short novels. Extensive experiments and human evaluation demonstrate that CharacterGPT presents new possibilities for role-playing agent research.
摘要:随着最近助手API的引入,基于文档的语言模型有望在各个领域得到积极的应用,特别是角色扮演。然而,一个关键的挑战在于利用主人公的角色:助手API往往无法实现其搜索,因为每次的信息提取部分都不同,并且它经常忽略重要的信息,如主人公的背景或关系。仅仅通过使用角色文档作为助理API的输入,很难维护一致的角色。为了解决稳定的角色一致性问题,我们提出了一种新的角色重构框架CharacterGPT,以缓解助手API的不足。我们的方法包括人物角色训练(CPT),这是一个有效的人物角色重建过程,通过从给定的小说摘要中为每个人物提取人物特征来更新人物角色,就像小说中的故事进展一样。在我们的实验中,我们要求每个角色在不同的设置下进行五大人格问卷测试,并对结果进行分析。为了评估它是否能跳出框框思考,我们让每个角色创作短篇小说。广泛的实验和人体评估表明,CharacterGPT为角色扮演代理的研究提供了新的可能性。

[NLP-47] Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding
[NLP-47] 通过自然语言理解的标签敏感奖励增强强化学习

链接: https://arxiv.org/abs/2405.19763
作者: Kuo Liao,Shuang Li,Meng Zhao,Liqun Liu,Mengge Xue,Zhenyu Hu,Honglin Han,Chengguo Yin
关键词: significantly enhance generation, Natural Language Understanding, Recent strides, leveraging reinforcement learning, yielded remarkable performance
中文关键词: 显着增强生成、自然语言理解、最近的进展,利用强化学习,取得了非凡的性能
类目: Computation and Language (cs.CL)
备注: Accept at ACL2024 Main

点击查看摘要

Abstract:Recent strides in large language models (LLMs) have yielded remarkable performance, leveraging reinforcement learning from human feedback (RLHF) to significantly enhance generation and alignment capabilities. However, RLHF encounters numerous challenges, including the objective mismatch issue, leading to suboptimal performance in Natural Language Understanding (NLU) tasks. To address this limitation, we propose a novel Reinforcement Learning framework enhanced with Label-sensitive Reward (RLLR) to amplify the performance of LLMs in NLU tasks. By incorporating label-sensitive pairs into reinforcement learning, our method aims to adeptly capture nuanced label-sensitive semantic features during RL, thereby enhancing natural language understanding. Experiments conducted on five diverse foundation models across eight tasks showcase promising results. In comparison to Supervised Fine-tuning models (SFT), RLLR demonstrates an average performance improvement of 1.54%. Compared with RLHF models, the improvement averages at 0.69%. These results reveal the effectiveness of our method for LLMs in NLU tasks. Code and data available at: this https URL.
摘要:最近在大型语言模型(LLM)方面的进展取得了显著的性能,利用人类反馈的强化学习(RLHF)显著增强了生成和对齐能力。然而,RLHF遇到了许多挑战,包括目标不匹配问题,导致自然语言理解(NLU)任务的表现不佳。针对这一局限性,我们提出了一种新的强化学习框架,该框架增强了标签敏感奖励(RLLR),以增强LLMS在NLU任务中的性能。通过将标签敏感对引入强化学习,我们的方法旨在熟练地捕捉RL过程中细微差别的标签敏感语义特征,从而增强自然语言理解。在八个任务的五个不同的基础模型上进行的实验展示了令人振奋的结果。与监督精调模型(SFT)相比,RLLR的平均性能提高了1.54%。与RLHF模型相比,平均提高了0.69%。这些结果表明了我们的方法对于自然语言理解任务中的LLMS的有效性。代码和数据可在以下网址获得:此HTTPS URL。

[NLP-48] X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions
[NLP-48] X-Direction:将低资源语言中的语言模型与自策划的跨语言指令相一致

链接: https://arxiv.org/abs/2405.19744
作者: Chong Li,Wen Yang,Jiajun Zhang,Jinliang Lu,Shaonan Wang,Chengqing Zong
关键词: Large language models, Large language, instruction, English, cross-lingual instruction
中文关键词: 大型语言模型,大型语言,教学,英语,跨语言教学
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2024. Our codes, data and model weights are available at this https URL

点击查看摘要

Abstract:Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.
摘要:大型语言模型在英语等高资源语言中反应良好,但在低资源语言中表现不佳。这可能是由于缺乏高质量的指导,遵循这些语言的数据。直接将英语样本翻译成这些语言可能是一种解决方案,但不可靠,导致答复中存在翻译错误,并且缺乏特定语言或文化知识。为了解决这一问题,我们提出了一种新的方法来构建跨语言教学,该方法遵循样本,用英语授课,用低资源语言回答。具体地,语言模型首先学习根据其他语言的自然网络文本生成适当的英语指令作为响应。候选跨语言教学调整样本进一步提炼和多样化。我们使用该方法在X指令等10种语言上构建了一个大规模的跨语言教学调整数据集。与朴素的翻译方法相比,使用我们的方法构建的教学数据包含了更多特定语言的知识。实验结果表明,基于X指令的模型的反应质量大大超过了从强大的教师模型中提炼出的模型,达到甚至超过了ChatGPT模型。此外,我们发现,根据样本进行跨语言教学的模型可以在不进一步调整的情况下跟随输出语言的指令。

[NLP-49] PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
[NLP-49] PertEval:揭示具有知识不变扰动的LLM的真实知识能力

链接: https://arxiv.org/abs/2405.19740
作者: Jiatong Li,Renjun Hu,Kunzhe Huang,Yan Zhuang,Qi Liu,Mengxiao Zhu,Xing Shi,Wei Lin
关键词: large language models, Expert-designed close-ended benchmarks, Expert-designed close-ended, language models, serve as vital
中文关键词: 大型语言模型、专家设计的封闭基准、专家设计的封闭语言模型至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages, 12 figures, 10 tables

点击查看摘要

Abstract:Expert-designed close-ended benchmarks serve as vital tools in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs’ knowledge capacity through knowledge-invariant perturbations. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of transition analyses that compare performance on raw vs. perturbed test sets to precisely assess LLMs’ genuine knowledge capacity. Six state-of-the-art LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 21% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs’ uncertainty to specious knowledge, potentially being resolved through rote memorization and leading to inflated performance. We also find that the detailed transition analyses by PertEval could illuminate weaknesses in existing LLMs’ knowledge mastery and guide the development of refinement. Given these insights, we posit that PertEval can act as an essential tool that, when applied alongside any close-ended benchmark, unveils the true knowledge capacity of LLMs, marking a significant step toward more trustworthy LLM evaluation.
摘要:专家设计的封闭式基准是评估大型语言模型知识能力的重要工具。尽管它们被广泛使用,但由于有限的测试场景和不可避免的数据污染风险,人们越来越担心它们的可靠性。为了纠正这一点,我们提出了PertEval,这是一个设计用于通过知识不变扰动来深入探索LLM的知识能力的工具包。这些扰动使用类似人类的重述技术来从静态基准生成即时测试样本,在更改不相关的细节的同时精心保留关键知识的内容。我们的工具包还包括一套转换分析,比较原始测试集和扰动测试集的性能,以准确评估LLMS的真正知识能力。使用PertEval对六个最先进的LLM进行了重新评估。结果显示,LLMS在原始基准上的性能被显著夸大,包括对GPT-4的绝对高估21%。此外,通过细微差别的反应模式分析,我们发现PertEval保留了LLMS对似是而非知识的不确定性,可能通过死记硬背来解决,并导致成绩膨胀。我们还发现,PertEval的详细转换分析可以揭示现有LLMS在知识掌握方面的不足,并指导精化的发展。鉴于这些见解,我们假设PertEval可以作为一个基本工具,当与任何封闭式基准一起应用时,揭示LLM的真实知识能力,标志着朝着更可信的LLM评估迈出了重要的一步。

[NLP-50] Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation
[NLP-50] 超越模仿:从推理蒸馏中的双思想链学习关键推理步骤

链接: https://arxiv.org/abs/2405.19737
作者: Chengwei Dai,Kun Li,Wei Zhou,Songlin Hu
关键词: Large Language Models, Smaller Language Models, compact Smaller Language, Language Models, Large Language
中文关键词: 大型语言模型、较小语言模型、紧凑较小语言、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion ( \approx 4.7% ) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher’s reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistak\textbfE-\textbfDriven key reason\textbfIng step distilla\textbfTion (\textbfEDIT), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTs\footnoteCode can be found at \urlthis https URL.
摘要:随着大型语言模型(LLM)的规模扩大和获得强大的思维链(COTS)推理能力,实际的资源限制促使人们努力将这些能力提取到更紧凑的小型语言模型(SLM)中。我们发现COTS主要由简单的推理形式组成,只有一小部分(约4.7)关键推理步骤真正影响结论。然而,以前的蒸馏方法通常只对教师LLM产生的正确COTS数据进行监督微调学生SLM,导致学生努力学习关键推理步骤,而不是模仿教师的推理形式,并在这些步骤上犯错误或遗漏。为了解决这些问题,通过类比人类学习,根据正确的解决方案分析错误通常会揭示导致成功或失败的关键步骤,我们提出了MISMak\textbfE-\textbf Driven key ason\textbfIng Step Distilla\extbfTion(\textbfEDIT),这是一种新的方法,进一步帮助SLM学习关键推理步骤,而不仅仅是简单的微调。首先,为了揭示COTS中的这些关键步骤,我们设计了特定的提示来生成推理路径相似但结论不同的双重COTS数据。然后,我们将最小编辑距离算法应用于双胶辊数据来定位这些关键步骤,并优化这些步骤的可能性。大量实验验证了域内和域外基准推理数据集编辑的有效性。进一步分析表明,EDIT可以生成具有更正确关键推理步骤的高质量胶辊。值得注意的是,我们还探索了不同的错误模式如何影响性能,并发现编辑从逻辑错误中获得的好处比从双重Cots中的知识或数学计算错误中获得的好处更多。\r在此HTTPS URL中可以找到。

[NLP-51] wo Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization
[NLP-51] 两个优化器比一个更好:增强基于对象的优化的LLM催化剂

链接: https://arxiv.org/abs/2405.19732
作者: Zixian Guo,Ming Liu,Zhilong Ji,Jinfeng Bai,Yiwen Guo,Wangmeng Zuo
关键词: skill generally relies, Learning a skill, insightful high-level guidance, skill generally, generally relies
中文关键词: 技能一般依赖,学习技能,有洞察力的高层指导,技能一般,一般依赖
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for concrete problems by inferring from natural language instructions, akin to a high-level instructor. In this paper, we show that these two optimizers are complementary to each other, suggesting a collaborative optimization approach. The gradient-based optimizer and LLM-based optimizer are combined in an interleaved manner. We instruct LLMs using task descriptions and timely optimization trajectories recorded during gradient-based optimization. Inferred results from LLMs are used as restarting points for the next stage of gradient optimization. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, our combined optimization method consistently yields improvements over competitive baseline prompt tuning methods. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at this https URL.
摘要:学习一项技能通常既依赖于实践者的实践经验,也依赖于教师富有洞察力的高层指导。这种策略是否也适用于解决复杂的非凸优化问题?在这里,常见的基于梯度的优化器就像训练有素的实干家,在每一步都进行局部最优更新。最近的方法利用大语言模型(LLM)通过从自然语言指令推断来优化具体问题的解决方案,类似于高级教师。在本文中,我们证明了这两个优化器是相辅相成的,提出了一种协同优化方法。基于梯度的优化器和基于LLM的优化器以交错的方式组合。我们使用基于梯度的优化过程中记录的任务描述和及时的优化轨迹来指导LLM。从LLMS推断的结果被用作下一阶段梯度优化的重新起点。通过利用局部严格的基于梯度的优化器和基于高级演绎LLM的优化器,我们的组合优化方法持续产生比竞争基准提示调优方法更好的结果。我们的结果证明了传统的基于梯度的优化的协同效应和LLMS的推理能力。代码在此HTTPS URL上发布。

[NLP-52] Enhancing Large Vision Language Models with Self-Training on Image Comprehension
[NLP-52] 通过图像理解自我训练增强大视觉语言模型

链接: https://arxiv.org/abs/2405.19716
作者: Yihe Deng,Pan Lu,Fan Yin,Ziniu Hu,Sheng Shen,James Zou,Kai-Wei Chang,Wei Wang
关键词: Large vision language, integrate large language, large language models, pre-trained vision encoders, vision language models
中文关键词: 大视觉语言、集成大语言、大语言模型、预训练的视觉编码器、视觉语言模型
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 19 pages, 14 figures, 6 tables

点击查看摘要

Abstract:Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model’s own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.
摘要:大视觉语言模型将大语言模型与预先训练好的视觉编码器相结合,从而激活模型的感知能力,从而理解不同查询的图像输入并进行后续推理。提高这一能力需要高质量的视觉语言数据,而获取这些数据的成本和劳动密集度都很高。自我训练方法在单模式设置中一直有效,通过利用模型自己的生成来减少对标记数据的需求。然而,有效的自我训练对于LVLMS独特的视觉感知和推理能力仍然是一个挑战。为了解决这一问题,我们引入了图像理解自我训练(STIC),它强调一种专门针对图像理解的自我训练方法。首先,该模型使用未标记的图像自构造用于图像描述的偏好数据集。首选响应是通过逐步提示生成的,而非首选响应是从损坏的图像或误导性提示生成的。为了进一步完善对提取的视觉信息的推理,我们让模型重用一小部分现有的指令调整数据,并将其自生成的图像描述附加到提示中。我们在七个不同的基准上验证了STIC的有效性,显示出平均4.0%的显著性能提升,同时使用的监督微调数据比当前方法少70%。进一步的研究调查了STIC的各个组成部分,并强调了它利用大量未标记的图像进行自我训练的潜力。代码和数据是公开提供的。

[NLP-53] SpecDec: Boosting Speculative Decoding via Adaptive Candidate Lengths
[NLP-53] SpecDec:通过自适应候选解码增强推测解码

链接: https://arxiv.org/abs/2405.19715
作者: Kaixuan Huang,Xudong Guo,Mengdi Wang
关键词: target large language, large language model, Markov Decision Process, Speculative decoding reduces, faster draft model
中文关键词: 目标大语言、大语言模型、马尔科夫决策过程、推测解码简化、更快的草稿模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K – the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.
摘要:推测译码通过利用较小和较快的草稿模型来减少目标大型语言模型的推理延迟。它的性能取决于一个超参数K–候选长度,即目标模型在每一轮中要验证的候选令牌的数量。然而,以前的方法往往使用简单的启发式算法来选择K,这可能会导致性能次优。我们研究了候选长度K的选择问题,并将其描述为一个马尔可夫决策过程。我们从理论上证明了该马尔可夫决策过程的最优策略为阈值策略,即当拒绝概率超过阈值时,当前的投机行为应该停止并被验证。在这一理论的启发下,我们提出了Speecdec++,这是一种推测性解码的增强版本,它可以动态地自适应地确定候选长度。我们用训练好的接受预测头来扩充草案模型,以预测候选令牌的条件接受概率。当预测的至少一个令牌被拒绝的概率超过阈值时,specDec++将停止当前推测。我们实现了specdec++,并将其应用于骆驼-2-Chat 7B 70B模型对。我们的自适应方法在羊驼数据集上实现了2.04倍的加速比(比基线推测解码额外提高了7.2%)。在GSM8K和HumanEval数据集上,我们的方法分别获得了2.26倍的加速比(9.4%的改进)和2.23倍的加速比(11.1%的改进)。

[NLP-54] Significance of Chain of Thought in Gender Bias Mitigation for English-Dravidian Machine Translation
[NLP-54] 减少性别偏见思维链对英语-达威机器翻译的意义

链接: https://arxiv.org/abs/2405.19701
作者: Lavanya Prahallad,Radhika Mamidi
关键词: machine translation systems, Gender bias, achieving accurate, accurate and inclusive, examines gender bias
中文关键词: 机器翻译系统,性别偏见,实现准确、准确和包容,审视性别偏见
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gender bias in machine translation (MT) systems poses a significant challenge to achieving accurate and inclusive translations. This paper examines gender bias in machine translation systems for languages such as Telugu and Kannada from the Dravidian family, analyzing how gender inflections affect translation accuracy and neutrality using Google Translate and ChatGPT. It finds that while plural forms can reduce bias, individual-centric sentences often maintain the bias due to historical stereotypes. The study evaluates the Chain of Thought processing, noting significant bias mitigation from 80% to 4% in Telugu and from 40% to 0% in Kannada. It also compares Telugu and Kannada translations, emphasizing the need for language specific strategies to address these challenges and suggesting directions for future research to enhance fairness in both data preparation and prompts during inference.
摘要:机器翻译(MT)系统中的性别偏见对实现准确和包容性的翻译构成了重大挑战。本文研究了德拉威家族泰卢固语和卡纳达语等语言的机器翻译系统中的性别偏见,并使用Google Translate和ChatGPT分析了性别变化如何影响翻译的准确性和中立性。研究发现,虽然复数形式可以减少偏见,但由于历史刻板印象,以个人为中心的句子往往会保持偏见。该研究评估了思想链处理,指出泰卢固语中的偏见从80%到4%,卡纳达语中的偏见从40%到0%显着减轻。它还比较了泰卢固语和卡纳达语的翻译,强调需要特定语言的策略来应对这些挑战,并为未来的研究提出了方向,以增强数据准备和推理过程中提示的公平性。

[NLP-55] One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models
[NLP-55] 一个代币可以帮忙!学习用于检索增强大型语言模型的可扩展和可插入虚拟令牌

链接: https://arxiv.org/abs/2405.19670
作者: Yutao Zhu,Zhaoheng Huang,Zhicheng Dou,Ji-Rong Wen
关键词: large language models, improve large language, Retrieval-augmented generation, language models, generating more factual
中文关键词: 大型语言模型、改进大型语言、检索增强生成、语言模型、生成更多事实
类目: Computation and Language (cs.CL)
备注: working in progress, repo: this https URL

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune the LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs’ general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs’ original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs’ performance but also preserves their general generation capacities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across nine question-answering tasks demonstrate the superiority of our approach.
摘要:检索增强生成(RAG)是一种很有前途的改进大型语言模型(LLM)的方法,以生成更真实、准确和最新的内容。现有方法或者优化提示以指导LLM利用检索到的信息,或者直接微调LLM以适应RAG场景。虽然微调可以产生更好的性能,但它经常通过修改LLM的参数来损害其一般生成能力。这一限制在实际应用中带来了挑战,特别是在已经部署LLM的情况下,因为参数调整可能会影响它们的原始功能。为了解决这个问题,我们提出了一种新的方法,该方法涉及为RAG学习可扩展和可插拔的虚拟令牌。通过保持LLMS的原始参数,并只微调这些可插拔令牌的嵌入,我们的方法不仅提高了LLMS的性能,而且保持了它们的一般生成能力。此外,我们设计了几种训练策略,以提高该方法的可扩展性、灵活性和泛化能力。九个问答任务的综合实验证明了该方法的优越性。

[NLP-56] PATIENT-Psi: Using Large Language Models to Simulate Patients for Training Mental Health Professionals
[NLP-56] 患者-Psi:使用大型语言模型模拟患者以培训心理健康专业人员

链接: https://arxiv.org/abs/2405.19660
作者: Ruiyi Wang,Stephanie Milani,Jamie C. Chiu,Shaun M. Eack,Travis Labrum,Samuel M. Murphy,Nev Jones,Kate Hardy,Hong Shen,Fei Fang,Zhiyu Zoey Chen
关键词: Psi, patient, Mental illness remains, public health issues, critical public health
中文关键词: Psi、患者、精神疾病残留、公共卫生问题、关键公共卫生
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Mental illness remains one of the most critical public health issues, with a significant gap between the available mental health support and patient needs. Many mental health professionals highlight a disconnect between their training and real-world patient interactions, leaving some trainees feeling unprepared and potentially affecting their early career success. In this paper, we propose PATIENT-\Psi, a novel patient simulation framework for cognitive behavior therapy (CBT) training. To build PATIENT-\Psi, we constructed diverse patient profiles and their corresponding cognitive models based on CBT principles, and then used large language models (LLMs) programmed with the patient cognitive models to act as a simulated therapy patient. We propose an interactive training scheme, PATIENT-\Psi-TRAINER, for mental health trainees to practice a key skill in CBT – formulating the cognitive model of the patient – through role-playing a therapy session with PATIENT-\Psi. To evaluate PATIENT-\Psi, we conducted a user study of 4 mental health trainees and 10 experts. The results demonstrate that practice using PATIENT-\Psi-TRAINER greatly enhances the perceived skill acquisition and confidence of the trainees beyond existing forms of training such as textbooks, videos, and role-play with non-patients. Based on the experts’ perceptions, PATIENT-\Psi is perceived to be closer to real patient interactions than GPT-4, and PATIENT-\Psi-TRAINER holds strong promise to improve trainee competencies. Our pioneering patient simulation training framework, using LLMs, holds great potential to enhance and advance mental health training, ultimately leading to improved patient care and outcomes. We will release all our data, code, and the training platform.
摘要:精神疾病仍然是最严重的公共卫生问题之一,现有的精神卫生支持与患者的需求之间存在着巨大的差距。许多心理健康专业人士强调,他们的培训与现实世界中的患者互动之间存在脱节,这让一些受训者感到准备不足,并可能影响他们早期的职业成功。本文提出了一种用于认知行为治疗(CBT)训练的新型患者模拟框架Patient-Psi。为了建立患者-PSI,我们基于CBT原理构建了不同的患者概况及其相应的认知模型,然后使用由患者认知模型编程的大语言模型(LLMS)作为模拟治疗患者。我们提出了一种交互式培训方案,Patient-Psi-Trader,供心理健康受训者通过与Patient-Psi进行角色扮演来练习CBT中的一项关键技能–建立患者的认知模型。为了评估Patient-PSI,我们对4名心理健康实习生和10名专家进行了用户研究。结果表明,使用Patient-Psi-Trader的实践大大提高了受训者的技能获得感和自信心,而不是现有的培训形式,如教科书、视频和与非患者的角色扮演。根据专家的看法,Patient-Psi比GPT-4更接近真实的患者互动,而Patient-Psi-Trader强烈承诺提高实习生的能力。我们开创性的患者模拟培训框架使用LLMS,具有增强和推进心理健康培训的巨大潜力,最终导致改善患者护理和结果。我们将公布我们所有的数据、代码和培训平台。

[NLP-57] Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach
[NLP-57] 检测大型语言模型生成中的幻觉:代币概率方法

链接: https://arxiv.org/abs/2405.19648
作者: Ernesto Quevedo,Jorge Yero,Rachel Koerner,Pablo Rivas,Tomas Cerny
关键词: Large Language Models, produce inaccurate outputs, Language Models, Large Language, propensity of Large
中文关键词: 大型语言模型,产生不准确的输出,语言模型,大型语言,大型倾向
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICAI’24 - The 26th Int’l Conf on Artificial Intelligence

点击查看摘要

Abstract:Concerns regarding the propensity of Large Language Models (LLMs) to produce inaccurate outputs, also known as hallucinations, have escalated. Detecting them is vital for ensuring the reliability of applications relying on LLM-generated content. Current methods often demand substantial resources and rely on extensive LLMs or employ supervised learning with multidimensional features or intricate linguistic and semantic analyses difficult to reproduce and largely depend on using the same LLM that hallucinated. This paper introduces a supervised learning approach employing two simple classifiers utilizing only four numerical features derived from tokens and vocabulary probabilities obtained from other LLM evaluators, which are not necessarily the same. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks. Additionally, we provide a comprehensive examination of the strengths and weaknesses of our approach, highlighting the significance of the features utilized and the LLM employed as an evaluator. We have released our code publicly at this https URL.
摘要:人们对大型语言模型(LLM)产生不准确输出(也称为幻觉)的倾向的担忧已经升级。检测它们对于确保依赖LLM生成的内容的应用程序的可靠性至关重要。目前的方法往往需要大量的资源,依赖于广泛的LLM,或者使用具有多维特征的监督学习,或者难以复制的复杂的语言和语义分析,并且在很大程度上依赖于使用与幻觉相同的LLM。本文介绍了一种监督学习方法,该方法使用两个简单的分类器,只利用四个数字特征,这些特征来自于从其他LLM评估器获得的词汇量和词汇量,它们不一定是相同的。该方法产生了有希望的结果,在三个不同基准的多个任务中超过了最先进的结果。此外,我们对我们的方法的优点和缺点进行了全面的检查,强调了所利用的功能和LLM作为评估者的重要性。我们已经在这个HTTPS URL公开发布了我们的代码。

[NLP-58] GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment
[NLP-58] GKT:一种新颖的基于指导的知识转移框架,用于高效的云边缘协作LLM部署

链接: https://arxiv.org/abs/2405.19635
作者: Yao Yao,Zuchao Li,Hai Zhao
关键词: Large Language Models, elevated resource demands, Large Language, Language Models, resource demands
中文关键词: 大型语言模型,资源需求增加,大型语言,语言模型,资源需求
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The burgeoning size of Large Language Models (LLMs) has led to enhanced capabilities in generating responses, albeit at the expense of increased inference times and elevated resource demands. Existing methods of acceleration, predominantly hinged on knowledge distillation, generally necessitate fine-tuning of considerably large models, such as Llama-7B, posing a challenge for average users. Furthermore, present techniques for expediting inference and reducing costs operate independently. To address these issues, we introduce a novel and intuitive Guidance-based Knowledge Transfer (GKT) framework. This approach leverages a larger LLM as a ‘‘teacher’’ to create guidance prompts, paired with a smaller ‘‘student’’ model to finalize responses. Remarkably, GKT requires no fine-tuning and doesn’t necessitate the teacher and student models to have the same vocabulary, allowing for extensive batch generation to accelerate the process while ensuring user customization. GKT can be seamlessly integrated into cloud-edge collaboration architectures, and is versatile enough for plug-and-play application across various models. It excels in both efficiency and affordability, epitomizing a ‘‘cheap and cheerful’’ solution. GKT achieves a maximum accuracy improvement of 14.18%, along with a 10.72 times speed-up on GSM8K and an accuracy improvement of 14.00 % along with a 7.73 times speed-up in CSQA. When utilizing ChatGPT as teacher model and Llama2-70B as the student model, we can achieve 95.00% of ChatGPT’s performance at 52% of the cost. The results highlight substantial enhancements in accuracy and processing speed on the GSM8K and CSQA datasets, surpassing the performance of using either the student or teacher models in isolation.
摘要:大型语言模型(LLM)的迅速增长导致了生成响应能力的增强,尽管代价是增加了推理时间和增加了资源需求。现有的加速方法主要依赖于知识蒸馏,通常需要对相当大的型号进行微调,如Llama-7B,这对普通用户构成了挑战。此外,本发明的加速推理和降低成本的技术独立运行。为了解决这些问题,我们引入了一种新颖而直观的基于指导的知识转移(GKT)框架。这种方法利用较大的LLM作为“教师”来创建指导提示,并与较小的“学生”模型配对以最终确定响应。值得注意的是,GKT不需要微调,也不需要教师和学生模型具有相同的词汇表,从而允许大量的批处理生成来加速过程,同时确保用户定制。GKT可以无缝集成到云边缘协作架构中,并且具有足够的通用性,可以跨各种型号进行即插即用应用。它在效率和可负担性方面都表现出色,体现了一种“廉价而令人愉快的”解决方案。GKT的最大准确率提高了14.18%,GSM8K的速度提高了10.72倍,CSQA的准确率提高了14.00%,CSQA的速度提高了7.73倍。当使用ChatGPT作为教师模型和Llama2-70B作为学生模型时,我们可以在52%的成本下获得95.00%的性能。结果表明,在GSM8K和CSQA数据集上,准确度和处理速度都有了实质性的提高,超过了单独使用学生或教师模型的性能。

[NLP-59] Easy Problems That LLMs Get Wrong
[NLP-59] LLM容易出错的问题

链接: https://arxiv.org/abs/2405.19616
作者: Sean Williams,James Huckle
关键词: comprehensive Linguistic Benchmark, Linguistic Benchmark designed, Large Language Models, Large Language, Linguistic Benchmark
中文关键词: 全面的语言基准、设计的语言基准、大型语言模型、大型语言、语言基准
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: AutogenAI Ltd. Associated code at this https URL

点击查看摘要

Abstract:We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding, among others. Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease. It also highlights the potential of prompt engineering to mitigate some errors and underscores the necessity for better training methodologies. Our findings stress the importance of grounding LLMs with human reasoning and common sense, emphasising the need for human-in-the-loop for enterprise applications. We hope this work paves the way for future research to enhance the usefulness and reliability of new models.
摘要:我们引入了一个全面的语言基准,旨在评估大型语言模型(LLM)在逻辑推理、空间智能和语言理解等领域的局限性。通过一系列简单的问题,它揭示了备受好评的模型在执行人类轻松管理的任务方面的显着局限性。它还强调了即时工程减轻一些错误的潜力,并强调了更好的培训方法的必要性。我们的研究结果强调了以人类推理和常识为基础的LLM的重要性,强调企业应用程序对人在环的需求。我们希望这项工作为未来的研究铺平道路,以增强新模型的有用性和可靠性。

[NLP-60] SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors
[NLP-60] SVFT:使用奇异载体的参数高效微调

链接: https://arxiv.org/abs/2405.19597
作者: Vijay Lingam,Atula Tejaswi,Aditya Vavre,Aneesh Shetty,Gautham Krishna Gudur,Joydeep Ghosh,Alex Dimakis,Eunsol Choi,Aleksandar Bojchevski,Sujay Sanghavi
关键词: Popular parameter-efficient fine-tuning, freeze pre-trained model, Popular parameter-efficient, pre-trained model weights, inject learnable matrices
中文关键词: 流行的参数高效微调、冻结预训练模型、流行的参数高效、预训练模型权重、注入可学习矩阵
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 5 figures, 14 tables

点击查看摘要

Abstract:Popular parameter-efficient fine-tuning (PEFT) methods, such as LoRA and its variants, freeze pre-trained model weights (W) and inject learnable matrices (\Delta W). These (\Delta W) matrices are structured for efficient parameterization, often using techniques like low-rank approximations or scaling vectors. However, these methods typically show a performance gap compared to full fine-tuning. Although recent PEFT methods have narrowed this gap, they do so at the cost of additional learnable parameters. We propose SVFT, a simple approach that fundamentally differs from existing methods: the structure imposed on (\Delta W) depends on the specific weight matrix (W). Specifically, SVFT updates (W) as a sparse combination of outer products of its singular vectors, training only the coefficients (scales) of these sparse combinations. This approach allows fine-grained control over expressivity through the number of coefficients. Extensive experiments on language and vision benchmarks show that SVFT recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25% of parameters, outperforming existing methods that only recover up to 85% performance using 0.03 to 0.8% of the trainable parameter budget.
摘要:目前流行的参数高效微调(PEFT)方法,如LORA及其变种,冻结预先训练好的模型权值(W)和注入可学习矩阵(ΔW)。这些(\Delta W)矩阵的结构用于有效的参数化,通常使用诸如低阶近似或缩放向量之类的技术。然而,与完全微调相比,这些方法通常表现出性能差距。尽管最近的PEFT方法缩小了这一差距,但它们是以额外的可学习参数为代价的。本文提出了一种简单的支持向量机方法,它与已有的方法有根本的不同:施加在W上的结构取决于比权矩阵W。具体地说,SVFT将[W]更新为其奇异向量的外积的稀疏组合,仅训练这些稀疏组合的系数(尺度)。这种方法允许通过系数的数量对表现力进行细粒度控制。在语言和视觉基准上的大量实验表明,支持向量机在仅训练0.006至0.25%的参数的情况下,可恢复高达96%的完全微调性能,优于现有方法,后者仅使用0.03%至0.8%的可训练参数预算来恢复高达85%的性能。

[NLP-61] Why Larger Language Models Do In-context Learning Differently?
[NLP-61] 为什么更大的语言模型会以不同的方式进行上下文学习?

链接: https://arxiv.org/abs/2405.19592
作者: Zhenmei Shi,Junyi Wei,Zhuoyan Xu,Yingyu Liang
关键词: unseen tasks based, in-context learning, unseen tasks, tasks based, ICL behaviors
中文关键词: 基于看不见的任务、上下文学习、看不见的任务、基于任务、ICL行为
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.
摘要:大型语言模型(LLM)已经成为人工智能的一种强有力的工具,其关键能力是情境学习(ICL),它们可以根据一系列简短的任务示例很好地执行未知任务,而不需要对模型参数进行任何调整。最近一个有趣的神秘观察是,不同尺度的模型可能有不同的ICL行为:较大的模型往往对测试环境中的噪音更敏感。本文从理论上对这一现象进行研究,旨在提高对LLM和ICL的认识。我们分析了两种程式化设置:(1)单层单头线性变压器的线性回归和(2)双层多注意头变压器的奇偶分类(非线性数据和非线性模型)。在这两种情况下,我们给出了闭合形式的最优解,发现较小的模型强调重要的隐藏特征,而较大的模型覆盖更多的隐藏特征;因此,较小的模型对噪声更健壮,而较大的模型更容易分散注意力,从而导致不同的ICL行为。这揭示了变压器关注的是什么,以及这是如何影响ICL的。在大型数据库和聊天模型上的初步实验结果为我们的分析提供了积极的支持。

[NLP-62] A Deep Convolutional Neural Network-based Model for Aspect and Polarity Classification in Hausa Movie Reviews
[NLP-62] 基于深度卷积神经网络的Haosa电影评论方面和两极分类模型

链接: https://arxiv.org/abs/2405.19575
作者: Umar Ibrahim,Abubakar Yakubu Zandam,Fatima Muhammad Adam,Aminu Musa
关键词: Aspect-based Sentiment Analysis, Convolutional Neural Network, Deep Convolutional Neural, understanding sentiment nuances, Aspect-based Sentiment
中文关键词: 基于蚁群的情绪分析、卷积神经网络、深度卷积神经、理解情绪细微差别、基于蚁群的情绪
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in the proceedings of ICCAIT 2023

点击查看摘要

Abstract:Aspect-based Sentiment Analysis (ABSA) is crucial for understanding sentiment nuances in text, especially across diverse languages and cultures. This paper introduces a novel Deep Convolutional Neural Network (CNN)-based model tailored for aspect and polarity classification in Hausa movie reviews, an underrepresented language in sentiment analysis research. A comprehensive Hausa ABSA dataset is created, filling a significant gap in resource availability. The dataset, preprocessed using sci-kit-learn for TF-IDF transformation, includes manually annotated aspect-level feature ontology words and sentiment polarity assignments. The proposed model combines CNNs with attention mechanisms for aspect-word prediction, leveraging contextual information and sentiment polarities. With 91% accuracy on aspect term extraction and 92% on sentiment polarity classification, the model outperforms traditional machine models, offering insights into specific aspects and sentiments. This study advances ABSA research, particularly in underrepresented languages, with implications for cross-cultural linguistic research.
摘要:基于方面的情感分析(ABSA)对于理解文本中的情感细微差别,特别是跨不同语言和文化的情感差异至关重要。本文介绍了一种新的基于深卷积神经网络(CNN)的模型,该模型适用于豪萨电影评论中的方面和极性分类,这是一种情感分析研究中未被代表的语言。创建了一个全面的Hausa ABSA数据集,填补了资源可获得性方面的重大空白。该数据集使用SCI-KIT-LEARN进行TF-IDF转换,包括手动标注的方面级特征本体词和情感极性赋值。该模型将CNN与注意机制相结合,利用上下文信息和情感极性进行体词预测。该模型对特征词提取的正确率为91%,对情感极性分类的正确率为92%,优于传统的机器模型,提供了对特定方面和情感的洞察。这项研究推进了ABSA的研究,特别是在代表性较低的语言中,对跨文化语言学研究具有启示意义。

[NLP-63] Unlearning Climate Misinformation in Large Language Models
[NLP-63] 在大型语言模型中消除气候错误信息

链接: https://arxiv.org/abs/2405.19563
作者: Michael Fore,Simranjit Singh,Chaehong Lee,Amritanshu Pandey,Antonios Anastasopoulos,Dimitrios Stamoulis
关键词: threats to humanity, key roadblock, roadblock in addressing, climate change, climate
中文关键词: 对人类的威胁,关键障碍,解决障碍,气候变化,气候
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled QA data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model’s responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks.
摘要:关于气候变化的错误信息是解决人类面临的最严重威胁之一的关键障碍。本文研究了关于气候信息的大型语言模型(LLM)中的事实准确性。使用真/假标记的QA数据来微调和评估气候相关主张的LLM,我们比较了开源模型,评估了它们对气候变化问题做出真实回应的能力。我们调查了故意被虚假气候信息毒害的模型的可检测性,发现这种毒化可能不会影响模型在其他领域的响应的准确性。此外,我们比较了遗忘算法、微调和检索-增强生成(RAG)算法在气候变化主题上实际建立最小二乘模型的有效性。我们的评估表明,遗忘算法可以有效地处理细微差别的概念声明,尽管之前的研究结果表明,它们在隐私环境中无效。这些见解旨在指导开发更真实可靠的LLM,并强调需要开展更多工作来确保LLM免受错误信息攻击。

[NLP-64] Selective Explanations
[NLP-64] 选择性住宿

链接: https://arxiv.org/abs/2405.19562
作者: Lucas Monteiro Paes,Dennis Wei,Flavio P. Calmon
关键词: explain black-box machine, assigning importance scores, methods explain black-box, black-box machine learning, Feature attribution
中文关键词: 解释黑匣子机器、分配重要性分数、解释黑匣子的方法、黑匣子机器学习、特征归因
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Feature attribution methods explain black-box machine learning (ML) models by assigning importance scores to input features. These methods can be computationally expensive for large ML models. To address this challenge, there has been increasing efforts to develop amortized explainers, where a machine learning model is trained to predict feature attribution scores with only one inference. Despite their efficiency, amortized explainers can produce inaccurate predictions and misleading explanations. In this paper, we propose selective explanations, a novel feature attribution method that (i) detects when amortized explainers generate low-quality explanations and (ii) improves these explanations using a technique called explanations with initial guess. Our selective explanation method allows practitioners to specify the fraction of samples that receive explanations with initial guess, offering a principled way to bridge the gap between amortized explainers and their high-quality counterparts.
摘要:特征属性方法通过给输入特征分配重要性分数来解释黑盒机器学习(ML)模型。对于大型ML模型,这些方法的计算代价可能很高。为了应对这一挑战,人们越来越努力地开发摊销解释器,其中机器学习模型经过训练,只需一次推理就能预测特征归因得分。尽管效率很高,但摊销的解释可能会产生不准确的预测和误导性的解释。在本文中,我们提出了选择性解释,这是一种新的特征归因方法,它(I)检测何时摊余的解释产生低质量的解释,(Ii)使用一种称为初始猜测解释的技术来改进这些解释。我们的选择性解释方法允许从业者指定接受初步猜测解释的样本的比例,提供了一种原则性的方法来弥合摊销解释人员与高质量解释人员之间的差距。

[NLP-65] Quo Vadis ChatGPT? From Large Language Models to Large Knowledge Models
[NLP-65] Quo Vadis ChatGPT?从大型语言模型到大型知识模型

链接: https://arxiv.org/abs/2405.19561
作者: Venkat Venkatasubramanian,Arijit Chakraborty
关键词: natural language processing, transformer-based generative neural, generative neural network, neural network architecture, large language models
中文关键词: 自然语言处理、基于转换器的生成神经、生成神经网络、神经网络架构、大型语言模型
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The startling success of ChatGPT and other large language models (LLMs) using transformer-based generative neural network architecture in applications such as natural language processing and image synthesis has many researchers excited about potential opportunities in process systems engineering (PSE). The almost human-like performance of LLMs in these areas is indeed very impressive, surprising, and a major breakthrough. Their capabilities are very useful in certain tasks, such as writing first drafts of documents, code writing assistance, text summarization, etc. However, their success is limited in highly scientific domains as they cannot yet reason, plan, or explain due to their lack of in-depth domain knowledge. This is a problem in domains such as chemical engineering as they are governed by fundamental laws of physics and chemistry (and biology), constitutive relations, and highly technical knowledge about materials, processes, and systems. Although purely data-driven machine learning has its immediate uses, the long-term success of AI in scientific and engineering domains would depend on developing hybrid AI systems that use first principles and technical knowledge effectively. We call these hybrid AI systems Large Knowledge Models (LKMs), as they will not be limited to only NLP-based techniques or NLP-like applications. In this paper, we discuss the challenges and opportunities in developing such systems in chemical engineering.
摘要:ChatGPT和其他大型语言模型(LLM)在自然语言处理和图像合成等应用中取得了惊人的成功,这让许多研究人员对过程系统工程(PSE)的潜在机会感到兴奋。LLM在这些领域近乎人类的表现确实非常令人印象深刻、令人惊讶,也是一项重大突破。他们的能力在某些任务中非常有用,如编写文档初稿、代码编写辅助、文本摘要等。然而,他们在高度科学的领域的成功受到限制,因为他们还不能推理、计划或解释,因为他们缺乏深入的领域知识。这在化学工程等领域是一个问题,因为它们受到物理和化学(以及生物学)的基本定律、本构关系和关于材料、工艺和系统的高技术知识的支配。尽管纯粹的数据驱动的机器学习有其立竿见影的作用,但人工智能在科学和工程领域的长期成功将取决于开发有效使用第一原理和技术知识的混合人工智能系统。我们将这些混合人工智能系统称为大知识模型(LKM),因为它们将不仅仅限于基于NLP的技术或类似NLP的应用。在本文中,我们讨论了在化学工程中开发这类系统的挑战和机遇。

[NLP-66] CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts Images and Patients
[NLP-66] CheXpress Plus:数十万条对齐的放射学文本图像和患者

链接: https://arxiv.org/abs/2405.19538
作者: Pierre Chambon,Jean-Benoit Delbrouck,Thomas Sounack,Shih-Cheng Huang,Zhihong Chen,Maya Varma,Steven QH Truong,Chu The Chuong,Curtis P. Langlotz
关键词: original CheXpert paper, years ago, paper five years, original CheXpert, CheXpert paper
中文关键词: 原版CheXpress纸,几年前,纸五年,原版CheXpress,CheXpress纸
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: this https URL Models are available at the following URL: this https URL
摘要:自五年前CheXpert的原始论文发布以来,CheXpert已经成为使用最广泛和被引用最多的临床人工智能数据集之一。视觉语言模型的出现引发了对分享与CheXpert图像相关的报告的需求的增加,同时人工智能公平研究人员对获取人口统计数据的兴趣也越来越大。为了解决这一问题,CheXpert Plus作为新的放射学数据源集合,公开可用来增强放射学领域所有后续机器学习任务的模型的可伸缩性、性能、稳健性和公平性。CheXpert Plus是放射学领域公开发布的最大文本数据集,共有3600万个文本令牌,其中包括1300万个印象令牌。据我们所知,它代表着放射学中最大的文本去身份识别努力,有近100万个PHI跨度被匿名。这只是第二次在放射学上发布大规模的英语配对数据集,从而第一次能够进行大规模的跨机构培训。所有报告都配有DICOM格式的高质量图像,以及涵盖各种临床和社会经济群体的大量图像和患者元数据,以及许多病理标签和RadGraph注释。我们希望这个数据集将促进对人工智能模型的研究,这些模型可以进一步帮助放射科医生,并帮助改善医疗保健。数据可从以下URL获得:此HTTPS URL模型可从以下URL获得:此HTTPS URL

[NLP-67] Preference Learning Algorithms Do Not Learn Preference Rankings
[NLP-67] 偏好学习算法不会学习偏好排名

链接: https://arxiv.org/abs/2405.19534
作者: Angelica Chen,Sadhika Malladi,Lily H. Zhang,Xinyi Chen,Qiuyi Zhang,Rajesh Ranganath,Kyunghyun Cho
关键词: Preference learning algorithms, ranking accuracy, Preference learning, produce generations, preferred outputs
中文关键词: 偏好学习算法、排名准确性、偏好学习、产生世代、首选输出
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via \textitranking accuracy . Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the \textitidealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant \textitalignment gap – \textiti.e. , a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.
摘要:偏好学习算法(如RLHF和DPO)经常被用来引导LLM产生人类更喜欢的世代,但我们对它们的内部工作原理的了解仍然有限。在这项工作中,我们研究了偏好学习的传统智慧,即偏好学习训练模型将较高的概率分配给较偏好的输出,而不是较不偏好的输出,通过文本转换精度来衡量。令人惊讶的是,我们发现大多数最先进的偏好调整模型在普通偏好数据集上的排名准确率低于60%。此外,我们还推导了偏好调整的LLM在完美优化DPO或RLHF目标的情况下所能达到的文本化排序精度。我们证明了现有的模型显示出显著的文本对齐差距–\tex ti.e。,观察到的排名精度和理想化的排名精度之间的差距。我们将这种差异归因于DPO目标,从经验和理论上讲,DPO目标不适合修复参考模型中即使是轻微的排序错误,并推导出一个简单而有效的公式来量化学习给定偏好数据点的难度。最后,我们证明了当模型接近于目标中使用的参考模型时,排名精度与经验上流行的胜率度量密切相关,从而进一步揭示了基于策略(例如RLHF)和非策略(例如DPO)偏好学习算法之间的差异。

[NLP-68] wo-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data
[NLP-68] 低资源医疗问答的多层检索增强生成框架:使用Reddit数据的概念证明

链接: https://arxiv.org/abs/2405.19519
作者: Sudeshna Das,Yao Ge,Yuting Guo,Swati Rajwal,JaMor Hairston,Jeanne Powell,Drew Walker,Snigdha Peddireddy,Sahithi Lakamana,Selen Bozkurt,Matthew Reyna,Reza Sameni,Yunyu Xiao,Sangmi Kim,Rasheeta Chandler,Natalie Hernandez,Danielle Mowery,Rachel Wightman,Jennifer Love,Anthony Spadaro,Jeanmarie Perrone,Abeed Sarker
关键词: Retrieval augmented generation, relevant in-context text, providing relevant in-context, generative model outputs, Retrieval augmented
中文关键词: 检索增强生成、相关上下文文本、提供相关上下文生成模型输出、检索增强
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for query-focused answer generation and evaluate a proof-of-concept for this framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. The evaluations demonstrate the effectiveness of the two-layer framework in resource constrained settings to enable researchers in obtaining near real-time data from users.
摘要:检索增强生成(RAG)通过提供相关的上下文文本,提供了限制生成模型输出并减轻幻觉可能性的能力。生成式大型语言模型(LLM)可以合并的标记数量,因为上下文是有限的,因此限制了生成答案的知识量。我们提出了一个用于以查询为中心的答案生成的两层RAG框架,并在社交媒体论坛以查询为中心的摘要生成的背景下评估该框架的概念验证,重点关注新兴的毒品相关信息。评估证明了两层框架在资源有限的环境中的有效性,使研究人员能够从用户那里获取近乎实时的数据。

[NLP-69] A Full-duplex Speech Dialogue Scheme Based On Large Language Models
[NLP-69] 基于大语言模型的双环语音对话方案

链接: https://arxiv.org/abs/2405.19487
作者: Peng Wang,Songshuo Lu,Yaohua Tang,Sijie Yan,Yuanjun Xiong,Wei Xia
关键词: full-duplex manner, present a generative, capable of operating, called neural FSM, neural FSM
中文关键词: 全速方式,呈现生成式、能够操作的,称为神经有限责任机、神经有限责任机
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allowing the system to simultaneously speak and listen to the user. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. Running a LLM with only 8 billion parameters, our system exhibits a 8% higher interruption precision rate than the best available commercial LLM for voice-based dialogue.
摘要:我们提出了一种产生式对话系统,能够以全双工方式运行,允许无缝交互。它基于一个大型语言模型(LLM),仔细调整以了解感知模块、运动功能模块和具有两个状态的简单有限状态机(称为神经FSM)的概念。感知和运动功能模块同时运行,允许系统同时说话和倾听用户。LLM生成用于查询响应的文本令牌,并通过向神经FSM发出控制令牌来自主决定开始响应、等待或中断用户。LLM的所有这些任务都作为对对话的序列化视图的实时下一令牌预测来执行。在模拟真实交互的自动质量评估中,与基于LLM的半双工对话系统相比,该系统的平均会话响应延迟降低了3倍以上,在50%以上的评估交互中,响应时间在500毫秒以内。运行只有80亿个参数的LLM,我们的系统显示出比基于语音对话的最佳商用LLM高8%的中断精确率。

[NLP-70] Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
[NLP-70] 关键学习期:利用早期训练动态进行高效数据修剪

链接: https://arxiv.org/abs/2405.19462
作者: Everlyn Asiko Chimoto,Jay Gala,Orevaoghene Ahia,Julia Kreutzer,Bruce A. Bassett,Sara Hooker
关键词: Neural Machine Translation, Neural Machine, Machine Translation models, Machine Translation, data
中文关键词: 神经机器翻译,神经机器,机器翻译模型,机器翻译,数据
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.
摘要:神经机器翻译模型非常需要数据和计算。然而,并不是所有的数据点都对模型训练和推广做出同样的贡献。删除低价值数据点的数据修剪的好处是在不显著降低模型性能的情况下大幅减少计算预算。在本文中,我们提出了一种新的数据剪枝技术:跨时间检查点(CheckPoints Over Time),该技术利用早期模型训练动态来识别与模型性能最相关的数据点。我们将CAT与包括Comet-QE、LASER和LaBSE在内的几种数据剪枝技术进行基准比较。我们发现CAT在多个测试集上的表现优于印欧语言的基准测试。当应用于英语-德语、英语-法语和英语-斯瓦希里语翻译任务时,CAT取得了与使用完整数据集相当的性能,同时修剪了高达50%的训练数据。我们检查了CAT选择的数据点,发现它倾向于使用较长的句子和带有唯一或罕见单词的句子。

[NLP-71] Beyond Agreement: Diagnosing the Rationale Alignment of Automated Essay Scoring Methods based on Linguistically-informed Counterfactuals
[NLP-71] 超越共识:基于语言知识的反事实诊断自动论文评分方法的合理性

链接: https://arxiv.org/abs/2405.19433
作者: Yupei Wang,Renfen Hu,Zhe Zhao
关键词: show high agreement, current automated essay, methods show high, human raters, fully explored
中文关键词: 显示高一致性、当前自动化论文、方法显示高、人工评分者、充分探索
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While current automated essay scoring (AES) methods show high agreement with human raters, their scoring mechanisms are not fully explored. Our proposed method, using counterfactual intervention assisted by Large Language Models (LLMs), reveals that when scoring essays, BERT-like models primarily focus on sentence-level features, while LLMs are attuned to conventions, language complexity, as well as organization, indicating a more comprehensive alignment with scoring rubrics. Moreover, LLMs can discern counterfactual interventions during feedback. Our approach improves understanding of neural AES methods and can also apply to other domains seeking transparency in model-driven decisions. The codes and data will be released at GitHub.
摘要:虽然当前的自动论文评分(AES)方法与人类评分者表现出高度一致,但其评分机制尚未得到充分探索。我们提出的方法,在大型语言模型(LLM)的辅助下使用反事实干预,揭示了在对论文评分时,类BERT模型主要关注业务层面的特征,而LLM则适应惯例、语言复杂性以及组织,这表明与评分指标更全面的一致。此外,LLM可以在反馈期间识别反事实干预。我们的方法提高了对神经AES方法的理解,也可以应用于寻求模型驱动决策透明度的其他领域。代码和数据将在GitHub上发布。

[NLP-72] Deep Learning for Assessment of Oral Reading Fluency
[NLP-72] 深度学习评估口语阅读流利度

链接: https://arxiv.org/abs/2405.19426
作者: Mithilesh Vaidya,Binaya Kumar Sahoo,Preeti Rao
关键词: early education interventions, monitor early education, Reading fluency assessment, Reading fluency, literacy programmes
中文关键词: 早期教育干预、监测早期教育、阅读流利度评估、阅读流利度、识字计划
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalable solution. Multiple complex aspects such as accuracy, rate and expressiveness underlie human judgements of reading fluency. In this work, we investigate end-to-end modeling on a training dataset of children’s audio recordings of story texts labeled by human experts. The pre-trained wav2vec2.0 model is adopted due its potential to alleviate the challenges from the limited amount of labeled data. We report the performance of a number of system variations on the relevant measures, and also probe the learned embeddings for lexical and acoustic-prosodic features known to be important to the perception of reading fluency.
摘要:阅读流利度评估是识字计划的重要组成部分,有助于指导和监测早期教育干预措施。鉴于教师进行练习时资源密集型,开发可以对口语阅读录音进行操作的自动工具作为一种客观且高度可扩展的解决方案具有吸引力。准确性、速度和表达力等多个复杂方面是人类对阅读流畅性的判断的基础。在这项工作中,我们研究了由人类专家标记的故事文本的儿童录音的训练数据集的端到端建模。采用预训练的wav2vec2.0模型是因为它有潜力缓解有限数量的标签数据带来的挑战。我们报告了许多系统变体在相关指标上的表现,并探讨了已知对阅读流畅性的感知重要的词汇和声学特征的习得嵌入。

[NLP-73] Adaptive In-conversation Team Building for Language Model Agents
[NLP-73] 语言模型代理的自适应对话中团队构建

链接: https://arxiv.org/abs/2405.19425
作者: Linxin Song,Jiale Liu,Jieyu Zhang,Shaokun Zhang,Ao Luo,Shijian Wang,Qingyun Wu,Chi Wang
关键词: Leveraging multiple large, multiple large language, large language model, tackling complex tasks, Leveraging multiple
中文关键词: 利用多种大型、多种大型语言、大型语言模型,处理复杂任务,利用多种
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Leveraging multiple large language model (LLM) agents has shown to be a promising approach for tackling complex tasks, while the effective design of multiple agents for a particular application remains an art. It is thus intriguing to answer a critical question: Given a task, how can we build a team of LLM agents to solve it effectively? Our new adaptive team-building paradigm offers a flexible solution, realized through a novel agent design named Captain Agent. It dynamically forms and manages teams for each step of a task-solving process, utilizing nested group conversations and reflection to ensure diverse expertise and prevent stereotypical outputs. It allows for a flexible yet structured approach to problem-solving and can help reduce redundancy and enhance output diversity. A comprehensive evaluation across six real-world scenarios demonstrates that Captain Agent significantly outperforms existing multi-agent methods with 21.94% improvement in average accuracy, providing outstanding performance without requiring task-specific prompt engineering.
摘要:利用多个大型语言模型(LLM)代理已被证明是处理复杂任务的一种有前途的方法,而针对特定应用的多个代理的有效设计仍然是一门艺术。因此,回答一个关键问题是耐人寻味的:给定一个任务,我们如何建立一支LLM代理团队来有效地解决它?我们新的自适应团队建设范例提供了一个灵活的解决方案,通过一个名为队长代理的新代理设计实现。它为任务解决过程的每一步动态组建和管理团队,利用嵌套的小组对话和反思来确保多样化的专业知识和防止陈规陋习的产出。它允许采取灵活而有条理的办法来解决问题,并有助于减少冗余和增强产出多样性。对六个真实场景的综合评估表明,队长代理的性能显著优于现有的多代理方法,平均准确率提高21.94%,在不需要特定任务的提示工程的情况下提供了出色的性能。

[NLP-74] Luganda Speech Intent Recognition for IoT Applications
[NLP-74] 适用于物联网应用的Luganda语音意图识别

链接: https://arxiv.org/abs/2405.19343
作者: Andrew Katumba,Sudi Murindanyi,John Trevor Kasule,Elvis Mugume
关键词: Internet of Things, generated massive interest, advent of Internet, voice-controlled smart homes, Luganda voice commands
中文关键词: 物联网,引起了巨大的兴趣,互联网的出现,语音控制智能家居,Luganda语音命令
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Presented as a conference paper at ICLR 2024/AfricaNLP

点击查看摘要

Abstract:The advent of Internet of Things (IoT) technology has generated massive interest in voice-controlled smart homes. While many voice-controlled smart home systems are designed to understand and support widely spoken languages like English, speakers of low-resource languages like Luganda may need more support. This research project aimed to develop a Luganda speech intent classification system for IoT applications to integrate local languages into smart home environments. The project uses hardware components such as Raspberry Pi, Wio Terminal, and ESP32 nodes as microcontrollers. The Raspberry Pi processes Luganda voice commands, the Wio Terminal is a display device, and the ESP32 nodes control the IoT devices. The ultimate objective of this work was to enable voice control using Luganda, which was accomplished through a natural language processing (NLP) model deployed on the Raspberry Pi. The NLP model utilized Mel Frequency Cepstral Coefficients (MFCCs) as acoustic features and a Convolutional Neural Network (Conv2D) architecture for speech intent classification. A dataset of Luganda voice commands was curated for this purpose and this has been made open-source. This work addresses the localization challenges and linguistic diversity in IoT applications by incorporating Luganda voice commands, enabling users to interact with smart home devices without English proficiency, especially in regions where local languages are predominant.
摘要:物联网(IoT)技术的出现引起了人们对声控智能家居的巨大兴趣。虽然许多声控智能家居系统旨在理解和支持英语等广泛使用的语言,但使用卢甘达语等资源较少的语言的人可能需要更多支持。本研究项目旨在开发一个面向物联网应用的卢甘达语音意图分类系统,以将当地语言整合到智能家居环境中。该项目采用树莓PI、WIO终端、ESP32节点等硬件部件作为微控制器。树莓PI处理卢甘达语音命令,WIO终端是显示设备,ESP32节点控制物联网设备。这项工作的最终目标是使用Luganda实现语音控制,这是通过在Raspberry PI上部署的自然语言处理(NLP)模型实现的。NLP模型利用Mel频率倒谱系数(MFCC)作为声学特征,采用卷积神经网络(Conv2D)结构进行语音意图分类。为此目的编制了卢甘达语音命令数据集,并已将其开放源代码。这项工作通过整合卢甘达语音命令来解决物联网应用中的本地化挑战和语言多样性,使用户能够在不熟练使用英语的情况下与智能家居设备交互,特别是在当地语言占主导地位的地区。

[NLP-75] Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants
[NLP-75] Sonos语音控制偏见评估数据集:语音助手人口统计偏见评估方法

链接: https://arxiv.org/abs/2405.19342
作者: Chloé Sekkat,Fanny Leroy,Salima Mdhaffar,Blake Perry Smith,Yannick Estève,Joseph Dureau,Alice Coucke
关键词: Recent works demonstrate, Recent works, Sonos Voice Control, North American English, Voice Control Bias
中文关键词: 最近的作品演示,最近的作品,Sonos语音控制,北美英语,语音控制偏见
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.
摘要:最近的研究表明,语音助手并不适用于每个人,但关于语音技术的人口统计稳健性的研究仍然很少。这主要是因为带有受控人口标签的大型数据集很少见。本文介绍了Sonos语音控制偏差评估数据集,这是一个开放的数据集,由音乐领域对北美英语的语音助手请求组成(1038个发言者,166个小时,17万个音频样本,带有9,040个唯一标记的成绩单),并控制人口多样性(性别、年龄、方言地区和种族)。我们还发布了单变量和多变量级别的统计人口统计偏差评估方法,该方法针对此特定用例而量身定做,并利用口语理解指标而不是转录准确性,我们认为这是更好的用户体验替代指标。为了展示这个数据集和统计方法检测人口统计偏差的能力,我们考虑了两个最先进的自动语音识别和口语理解模型。结果显示,不同年龄、不同方言地区和不同种族的学生在学习成绩上存在显著差异。多变量测试对于揭示方言地区、性别和年龄之间的混合影响至关重要。

[NLP-76] DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
[NLP-76] DeTikZify:使用TikZ合成科学图形和草图的图形程序

链接: https://arxiv.org/abs/2405.15306
作者: Jonas Belouadi,Simone Paolo Ponzetto,Steffen Eger
关键词: Creating high-quality scientific, Creating high-quality, high-quality scientific figures, scientific figures, time-consuming and challenging
中文关键词: 创造高质量的科学,创造高质量、高质量的科学人物,科学人物,耗时且具有挑战性
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Creating high-quality scientific figures can be time-consuming and challenging, even though sketching ideas on paper is relatively easy. Furthermore, recreating existing figures that are not stored in formats preserving semantic information is equally complex. To tackle this problem, we introduce DeTikZify, a novel multimodal language model that automatically synthesizes scientific figures as semantics-preserving TikZ graphics programs based on sketches and existing figures. To achieve this, we create three new datasets: DaTikZv2, the largest TikZ dataset to date, containing over 360k human-created TikZ graphics; SketchFig, a dataset that pairs hand-drawn sketches with their corresponding scientific figures; and SciCap++, a collection of diverse scientific figures and associated metadata. We train DeTikZify on SciCap++ and DaTikZv2, along with synthetically generated sketches learned from SketchFig. We also introduce an MCTS-based inference algorithm that enables DeTikZify to iteratively refine its outputs without the need for additional training. Through both automatic and human evaluation, we demonstrate that DeTikZify outperforms commercial Claude 3 and GPT-4V in synthesizing TikZ programs, with the MCTS algorithm effectively boosting its performance. We make our code, models, and datasets publicly available.
摘要:创造高质量的科学数字可能既耗时又具有挑战性,尽管在纸上勾勒想法相对容易。此外,重新创建未以保留语义信息的格式存储的现有图形也同样复杂。为了解决这一问题,我们引入了DeTikZify,这是一个新的多通道语言模型,它基于草图和现有图形自动合成科学图形作为保持语义的TikZ图形程序。为了实现这一目标,我们创建了三个新的数据集:DaTikZv2,迄今为止最大的TikZ数据集,包含超过36万个人工创建的TikZ图形;SketchFig,一个将手绘草图与其相应的科学人物配对的数据集;以及SciCap++,各种科学人物和相关元数据的集合。我们在SciCap++和DaTikZv2上训练DeTikZify,以及从SketchFig学习的合成生成的草图。我们还引入了一种基于MCTS的推理算法,使DeTikZify能够迭代地精炼其输出,而不需要额外的训练。通过自动和人工评估,我们证明了DeTikZify在合成TikZ程序方面优于商用的Claude 3和GPT-4V,MCTS算法有效地提高了其性能。我们公开我们的代码、模型和数据集。

计算机视觉

[CV-0] Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

链接: https://arxiv.org/abs/2405.20343
作者: Kailu Wu,Fangfu Liu,Zhihan Cai,Runjie Yan,Hanyang Wang,Yating Hu,Yueqi Duan,Kaisheng Ma
关键词: efficiently generating high-quality, Score Distillation Sampling, generating high-quality, meshes from single-view, strong generalizability
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion models, but they usually suffer from long per-case optimization time with inconsistent issues. Recent works address the problem and generate better 3D results either by finetuning a multi-view diffusion model or training a fast feed-forward model. However, they still lack intricate textures and complex geometries due to inconsistency and limited generated resolution. To simultaneously achieve high fidelity, consistency, and efficiency in single image-to-3D, we propose a novel framework Unique3D that includes a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images with their normal maps, a multi-level upscale process to progressively improve the resolution of generated orthographic multi-views, as well as an instant and consistent mesh reconstruction algorithm called ISOMER, which fully integrates the color and geometric priors into mesh results. Extensive experiments demonstrate that our Unique3D significantly outperforms other image-to-3D baselines in terms of geometric and textural details.

[CV-1] MotionLLM: Understanding Human Behaviors from Human Motions and Videos

链接: https://arxiv.org/abs/2405.20340
作者: Ling-Hao Chen,Shunlin Lu,Ailing Zeng,Hao Zhang,Benyou Wang,Ruimao Zhang,Lei Zhang
关键词: Large Language Models, Language Models, Large Language, capabilities of Large, human behavior understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MotionLLM version 1.0, project page see https://lhchen.top/MotionLLM

点击查看摘要

Abstract:This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.

[CV-2] Visual Perception by Large Language Models Weights

链接: https://arxiv.org/abs/2405.20339
作者: Feipeng Ma,Hongwei Xue,Guangting Wang,Yizhou Zhou,Fengyun Rao,Shilin Yan,Yueyi Zhang,Siying Wu,Mike Zheng Shou,Xiaoyan Sun
关键词: Multimodal Large Language, Existing Multimodal Large, Large Language Models, Large Language, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM’s weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

[CV-3] OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

链接: https://arxiv.org/abs/2405.20337
作者: Lening Wang,Wenzhao Zheng,Yilong Ren,Han Jiang,Zhiyong Cui,Haiyang Yu,Jiwen Lu
关键词: important for effective, effective autonomous driving, Understanding the evolution, Understanding, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: this https URL.

[CV-4] RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

链接: https://arxiv.org/abs/2405.20336
作者: Jiaben Chen,Xin Yan,Yihang Chen,Siyuan Cen,Qinwei Ma,Haoyu Zhen,Kaizhi Qian,Lie Lu,Chuang Gan
关键词: holistic body meshes, holistic body, holistic body motions, existing works, simultaneously generating
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Project website: this https URL

点击查看摘要

Abstract:In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information, and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at this https URL.

[CV-5] VividDream: Generating 3D Scene with Ambient Dynamics

链接: https://arxiv.org/abs/2405.20334
作者: Yao-Chih Lee,Yi-Ting Chen,Andrew Wang,Ting-Hsuan Liao,Brandon Y. Feng,Jia-Bin Huang
关键词: single input image, generating explorable, method for generating, input image, single input
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:We introduce VividDream, a method for generating explorable 4D scenes with ambient dynamics from a single input image or text prompt. VividDream first expands an input image into a static 3D point cloud through iterative inpainting and geometry merging. An ensemble of animated videos is then generated using video diffusion models with quality refinement techniques and conditioned on renderings of the static 3D scene from the sampled camera trajectories. We then optimize a canonical 4D scene representation using an animated video ensemble, with per-video motion embeddings and visibility masks to mitigate inconsistencies. The resulting 4D scene enables free-view exploration of a 3D scene with plausible ambient scene dynamics. Experiments demonstrate that VividDream can provide human viewers with compelling 4D experiences generated based on diverse real images and text prompts.

[CV-6] SurgiTrack: Fine-Grained Multi-Class Multi-Tool Tracking in Surgical Videos

链接: https://arxiv.org/abs/2405.20333
作者: Chinedu Innocent Nwoye,Nicolas Padoy
关键词: tool, computer-assisted intervention, success of computer-assisted, Accurate tool tracking, tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 7 figures, 9 tables, 1 video. Supplementary video available at: this https URL

点击查看摘要

Abstract:Accurate tool tracking is essential for the success of computer-assisted intervention. Previous efforts often modeled tool trajectories rigidly, overlooking the dynamic nature of surgical procedures, especially tracking scenarios like out-of-body and out-of-camera views. Addressing this limitation, the new CholecTrack20 dataset provides detailed labels that account for multiple tool trajectories in three perspectives: (1) intraoperative, (2) intracorporeal, and (3) visibility, representing the different types of temporal duration of tool tracks. These fine-grained labels enhance tracking flexibility but also increase the task complexity. Re-identifying tools after occlusion or re-insertion into the body remains challenging due to high visual similarity, especially among tools of the same category. This work recognizes the critical role of the tool operators in distinguishing tool track instances, especially those belonging to the same tool category. The operators’ information are however not explicitly captured in surgical videos. We therefore propose SurgiTrack, a novel deep learning method that leverages YOLOv7 for precise tool detection and employs an attention mechanism to model the originating direction of the tools, as a proxy to their operators, for tool re-identification. To handle diverse tool trajectory perspectives, SurgiTrack employs a harmonizing bipartite matching graph, minimizing conflicts and ensuring accurate tool identity association. Experimental results on CholecTrack20 demonstrate SurgiTrack’s effectiveness, outperforming baselines and state-of-the-art methods with real-time inference capability. This work sets a new standard in surgical tool tracking, providing dynamic trajectories for more adaptable and precise assistance in minimally invasive surgeries.

[CV-7] 4DHands: Reconstructing Interactive Hands in 4D with Transformers

链接: https://arxiv.org/abs/2405.20330
作者: Dixuan Lin,Yuxiang Zhang,Mengcheng Li,Yebin Liu,Wei Jing,Qi Yan,Qianying Wang,Hongwen Zhang
关键词: hand, monocular inputs, recovering interactive hand, hand image inputs, Spatio-temporal Interaction Reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: More demo videos can be seen at our project page: this https URL

点击查看摘要

Abstract:In this paper, we introduce 4DHands, a robust approach to recovering interactive hand meshes and their relative movement from monocular inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a transformer-based architecture with novel tokenization and feature fusion strategies. Specifically, we propose a Relation-aware Two-Hand Tokenization (RAT) method to embed positional relation information into the hand tokens. In this way, our network can handle both single-hand and two-hand inputs and explicitly leverage relative hand positions, facilitating the reconstruction of intricate hand interactions in real-world scenarios. As such tokenization indicates the relative relationship of two hands, it also supports more effective feature fusion. To this end, we further develop a Spatio-temporal Interaction Reasoning (SIR) module to fuse hand tokens in 4D with attention and decode them into 3D hand meshes and relative temporal movements. The efficacy of our approach is validated on several benchmark datasets. The results on in-the-wild videos and real-world scenarios demonstrate the superior performances of our approach for interactive hand reconstruction. More video results can be found on the project page: this https URL.

[CV-8] GECO: Generative Image-to-3D within a SECOnd

链接: https://arxiv.org/abs/2405.20327
作者: Chen Wang,Jiatao Gu,Xiaoxiao Long,Yuan Liu,Lingjie Liu
关键词: recent years, remarkable progress, progress in recent, Abstract, efficiency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:3D generation has seen remarkable progress in recent years. Existing techniques, such as score distillation methods, produce notable results but require extensive per-scene optimization, impacting time efficiency. Alternatively, reconstruction-based approaches prioritize efficiency but compromise quality due to their limited handling of uncertainty. We introduce GECO, a novel method for high-quality 3D generative modeling that operates within a second. Our approach addresses the prevalent issues of uncertainty and inefficiency in current methods through a two-stage approach. In the initial stage, we train a single-step multi-view generative model with score distillation. Then, a second-stage distillation is applied to address the challenge of view inconsistency from the multi-view prediction. This two-stage process ensures a balanced approach to 3D generation, optimizing both quality and efficiency. Our comprehensive experiments demonstrate that GECO achieves high-quality image-to-3D generation with an unprecedented level of efficiency.

[CV-9] MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion

链接: https://arxiv.org/abs/2405.20325
作者: Shuyuan Tu,Qi Dai,Zihao Zhang,Sicheng Xie,Zhi-Qi Cheng,Chong Luo,Xintong Han,Zuxuan Wu,Yu-Gang Jiang
关键词: altering video attributes, modifying motion information, diffusion-based video editing, impressive advancements, advancements in diffusion-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 18 figures. Project page at this https URL

点击查看摘要

Abstract:Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist’s appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, MotionFollower leverages two of our proposed lightweight signal controllers, one for poses and the other for appearances, both of which consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture, including the reconstruction and editing branches, which significantly enhance the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers and losses during the score estimation. The resulting gradients thus inject appropriate guidance to the intermediate latents, forcing the model to preserve the original background details and protagonists’ appearances without interfering with the motion modification. Experiments demonstrate the competitive motion editing ability of MotionFollower qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory while delivering superior motion editing performance and exclusively supporting large camera movements and actions.

[CV-10] Dont drop your samples! Coherence-aware training benefits Conditional diffusion

链接: https://arxiv.org/abs/2405.20324
作者: Nicolas Dufour,Victor Besnier,Vicky Kalogeiton,David Picard
关键词: conditional information, powerful generative models, segmentation masks, Conditional, class labels
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at CVPR 2024 as a Highlight. Project page: this https URL

点击查看摘要

Abstract:Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.

[CV-11] textitS3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving

链接: https://arxiv.org/abs/2405.20323
作者: Nan Huang,Xiaobao Wei,Wenzhao Zheng,Pengju An,Ming Lu,Wei Zhan,Masayoshi Tomizuka,Kurt Keutzer,Shanghang Zhang
关键词: developing real-world simulators, Neural Radiance Fields, critical technique, technique for developing, developing real-world
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian ( \textitS^3 Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our \textitS^3 Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. Code is available at: this https URL.

[CV-12] Vision-based Manipulation from Single Human Video with Open-World Object Graphs

链接: https://arxiv.org/abs/2405.20321
作者: Yifeng Zhu,Arisrei Lim,Peter Stone,Yuke Zhu
关键词: vision-based manipulation skills, single human video, learn vision-based manipulation, approach to empower, single human
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We investigate the problem of imitating robot manipulation from a single human video in the open-world setting, where a robot must learn to manipulate novel objects from one video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices such as an iPad and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world. Videos can be found in the project website this https URL.

[CV-13] Improving the Training of Rectified Flows

链接: https://arxiv.org/abs/2405.20320
作者: Sangyun Lee,Zinan Lin,Giulia Fanti
关键词: shown great promise, expensive numerical integration, Diffusion models, models requires expensive, requires expensive numerical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have shown great promise for image and video generation, but sampling from state-of-the-art models requires expensive numerical integration of a generative ODE. One approach for tackling this problem is rectified flows, which iteratively learn smooth ODE paths that are less susceptible to truncation error. However, rectified flows still require a relatively large number of function evaluations (NFEs). In this work, we propose improved techniques for training rectified flows, allowing them to compete with knowledge distillation methods even in the low NFE setting. Our main insight is that under realistic settings, a single iteration of the Reflow algorithm for training rectified flows is sufficient to learn nearly straight trajectories; hence, the current practice of using multiple Reflow iterations is unnecessary. We thus propose techniques to improve one-round training of rectified flows, including a U-shaped timestep distribution and LPIPS-Huber premetric. With these techniques, we improve the FID of the previous 2-rectified flow by up to 72% in the 1 NFE setting on CIFAR-10. On ImageNet 64 \times 64, our improved rectified flow outperforms the state-of-the-art distillation methods such as consistency distillation and progressive distillation in both one-step and two-step settings and rivals the performance of improved consistency training (iCT) in FID. Code is available at this https URL.

[CV-14] ParSEL: Parameterized Shape Editing with Language

链接: https://arxiv.org/abs/2405.20319
作者: Aditya Ganeshan,Ryan Y. Huang,Xianghao Xu,R. Kenny Jones,Daniel Ritchie
关键词: natural language, natural language presents, content creation, presents a compelling, compelling paradigm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:The ability to edit 3D assets from natural language presents a compelling paradigm to aid in the democratization of 3D content creation. However, while natural language is often effective at communicating general intent, it is poorly suited for specifying precise manipulation. To address this gap, we introduce ParSEL, a system that enables controllable editing of high-quality 3D assets from natural language. Given a segmented 3D mesh and an editing request, ParSEL produces a parameterized editing program. Adjusting the program parameters allows users to explore shape variations with a precise control over the magnitudes of edits. To infer editing programs which align with an input edit request, we leverage the abilities of large-language models (LLMs). However, while we find that LLMs excel at identifying initial edit operations, they often fail to infer complete editing programs, and produce outputs that violate shape semantics. To overcome this issue, we introduce Analytical Edit Propagation (AEP), an algorithm which extends a seed edit with additional operations until a complete editing program has been formed. Unlike prior methods, AEP searches for analytical editing operations compatible with a range of possible user edits through the integration of computer algebra systems for geometric analysis. Experimentally we demonstrate ParSEL’s effectiveness in enabling controllable editing of 3D objects through natural language requests over alternative system designs.

[CV-15] A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction

链接: https://arxiv.org/abs/2405.20310
作者: Jianghao Shen,Tianfu Wu
关键词: Splatter Image method, Splatter Image, Hierarchical Splatter Image, Gaussians, long-standing fundamental problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: preprint, under review

点击查看摘要

Abstract:Learning 3D scene representation from a single-view image is a long-standing fundamental problem in computer vision, with the inherent ambiguity in predicting contents unseen from the input view. Built on the recently proposed 3D Gaussian Splatting (3DGS), the Splatter Image method has made promising progress on fast single-image novel view synthesis via learning a single 3D Gaussian for each pixel based on the U-Net feature map of an input image. However, it has limited expressive power to represent occluded components that are not observable in the input view. To address this problem, this paper presents a Hierarchical Splatter Image method in which a pixel is worth more than one 3D Gaussians. Specifically, each pixel is represented by a parent 3D Gaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter Image. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron (MLP) which takes as input the projected image features of a parent 3D Gaussian and the embedding of a target camera view. Both parent and child 3D Gaussians are learned end-to-end in a stage-wise way. The joint condition of input image features from eyes of the parent Gaussians and the target camera position facilitates learning to allocate child Gaussians to ``see the unseen’', recovering the occluded details that are often missed by parent Gaussians. In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D datasets with state-of-the-art performance obtained, especially showing promising capabilities of reconstructing occluded contents in the input view. Comments: preprint, under review Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2405.20310 [cs.CV] (or arXiv:2405.20310v1 [cs.CV] for this version)

[CV-16] Cant make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

链接: https://arxiv.org/abs/2405.20305
作者: Himangi Mittal,Nakul Agarwal,Shao-Yuan Lo,Kwonjoon Lee
关键词: plausible action sequence, action sequences, action, action sequence learning, large video-language model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024

点击查看摘要

Abstract:We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.

[CV-17] Scaling White-Box Transformers for Vision

链接: https://arxiv.org/abs/2405.20299
作者: Jinrui Yang,Xianhang Li,Druv Pai,Yuyin Zhou,Yi Ma,Yaodong Yu,Cihang Xie
关键词: standard vision transformers, CRATE, white-box transformer architecture, inherent mathematical interpretability, vision transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE- \alpha , featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE- \alpha can effectively scale with larger model sizes and datasets. For example, our CRATE- \alpha -B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an accuracy of 83.2%. Meanwhile, when scaling further, our CRATE- \alpha -L obtains an ImageNet classification accuracy of 85.1%. More notably, these model performance improvements are achieved while preserving, and potentially even enhancing the interpretability of learned CRATE models, as we demonstrate through showing that the learned token representations of increasingly larger trained CRATE- \alpha models yield increasingly higher-quality unsupervised object segmentation of images. The project page is this https URL.

[CV-18] Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness

链接: https://arxiv.org/abs/2405.20291
作者: Weilin Lin,Li Liu,Shaokui Wei,Jianze Li,Hui Xiong
关键词: deep neural networks, neural networks, security threat, central concern, concern for deep
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The security threat of backdoor attacks is a central concern for deep neural networks (DNNs). Recently, without poisoned data, unlearning models with clean data and then learning a pruning mask have contributed to backdoor defense. Additionally, vanilla fine-tuning with those clean data can help recover the lost clean accuracy. However, the behavior of clean unlearning is still under-explored, and vanilla fine-tuning unintentionally induces back the backdoor effect. In this work, we first investigate model unlearning from the perspective of weight changes and gradient norms, and find two interesting observations in the backdoored model: 1) the weight changes between poison and clean unlearning are positively correlated, making it possible for us to identify the backdoored-related neurons without using poisoned data; 2) the neurons of the backdoored model are more active (i.e., larger changes in gradient norm) than those in the clean model, suggesting the need to suppress the gradient norm during fine-tuning. Then, we propose an effective two-stage defense method. In the first stage, an efficient Neuron Weight Change (NWC)-based Backdoor Reinitialization is proposed based on observation 1). In the second stage, based on observation 2), we design an Activeness-Aware Fine-Tuning to replace the vanilla fine-tuning. Extensive experiments, involving eight backdoor attacks on three benchmark datasets, demonstrate the superior performance of our proposed method compared to recent state-of-the-art backdoor defense approaches.

[CV-19] Sphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes

链接: https://arxiv.org/abs/2405.20283
作者: Minghao Guo,Bohan Wang,Kaiming He,Wojciech Matusik
关键词: present TetSphere splatting, Lagrangian representation, TetSphere splatting, mesh quality, Lagrangian
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We present TetSphere splatting, an explicit, Lagrangian representation for reconstructing 3D shapes with high-quality geometry. In contrast to conventional object reconstruction methods which predominantly use Eulerian representations, including both neural implicit (e.g., NeRF, NeuS) and explicit representations (e.g., DMTet), and often struggle with high computational demands and suboptimal mesh quality, TetSphere splatting utilizes an underused but highly effective geometric primitive – tetrahedral meshes. This approach directly yields superior mesh quality without relying on neural networks or post-processing. It deforms multiple initial tetrahedral spheres to accurately reconstruct the 3D shape through a combination of differentiable rendering and geometric energy optimization, resulting in significant computational efficiency. Serving as a robust and versatile geometry representation, Tet-Sphere splatting seamlessly integrates into diverse applications, including single-view 3D reconstruction, image-/text-to-3D content generation. Experimental results demonstrate that TetSphere splatting outperforms existing representations, delivering faster optimization speed, enhanced mesh quality, and reliable preservation of thin structures.

[CV-20] SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

链接: https://arxiv.org/abs/2405.20282
作者: Chaoyang Wang,Xiangtai Li,Lu Qi,Henghui Ding,Yunhai Tong,Ming-Hsuan Yang
关键词: perception and generation, Semantic, visual perception, semantic image synthesis, Semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semantic segmentation and semantic image synthesis are two representative tasks in visual perception and generation. While existing methods consider them as two distinct tasks, we propose a unified diffusion-based framework (SemFlow) and model them as a pair of reverse problems. Specifically, motivated by rectified flow theory, we train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks. As the training object is symmetric, samples belonging to the two distributions, images and semantic masks, can be effortlessly transferred reversibly. For semantic segmentation, our approach solves the contradiction between the randomness of diffusion outputs and the uniqueness of segmentation results. For image synthesis, we propose a finite perturbation approach to enhance the diversity of generated results without changing the semantic categories. Experiments show that our SemFlow achieves competitive results on semantic segmentation and semantic image synthesis tasks. We hope this simple framework will motivate people to rethink the unification of low-level and high-level vision. Project page: this https URL.

[CV-21] CV-VAE: A Compatible Video VAE for Latent Generative Video Models

链接: https://arxiv.org/abs/2405.20279
作者: Sijie Zhao,Yong Zhang,Xiaodong Cun,Shaoshu Yang,Muyao Niu,Xiaoyu Li,Wenbo Hu,Ying Shan
关键词: Variational Autoencoders, OpenAI SORA, SORA and numerous, VAE, video models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI’s SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.

[CV-22] ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections

链接: https://arxiv.org/abs/2405.20271
作者: Massimo Bini,Karsten Roth,Zeynep Akata,Anna Khoreva
关键词: adapt foundation models, downstream task requirements, generalization ability, ubiquitous to adapt, adapt foundation
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ICML 2024. Code available at this https URL

点击查看摘要

Abstract:Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters ( \sim 10 - 100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility. The code is available at this https URL.

[CV-23] FaceMixup: Enhancing Facial Expression Recognition through Mixed Face Regularization

链接: https://arxiv.org/abs/2405.20259
作者: Fabio A. Faria,Mateus M. Souza,Raoni F. da S. Teixeira,Mauricio P. Segundo
关键词: pose significant challenges, large annotated datasets, annotated datasets pose, datasets pose significant, scarcity of large
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 29 pages, 9 figures, paper is under review on journal

点击查看摘要

Abstract:The proliferation of deep learning solutions and the scarcity of large annotated datasets pose significant challenges in real-world applications. Various strategies have been explored to overcome this challenge, with data augmentation (DA) approaches emerging as prominent solutions. DA approaches involve generating additional examples by transforming existing labeled data, thereby enriching the dataset and helping deep learning models achieve improved generalization without succumbing to overfitting. In real applications, where solutions based on deep learning are widely used, there is facial expression recognition (FER), which plays an essential role in human communication, improving a range of knowledge areas (e.g., medicine, security, and marketing). In this paper, we propose a simple and comprehensive face data augmentation approach based on mixed face component regularization that outperforms the classical DA approaches from the literature, including the MixAugment which is a specific approach for the target task in two well-known FER datasets existing in the literature.

[CV-24] KerasCV and KerasNLP: Vision and Language Power-Ups

链接: https://arxiv.org/abs/2405.20247
作者: Matthew Watson,Divyashree Shivakumar Sreepathihalli,Francois Chollet,Martin Gorner,Kiranbir Sodhia,Ramesh Sampath,Tirth Patel,Haifeng Jin,Neel Kovelamudi,Gabriel Rasskin,Samaneh Saadat,Luke Wood,Chen Qian,Jonathan Bischof,Ian Stenbit
关键词: Natural Language Processing, Language Processing workflows, Keras domain packages, Computer Vision, Vision and Natural
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Submitted to Journal of Machine Learning Open Source Software

点击查看摘要

Abstract:We present the Keras domain packages KerasCV and KerasNLP, extensions of the Keras API for Computer Vision and Natural Language Processing workflows, capable of running on either JAX, TensorFlow, or PyTorch. These domain packages are designed to enable fast experimentation, with a focus on ease-of-use and performance. We adopt a modular, layered design: at the library’s lowest level of abstraction, we provide building blocks for creating models and data preprocessing pipelines, and at the library’s highest level of abstraction, we provide pretrained ``task" models for popular architectures such as Stable Diffusion, YOLOv8, GPT2, BERT, Mistral, CLIP, Gemma, T5, etc. Task models have built-in preprocessing, pretrained weights, and can be fine-tuned on raw inputs. To enable efficient training, we support XLA compilation for all models, and run all preprocessing via a compiled graph of TensorFlow operations using the tf.data API. The libraries are fully open-source (Apache 2.0 license) and available on GitHub.

[CV-25] Feature Fusion for Improved Classification: Combining Dempster-Shafer Theory and Multiple CNN Architectures

链接: https://arxiv.org/abs/2405.20230
作者: Ayyub Alzahem,Wadii Boulila,Maha Driss,Anis Koubaa
关键词: Deep Learning, make reliable predictions, Addressing uncertainty, uncertainty in Deep, decisions in complex
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Addressing uncertainty in Deep Learning (DL) is essential, as it enables the development of models that can make reliable predictions and informed decisions in complex, real-world environments where data may be incomplete or ambiguous. This paper introduces a novel algorithm leveraging Dempster-Shafer Theory (DST) to integrate multiple pre-trained models to form an ensemble capable of providing more reliable and enhanced classifications. The main steps of the proposed method include feature extraction, mass function calculation, fusion, and expected utility calculation. Several experiments have been conducted on CIFAR-10 and CIFAR-100 datasets, demonstrating superior classification accuracy of the proposed DST-based method, achieving improvements of 5.4% and 8.4%, respectively, compared to the best individual pre-trained models. Results highlight the potential of DST as a robust framework for managing uncertainties related to data when applying DL in real-world scenarios.

[CV-26] EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

链接: https://arxiv.org/abs/2405.20224
作者: Wangbo Yu,Chaoran Feng,Jiye Tang,Xu Jia,Li Yuan,Yonghong Tian
关键词: demonstrated exceptional capabilities, Assisted Gaussian Splatting, Gaussian Splatting, Stream Assisted Gaussian, demonstrated exceptional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in 3D scene reconstruction and novel view synthesis. However, its training heavily depends on high-quality, sharp images and accurate camera poses. Fulfilling these requirements can be challenging in non-ideal real-world scenarios, where motion-blurred images are commonly encountered in high-speed moving cameras or low-light environments that require long exposure times. To address these challenges, we introduce Event Stream Assisted Gaussian Splatting (EvaGaussians), a novel approach that integrates event streams captured by an event camera to assist in reconstructing high-quality 3D-GS from blurry images. Capitalizing on the high temporal resolution and dynamic range offered by the event camera, we leverage the event streams to explicitly model the formation process of motion-blurred images and guide the deblurring reconstruction of 3D-GS. By jointly optimizing the 3D-GS parameters and recovering camera motion trajectories during the exposure time, our method can robustly facilitate the acquisition of high-fidelity novel views with intricate texture details. We comprehensively evaluated our method and compared it with previous state-of-the-art deblurring rendering methods. Both qualitative and quantitative comparisons demonstrate that our method surpasses existing techniques in restoring fine details from blurry images and producing high-fidelity novel views.

[CV-27] MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

链接: https://arxiv.org/abs/2405.20222
作者: Muyao Niu,Xiaodong Cun,Xintao Wang,Yong Zhang,Ying Shan,Yinqiang Zheng
关键词: additional controllable signals, image animation method, present MOFA-Video, human landmarks reference, advanced controllable image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation.

[CV-28] Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback

链接: https://arxiv.org/abs/2405.20216
作者: Sanghyeon Na,Yonggyu Kim,Hyunjoon Lee
关键词: challenging task, human image generation, image generation, significant yet challenging, Direct Preference Optimization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 28 pages, 18 figures

点击查看摘要

Abstract:The generation of high-quality human images through text-to-image (T2I) methods is a significant yet challenging task. Distinct from general image generation, human image synthesis must satisfy stringent criteria related to human pose, anatomy, and alignment with textual prompts, making it particularly difficult to achieve realistic results. Recent advancements in T2I generation based on diffusion models have shown promise, yet challenges remain in meeting human-specific preferences. In this paper, we introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO). Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback. We also propose a modified loss function that enhances the DPO training process by minimizing artifacts and improving image fidelity. Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation. Through comprehensive evaluations, we show that our approach significantly advances the state of human image generation, achieving superior results in terms of natural anatomies, poses, and text-image alignment.

[CV-29] Jina CLIP: Your CLIP Model Is Also Your Text Retriever

链接: https://arxiv.org/abs/2405.20204
作者: Andreas Koukounas,Georgios Mastrapas,Michael Günther,Bo Wang,Scott Martens,Isabelle Mohr,Saba Sturua,Mohammad Kalim Akram,Joan Fontanals Martínez,Saahil Ognawala,Susana Guzman,Maximilian Werk,Nan Wang,Han Xiao
关键词: Contrastive Language-Image Pretraining, Language-Image Pretraining, common embedding space, fixed-sized vectors, align images
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: 4 pages, ICML2024 workshop submission

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

[CV-30] SPARE: Symmetrized Point-to-Plane Distance for Robust Non-Rigid Registration

链接: https://arxiv.org/abs/2405.20188
作者: Yuxin Yao,Bailin Deng,Junhui Hou,Juyong Zhang
关键词: error metric based, registration typically minimize, source surface, target surface, alignment error metric
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Existing optimization-based methods for non-rigid registration typically minimize an alignment error metric based on the point-to-point or point-to-plane distance between corresponding point pairs on the source surface and target surface. However, these metrics can result in slow convergence or a loss of detail. In this paper, we propose SPARE, a novel formulation that utilizes a symmetrized point-to-plane distance for robust non-rigid registration. The symmetrized point-to-plane distance relies on both the positions and normals of the corresponding points, resulting in a more accurate approximation of the underlying geometry and can achieve higher accuracy than existing methods. To solve this optimization problem efficiently, we propose an alternating minimization solver using a majorization-minimization strategy. Moreover, for effective initialization of the solver, we incorporate a deformation graph-based coarse alignment that improves registration quality and efficiency. Extensive experiments show that the proposed method greatly improves the accuracy of non-rigid registration problems and maintains relatively high solution efficiency. The code is publicly available at this https URL.

[CV-31] ransformers and Slot Encoding for Sample Efficient Physical World Modelling

链接: https://arxiv.org/abs/2405.20180
作者: Francesco Petri,Luigi Asprino,Aldo Gangemi
关键词: World modelling, predict its evolution, rules that govern, essential ability, physical world
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Recent applications of the Transformer architecture to the problem of world modelling from video input show notable improvements in sample efficiency. However, existing approaches tend to work only at the image level thus disregarding that the environment is composed of objects interacting with each other. In this paper, we propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene. We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples. The code for our architecture and experiments is available at this https URL

[CV-32] Landslide mapping from Sentinel-2 imagery through change detection

链接: https://arxiv.org/abs/2405.20161
作者: Tommaso Monopoli,Fabio Montello,Claudio Rossi
关键词: destructive geohazards, critical and destructive, Digital Elevation Model, Landslides, frequency and destructive
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: to be published in IEEE IGARSS 2024 conference proceedings

点击查看摘要

Abstract:Landslides are one of the most critical and destructive geohazards. Widespread development of human activities and settlements combined with the effects of climate change on weather are resulting in a high increase in the frequency and destructive power of landslides, making them a major threat to human life and the economy. In this paper, we explore methodologies to map newly-occurred landslides using Sentinel-2 imagery automatically. All approaches presented are framed as a bi-temporal change detection problem, requiring only a pair of Sentinel-2 images, taken respectively before and after a landslide-triggering event. Furthermore, we introduce a novel deep learning architecture for fusing Sentinel-2 bi-temporal image pairs with Digital Elevation Model (DEM) data, showcasing its promising performances w.r.t. other change detection models in the literature. As a parallel task, we address limitations in existing datasets by creating a novel geodatabase, which includes manually validated open-access landslide inventories over heterogeneous ecoregions of the world. We release both code and dataset with an open-source license.

[CV-33] MotionDreamer: Zero-Shot 3D Mesh Animation from Video Diffusion Models

链接: https://arxiv.org/abs/2405.20155
作者: Lukas Uzolas,Elmar Eisemann,Petr Kellnhofer
关键词: techniques bring digital, bring digital, worlds and characters, characters to life, Animation techniques bring
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Animation techniques bring digital 3D worlds and characters to life. However, manual animation is tedious and automated techniques are often specialized to narrow shape classes. In our work, we propose a technique for automatic re-animation of arbitrary 3D shapes based on a motion prior extracted from a video diffusion model. Unlike existing 4D generation methods, we focus solely on the motion, and we leverage an explicit mesh-based representation compatible with existing computer-graphics pipelines. Furthermore, our utilization of diffusion features enhances accuracy of our motion fitting. We analyze efficacy of these features for animation fitting and we experimentally validate our approach for two different diffusion models and four animation models. Finally, we demonstrate that our time-efficient zero-shot method achieves a superior performance re-animating a diverse set of 3D shapes when compared to existing techniques in a user study. The project website is located at this https URL.

[CV-34] Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals

链接: https://arxiv.org/abs/2405.20152
作者: Phillip Howard,Kathleen C. Fraser,Anahita Bhiwandiwalla,Svetlana Kiritchenko
关键词: Large Language Models, Large Vision-Language Models, Large Language, possessing increasingly impressive, increasingly impressive capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the advent of Large Language Models (LLMs) possessing increasingly impressive capabilities, a number of Large Vision-Language Models (LVLMs) have been proposed to augment LLMs with visual inputs. Such models condition generated text on both an input image and a text prompt, enabling a variety of use cases such as visual question answering and multimodal chat. While prior studies have examined the social biases contained in text generated by LLMs, this topic has been relatively unexplored in LVLMs. Examining social biases in LVLMs is particularly challenging due to the confounding contributions of bias induced by information contained across the text and visual modalities. To address this challenging problem, we conduct a large-scale study of text generated by different LVLMs under counterfactual changes to input images. Specifically, we present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets, where each set contains images which are largely identical in their depiction of a common subject (e.g., a doctor), but vary only in terms of intersectional social attributes (e.g., race and gender). We comprehensively evaluate the text produced by different models under this counterfactual generation setting at scale, producing over 57 million responses from popular LVLMs. Our multi-dimensional analysis reveals that social attributes such as race, gender, and physical characteristics depicted in input images can significantly influence the generation of toxic content, competency-associated words, harmful stereotypes, and numerical ratings of depicted individuals. We additionally explore the relationship between social bias in LVLMs and their corresponding LLMs, as well as inference-time strategies to mitigate bias.

[CV-35] OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation

链接: https://arxiv.org/abs/2405.20141
作者: Gonca Yilmaz,Songyou Peng,Francis Engelmann,Marc Pollefeys,Hermann Blum
关键词: Vision Language Models, Language Models, Vision Language, dynamic image-language interactions, transformed image understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advent of Vision Language Models (VLMs) transformed image understanding from closed-set classifications to dynamic image-language interactions, enabling open-vocabulary segmentation. Despite this flexibility, VLMs often fall behind closed-set classifiers in accuracy due to their reliance on ambiguous image captions and lack of domain-specific knowledge. We, therefore, introduce a new task domain adaptation for open-vocabulary segmentation, enhancing VLMs with domain-specific priors while preserving their open-vocabulary nature. Existing adaptation methods, when applied to segmentation tasks, improve performance on training queries but can reduce VLM performance on zero-shot text inputs. To address this shortcoming, we propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy. This strategy is designed to enhance open-vocabulary generalization while adapting to the visual domain. Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks across indoor and outdoor datasets. Notably, our approach is the only one that consistently surpasses the original VLM on zero-shot queries. Our adapted VLMs can be plug-and-play integrated into existing open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU on ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes to the methods.

[CV-36] A Multimodal Dangerous State Recognition and Early Warning System for Elderly with Intermittent Dementia

链接: https://arxiv.org/abs/2405.20136
作者: Liyun Deng,Lei Jin,Guangcheng Wang,Quan Shi,Han Wang
关键词: aggravating aging population, intelligent early warning, elderly vulnerable groups, population in China, early warning system
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages,9 figures

点击查看摘要

Abstract:In response to the social issue of the increasing number of elderly vulnerable groups going missing due to the aggravating aging population in China, our team has developed a wearable anti-loss device and intelligent early warning system for elderly individuals with intermittent dementia using artificial intelligence and IoT technology. This system comprises an anti-loss smart helmet, a cloud computing module, and an intelligent early warning application on the caregiver’s mobile device. The smart helmet integrates a miniature camera module, a GPS module, and a 5G communication module to collect first-person images and location information of the elderly. Data is transmitted remotely via 5G, FTP, and TCP protocols. In the cloud computing module, our team has proposed for the first time a multimodal dangerous state recognition network based on scene and location information to accurately assess the risk of elderly individuals going missing. Finally, the application software interface designed for the caregiver’s mobile device implements multi-level early warnings. The system developed by our team requires no operation or response from the elderly, achieving fully automatic environmental perception, risk assessment, and proactive alarming. This overcomes the limitations of traditional monitoring devices, which require active operation and response, thus avoiding the issue of the digital divide for the elderly. It effectively prevents accidental loss and potential dangers for elderly individuals with dementia.

[CV-37] Federated and Transfer Learning for Cancer Detection Based on Image Analysis

链接: https://arxiv.org/abs/2405.20126
作者: Amine Bechar,Youssef Elmir,Yassine Himeur,Rafik Medjoudj,Abbes Amira
关键词: review article discusses, cancer detection based, cancer detection, machine learning, image analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This review article discusses the roles of federated learning (FL) and transfer learning (TL) in cancer detection based on image analysis. These two strategies powered by machine learning have drawn a lot of attention due to their potential to increase the precision and effectiveness of cancer diagnosis in light of the growing importance of machine learning techniques in cancer detection. FL enables the training of machine learning models on data distributed across multiple sites without the need for centralized data sharing, while TL allows for the transfer of knowledge from one task to another. A comprehensive assessment of the two methods, including their strengths, and weaknesses is presented. Moving on, their applications in cancer detection are discussed, including potential directions for the future. Finally, this article offers a thorough description of the functions of TL and FL in image-based cancer detection. The authors also make insightful suggestions for additional study in this rapidly developing area.

[CV-38] Infinite 3D Landmarks: Improving Continuous 2D Facial Landmark Detection

链接: https://arxiv.org/abs/2405.20117
作者: Prashanth Chandran,Gaspard Zoss,Paulo Gotardo,Derek Bradley
关键词: specific architectural modifications, facial landmark detectors, landmark, landmark detector, optimal face normalization
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 12 pages, 13 figures

点击查看摘要

Abstract:In this paper, we examine 3 important issues in the practical use of state-of-the-art facial landmark detectors and show how a combination of specific architectural modifications can directly improve their accuracy and temporal stability. First, many facial landmark detectors require face normalization as a preprocessing step, which is accomplished by a separately-trained neural network that crops and resizes the face in the input image. There is no guarantee that this pre-trained network performs the optimal face normalization for landmark detection. We instead analyze the use of a spatial transformer network that is trained alongside the landmark detector in an unsupervised manner, and jointly learn optimal face normalization and landmark detection. Second, we show that modifying the output head of the landmark predictor to infer landmarks in a canonical 3D space can further improve accuracy. To convert the predicted 3D landmarks into screen-space, we additionally predict the camera intrinsics and head pose from the input image. As a side benefit, this allows to predict the 3D face shape from a given image only using 2D landmarks as supervision, which is useful in determining landmark visibility among other things. Finally, when training a landmark detector on multiple datasets at the same time, annotation inconsistencies across datasets forces the network to produce a suboptimal average. We propose to add a semantic correction network to address this issue. This additional lightweight neural network is trained alongside the landmark detector, without requiring any additional supervision. While the insights of this paper can be applied to most common landmark detectors, we specifically target a recently-proposed continuous 2D landmark detector to demonstrate how each of our additions leads to meaningful improvements over the state-of-the-art on standard benchmarks.

[CV-39] RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection

链接: https://arxiv.org/abs/2405.20112
作者: Zhiyuan He,Pin-Yu Chen,Tsung-Yi Ho
关键词: highly realistic images, arbitrary content, raising concerns, misuse and harm, rapid advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid advances in generative AI models have empowered the creation of highly realistic images with arbitrary content, raising concerns about potential misuse and harm, such as Deepfakes. Current research focuses on training detectors using large datasets of generated images. However, these training-based solutions are often computationally expensive and show limited generalization to unseen generated images. In this paper, we propose a training-free method to distinguish between real and AI-generated images. We first observe that real images are more robust to tiny noise perturbations than AI-generated images in the representation space of vision foundation models. Based on this observation, we propose RIGID, a training-free and model-agnostic method for robust AI-generated image detection. RIGID is a simple yet effective approach that identifies whether an image is AI-generated by comparing the representation similarity between the original and the noise-perturbed counterpart. Our evaluation on a diverse set of AI-generated images and benchmarks shows that RIGID significantly outperforms existing trainingbased and training-free detectors. In particular, the average performance of RIGID exceeds the current best training-free method by more than 25%. Importantly, RIGID exhibits strong generalization across different image generation methods and robustness to image corruptions.

[CV-40] FMARS: Annotating Remote Sensing Images for Disaster Management using Foundation Models

链接: https://arxiv.org/abs/2405.20109
作者: Edoardo Arnaudo,Jacopo Lungo Vaschetti,Lorenzo Innocenti,Luca Barco,Davide Lisi,Vanina Fissore,Claudio Rossi
关键词: Very-High Resolution, effective machine learning, machine learning applications, increasingly accessible, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at IGARSS 2024, 5 pages

点击查看摘要

Abstract:Very-High Resolution (VHR) remote sensing imagery is increasingly accessible, but often lacks annotations for effective machine learning applications. Recent foundation models like GroundingDINO and Segment Anything (SAM) provide opportunities to automatically generate annotations. This study introduces FMARS (Foundation Model Annotations in Remote Sensing), a methodology leveraging VHR imagery and foundation models for fast and robust annotation. We focus on disaster management and provide a large-scale dataset with labels obtained from pre-event imagery over 19 disaster events, derived from the Maxar Open Data initiative. We train segmentation models on the generated labels, using Unsupervised Domain Adaptation (UDA) techniques to increase transferability to real-world scenarios. Our results demonstrate the effectiveness of leveraging foundation models to automatically annotate remote sensing data at scale, enabling robust downstream models for critical applications. Code and dataset are available at \urlthis https URL.

[CV-41] Rapid Wildfire Hotspot Detection Using Self-Supervised Learning on Temporal Remote Sensing Data

链接: https://arxiv.org/abs/2405.20093
作者: Luca Barco,Angelica Urbanelli,Claudio Rossi
关键词: Rapid detection, detection and well-timed, well-timed intervention, intervention are essential, essential to mitigate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rapid detection and well-timed intervention are essential to mitigate the impacts of wildfires. Leveraging remote sensed data from satellite networks and advanced AI models to automatically detect hotspots (i.e., thermal anomalies caused by active fires) is an effective way to build wildfire monitoring systems. In this work, we propose a novel dataset containing time series of remotely sensed data related to European fire events and a Self-Supervised Learning (SSL)-based model able to analyse multi-temporal data and identify hotspots in potentially near real time. We train and evaluate the performance of our model using our dataset and Thraws, a dataset of thermal anomalies including several fire events, obtaining an F1 score of 63.58.

[CV-42] Visual Attention Analysis in Online Learning

链接: https://arxiv.org/abs/2405.20091
作者: Navarro Miriam,Becerra Álvaro,Daza Roberto,Cobos Ruth,Morales Aythami,Fierrez Julian
关键词: Multimodal Learning Analytics, Learning Analytics field, Analytics field, Multimodal Learning, Learning Analytics
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted in CEDI 2024 (VII Congreso Español de Informática), A Coruña, Spain

点击查看摘要

Abstract:In this paper, we present an approach in the Multimodal Learning Analytics field. Within this approach, we have developed a tool to visualize and analyze eye movement data collected during learning sessions in online courses. The tool is named VAAD (an acronym for Visual Attention Analysis Dashboard). These eye movement data have been gathered using an eye-tracker and subsequently processed and visualized for interpretation. The purpose of the tool is to conduct a descriptive analysis of the data by facilitating its visualization, enabling the identification of differences and learning patterns among various learner populations. Additionally, it integrates a predictive module capable of anticipating learner activities during a learning session. Consequently, VAAD holds the potential to offer valuable insights into online learning behaviors from both descriptive and predictive perspectives.

[CV-43] ypography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models

链接: https://arxiv.org/abs/2405.20090
作者: Hao Cheng,Erjia Xiao,Jiahang Cao,Le Yang,Kaidi Xu,Jindong Gu,Renjing Xu
关键词: Multimodal Large Language, Large Language Models, Artificial Intelligence, Multimodal Large, attracted wide attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Following the advent of the Artificial Intelligence (AI) era of large models, Multimodal Large Language Models (MLLMs) with the ability to understand cross-modal interactions between vision and text have attracted wide attention. Adversarial examples with human-imperceptible perturbation are shown to possess a characteristic known as transferability, which means that a perturbation generated by one model could also mislead another different model. Augmenting the diversity in input data is one of the most significant methods for enhancing adversarial transferability. This method has been certified as a way to significantly enlarge the threat impact under black-box conditions. Research works also demonstrate that MLLMs can be exploited to generate adversarial examples in the white-box scenario. However, the adversarial transferability of such perturbations is quite limited, failing to achieve effective black-box attacks across different models. In this paper, we propose the Typographic-based Semantic Transfer Attack (TSTA), which is inspired by: (1) MLLMs tend to process semantic-level information; (2) Typographic Attack could effectively distract the visual information captured by MLLMs. In the scenarios of Harmful Word Insertion and Important Information Protection, our TSTA demonstrates superior performance.

[CV-44] Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach

链接: https://arxiv.org/abs/2405.20084
作者: Muhammad Saif Ullah Khan,Dhavalkumar Limbachiya,Didier Stricker,Muhammad Zeshan Afzal
关键词: Human pose estimation, Human pose, interactive systems, key task, task in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages (with references)

点击查看摘要

Abstract:Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations.

[CV-45] NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

链接: https://arxiv.org/abs/2405.20081
作者: Kai Wu,Boyuan Jiang,Zhengkai Jiang,Qingdong He,Donghao Luo,Shengzhi Wang,Qingwen Liu,Chengjie Wang
关键词: Multimodal large language, large language models, Multimodal large, large language, contribute a powerful
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: updating

点击查看摘要

Abstract:Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at this https URL.

[CV-46] Faces of the Mind: Unveiling Mental Health States Through Facial Expressions in 11427 Adolescents

链接: https://arxiv.org/abs/2405.20072
作者: Xiao Xu,Keyin Zhou,Yan Zhang,Yang Wang,Fei Wang,Xizhe Zhang
关键词: Mood disorders, estimating mood disorder, mood disorder severity, facial, facial expressions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Mood disorders, including depression and anxiety, often manifest through facial expressions. While previous research has explored the connection between facial features and emotions, machine learning algorithms for estimating mood disorder severity have been hindered by small datasets and limited real-world application. To address this gap, we analyzed facial videos of 11,427 participants, a dataset two orders of magnitude larger than previous studies. This comprehensive collection includes standardized facial expression videos from reading tasks, along with a detailed psychological scale that measures depression, anxiety, and stress. By examining the relationships among these emotional states and employing clustering analysis, we identified distinct subgroups embodying different emotional profiles. We then trained tree-based classifiers and deep learning models to estimate emotional states from facial features. Results indicate that models previously effective on small datasets experienced decreased performance when applied to our large dataset, highlighting the importance of data scale and mitigating overfitting in practical settings. Notably, our study identified subtle shifts in pupil dynamics and gaze orientation as potential markers of mood disorders, providing valuable information on the interaction between facial expressions and mental health. This research marks the first large-scale and comprehensive investigation of facial expressions in the context of mental health, laying the groundwork for future data-driven advancements in this field.

[CV-47] N-Dimensional Gaussians for Fitting of High Dimensional Functions

链接: https://arxiv.org/abs/2405.20067
作者: Stavros Diolatzis,Tobias Zirr,Alexandr Kuznetsov,Georgios Kopanas,Anton Kaplanyan
关键词: exhibit promising performance, explicitly learned representations, learned representations exhibit, representations exhibit promising, representing high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:In the wake of many new ML-inspired approaches for reconstructing and representing high-quality 3D content, recent hybrid and explicitly learned representations exhibit promising performance and quality characteristics. However, their scaling to higher dimensions is challenging, e.g. when accounting for dynamic content with respect to additional parameters such as material properties, illumination, or time. In this paper, we tackle these challenges for an explicit representations based on Gaussian mixture models. With our solutions, we arrive at efficient fitting of compact N-dimensional Gaussian mixtures and enable efficient evaluation at render time: For fast fitting and evaluation, we introduce a high-dimensional culling scheme that efficiently bounds N-D Gaussians, inspired by Locality Sensitive Hashing. For adaptive refinement yet compact representation, we introduce a loss-adaptive density control scheme that incrementally guides the use of additional capacity towards missing details. With these tools we can for the first time represent complex appearance that depends on many input dimensions beyond position or viewing angle within a compact, explicit representation optimized in minutes and rendered in milliseconds.

[CV-48] Can the accuracy bias by facial hairstyle be reduced through balancing the training data?

链接: https://arxiv.org/abs/2405.20062
作者: Kagan Ozturk,Haiyu Wu,Kevin W. Bowyer
关键词: facial, facial hair, facial hairstyles, beard and mustache, hair
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Appearance of a face can be greatly altered by growing a beard and mustache. The facial hairstyles in a pair of images can cause marked changes to the impostor distribution and the genuine distribution. Also, different distributions of facial hairstyle across demographics could cause a false impression of relative accuracy across demographics. We first show that, even though larger training sets boost the recognition accuracy on all facial hairstyles, accuracy variations caused by facial hairstyles persist regardless of the size of the training set. Then, we analyze the impact of having different fractions of the training data represent facial hairstyles. We created balanced training sets using a set of identities available in Webface42M that both have clean-shaven and facial hair images. We find that, even when a face recognition model is trained with a balanced clean-shaven / facial hair training set, accuracy variation on the test data does not diminish. Next, data augmentation is employed to further investigate the effect of facial hair distribution in training data by manipulating facial hair pixels with the help of facial landmark points and a facial hair segmentation model. Our results show facial hair causes an accuracy gap between clean-shaven and facial hair images, and this impact can be significantly different between African-Americans and Caucasians.

[CV-49] Enhancing Plant Disease Detection: A Novel CNN-Based Approach with Tensor Subspace Learning and HOWSVD-MD

链接: https://arxiv.org/abs/2405.20058
作者: Abdelmalik Ouamane,Ammar Chouchane,Yassine Himeur,Abderrazak Debilou,Abbes Amira,Shadi Atalla,Wathiq Mansoor,Hussain Al Ahmad
关键词: maintaining crop health, Machine learning, tomato leaf diseases, revolutionized the field, maintaining crop
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 9 figures and 8 tables

点击查看摘要

Abstract:Machine learning has revolutionized the field of agricultural science, particularly in the early detection and management of plant diseases, which are crucial for maintaining crop health and productivity. Leveraging advanced algorithms and imaging technologies, researchers are now able to identify and classify plant diseases with unprecedented accuracy and speed. Effective management of tomato diseases is crucial for enhancing agricultural productivity. The development and application of tomato disease classification methods are central to this objective. This paper introduces a cutting-edge technique for the detection and classification of tomato leaf diseases, utilizing insights from the latest pre-trained Convolutional Neural Network (CNN) models. We propose a sophisticated approach within the domain of tensor subspace learning, known as Higher-Order Whitened Singular Value Decomposition (HOWSVD), designed to boost the discriminatory power of the system. Our approach to Tensor Subspace Learning is methodically executed in two phases, beginning with HOWSVD and culminating in Multilinear Discriminant Analysis (MDA). The efficacy of this innovative method was rigorously tested through comprehensive experiments on two distinct datasets, namely PlantVillage and the Taiwan dataset. The findings reveal that HOWSVD-MDA outperforms existing methods, underscoring its capability to markedly enhance the precision and dependability of diagnosing tomato leaf diseases. For instance, up to 98.36% and 89.39% accuracy scores have been achieved under PlantVillage and the Taiwan datasets, respectively.

[CV-50] A Point-Neighborhood Learning Framework for Nasal Endoscope Image Segmentation

链接: https://arxiv.org/abs/2405.20044
作者: Pengyu Jie,Wanquan Liu,Chenqiang Gao,Yihui Wen,Rui He,Pengcheng Li,Jintao Zhang,Deyu Meng
关键词: ambiguous features, challenging due, complex and ambiguous, labeling burden, lesion segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 10 figures,

点击查看摘要

Abstract:The lesion segmentation on endoscopic images is challenging due to its complex and ambiguous features. Fully-supervised deep learning segmentation methods can receive good performance based on entirely pixel-level labeled dataset but greatly increase experts’ labeling burden. Semi-supervised and weakly supervised methods can ease labeling burden, but heavily strengthen the learning difficulty. To alleviate this difficulty, weakly semi-supervised segmentation adopts a new annotation protocol of adding a large number of point annotation samples into a few pixel-level annotation samples. However, existing methods only mine points’ limited information while ignoring reliable prior surrounding the point annotations. In this paper, we propose a weakly semi-supervised method called Point-Neighborhood Learning (PNL) framework. To mine the prior of the pixels surrounding the annotated point, we transform a single-point annotation into a circular area named a point-neighborhood. We propose point-neighborhood supervision loss and pseudo-label scoring mechanism to enhance training supervision. Point-neighborhoods are also used to augment the data diversity. Our method greatly improves performance without changing the structure of segmentation network. Comprehensive experiments show the superiority of our method over the other existing methods, demonstrating its effectiveness in point-annotated medical images. The project code will be available on: this https URL.

[CV-51] Structure Gaussian SLAM with Manhattan World Hypothesis

链接: https://arxiv.org/abs/2405.20031
作者: Shuhong Liu,Heng Zhou,Liuzhuozheng Li,Yun Liu,Tianchen Deng,Yiming Zhou,Mingrui Li
关键词: Gaussian SLAM systems, Gaussian SLAM, Manhattan Gaussian SLAM, made significant advancements, made significant
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gaussian SLAM systems have made significant advancements in improving the efficiency and fidelity of real-time reconstructions. However, these systems often encounter incomplete reconstructions in complex indoor environments, characterized by substantial holes due to unobserved geometry caused by obstacles or limited view angles. To address this challenge, we present Manhattan Gaussian SLAM (MG-SLAM), an RGB-D system that leverages the Manhattan World hypothesis to enhance geometric accuracy and completeness. By seamlessly integrating fused line segments derived from structured scenes, MG-SLAM ensures robust tracking in textureless indoor areas. Moreover, The extracted lines and planar surface assumption allow strategic interpolation of new Gaussians in regions of missing geometry, enabling efficient scene completion. Extensive experiments conducted on both synthetic and real-world scenes demonstrate that these advancements enable our method to achieve state-of-the-art performance, marking a substantial improvement in the capabilities of Gaussian SLAM systems.

[CV-52] EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos

链接: https://arxiv.org/abs/2405.20030
作者: Masashi Hatano,Ryo Hachiuma,Hideo Saito
关键词: Predicting future human, human intention understanding, Predicting future, intention understanding, challenging but critical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Predicting future human behavior from egocentric videos is a challenging but critical task for human intention understanding. Existing methods for forecasting 2D hand positions rely on visual representations and mainly focus on hand-object interactions. In this paper, we investigate the hand forecasting task and tackle two significant issues that persist in the existing methods: (1) 2D hand positions in future frames are severely affected by ego-motions in egocentric videos; (2) prediction based on visual information tends to overfit to background or scene textures, posing a challenge for generalization on novel scenes or human behaviors. To solve the aforementioned problems, we propose EMAG, an ego-motion-aware and generalizable 2D hand forecasting method. In response to the first problem, we propose a method that considers ego-motion, represented by a sequence of homography matrices of two consecutive frames. We further leverage modalities such as optical flow, trajectories of hands and interacting objects, and ego-motions, thereby alleviating the second issue. Extensive experiments on two large-scale egocentric video datasets, Ego4D and EPIC-Kitchens 55, verify the effectiveness of the proposed method. In particular, our model outperforms prior methods by 7.0 % on cross-dataset evaluations. Project page: this https URL

[CV-53] From Forest to Zoo: Great Ape Behavior Recognition with ChimpBehave

链接: https://arxiv.org/abs/2405.20025
作者: Michael Fuchs,Emilie Genty,Adrian Bangerter,Klaus Zuberbühler,Paul Cotofrei
关键词: non-human primates, specifically focusing, Computer Vision, addresses the significant, Pattern Recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling In conjunction with Computer Vision and Pattern Recognition 2024

点击查看摘要

Abstract:This paper addresses the significant challenge of recognizing behaviors in non-human primates, specifically focusing on chimpanzees. Automated behavior recognition is crucial for both conservation efforts and the advancement of behavioral research. However, it is significantly hindered by the labor-intensive process of manual video annotation. Despite the availability of large-scale animal behavior datasets, the effective application of machine learning models across varied environmental settings poses a critical challenge, primarily due to the variability in data collection contexts and the specificity of annotations. In this paper, we introduce ChimpBehave, a novel dataset featuring over 2 hours of video (approximately 193,000 video frames) of zoo-housed chimpanzees, meticulously annotated with bounding boxes and behavior labels for action recognition. ChimpBehave uniquely aligns its behavior classes with existing datasets, allowing for the study of domain adaptation and cross-dataset generalization methods between different visual settings. Furthermore, we benchmark our dataset using a state-of-the-art CNN-based action recognition model, providing the first baseline results for both within and cross-dataset settings. The dataset, models, and code can be accessed at: this https URL Comments: CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling In conjunction with Computer Vision and Pattern Recognition 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2405.20025 [cs.CV] (or arXiv:2405.20025v1 [cs.CV] for this version)

[CV-54] Sharing Key Semantics in Transformer Makes Efficient Image Restoration

链接: https://arxiv.org/abs/2405.20008
作者: Bin Ren,Yawei Li,Jingyun Liang,Rakesh Ranjan,Mengyuan Liu,Rita Cucchiara,Luc Van Gool,Ming-Hsuan Yang,Nicu Sebe
关键词: classic low-level vision, effectively model global, witnessed significant advancements, low-level vision task, deep models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages

点击查看摘要

Abstract:Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the Vision Transformers (ViTs) emergence has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR’s performance by sharing the key semantics via Transformer for IR (i.e., SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR’s state-of-the-art performance, quantitatively and qualitatively showcasing advancements.

[CV-55] DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild

链接: https://arxiv.org/abs/2405.19996
作者: Honghao Fu,Yufei Wang,Wenhan Yang,Bihan Wen
关键词: selecting high-quality images, Image quality assessment, Image quality, IQA method called, plays a critical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Image quality assessment (IQA) plays a critical role in selecting high-quality images and guiding compression and enhancement methods in a series of applications. The blind IQA, which assesses the quality of in-the-wild images containing complex authentic distortions without reference images, poses greater challenges. Existing methods are limited to modeling a uniform distribution with local patches and are bothered by the gap between low and high-level visions (caused by widely adopted pre-trained classification networks). In this paper, we propose a novel IQA method called diffusion priors-based IQA (DP-IQA), which leverages the prior knowledge from the pre-trained diffusion model with its excellent powers to bridge semantic gaps in the perception of the visual quality of images. Specifically, we use pre-trained stable diffusion as the backbone, extract multi-level features from the denoising U-Net during the upsampling process at a specified timestep, and decode them to estimate the image quality score. The text and image adapters are adopted to mitigate the domain gap for downstream tasks and correct the information loss caused by the variational autoencoder bottleneck. Finally, we distill the knowledge in the above model into a CNN-based student model, significantly reducing the parameter to enhance applicability, with the student model performing similarly or even better than the teacher model surprisingly. Experimental results demonstrate that our DP-IQA achieves state-of-the-art results on various in-the-wild datasets with better generalization capability, which shows the superiority of our method in global modeling and utilizing the hierarchical feature clues of diffusion for evaluating image quality.

[CV-56] DiffPhysBA: Diffusion-based Physical Backdoor Attack against Person Re-Identification in Real-World

链接: https://arxiv.org/abs/2405.19990
作者: Wenli Sun,Xinyang Jiang,Dongsheng Li,Cairong Zhao
关键词: significant security risk, systems pose, allowing adversaries, person ReID models, pose a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Person Re-Identification (ReID) systems pose a significant security risk from backdoor attacks, allowing adversaries to evade tracking or impersonate others. Beyond recognizing this issue, we investigate how backdoor attacks can be deployed in real-world scenarios, where a ReID model is typically trained on data collected in the digital domain and then deployed in a physical environment. This attack scenario requires an attack flow that embeds backdoor triggers in the digital domain realistically enough to also activate the buried backdoor in person ReID models in the physical domain. This paper realizes this attack flow by leveraging a diffusion model to generate realistic accessories on pedestrian images (e.g., bags, hats, etc.) as backdoor triggers. However, the noticeable domain gap between the triggers generated by the off-the-shelf diffusion model and their physical counterparts results in a low attack success rate. Therefore, we introduce a novel diffusion-based physical backdoor attack (DiffPhysBA) method that adopts a training-free similarity-guided sampling process to enhance the resemblance between generated and physical triggers. Consequently, DiffPhysBA can generate realistic attributes as semantic-level triggers in the digital domain and provides higher physical ASR compared to the direct paste method by 25.6% on the real-world test set. Through evaluations on newly proposed real-world and synthetic ReID test sets, DiffPhysBA demonstrates an impressive success rate exceeding 90% in both the digital and physical domains. Notably, it excels in digital stealth metrics and can effectively evade state-of-the-art defense methods.

[CV-57] PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting

链接: https://arxiv.org/abs/2405.19957
作者: Qiaowei Miao,Yawei Luo,Yi Yang
关键词: research community focus, Score Distillation Sampling, text-conditioned diffusion models, identify Score Distillation, achieve breakthroughs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As text-conditioned diffusion models (DMs) achieve breakthroughs in image, video, and 3D generation, the research community’s focus has shifted to the more challenging task of text-to-4D synthesis, which introduces a temporal dimension to generate dynamic 3D objects. In this context, we identify Score Distillation Sampling (SDS), a widely used technique for text-to-3D synthesis, as a significant hindrance to text-to-4D performance due to its Janus-faced and texture-unrealistic problems coupled with high computational costs. In this paper, we propose \textbfPixel-\textbfLevel \textbfAlignments for Text-to-\textbf4D Gaussian Splatting (\textbfPLA4D), a novel method that utilizes text-to-video frames as explicit pixel alignment targets to generate static 3D objects and inject motion into them. Specifically, we introduce Focal Alignment to calibrate camera poses for rendering and GS-Mesh Contrastive Learning to distill geometry priors from rendered image contrasts at the pixel level. Additionally, we develop Motion Alignment using a deformation network to drive changes in Gaussians and implement Reference Refinement for smooth 4D object surfaces. These techniques enable 4D Gaussian Splatting to align geometry, texture, and motion with generated videos at the pixel level. Compared to previous methods, PLA4D produces synthesized outputs with better texture details in less time and effectively mitigates the Janus-faced problem. PLA4D is fully implemented using open-source models, offering an accessible, user-friendly, and promising direction for 4D digital content creation. Our project page: \hrefthis https URLthis https URL.

[CV-58] Hyper-Transformer for Amodal Completion

链接: https://arxiv.org/abs/2405.19949
作者: Jianxiong Gao,Xuelin Qian,Longfei Liang,Junwei Han,Yanwei Fu
关键词: Amodal object completion, object based, complex task, task that involves, involves predicting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information. Learning shape priors is crucial for effective amodal completion, but traditional methods often rely on two-stage processes or additional information, leading to inefficiencies and potential error accumulation. To address these shortcomings, we introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN). This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks. Specifically, H-TAN uses a dual-branch structure to extract multi-scale features from both images and masks. The multi-scale features from the image branch guide the hyper transformer in learning shape priors and in generating the weights for dynamic convolution tailored to each instance. The dynamic convolution head then uses the features from the mask branch to predict precise amodal masks. We extensively evaluate our model on three benchmark datasets: KINS, COCOA-cls, and D2SA, where H-TAN demonstrated superior performance compared to existing methods. Additional experiments validate the effectiveness and stability of the novel hyper transformer in our framework.

[CV-59] Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting

链接: https://arxiv.org/abs/2405.19943
作者: Qi Zhang,Yunfei Gong,Daijie Chen,Antoni B. Chan,Hui Huang
关键词: Recent deep learning-based, Recent deep, deep learning-based multi-view, MVD, deep learning-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: AAAI 2024

点击查看摘要

Abstract:Recent deep learning-based multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model’s generalization ability and enable more practical evaluation and comparison. The model’s performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance. See code here: https://vcc.tech/research/2024/MVD.

[CV-60] Exploring Diffusion Models Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks

链接: https://arxiv.org/abs/2405.19931
作者: Xiaoyu Wu,Jiaru Zhang,Yang Hua,Bohan Lyu,Hao Wang,Tao Song,Haibing Guan
关键词: reducing training costs, Diffusion Models, significantly reducing training, key advancement, personalized AI applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement, significantly reducing training costs and enabling personalized AI applications. However, we explore the training dynamics of DMs and observe an unanticipated phenomenon: during the training process, image fidelity initially improves, then unexpectedly deteriorates with the emergence of noisy patterns, only to recover later with severe overfitting. We term the stage with generated noisy patterns as corruption stage. To understand this corruption stage, we begin by theoretically modeling the one-shot fine-tuning scenario, and then extend this modeling to more general cases. Through this modeling, we identify the primary cause of this corruption stage: a narrowed learning distribution inherent in the nature of few-shot fine-tuning. To tackle this, we apply Bayesian Neural Networks (BNNs) on DMs with variational inference to implicitly broaden the learned distribution, and present that the learning target of the BNNs can be naturally regarded as an expectation of the diffusion loss and a further regularization with the pretrained DMs. This approach is highly compatible with current few-shot fine-tuning methods in DMs and does not introduce any extra inference costs. Experimental results demonstrate that our method significantly mitigates corruption, and improves the fidelity, quality and diversity of the generated images in both object-driven and subject-driven generation tasks.

[CV-61] MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion

链接: https://arxiv.org/abs/2405.19921
作者: Angel Villar-Corrales,Moritz Austermann,Sven Behnke
关键词: Autonomous systems, reliable semantic environment, semantic environment perception, self-driving cars, rely on reliable
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Autonomous systems, such as self-driving cars, rely on reliable semantic environment perception for decision making. Despite great advances in video semantic segmentation, existing approaches ignore important inductive biases and lack structured and interpretable internal representations. In this work, we propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera, while also estimating the motion of external objects. Our model leverages these representations to improve the temporal consistency of semantic segmentation without sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion approach in which scene geometry and camera motion are first used to compensate for ego-motion, then residual flow is used to compensate motion of dynamic objects, and finally the predicted scene features are fused with the current features to obtain a temporally consistent scene segmentation. Our model parses automotive scenes into multiple decoupled interpretable representations such as scene geometry, ego-motion, and object motion. Quantitative evaluation shows that MCDS-VSS achieves superior temporal consistency on video sequences while retaining competitive segmentation performance.

[CV-62] Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

链接: https://arxiv.org/abs/2405.19917
作者: Masashi Hatano,Ryo Hachiuma,Ryo Fuji,Hideo Saito
关键词: few-shot learning task, cross-domain few-shot learning, egocentric action recognition, action recognition, egocentric action
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings: (1) the extreme domain gap in egocentric videos (\eg, daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference speed. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model’s adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving 2.2 times faster inference speed. Project page: this https URL

[CV-63] owards RGB-NIR Cross-modality Image Registration and Beyond

链接: https://arxiv.org/abs/2405.19914
作者: Huadong Li,Shichao Dong,Jin Wang,Rong Fu,Minhao Jing,Jiajun Liang,Haoqiang Fan,Renhe Ji
关键词: cross-modality image registration, area of RGB, downstream vision tasks, complementary information present, RGB-NIR cross-modality registration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:This paper focuses on the area of RGB(visible)-NIR(near-infrared) cross-modality image registration, which is crucial for many downstream vision tasks to fully leverage the complementary information present in visible and infrared images. In this field, researchers face two primary challenges - the absence of a correctly-annotated benchmark with viewpoint variations for evaluating RGB-NIR cross-modality registration methods and the problem of inconsistent local features caused by the appearance discrepancy between RGB-NIR cross-modality images. To address these challenges, we first present the RGB-NIR Image Registration (RGB-NIR-IRegis) benchmark, which, for the first time, enables fair and comprehensive evaluations for the task of RGB-NIR cross-modality image registration. Evaluations of previous methods highlight the significant challenges posed by our RGB-NIR-IRegis benchmark, especially on RGB-NIR image pairs with viewpoint variations. To analyze the causes of the unsatisfying performance, we then design several metrics to reveal the toxic impact of inconsistent local features between visible and infrared images on the model performance. This further motivates us to develop a baseline method named Semantic Guidance Transformer (SGFormer), which utilizes high-level semantic guidance to mitigate the negative impact of local inconsistent features. Despite the simplicity of our motivation, extensive experimental results show the effectiveness of our method.

[CV-64] Open-Set Domain Adaptation for Semantic Segmentation

链接: https://arxiv.org/abs/2405.19899
作者: Seun-An Choe,Ah-Hyung Shin,Keon-Hee Park,Jinwoo Choi,Gyeong-Moon Park
关键词: Unsupervised domain adaptation, semantic segmentation aims, unlabeled target domain, unknown classes, labeled source domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures, 13 tables, CVPR 2024 Poster

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer the pixel-wise knowledge from the labeled source domain to the unlabeled target domain. However, current UDA methods typically assume a shared label space between source and target, limiting their applicability in real-world scenarios where novel categories may emerge in the target domain. In this paper, we introduce Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS) for the first time, where the target domain includes unknown classes. We identify two major problems in the OSDA-SS scenario as follows: 1) the existing UDA methods struggle to predict the exact boundary of the unknown classes, and 2) they fail to accurately predict the shape of the unknown classes. To address these issues, we propose Boundary and Unknown Shape-Aware open-set domain adaptation, coined BUS. Our BUS can accurately discern the boundaries between known and unknown classes in a contrastive manner using a novel dilation-erosion-based contrastive loss. In addition, we propose OpenReMix, a new domain mixing augmentation method that guides our model to effectively learn domain and size-invariant features for improving the shape detection of the known and unknown classes. Through extensive experiments, we demonstrate that our proposed BUS effectively detects unknown classes in the challenging OSDA-SS scenario compared to the previous methods by a large margin. The code is available at this https URL.

[CV-65] PixOOD: Pixel-Level Out-of-Distribution Detection

链接: https://arxiv.org/abs/2405.19882
作者: Tomáš Vojíř,Jan Šochman,Jiří Matas
关键词: traditional training biases, dense image prediction, avoids traditional training, training biases, image prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: under review at ECCV 2024

点击查看摘要

Abstract:We propose a dense image prediction out-of-distribution detection algorithm, called PixOOD, which does not require training on samples of anomalous data and is not designed for a specific application which avoids traditional training biases. In order to model the complex intra-class variability of the in-distribution data at the pixel level, we propose an online data condensation algorithm which is more robust than standard K-means and is easily trainable through SGD. We evaluate PixOOD on a wide range of problems. It achieved state-of-the-art results on four out of seven datasets, while being competitive on the rest. The source code is available at this https URL.

[CV-66] IReNe: Instant Recoloring in Neural Radiance Fields

链接: https://arxiv.org/abs/2405.19876
作者: Alessio Mazzucchelli,Adrian Garcia-Garcia,Elena Garces,Fernando Rivas-Manzaneque,Francesc Moreno-Noguer,Adrian Penate-Sanchez
关键词: Advances in NERFs, Advances, view synthesis, scene reconstructions, color
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Advances in NERFs have allowed for 3D scene reconstructions and novel view synthesis. Yet, efficiently editing these representations while retaining photorealism is an emerging challenge. Recent methods face three primary limitations: they’re slow for interactive use, lack precision at object boundaries, and struggle to ensure multi-view consistency. We introduce IReNe to address these limitations, enabling swift, near real-time color editing in NeRF. Leveraging a pre-trained NeRF model and a single training image with user-applied color edits, IReNe swiftly adjusts network parameters in seconds. This adjustment allows the model to generate new scene views, accurately representing the color changes from the training image while also controlling object boundaries and view-specific effects. Object boundary control is achieved by integrating a trainable segmentation module into the model. The process gains efficiency by retraining only the weights of the last network layer. We observed that neurons in this layer can be classified into those responsible for view-dependent appearance and those contributing to diffuse appearance. We introduce an automated classification approach to identify these neuron types and exclusively fine-tune the weights of the diffuse neurons. This further accelerates training and ensures consistent color edits across different views. A thorough validation on a new dataset, with edited object colors, shows significant quantitative and qualitative advancements over competitors, accelerating speeds by 5x to 500x.

[CV-67] Hierarchical Object-Centric Learning with Capsule Networks

链接: https://arxiv.org/abs/2405.19861
作者: Riccardo Renzulli
关键词: Toggle, neural networks limitations, convolutional neural networks, Code, Papers
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Updated version of my PhD thesis (Nov 2023), with fixed typos. Will keep updated as new typos are discovered!

点击查看摘要

Abstract:Capsule networks (CapsNets) were introduced to address convolutional neural networks limitations, learning object-centric representations that are more robust, pose-aware, and interpretable. They organize neurons into groups called capsules, where each capsule encodes the instantiation parameters of an object or one of its parts. Moreover, a routing algorithm connects capsules in different layers, thereby capturing hierarchical part-whole relationships in the data. This thesis investigates the intriguing aspects of CapsNets and focuses on three key questions to unlock their full potential. First, we explore the effectiveness of the routing algorithm, particularly in small-sized networks. We propose a novel method that anneals the number of routing iterations during training, enhancing performance in architectures with fewer parameters. Secondly, we investigate methods to extract more effective first-layer capsules, also known as primary capsules. By exploiting pruned backbones, we aim to improve computational efficiency by reducing the number of capsules while achieving high generalization. This approach reduces CapsNets memory requirements and computational effort. Third, we explore part-relationship learning in CapsNets. Through extensive research, we demonstrate that capsules with low entropy can extract more concise and discriminative part-whole relationships compared to traditional capsule networks, even with reasonable network sizes. Lastly, we showcase how CapsNets can be utilized in real-world applications, including autonomous localization of unmanned aerial vehicles, quaternion-based rotations prediction in synthetic datasets, and lung nodule segmentation in biomedical imaging. The findings presented in this thesis contribute to a deeper understanding of CapsNets and highlight their potential to address complex computer vision challenges. Comments: Updated version of my PhD thesis (Nov 2023), with fixed typos. Will keep updated as new typos are discovered! Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2405.19861 [cs.CV] (or arXiv:2405.19861v1 [cs.CV] for this version) Submission history From: Riccardo Renzulli [view email] [v1] Thu, 30 May 2024 09:10:33 UTC (47,510 KB) Full-text links: Access Paper: View a PDF of the paper titled Hierarchical Object-Centric Learning with Capsule Networks, by Riccardo RenzulliView PDFTeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2405 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-68] RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection

链接: https://arxiv.org/abs/2405.19854
作者: Fangyi Chen,Han Zhang,Zhantao Yang,Hao Chen,Kai Hu,Marios Savvides
关键词: requires solid modeling, requires solid, region-semantic relationship, Open-vocabulary object detection, solid modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical report

点击查看摘要

Abstract:Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs. However, such data is limited in practice due to significant annotation costs. In this work, we propose RTGen to generate scalable open-vocabulary region-text pairs and demonstrate its capability to boost the performance of open-vocabulary object detection. RTGen includes both text-to-region and region-to-text generation processes on scalable image-caption data. The text-to-region generation is powered by image inpainting, directed by our proposed scene-aware inpainting guider for overall layout harmony. For region-to-text generation, we perform multiple region-level image captioning with various prompts and select the best matching text according to CLIP similarity. To facilitate detection training on region-text pairs, we also introduce a localization-aware region-text contrastive loss that learns object proposals tailored with different localization qualities. Extensive experiments demonstrate that our RTGen can serve as a scalable, semantically rich, and effective source for open-vocabulary object detection and continue to improve the model performance when more data is utilized, delivering superior performance compared to the existing state-of-the-art methods.

[CV-69] KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation

链接: https://arxiv.org/abs/2405.19833
作者: Fengyuan Yang,Kerui Gu,Angela Yao
关键词: refine estimated, additional cue, cue to refine, human meshes, keypoints are commonly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR24

点击查看摘要

Abstract:2D keypoints are commonly used as an additional cue to refine estimated 3D human meshes. Current methods optimize the pose and shape parameters with a reprojection loss on the provided 2D keypoints. Such an approach, while simple and intuitive, has limited effectiveness because the optimal solution is hard to find in ambiguous parameter space and may sacrifice depth. Additionally, divergent gradients from distal joints complicate and deviate the refinement of proximal joints in the kinematic chain. To address these, we introduce Kinematic-Tree Rotation (KITRO), a novel mesh refinement strategy that explicitly models depth and human kinematic-tree structure. KITRO treats refinement from a bone-wise perspective. Unlike previous methods which perform gradient-based optimizations, our method calculates bone directions in closed form. By accounting for the 2D pose, bone length, and parent joint’s depth, the calculation results in two possible directions for each child joint. We then use a decision tree to trace binary choices for all bones along the human skeleton’s kinematic-tree to select the most probable hypothesis. Our experiments across various datasets and baseline models demonstrate that KITRO significantly improves 3D joint estimation accuracy and achieves an ideal 2D fit simultaneously. Our code available at: this https URL.

[CV-70] Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline Methodology

链接: https://arxiv.org/abs/2405.19822
作者: Frank A. Ruis,Alma M. Liezenga,Friso G. Heslinga,Luca Ballan,Thijs A. Eker,Richard J. M. den Hollander,Martin C. van Leeuwen,Judith Dijk,Wyke Huizinga
关键词: Collecting and annotating, synthetic data, data, synthetic, annotating real-world data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: Submitted to and presented at SPIE Defense + Commercial Sensing 2024, 13 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Collecting and annotating real-world data for the development of object detection models is a time-consuming and expensive process. In the military domain in particular, data collection can also be dangerous or infeasible. Training models on synthetic data may provide a solution for cases where access to real-world training data is restricted. However, bridging the reality gap between synthetic and real data remains a challenge. Existing methods usually build on top of baseline Convolutional Neural Network (CNN) models that have been shown to perform well when trained on real data, but have limited ability to perform well when trained on synthetic data. For example, some architectures allow for fine-tuning with the expectation of large quantities of training data and are prone to overfitting on synthetic data. Related work usually ignores various best practices from object detection on real data, e.g. by training on synthetic data from a single environment with relatively little variation. In this paper we propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data. Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images. Based on the state of the art, we incorporate data augmentation methods and a Transformer backbone. Besides reaching relatively strong performance without any specialized synthetic data transfer methods, we show that our methods improve the state of the art on synthetic data trained object detection for the RarePlanes and DGTA-VisDrone datasets, and reach near-perfect performance on an in-house vehicle detection dataset.

[CV-71] Gated Fields: Learning Scene Reconstruction from Gated Videos

链接: https://arxiv.org/abs/2405.19819
作者: Andrea Ramazzina,Stefanie Walz,Pragyan Dahal,Mario Bijelic,Felix Heide
关键词: Reconstructing outdoor, temporal observations, challenge that recent, recent work, Reconstructing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing outdoor 3D scenes from temporal observations is a challenge that recent work on neural fields has offered a new avenue for. However, existing methods that recover scene properties, such as geometry, appearance, or radiance, solely from RGB captures often fail when handling poorly-lit or texture-deficient regions. Similarly, recovering scenes with scanning LiDAR sensors is also difficult due to their low angular sampling rate which makes recovering expansive real-world scenes difficult. Tackling these gaps, we introduce Gated Fields - a neural scene reconstruction method that utilizes active gated video sequences. To this end, we propose a neural rendering approach that seamlessly incorporates time-gated capture and illumination. Our method exploits the intrinsic depth cues in the gated videos, achieving precise and dense geometry reconstruction irrespective of ambient illumination conditions. We validate the method across day and night scenarios and find that Gated Fields compares favorably to RGB and LiDAR reconstruction methods. Our code and datasets are available at this https URL.

[CV-72] WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark

链接: https://arxiv.org/abs/2405.19818
作者: Chunhui Zhang,Li Liu,Guanjie Huang,Hao Wen,Xi Zhou,Yanfeng Wang
关键词: tracing submerged entities, UOT, UOT datasets, foundational task, task for identifying
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: GitHub project: this https URL

点击查看摘要

Abstract:Underwater object tracking (UOT) is a foundational task for identifying and tracing submerged entities in underwater video sequences. However, current UOT datasets suffer from limitations in scale, diversity of target categories and scenarios covered, hindering the training and evaluation of modern tracking algorithms. To bridge this gap, we take the first step and introduce WebUOT-1M, \ie, the largest public UOT benchmark to date, sourced from complex and realistic underwater environments. It comprises 1.1 million frames across 1,500 video clips filtered from 408 target categories, largely surpassing previous UOT datasets, \eg, UVOT400. Through meticulous manual annotation and verification, we provide high-quality bounding boxes for underwater targets. Additionally, WebUOT-1M includes language prompts for video sequences, expanding its application areas, \eg, underwater vision-language tracking. Most existing trackers are tailored for open-air environments, leading to performance degradation when applied to UOT due to domain gaps. Retraining and fine-tuning these trackers are challenging due to sample imbalances and limited real-world underwater datasets. To tackle these challenges, we propose a novel omni-knowledge distillation framework based on WebUOT-1M, incorporating various strategies to guide the learning of the student Transformer. To the best of our knowledge, this framework is the first to effectively transfer open-air domain knowledge to the UOT model through knowledge distillation, as demonstrated by results on both existing UOT datasets and the newly proposed WebUOT-1M. Furthermore, we comprehensively evaluate WebUOT-1M using 30 deep trackers, showcasing its value as a benchmark for UOT research by presenting new challenges and opportunities for future studies. The complete dataset, codes and tracking results, will be made publicly available.

[CV-73] Performance Examination of Symbolic Aggregate Approximation in IoT Applications

链接: https://arxiv.org/abs/2405.19817
作者: Suzana Veljanovska,Hans Dermot Doran
关键词: Symbolic Aggregate approXimation, Symbolic Aggregate, Aggregate approXimation, time-series data, common dimensionality reduction
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Embedded World Conference, Nuremberg, 2024

点击查看摘要

Abstract:Symbolic Aggregate approXimation (SAX) is a common dimensionality reduction approach for time-series data which has been employed in a variety of domains, including classification and anomaly detection in time-series data. Domains also include shape recognition where the shape outline is converted into time-series data forinstance epoch classification of archived arrowheads. In this paper we propose a dimensionality reduction and shape recognition approach based on the SAX algorithm, an application which requires responses on cost efficient, IoT-like, platforms. The challenge is largely dealing with the computational expense of the SAX algorithm in IoT-like applications, from simple time-series dimension reduction through shape recognition. The approach is based on lowering the dimensional space while capturing and preserving the most representative features of the shape. We present three scenarios of increasing computational complexity backing up our statements with measurement of performance characteristics

[CV-74] Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera

链接: https://arxiv.org/abs/2405.19794
作者: Inpyo Song,Minjun Joo,Joonhyung Kwon,Jangwon Lee
关键词: daily challenges encountered, navigation difficulties, access to information, social interaction, limited access
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR2024 EgoVis Workshop

点击查看摘要

Abstract:This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction. To alleviate these challenges, we introduce a novel visual question answering dataset. Our dataset offers two significant advancements over previous datasets: Firstly, it features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings, departing from the static image-centric nature of prior datasets. Secondly, unlike datasets centered on singular challenges, ours addresses multiple real-life obstacles simultaneously through an innovative visual-question answering framework. We validate our dataset using various state-of-the-art VideoQA methods and diverse metrics. Results indicate that while progress has been made, satisfactory performance levels for AI-powered assistive services remain elusive for visually impaired individuals. Additionally, our evaluation highlights the distinctive features of the proposed dataset, featuring ego-motion in videos captured via 360-degree cameras across varied scenarios.

[CV-75] Instruction-Guided Visual Masking

链接: https://arxiv.org/abs/2405.19783
作者: Jinliang Zheng,Jianxiong Li,Sijie Cheng,Yinan Zheng,Jiaming Li,Jihao Liu,Yu Liu,Jingjing Liu,Xianyuan Zhan
关键词: contemporary LLM, crucial in contemporary, LLM, multimodal, Instruction-guided Visual Masking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: preprint, 21 pages

点击查看摘要

Abstract:Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code is available at this https URL.

[CV-76] Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network

链接: https://arxiv.org/abs/2405.19775
作者: Sizhe Zheng,Pan Gao,Peng Zhou,Jie Qin
关键词: original structure, aims to render, maintaining the original, Style, Style transfer aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 11 figures, to be published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2024)

点击查看摘要

Abstract:Style transfer aims to render an image with the artistic features of a style image, while maintaining the original structure. Various methods have been put forward for this task, but some challenges still exist. For instance, it is difficult for CNN-based methods to handle global information and long-range dependencies between input images, for which transformer-based methods have been proposed. Although transformers can better model the relationship between content and style images, they require high-cost hardware and time-consuming inference. To address these issues, we design a novel transformer model that includes only the encoder, thus significantly reducing the computational cost. In addition, we also find that existing style transfer methods may lead to images under-stylied or missing content. In order to achieve better stylization, we design a content feature extractor and a style feature extractor, based on which pure content and style images can be fed to the transformer. Finally, we propose a novel network termed Puff-Net, i.e., pure content and style feature fusion network. Through qualitative and quantitative experiments, we demonstrate the advantages of our model compared to state-of-the-art ones in the literature.

[CV-77] VQA Training Sets are Self-play Environments for Generating Few-shot Pools

链接: https://arxiv.org/abs/2405.19773
作者: Tautvydas Misiunas,Hassan Mansoor,Jasper Uijlings,Oriana Riva,Victor Carbune
关键词: visual-question answering benchmarks, solving compositional reasoning, answering benchmarks, Large-language models, increasingly capable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large-language models and large-vision models are increasingly capable of solving compositional reasoning tasks, as measured by breakthroughs in visual-question answering benchmarks. However, state-of-the-art solutions often involve careful construction of large pre-training and fine-tuning datasets, which can be expensive. The use of external tools, whether other ML models, search engines, or APIs, can significantly improve performance by breaking down high-level reasoning questions into sub-questions that are answerable by individual tools, but this approach has similar dataset construction costs to teach fine-tuned models how to use the available tools. We propose a technique in which existing training sets can be directly used for constructing computational environments with task metrics as rewards. This enables a model to autonomously teach itself to use itself or another model as a tool. By doing so, we augment training sets by integrating external signals. The proposed method starts with zero-shot prompts and iteratively refines them by selecting few-shot examples that maximize the task metric on the training set. Our experiments showcase how Gemini learns how to use itself, or another smaller and specialized model such as ScreenAI, to iteratively improve performance on training sets. Our approach successfully generalizes and improves upon zeroshot performance on charts, infographics, and document visual question-answering datasets

[CV-78] All-In-One Medical Image Restoration via Task-Adaptive Routing

链接: https://arxiv.org/abs/2405.19769
作者: Zhiwen Yang,Haowei Chen,Ziniu Qian,Yang Yi,Hui Zhang,Dan Zhao,Bingzheng Wei,Yan Xu
关键词: medical image restoration, witnessed remarkable success, image restoration, medical image, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This article has been early accepted by MICCAI 2024

点击查看摘要

Abstract:Although single-task medical image restoration (MedIR) has witnessed remarkable success, the limited generalizability of these methods poses a substantial obstacle to wider application. In this paper, we focus on the task of all-in-one medical image restoration, aiming to address multiple distinct MedIR tasks with a single universal model. Nonetheless, due to significant differences between different MedIR tasks, training a universal model often encounters task interference issues, where different tasks with shared parameters may conflict with each other in the gradient update direction. This task interference leads to deviation of the model update direction from the optimal path, thereby affecting the model’s performance. To tackle this issue, we propose a task-adaptive routing strategy, allowing conflicting tasks to select different network paths in spatial and channel dimensions, thereby mitigating task interference. Experimental results demonstrate that our proposed \textbfAll-in-one \textbfMedical \textbfImage \textbfRestoration (\textbfAMIR) network achieves state-of-the-art performance in three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis, both in single-task and all-in-one settings. The code and data will be available at \hrefthis https URLthis https URL.

[CV-79] owards Unified Multi-granularity Text Detection with Interactive Attention

链接: https://arxiv.org/abs/2405.19765
作者: Xingyu Wan,Chengquan Zhang,Pengyuan Lyu,Sen Fan,Zihan Ni,Kun Yao,Errui Ding,Jingdong Wang
关键词: Existing OCR engines, Existing OCR, systems typically rely, significant computational complexity, OCR engines
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ICML 2024

点击查看摘要

Abstract:Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce “Detect Any Text” (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. This design enables DAT to efficiently manage text instances at different granularities, including word, line, paragraph and page. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances at varying granularities by correlating structural information across different text queries. As a result, it enables the model to achieve mutually beneficial detection performances across multiple text granularities. Additionally, a prompt-based segmentation module refines detection outcomes for texts of arbitrary curvature and complex layouts, thereby improving DAT’s accuracy and expanding its real-world applicability. Experimental results demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks, including multi-oriented/arbitrarily-shaped scene text detection, document layout analysis and page detection tasks.

[CV-80] Mitigating annotation shift in cancer classification using single image generative models

链接: https://arxiv.org/abs/2405.19754
作者: Marta Buetas Arcas,Richard Osuala,Karim Lekadir,Oliver Díaz
关键词: Artificial Intelligence, detection and diagnosis, valuable tool, tool for assisting, assisting radiologists
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint of paper accepted at SPIE IWBI 2024 Conference

点击查看摘要

Abstract:Artificial Intelligence (AI) has emerged as a valuable tool for assisting radiologists in breast cancer detection and diagnosis. However, the success of AI applications in this domain is restricted by the quantity and quality of available data, posing challenges due to limited and costly data annotation procedures that often lead to annotation shifts. This study simulates, analyses and mitigates annotation shifts in cancer classification in the breast mammography domain. First, a high-accuracy cancer risk prediction model is developed, which effectively distinguishes benign from malignant lesions. Next, model performance is used to quantify the impact of annotation shift. We uncover a substantial impact of annotation shift on multiclass classification performance particularly for malignant lesions. We thus propose a training data augmentation approach based on single-image generative models for the affected class, requiring as few as four in-domain annotations to considerably mitigate annotation shift, while also addressing dataset imbalance. Lastly, we further increase performance by proposing and validating an ensemble architecture based on multiple models trained under different data augmentation regimes. Our study offers key insights into annotation shift in deep learning breast cancer classification and explores the potential of single-image generative models to overcome domain shift challenges.

[CV-81] HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

链接: https://arxiv.org/abs/2405.19751
作者: Wenxuan Liu,Saiqian Zhang
关键词: outperforming traditional diffusion, traditional diffusion models, Diffusion Transformers, visual generation capabilities, recently gained substantial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net. However,the enhanced performance of DiTs also comes with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones. To address these challenges, we introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference. Compared to fixed-point quantization (e.g., INT8), FP quantization, complemented by our proposed clipping range selection mechanism, naturally aligns with the data distribution within DiT, resulting in a minimal quantization error. Furthermore, HQ-DiT also implements a universal identity mathematical transform to mitigate the serious quantization error caused by the outliers. The experimental results demonstrate that DiT can achieve extremely low-precision quantization (i.e., 4 bits) with negligible impact on performance. Our approach marks the first instance where both weights and activations in DiTs are quantized to just 4 bits, with only a 0.12 increase in sFID on ImageNet.

[CV-82] DenseSeg: Joint Learning for Semantic Segmentation and Landmark Detection Using Dense Image-to-Shape Representation

链接: https://arxiv.org/abs/2405.19746
作者: Ron Keuth,Lasse Hansen,Maren Balks,Ronja Jäger,Anne-Nele Schröder,Ludger Tüshaus,Mattias Heinrich
关键词: medical image processing, Semantic segmentation, image processing, facilitating further analysis, landmark detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Purpose: Semantic segmentation and landmark detection are fundamental tasks of medical image processing, facilitating further analysis of anatomical objects. Although deep learning-based pixel-wise classification has set a new-state-of-the-art for segmentation, it falls short in landmark detection, a strength of shape-based approaches. Methods: In this work, we propose a dense image-to-shape representation that enables the joint learning of landmarks and semantic segmentation by employing a fully convolutional architecture. Our method intuitively allows the extraction of arbitrary landmarks due to its representation of anatomical correspondences. We benchmark our method against the state-of-the-art for semantic segmentation (nnUNet), a shape-based approach employing geometric deep learning and a CNN-based method for landmark detection. Results: We evaluate our method on two medical dataset: one common benchmark featuring the lungs, heart, and clavicle from thorax X-rays, and another with 17 different bones in the paediatric wrist. While our method is on pair with the landmark detection baseline in the thorax setting (error in mm of 2.6\pm0.9 vs 2.7\pm0.9 ), it substantially surpassed it in the more complex wrist setting ( 1.1\pm0.6 vs 1.9\pm0.5 ). Conclusion: We demonstrate that dense geometric shape representation is beneficial for challenging landmark detection tasks and outperforms previous state-of-the-art using heatmap regression. While it does not require explicit training on the landmarks themselves, allowing for the addition of new landmarks without necessitating retraining. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2405.19746 [cs.CV] (or arXiv:2405.19746v1 [cs.CV] for this version) Submission history From: Ron Keuth [view email] [v1] Thu, 30 May 2024 06:49:59 UTC (6,277 KB)

[CV-83] GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis

链接: https://arxiv.org/abs/2405.19745
作者: Boming Zhao,Yuan Li,Ziyu Sun,Lin Zeng,Yujun Shen,Rui Ma,Yinda Zhang,Hujun Bao,Zhaopeng Cui
关键词: Forecasting future scenarios, Forecasting future, decision-making and navigation, vision and robotics, essential for intelligent
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted to SIGGRAPH 2024 Conference. Project Page: this https URL

点击查看摘要

Abstract:Forecasting future scenarios in dynamic environments is essential for intelligent decision-making and navigation, a challenge yet to be fully realized in computer vision and robotics. Traditional approaches like video prediction and novel-view synthesis either lack the ability to forecast from arbitrary viewpoints or to predict temporal dynamics. In this paper, we introduce GaussianPrediction, a novel framework that empowers 3D Gaussian representations with dynamic scene modeling and future scenario synthesis in dynamic environments. GaussianPrediction can forecast future states from any viewpoint, using video observations of dynamic scenes. To this end, we first propose a 3D Gaussian canonical space with deformation modeling to capture the appearance and geometry of dynamic scenes, and integrate the lifecycle property into Gaussians for irreversible deformations. To make the prediction feasible and efficient, a concentric motion distillation approach is developed by distilling the scene motion with key points. Finally, a Graph Convolutional Network is employed to predict the motions of key points, enabling the rendering of photorealistic images of future scenarios. Our framework shows outstanding performance on both synthetic and real-world datasets, demonstrating its efficacy in predicting and rendering future environments.

[CV-84] May the Dance be with You: Dance Generation Framework for Non-Humanoids

链接: https://arxiv.org/abs/2405.19743
作者: Hyemin Ahn
关键词: visual rhythm, optical flow, music, visual, rhythm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 13 pages, 6 Figures, Rejected at Neurips 2023

点击查看摘要

Abstract:We hypothesize dance as a motion that forms a visual rhythm from music, where the visual rhythm can be perceived from an optical flow. If an agent can recognize the relationship between visual rhythm and music, it will be able to dance by generating a motion to create a visual rhythm that matches the music. Based on this, we propose a framework for any kind of non-humanoid agents to learn how to dance from human videos. Our framework works in two processes: (1) training a reward model which perceives the relationship between optical flow (visual rhythm) and music from human dance videos, (2) training the non-humanoid dancer based on that reward model, and reinforcement learning. Our reward model consists of two feature encoders for optical flow and music. They are trained based on contrastive learning which makes the higher similarity between concurrent optical flow and music features. With this reward model, the agent learns dancing by getting a higher reward when its action creates an optical flow whose feature has a higher similarity with the given music feature. Experiment results show that generated dance motion can align with the music beat properly, and user study result indicates that our framework is more preferred by humans compared to the baselines. To the best of our knowledge, our work of non-humanoid agents which learn dance from human videos is unprecedented. An example video can be found at this https URL.

[CV-85] win Deformable Point Convolutions for Point Cloud Semantic Segmentation in Remote Sensing Scenes

链接: https://arxiv.org/abs/2405.19735
作者: Yong-Qiang Mao,Hanbo Bi,Xuexue Li,Kaiqiang Chen,Zhirui Wang,Xian Sun,Kun Fu
关键词: remote sensing fields, Deformable point Convolution, point cloud processing, remote sensing, point cloud segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Thanks to the application of deep learning technology in point cloud processing of the remote sensing field, point cloud segmentation has become a research hotspot in recent years, which can be applied to real-world 3D, smart cities, and other fields. Although existing solutions have made unprecedented progress, they ignore the inherent characteristics of point clouds in remote sensing fields that are strictly arranged according to latitude, longitude, and altitude, which brings great convenience to the segmentation of point clouds in remote sensing fields. To consider this property cleverly, we propose novel convolution operators, termed Twin Deformable point Convolutions (TDConvs), which aim to achieve adaptive feature learning by learning deformable sampling points in the latitude-longitude plane and altitude direction, respectively. First, to model the characteristics of the latitude-longitude plane, we propose a Cylinder-wise Deformable point Convolution (CyDConv) operator, which generates a two-dimensional cylinder map by constructing a cylinder-like grid in the latitude-longitude direction. Furthermore, to better integrate the features of the latitude-longitude plane and the spatial geometric features, we perform a multi-scale fusion of the extracted latitude-longitude features and spatial geometric features, and realize it through the aggregation of adjacent point features of different scales. In addition, a Sphere-wise Deformable point Convolution (SpDConv) operator is introduced to adaptively offset the sampling points in three-dimensional space by constructing a sphere grid structure, aiming at modeling the characteristics in the altitude direction. Experiments on existing popular benchmarks conclude that our TDConvs achieve the best segmentation performance, surpassing the existing state-of-the-art methods.

[CV-86] wo Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization

链接: https://arxiv.org/abs/2405.19732
作者: Zixian Guo,Ming Liu,Zhilong Ji,Jinfeng Bai,Yiwen Guo,Wangmeng Zuo
关键词: skill generally relies, Learning a skill, insightful high-level guidance, skill generally, generally relies
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for concrete problems by inferring from natural language instructions, akin to a high-level instructor. In this paper, we show that these two optimizers are complementary to each other, suggesting a collaborative optimization approach. The gradient-based optimizer and LLM-based optimizer are combined in an interleaved manner. We instruct LLMs using task descriptions and timely optimization trajectories recorded during gradient-based optimization. Inferred results from LLMs are used as restarting points for the next stage of gradient optimization. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, our combined optimization method consistently yields improvements over competitive baseline prompt tuning methods. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at this https URL.

[CV-87] Automatic Dance Video Segmentation for Understanding Choreography

链接: https://arxiv.org/abs/2405.19727
作者: Koki Endo,Shuhei Tsuchida,Tsukasa Fukusato,Takeo Igarashi
关键词: Segmenting dance video, easily understand dance, Segmenting dance, dance video, dance
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 9 pages, 11 figures

点击查看摘要

Abstract:Segmenting dance video into short movements is a popular way to easily understand dance choreography. However, it is currently done manually and requires a significant amount of effort by experts. That is, even if many dance videos are available on social media (e.g., TikTok and YouTube), it remains difficult for people, especially novices, to casually watch short video segments to practice dance choreography. In this paper, we propose a method to automatically segment a dance video into each movement. Given a dance video as input, we first extract visual and audio features: the former is computed from the keypoints of the dancer in the video, and the latter is computed from the Mel spectrogram of the music in the video. Next, these features are passed to a Temporal Convolutional Network (TCN), and segmentation points are estimated by picking peaks of the network output. To build our training dataset, we annotate segmentation points to dance videos in the AIST Dance Video Database, which is a shared database containing original street dance videos with copyright-cleared dance music. The evaluation study shows that the proposed method (i.e., combining the visual and audio features) can estimate segmentation points with high accuracy. In addition, we developed an application to help dancers practice choreography using the proposed method.

[CV-88] Streaming Video Diffusion: Online Video Editing with Diffusion Models

链接: https://arxiv.org/abs/2405.19726
作者: Feng Chen,Zhen Yang,Bohan Zhuang,Qi Wu
关键词: task called online, maintaining temporal consistency, online video editing, called online video, video editing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a novel task called online video editing, which is designed to edit \textbfstreaming frames while maintaining temporal consistency. Unlike existing offline video editing assuming all frames are pre-established and accessible, online video editing is tailored to real-life applications such as live streaming and online chat, requiring (1) fast continual step inference, (2) long-term temporal modeling, and (3) zero-shot video editing capability. To solve these issues, we propose Streaming Video Diffusion (SVDiff), which incorporates the compact spatial-aware temporal recurrence into off-the-shelf Stable Diffusion and is trained with the segment-level scheme on large-scale long videos. This simple yet effective setup allows us to obtain a single model that is capable of executing a broad range of videos and editing each streaming frame with temporal coherence. Our experiments indicate that our model can edit long, high-quality videos with remarkable results, achieving a real-time inference speed of 15.2 FPS at a resolution of 512x512.

[CV-89] Encoding and Controlling Global Semantics for Long-form Video Question Answering

链接: https://arxiv.org/abs/2405.19723
作者: Thong Thanh Nguyen,Zhiyuan Hu,Xiaobao Wu,Cong-Duy T Nguyen,See-Kiong Ng,Anh Tuan Luu
关键词: Seeking answers effectively, Seeking answers, answers effectively, essential to build, video question answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:Seeking answers effectively for long videos is essential to build video question answering (videoQA) systems. Previous methods adaptively select frames and regions from long videos to save computations. However, this fails to reason over the whole sequence of video, leading to sub-optimal performance. To address this problem, we introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video, which mitigates the video information loss caused by frame and region selection modules. Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations. To further enhance the controllability, we introduce a cross-modal compositional congruence (C^3) objective to encourage global semantics aligned with the question. To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length, i.e. 17.5 minutes and 1.9 hours, respectively. Extensive experiments demonstrate the superiority of our framework on these new as well as existing datasets.

[CV-90] QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering

链接: https://arxiv.org/abs/2405.19722
作者: Xuan-Bac Nguyen,Hoang-Quan Nguyen,Samuel Yen-Chi Chen,Samee U. Khan,Hugh Churchill,Khoa Luu
关键词: yielding significant outcomes, studied for decades, yielding significant, significant outcomes, outcomes across numerous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised vision clustering, a cornerstone in computer vision, has been studied for decades, yielding significant outcomes across numerous vision tasks. However, these algorithms involve substantial computational demands when confronted with vast amounts of unlabeled data. Conversely, Quantum computing holds promise in expediting unsupervised algorithms when handling large-scale databases. In this study, we introduce QClusformer, a pioneering Transformer-based framework leveraging Quantum machines to tackle unsupervised vision clustering challenges. Specifically, we design the Transformer architecture, including the self-attention module and transformer blocks, from a Quantum perspective to enable execution on Quantum hardware. In addition, we present QClusformer, a variant based on the Transformer architecture, tailored for unsupervised vision clustering tasks. By integrating these elements into an end-to-end framework, QClusformer consistently outperforms previous methods running on classical computers. Empirical evaluations across diverse benchmarks, including MS-Celeb-1M and DeepFashion, underscore the superior performance of QClusformer compared to state-of-the-art methods.

[CV-91] LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising

链接: https://arxiv.org/abs/2405.19718
作者: Yuxing Duan,Shihan Peng,Lin Zhu,Wei Zhang,Yi Chang,Sheng Zhong,Luxin Yan
关键词: significant advantages, advantages in capturing, challenging conditions, dynamic scene information, capturing dynamic scene
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR 2024

点击查看摘要

Abstract:Event camera has significant advantages in capturing dynamic scene information while being prone to noise interference, particularly in challenging conditions like low threshold and low illumination. However, most existing research focuses on gentle situations, hindering event camera applications in realistic complex scenarios. To tackle this limitation and advance the field, we construct a new paired real-world event denoising dataset (LED), including 3K sequences with 18K seconds of high-resolution (1200*680) event streams and showing three notable distinctions compared to others: diverse noise levels and scenes, larger-scale with high-resolution, and high-quality GT. Specifically, it contains stepped parameters and varying illumination with diverse scenarios. Moreover, based on the property of noise events inconsistency and signal events consistency, we propose a novel effective denoising framework(DED) using homogeneous dual events to generate the GT with better separating noise from the raw. Furthermore, we design a bio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with dynamic thresholds to realize accurate denoising. The experimental results demonstrate that the remarkable performance of the proposed approach on different datasets.The dataset and code are at this https URL.

[CV-92] Enhancing Large Vision Language Models with Self-Training on Image Comprehension

链接: https://arxiv.org/abs/2405.19716
作者: Yihe Deng,Pan Lu,Fan Yin,Ziniu Hu,Sheng Shen,James Zou,Kai-Wei Chang,Wei Wang
关键词: Large vision language, integrate large language, large language models, pre-trained vision encoders, vision language models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 19 pages, 14 figures, 6 tables

点击查看摘要

Abstract:Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model’s own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.

[CV-93] HINT: Learning Complete Human Neural Representations from Limited Viewpoints

链接: https://arxiv.org/abs/2405.19712
作者: Alessandro Sanvito,Andrea Ramazzina,Stefanie Walz,Mario Bijelic,Felix Heide
关键词: animated humanoid avatars, augmented application, animated humanoid, humanoid avatars, human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:No augmented application is possible without animated humanoid avatars. At the same time, generating human replicas from real-world monocular hand-held or robotic sensor setups is challenging due to the limited availability of views. Previous work showed the feasibility of virtual avatars but required the presence of 360 degree views of the targeted subject. To address this issue, we propose HINT, a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles. We achieve this by introducing a symmetry prior, regularization constraints, and training cues from large human datasets. In particular, we introduce a sagittal plane symmetry prior to the appearance of the human, directly supervise the density function of the human model using explicit 3D body modeling, and leverage a co-learned human digitization network as additional supervision for the unseen angles. As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR compared to previous state-of-the-art algorithms.

[CV-94] xt Guided Image Editing with Automatic Concept Locating and Forgetting

链接: https://arxiv.org/abs/2405.19708
作者: Jia Li,Lijie Hu,Zhixian He,Jingfeng Zhang,Tianhang Zheng,Di Wang
关键词: diffusion models guided, significant progress, image, text-guided image editing, image editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the advancement of image-to-image diffusion models guided by text, significant progress has been made in image editing. However, a persistent challenge remains in seamlessly incorporating objects into images based on textual instructions, without relying on extra user-provided guidance. Text and images are inherently distinct modalities, bringing out difficulties in fully capturing the semantic intent conveyed through language and accurately translating that into the desired visual modifications. Therefore, text-guided image editing models often produce generations with residual object attributes that do not fully align with human expectations. To address this challenge, the models should comprehend the image content effectively away from a disconnect between the provided textual editing prompts and the actual modifications made to the image. In our paper, we propose a novel method called Locate and Forget (LaF), which effectively locates potential target concepts in the image for modification by comparing the syntactic trees of the target prompt and scene descriptions in the input image, intending to forget their existence clues in the generated image. Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.

[CV-95] DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

链接: https://arxiv.org/abs/2405.19707
作者: Haoxing Chen,Yan Hong,Zizheng Huang,Zhuoer Xu,Zhangxuan Gu,Yaohui Li,Jun Lan,Huijia Zhu,Jianfu Zhang,Weiqiang Wang,Huaxiong Li
关键词: video, AI-generated video detection, fake AI-generated videos, Recently, videos
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors’ performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba’s superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \urlthis https URL.

[CV-96] owards a Better Evaluation of Out-of-Domain Generalization

链接: https://arxiv.org/abs/2405.19703
作者: Duhun Hwang,Suhyun Kang,Moonjung Eo,Jimyeong Kim,Wonjong Rhee
关键词: unseen test distributions, previously unseen test, average measure, achieving high performance, domain generalization performance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The objective of Domain Generalization (DG) is to devise algorithms and models capable of achieving high performance on previously unseen test distributions. In the pursuit of this objective, average measure has been employed as the prevalent measure for evaluating models and comparing algorithms in the existing DG studies. Despite its significance, a comprehensive exploration of the average measure has been lacking and its suitability in approximating the true domain generalization performance has been questionable. In this study, we carefully investigate the limitations inherent in the average measure and propose worst+gap measure as a robust alternative. We establish theoretical grounds of the proposed measure by deriving two theorems starting from two different assumptions. We conduct extensive experimental investigations to compare the proposed worst+gap measure with the conventional average measure. Given the indispensable need to access the true DG performance for studying measures, we modify five existing datasets to come up with SR-CMNIST, C-CatsDogs, L-CIFAR10, PACS-corrupted, and VLCS-corrupted datasets. The experiment results unveil an inferior performance of the average measure in approximating the true DG performance and confirm the robustness of the theoretically supported worst+gap measure.

[CV-97] Distribution Aligned Semantics Adaption for Lifelong Person Re-Identification

链接: https://arxiv.org/abs/2405.19695
作者: Qizao Wang,Xuelin Qian,Bin Li,Xiangyang Xue
关键词: Lifelong person Re-IDentification, person Re-IDentification, real-world scenarios, space and time, Semantics Adaption
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In real-world scenarios, person Re-IDentification (Re-ID) systems need to be adaptable to changes in space and time. Therefore, the adaptation of Re-ID models to new domains while preserving previously acquired knowledge is crucial, known as Lifelong person Re-IDentification (LReID). Advanced LReID methods rely on replaying exemplars from old domains and applying knowledge distillation in logits with old models. However, due to privacy concerns, retaining previous data is inappropriate. Additionally, the fine-grained and open-set characteristics of Re-ID limit the effectiveness of the distillation paradigm for accumulating knowledge. We argue that a Re-ID model trained on diverse and challenging pedestrian images at a large scale can acquire robust and general human semantic knowledge. These semantics can be readily utilized as shared knowledge for lifelong applications. In this paper, we identify the challenges and discrepancies associated with adapting a pre-trained model to each application domain, and introduce the Distribution Aligned Semantics Adaption (DASA) framework. It efficiently adjusts Batch Normalization (BN) to mitigate interference from data distribution discrepancy and freezes the pre-trained convolutional layers to preserve shared knowledge. Additionally, we propose the lightweight Semantics Adaption (SA) module, which effectively adapts learned semantics to enhance pedestrian representations. Extensive experiments demonstrate the remarkable superiority of our proposed framework over advanced LReID methods, and it exhibits significantly reduced storage consumption. DASA presents a novel and cost-effective perspective on effectively adapting pre-trained models for LReID.

[CV-98] Uncertainty-aware sign language video retrieval with probability distribution modeling

链接: https://arxiv.org/abs/2405.19689
作者: Xuan Wu,Hongxiang Li,Yuanjiang Luo,Xuxin Cheng,Xianwei Zhuang,Meng Cao,Keren Fu
关键词: Sign language video, facilitating information access, Sign language, language video, sign language retrieval
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%).

[CV-99] DNPM: A Neural Parametric Model for the Synthesis of Facial Geometric Details

链接: https://arxiv.org/abs/2405.19688
作者: Haitao Cao,Baoping Cheng,Qiran Pu,Haocheng Zhang,Bin Luo,Yixiang Zhuang,Juncong Lin,Liyan Chen,Xuan Cheng
关键词: modeling human faces, bodies and hands, graphics tasks, parametric model, enabled a wide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Parametric 3D models have enabled a wide variety of computer vision and graphics tasks, such as modeling human faces, bodies and hands. In 3D face modeling, 3DMM is the most widely used parametric model, but can’t generate fine geometric details solely from identity and expression inputs. To tackle this limitation, we propose a neural parametric model named DNPM for the facial geometric details, which utilizes deep neural network to extract latent codes from facial displacement maps encoding details and wrinkles. Built upon DNPM, a novel 3DMM named Detailed3DMM is proposed, which augments traditional 3DMMs by including the synthesis of facial details only from the identity and expression inputs. Moreover, we show that DNPM and Detailed3DMM can facilitate two downstream applications: speech-driven detailed 3D facial animation and 3D face reconstruction from a degraded image. Extensive experiments have shown the usefulness of DNPM and Detailed3DMM, and the progressiveness of two proposed applications.

[CV-100] Autonomous Driving with Spiking Neural Networks

链接: https://arxiv.org/abs/2405.19687
作者: Rui-Jie Zhu,Ziqing Wang,Leilani Gilpin,Jason K. Eshraghian
关键词: Autonomous driving demands, Autonomous driving, strict energy constraints, Spiking Autonomous Driving, Spiking Neural Network
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous driving demands an integrated approach that encompasses perception, prediction, and planning, all while operating under strict energy constraints to enhance scalability and environmental sustainability. We present Spiking Autonomous Driving (\name), the first unified Spiking Neural Network (SNN) to address the energy challenges faced by autonomous driving systems through its event-driven and energy-efficient nature. SAD is trained end-to-end and consists of three main modules: perception, which processes inputs from multi-view cameras to construct a spatiotemporal bird’s eye view; prediction, which utilizes a novel dual-pathway with spiking neurons to forecast future states; and planning, which generates safe trajectories considering predicted occupancy, traffic rules, and ride comfort. Evaluated on the nuScenes dataset, SAD achieves competitive performance in perception, prediction, and planning tasks, while drawing upon the energy efficiency of SNNs. This work highlights the potential of neuromorphic computing to be applied to energy-efficient autonomous driving, a critical step toward sustainable and safety-critical automotive technology. Our code is available at \urlthis https URL.

[CV-101] A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning

链接: https://arxiv.org/abs/2405.19684
作者: Xiaofeng Cong,Yu Zhao,Jie Gui,Junming Hou,Dacheng Tao
关键词: Underwater image enhancement, Underwater image, image enhancement, computer vision, field of computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: A survey on the underwater image enhancement task

点击查看摘要

Abstract:Underwater image enhancement (UIE) is a challenging research task in the field of computer vision. Although hundreds of UIE algorithms have been proposed, a comprehensive and systematic review is still lacking. To promote future research, we summarize the UIE task from multiple perspectives. First, the physical models, data construction processes, evaluation metrics, and loss functions are introduced. Second, according to the contributions brought by different literatures, recent proposed algorithms are discussed and classified from six perspectives, namely network architecture, learning strategy, learning stage, assistance task, domain perspective and disentanglement fusion, respectively. Third, considering the inconsistencies in experimental settings in different literatures, a comprehensive and fair comparison does not yet exist. To this end, we quantitatively and qualitatively evaluate state-of-the-art algorithms on multiple benchmark datasets. Finally, issues worthy of further research in the UIE task are raised. A collection of useful materials is available at this https URL.

[CV-102] Fully Test-Time Adaptation for Monocular 3D Object Detection

链接: https://arxiv.org/abs/2405.19682
作者: Hongbin Lin,Yifan Zhang,Shuaicheng Niu,Shuguang Cui,Zhen Li
关键词: single RGB image, RGB image, single RGB, Mono, test
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Monocular 3D object detection (Mono 3Det) aims to identify 3D objects from a single RGB image. However, existing methods often assume training and test data follow the same distribution, which may not hold in real-world test scenarios. To address the out-of-distribution (OOD) problems, we explore a new adaptation paradigm for Mono 3Det, termed Fully Test-time Adaptation. It aims to adapt a well-trained model to unlabeled test data by handling potential data distribution shifts at test time without access to training data and test labels. However, applying this paradigm in Mono 3Det poses significant challenges due to OOD test data causing a remarkable decline in object detection scores. This decline conflicts with the pre-defined score thresholds of existing detection methods, leading to severe object omissions (i.e., rare positive detections and many false negatives). Consequently, the limited positive detection and plenty of noisy predictions cause test-time adaptation to fail in Mono 3Det. To handle this problem, we propose a novel Monocular Test-Time Adaptation (MonoTTA) method, based on two new strategies. 1) Reliability-driven adaptation: we empirically find that high-score objects are still reliable and the optimization of high-score objects can enhance confidence across all detections. Thus, we devise a self-adaptive strategy to identify reliable objects for model adaptation, which discovers potential objects and alleviates omissions. 2) Noise-guard adaptation: since high-score objects may be scarce, we develop a negative regularization term to exploit the numerous low-score objects via negative learning, preventing overfitting to noise and trivial solutions. Experimental results show that MonoTTA brings significant performance gains for Mono 3Det models in OOD test scenarios, approximately 190% gains by average on KITTI and 198% gains on nuScenes.

[CV-103] View-Consistent Hierarchical 3D SegmentationUsing Ultrametric Feature Fields

链接: https://arxiv.org/abs/2405.19678
作者: Haodi He,Colton Stearns,Adam W. Harley,Leonidas J. Guibas
关键词: Large-scale vision foundation, demonstrate impressive performance, Large-scale vision, vision foundation models, demonstrate impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of coarse" or fine" granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model’s 3D hierarchical segmentations in real world scenes.\footnoteThe code and dataset are available at:

[CV-104] Knowledge-grounded Adaptation Strategy for Vision-language Models: Building Unique Case-set for Screening Mammograms for Residents Training

链接: https://arxiv.org/abs/2405.19675
作者: Aisha Urooj Khan,John Garrett,Tyler Bradshaw,Lonie Salkowski,Jiwoong Jason Jeong,Amara Tariq,Imon Banerjee
关键词: text pairs poses, medical contexts due, pre-trained on natural, natural images, images and text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A visual-language model (VLM) pre-trained on natural images and text pairs poses a significant barrier when applied to medical contexts due to domain shift. Yet, adapting or fine-tuning these VLMs for medical use presents considerable hurdles, including domain misalignment, limited access to extensive datasets, and high-class imbalances. Hence, there is a pressing need for strategies to effectively adapt these VLMs to the medical domain, as such adaptations would prove immensely valuable in healthcare applications. In this study, we propose a framework designed to adeptly tailor VLMs to the medical domain, employing selective sampling and hard-negative mining techniques for enhanced performance in retrieval tasks. We validate the efficacy of our proposed approach by implementing it across two distinct VLMs: the in-domain VLM (MedCLIP) and out-of-domain VLMs (ALBEF). We assess the performance of these models both in their original off-the-shelf state and after undergoing our proposed training strategies, using two extensive datasets containing mammograms and their corresponding reports. Our evaluation spans zero-shot, few-shot, and supervised scenarios. Through our approach, we observe a notable enhancement in Recall@K performance for the image-text retrieval task.

[CV-105] CRIS: Collaborative Refinement Integrated with Segmentation for Polyp Segmentation

链接: https://arxiv.org/abs/2405.19672
作者: Ankush Gajanan Arudkar,Bernard J.E. Evans
关键词: early prevention heavily, prevention heavily rely, Accurate detection, precise polyp identification, gastrointestinal colonoscopy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate detection of colorectal cancer and early prevention heavily rely on precise polyp identification during gastrointestinal colonoscopy. Due to limited data, many current state-of-the-art deep learning methods for polyp segmentation often rely on post-processing of masks to reduce noise and enhance results. In this study, we propose an approach that integrates mask refinement and binary semantic segmentation, leveraging a novel collaborative training strategy that surpasses current widely-used refinement strategies. We demonstrate the superiority of our approach through comprehensive evaluation on established benchmark datasets and its successful application across various medical image segmentation architectures.

[CV-106] GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction

链接: https://arxiv.org/abs/2405.19671
作者: Haodong Xiang,Xinghui Li,Xiansong Lai,Wanting Zhang,Zhichao Liao,Kai Cheng,Xueping Liu
关键词: revolutionized neural rendering, Gaussian Splatting, high-quality rendering, real-time speed, neural SDF
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting(3DGS) has revolutionized neural rendering with its high-quality rendering and real-time speed. However, when it comes to indoor scenes with a significant number of textureless areas, 3DGS yields incomplete and noisy reconstruction results due to the poor initialization of the point cloud and under-constrained optimization. Inspired by the continuity of signed distance field (SDF), which naturally has advantages in modeling surfaces, we present a unified optimizing framework integrating neural SDF with 3DGS. This framework incorporates a learnable neural SDF field to guide the densification and pruning of Gaussians, enabling Gaussians to accurately model scenes even with poor initialized point clouds. At the same time, the geometry represented by Gaussians improves the efficiency of the SDF field by piloting its point sampling. Additionally, we regularize the optimization with normal and edge priors to eliminate geometry ambiguity in textureless areas and improve the details. Extensive experiments in ScanNet and ScanNet++ show that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis.

[CV-107] xture-guided Coding for Deep Features

链接: https://arxiv.org/abs/2405.19669
作者: Lei Xiong,Xin Luo,Zihao Wang,Chaofan He,Shuyuan Zhu,Bing Zeng
关键词: machine vision technology, vision technology, machine vision, feature, feature compression
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rapid development of machine vision technology in recent years, many researchers have begun to focus on feature compression that is better suited for machine vision tasks. The target of feature compression is deep features, which arise from convolution in the middle layer of a pre-trained convolutional neural network. However, due to the large volume of data and high level of abstraction of deep features, their application is primarily limited to machine-centric scenarios, which poses significant constraints in situations requiring human-computer interaction. This paper investigates features and textures and proposes a texture-guided feature compression strategy based on their characteristics. Specifically, the strategy comprises feature layers and texture layers. The feature layers serve the machine, including a feature selection module and a feature reconstruction network. With the assistance of texture images, they selectively compress and transmit channels relevant to visual tasks, reducing feature data while providing high-quality features for the machine. The texture layers primarily serve humans and consist of an image reconstruction network. This image reconstruction network leverages features and texture images to reconstruct preview images for humans. Our method fully exploits the characteristics of texture and features. It eliminates feature redundancy, reconstructs high-quality preview images for humans, and supports decision-making. The experimental results demonstrate excellent performance when employing our proposed method to compress the deep features.

[CV-108] AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization

链接: https://arxiv.org/abs/2405.19668
作者: Jiawei Chen,Xiao Yang,Zhengwei Fang,Yu Tian,Yinpeng Dong,Zhaoxia Yin,Hang Su
关键词: defense mechanisms ineffective, large language models, recent studies, mechanisms ineffective, widespread application
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Despite the widespread application of large language models (LLMs) across various tasks, recent studies indicate that they are susceptible to jailbreak attacks, which can render their defense mechanisms ineffective. However, previous jailbreak research has frequently been constrained by limited universality, suboptimal efficiency, and a reliance on manual crafting. In response, we rethink the approach to jailbreaking LLMs and formally define three essential properties from the attacker’ s perspective, which contributes to guiding the design of jailbreak methods. We further introduce AutoBreach, a novel method for jailbreaking LLMs that requires only black-box access. Inspired by the versatility of wordplay, AutoBreach employs a wordplay-guided mapping rule sampling strategy to generate a variety of universal mapping rules for creating adversarial prompts. This generation process leverages LLMs’ automatic summarization and reasoning capabilities, thus alleviating the manual burden. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Additionally, we propose a two-stage mapping rule optimization strategy that initially optimizes mapping rules before querying target LLMs to enhance the efficiency of AutoBreach. AutoBreach can efficiently identify security vulnerabilities across various LLMs, including three proprietary models: Claude-3, GPT-3.5, GPT-4 Turbo, and two LLMs’ web platforms: Bingchat, GPT-4 Web, achieving an average success rate of over 80% with fewer than 10 queries

[CV-109] CSANet: Channel Spatial Attention Network for Robust 3D Face Alignment and Reconstruction

链接: https://arxiv.org/abs/2405.19659
作者: Yilin Liu,Xuezhou Guo,Xinqi Wang,Fangzhou Du
关键词: Depth-wise Separable Convolution, face alignment, reconstruction network, Spatial Group-wise Enhancement, project proposes
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Our project proposes an end-to-end 3D face alignment and reconstruction network. The backbone of our model is built by Bottle-Neck structure via Depth-wise Separable Convolution. We integrate Coordinate Attention mechanism and Spatial Group-wise Enhancement to extract more representative features. For more stable training process and better convergence, we jointly use Wing loss and the Weighted Parameter Distance Cost to learn parameters for 3D Morphable model and 3D vertices. Our proposed model outperforms all baseline models both quantitatively and qualitatively.

[CV-110] Uncertainty-guided Optimal Transport in Depth Supervised Sparse-View 3D Gaussian

链接: https://arxiv.org/abs/2405.19657
作者: Wei Sun,Qi Zhang,Yanzhao Zhou,Qixiang Ye,Jianbin Jiao,Yuan Li
关键词: demonstrated impressive performance, Gaussian splatting, splatting has demonstrated, demonstrated impressive, impressive performance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10pages

点击查看摘要

Abstract:3D Gaussian splatting has demonstrated impressive performance in real-time novel view synthesis. However, achieving successful reconstruction from RGB images generally requires multiple input views captured under static conditions. To address the challenge of sparse input views, previous approaches have incorporated depth supervision into the training of 3D Gaussians to mitigate overfitting, using dense predictions from pretrained depth networks as pseudo-ground truth. Nevertheless, depth predictions from monocular depth estimation models inherently exhibit significant uncertainty in specific areas. Relying solely on pixel-wise L2 loss may inadvertently incorporate detrimental noise from these uncertain areas. In this work, we introduce a novel method to supervise the depth distribution of 3D Gaussians, utilizing depth priors with integrated uncertainty estimates. To address these localized errors in depth predictions, we integrate a patch-wise optimal transport strategy to complement traditional L2 loss in depth supervision. Extensive experiments conducted on the LLFF, DTU, and Blender datasets demonstrate that our approach, UGOT, achieves superior novel view synthesis and consistently outperforms state-of-the-art methods.

[CV-111] Dual sparse training framework: inducing activation map sparsity via Transformed ell1 regularization

链接: https://arxiv.org/abs/2405.19652
作者: Xiaolong Yu,Cong Tian
关键词: achieved rapid development, deep convolutional neural, convolutional neural networks, activation sparsity induction, rapid development
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although deep convolutional neural networks have achieved rapid development, it is challenging to widely promote and apply these models on low-power devices, due to computational and storage limitations. To address this issue, researchers have proposed techniques such as model compression, activation sparsity induction, and hardware accelerators. This paper presents a method to induce the sparsity of activation maps based on Transformed \ell1 regularization, so as to improve the research in the field of activation sparsity induction. Further, the method is innovatively combined with traditional pruning, constituting a dual sparse training framework. Compared to previous methods, Transformed \ell1 can achieve higher sparsity and better adapt to different network structures. Experimental results show that the method achieves improvements by more than 20% in activation map sparsity on most models and corresponding datasets without compromising the accuracy. Specifically, it achieves a 27.52% improvement for ResNet18 on the ImageNet dataset, and a 44.04% improvement for LeNet5 on the MNIST dataset. In addition, the dual sparse training framework can greatly reduce the computational load and provide potential for reducing the required storage during runtime. Specifically, the ResNet18 and ResNet50 models obtained by the dual sparse training framework respectively reduce 81.7% and 84.13% of multiplicative floating-point operations, while maintaining accuracy and a low pruning rate.

[CV-112] FaceLift: Semi-supervised 3D Facial Landmark Localization

链接: https://arxiv.org/abs/2405.19646
作者: David Ferman,Pablo Garrido,Gaurav Bharaj
关键词: face tracking, face modeling, face reconstruction, facial landmark localization, face
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024

点击查看摘要

Abstract:3D facial landmark localization has proven to be of particular use for applications, such as face tracking, 3D face modeling, and image-based 3D face reconstruction. In the supervised learning case, such methods usually rely on 3D landmark datasets derived from 3DMM-based registration that often lack spatial definition alignment, as compared with that chosen by hand-labeled human consensus, e.g., how are eyebrow landmarks defined? This creates a gap between landmark datasets generated via high-quality 2D human labels and 3DMMs, and it ultimately limits their effectiveness. To address this issue, we introduce a novel semi-supervised learning approach that learns 3D landmarks by directly lifting (visible) hand-labeled 2D landmarks and ensures better definition alignment, without the need for 3D landmark datasets. To lift 2D landmarks to 3D, we leverage 3D-aware GANs for better multi-view consistency learning and in-the-wild multi-frame videos for robust cross-generalization. Empirical experiments demonstrate that our method not only achieves better definition alignment between 2D-3D landmarks but also outperforms other supervised learning 3D landmark localization methods on both 3DMM labeled and photogrammetric ground truth evaluation datasets. Project Page: this https URL

[CV-113] EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos

链接: https://arxiv.org/abs/2405.19644
作者: Ryo Fujii,Masashi Hatano,Hideo Saito,Hiroki Kajita
关键词: Surgical phase recognition, modern operating room, Surgical phase, phase recognition, open surgery video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Early accepted by MICCAI 2024

点击查看摘要

Abstract:Surgical phase recognition has gained significant attention due to its potential to offer solutions to numerous demands of the modern operating room. However, most existing methods concentrate on minimally invasive surgery (MIS), leaving surgical phase recognition for open surgery understudied. This discrepancy is primarily attributed to the scarcity of publicly available open surgery video datasets for surgical phase recognition. To address this issue, we introduce a new egocentric open surgery video dataset for phase recognition, named EgoSurgery-Phase. This dataset comprises 15 hours of real open surgery videos spanning 9 distinct surgical phases all captured using an egocentric camera attached to the surgeon’s head. In addition to video, the EgoSurgery-Phase offers eye gaze. As far as we know, it is the first real open surgery video dataset for surgical phase recognition publicly available. Furthermore, inspired by the notable success of masked autoencoders (MAEs) in video understanding tasks (e.g., action recognition), we propose a gaze-guided masked autoencoder (GGMAE). Considering the regions where surgeons’ gaze focuses are often critical for surgical phase recognition (e.g., surgical field), in our GGMAE, the gaze information acts as an empirical semantic richness prior to guiding the masking process, promoting better attention to semantically rich spatial regions. GGMAE significantly improves the previous state-of-the-art recognition method (6.4% in Jaccard) and the masked autoencoder-based method (3.1% in Jaccard) on EgoSurgery-Phase. The dataset will be released at this https URL.

[CV-114] Learning Robust Correlation with Foundation Model for Weakly-Supervised Few-Shot Segmentation

链接: https://arxiv.org/abs/2405.19638
作者: Xinyang Huang,Chuang Zhu,Kebin Liu,Ruiying Ren,Shengjie Liu
关键词: precise pixel masks, segmenting unseen categories, learn robust correlation, segmenting unseen, Existing few-shot segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing few-shot segmentation (FSS) only considers learning support-query correlation and segmenting unseen categories under the precise pixel masks. However, the cost of a large number of pixel masks during training is expensive. This paper considers a more challenging scenario, weakly-supervised few-shot segmentation (WS-FSS), which only provides category ( i.e. image-level) labels. It requires the model to learn robust support-query information when the generated mask is inaccurate. In this work, we design a Correlation Enhancement Network (CORENet) with foundation model, which utilizes multi-information guidance to learn robust correlation. Specifically, correlation-guided transformer (CGT) utilizes self-supervised ViT tokens to learn robust correlation from both local and global perspectives. From the perspective of semantic categories, the class-guided module (CGM) guides the model to locate valuable correlations through the pre-trained CLIP. Finally, the embedding-guided module (EGM) implicitly guides the model to supplement the inevitable information loss during the correlation learning by the original appearance embedding and finally generates the query mask. Extensive experiments on PASCAL-5 ^i and COCO-20 ^i have shown that CORENet exhibits excellent performance compared to existing methods.

[CV-115] YotoR-You Only Transform One Representation

链接: https://arxiv.org/abs/2405.19629
作者: José Ignacio Díaz Villa,Patricio Loncomilla,Javier Ruiz-del-Solar
关键词: Transform One Representation, combines Swin Transformers, Swin Transformers, deep learning model, Swin Transformer models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:This paper introduces YotoR (You Only Transform One Representation), a novel deep learning model for object detection that combines Swin Transformers and YoloR architectures. Transformers, a revolutionary technology in natural language processing, have also significantly impacted computer vision, offering the potential to enhance accuracy and computational efficiency. YotoR combines the robust Swin Transformer backbone with the YoloR neck and head. In our experiments, YotoR models TP5 and BP4 consistently outperform YoloR P6 and Swin Transformers in various evaluations, delivering improved object detection performance and faster inference speeds than Swin Transformer models. These results highlight the potential for further model combinations and improvements in real-time object detection with Transformers. The paper concludes by emphasizing the broader implications of YotoR, including its potential to enhance transformer-based models for image-related tasks.

[CV-116] SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

链接: https://arxiv.org/abs/2405.19620
作者: Wenchao Sun,Xuewu Lin,Yining Shi,Chuang Zhang,Haoran Wu,Sifa Zheng
关键词: well-established modular autonomous, suffering from information, well-established modular, system is decoupled, information loss
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The well-established modular autonomous driving system is decoupled into different standalone tasks, e.g. perception, prediction and planning, suffering from information loss and error accumulation across modules. In contrast, end-to-end paradigms unify multi-tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. Despite the great potential of end-to-end paradigms, both the performance and efficiency of existing methods are not satisfactory, particularly in terms of planning safety. We attribute this to the computationally expensive BEV (bird’s eye view) features and the straightforward design for prediction and planning. To this end, we explore the sparse representation and review the task design for end-to-end autonomous driving, proposing a new paradigm named SparseDrive. Concretely, SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. For motion prediction and planning, we review the great similarity between these two tasks, leading to a parallel design for motion planner. Based on this parallel design, which models planning as a multi-modal problem, we propose a hierarchical planning selection strategy , which incorporates a collision-aware rescore module, to select a rational and safe trajectory as the final planning output. With such effective designs, SparseDrive surpasses previous state-of-the-arts by a large margin in performance of all tasks, while achieving much higher training and inference efficiency. Code will be avaliable at this https URL for facilitating future research.

[CV-117] SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations

链接: https://arxiv.org/abs/2405.19609
作者: Yujiao Jiang,Qingmin Liao,Zhaolong Wang,Xiangru Lin,Zongqing Lu,Yuxi Zhao,Hanqing Wei,Jingrui Ye,Yu Zhang,Zhijing Shao
关键词: including virtual reality, Recovering photorealistic, numerous applications, including virtual, virtual reality
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: ICME 2024;Project page: this https URL

点击查看摘要

Abstract:Recovering photorealistic and drivable full-body avatars is crucial for numerous applications, including virtual reality, 3D games, and tele-presence. Most methods, whether reconstruction or generation, require large numbers of human motion sequences and corresponding textured meshes. To easily learn a drivable avatar, a reasonable parametric body model with unified topology is paramount. However, existing human body datasets either have images or textured models and lack parametric models which fit clothes well. We propose a new parametric model SMPLX-Lite-D, which can fit detailed geometry of the scanned mesh while maintaining stable geometry in the face, hand and foot regions. We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, keypoints annotations, textured scanned meshes, and textured SMPLX-Lite-D models. With the SMPLX-Lite dataset, we train a conditional variational autoencoder model that takes human pose and facial keypoints as input, and generates a photorealistic drivable human avatar.

[CV-118] he RSNA Abdominal Traumatic Injury CT (RATIC) Dataset

链接: https://arxiv.org/abs/2405.19595
作者: Jeffrey D. Rudie,Hui-Ming Lin,Robyn L. Ball,Sabeena Jalal,Luciano M. Prevedello,Savvas Nicolaou,Brett S. Marinelli,Adam E. Flanders,Kirti Magudia,George Shih,Melissa A. Davis,John Mongan,Peter D. Chang,Ferco H. Berger,Sebastiaan Hermans,Meng Law,Tyler Richards,Jan-Peter Grunz,Andreas Steven Kunz,Shobhit Mathur,Sandro Galea-Soler,Andrew D. Chung,Saif Afat,Chin-Chi Kuo,Layal Aweidah,Ana Villanueva Campos,Arjuna Somasundaram,Felipe Antonio Sanchez Tijmes,Attaporn Jantarangkoon,Leonardo Kayat Bittencourt,Michael Brassil,Ayoub El Hajjami,Hakan Dogan,Muris Becircic,Agrahara G. Bharatkumar,Eduardo Moreno Júdice de Mattos Farina,Dataset Curator Group,Dataset Contributor Group,Dataset Annotator Group,Errol Colak
关键词: RSNA Abdominal Traumatic, largest publicly, publicly available collection, collection of adult, dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 40 pages, 2 figures, 3 tables

点击查看摘要

Abstract:The RSNA Abdominal Traumatic Injury CT (RATIC) dataset is the largest publicly available collection of adult abdominal CT studies annotated for traumatic injuries. This dataset includes 4,274 studies from 23 institutions across 14 countries. The dataset is freely available for non-commercial use via Kaggle at this https URL. Created for the RSNA 2023 Abdominal Trauma Detection competition, the dataset encourages the development of advanced machine learning models for detecting abdominal injuries on CT scans. The dataset encompasses detection and classification of traumatic injuries across multiple organs, including the liver, spleen, kidneys, bowel, and mesentery. Annotations were created by expert radiologists from the American Society of Emergency Radiology (ASER) and Society of Abdominal Radiology (SAR). The dataset is annotated at multiple levels, including the presence of injuries in three solid organs with injury grading, image-level annotations for active extravasations and bowel injury, and voxelwise segmentations of each of the potentially injured organs. With the release of this dataset, we hope to facilitate research and development in machine learning and abdominal trauma that can lead to improved patient care and outcomes.

[CV-119] SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

链接: https://arxiv.org/abs/2405.19586
作者: Junjie Zhang,Chenjia Bai,Haoran He,Wenke Xia,Zhigang Wang,Bin Zhao,Xiu Li,Xuelong Li
关键词: Acquiring a multi-task, multi-task imitation policy, manipulation poses challenges, challenges in terms, poses challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ICML 2024. Project page: this https URL

点击查看摘要

Abstract:Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot’s end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

[CV-120] Blind Image Restoration via Fast Diffusion Inversion

链接: https://arxiv.org/abs/2405.19572
作者: Hamadi Chihaoui,Abdelhak Lemkhenter,Paolo Favaro
关键词: solve Image Restoration, Image Restoration, diffusion model leading, Blind Image Restoration, image restoration tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, various methods have been proposed to solve Image Restoration (IR) tasks using a pre-trained diffusion model leading to state-of-the-art performance. However, most of these methods assume that the degradation operator in the IR task is completely known. Furthermore, a common characteristic among these approaches is that they alter the diffusion sampling process in order to satisfy the consistency with the degraded input image. This choice has recently been shown to be sub-optimal and to cause the restored image to deviate from the data manifold. To address these issues, we propose Blind Image Restoration via fast Diffusion inversion (BIRD) a blind IR method that jointly optimizes for the degradation model parameters and the restored image. To ensure that the restored images lie onto the data manifold, we propose a novel sampling technique on a pre-trained diffusion model. A key idea in our method is not to modify the reverse sampling, i.e., not to alter all the intermediate latents, once an initial noise is sampled. This is ultimately equivalent to casting the IR task as an optimization problem in the space of the input noise. Moreover, to mitigate the computational cost associated with inverting a fully unrolled diffusion model, we leverage the inherent capability of these models to skip ahead in the forward diffusion process using large time steps. We experimentally validate BIRD on several image restoration tasks and show that it achieves state of the art performance on all of them. Our code is available at this https URL.

[CV-121] Improved Convex Decomposition with Ensembling and Boolean Primitives

链接: https://arxiv.org/abs/2405.19569
作者: Vaibhav Vavilala,Florian Kluger,Seemandhar Jain,Bodo Rosenhahn,David Forsyth
关键词: geometrically simple shapes, established vision problem, geometrically simple, abstraction of structure, simple shapes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Describing a scene in terms of primitives – geometrically simple shapes that offer a parsimonious but accurate abstraction of structure – is an established vision problem. This is a good model of a difficult fitting problem: different scenes require different numbers of primitives and primitives interact strongly, but any proposed solution can be evaluated at inference time. The state of the art method involves a learned regression procedure to predict a start point consisting of a fixed number of primitives, followed by a descent method to refine the geometry and remove redundant primitives. Methods are evaluated by accuracy in depth and normal prediction and in scene segmentation. This paper shows that very significant improvements in accuracy can be obtained by (a) incorporating a small number of negative primitives and (b) ensembling over a number of different regression procedures. Ensembling is by refining each predicted start point, then choosing the best by fitting loss. Extensive experiments on a standard dataset confirm that negative primitives are useful in a large fraction of images, and that our refine-then-choose strategy outperforms choose-then-refine, confirming that the fitting problem is very difficult.

[CV-122] Organizing Background to Explore Latent Classes for Incremental Few-shot Semantic Segmentation

链接: https://arxiv.org/abs/2405.19568
作者: Lianlei Shan,Wenzhang Zhou,Wei Li,Xingyu Ding
关键词: Few-shot Semantic Segmentation, incremental Few-shot Semantic, extend pre-trained segmentation, Few-shot Semantic, Semantic Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:The goal of incremental Few-shot Semantic Segmentation (iFSS) is to extend pre-trained segmentation models to new classes via few annotated images without access to old training data. During incrementally learning novel classes, the data distribution of old classes will be destroyed, leading to catastrophic forgetting. Meanwhile, the novel classes have only few samples, making models impossible to learn the satisfying representations of novel classes. For the iFSS problem, we propose a network called OINet, i.e., the background embedding space \textbfOrganization and prototype \textbfInherit Network. Specifically, when training base classes, OINet uses multiple classification heads for the background and sets multiple sub-class prototypes to reserve embedding space for the latent novel classes. During incrementally learning novel classes, we propose a strategy to select the sub-class prototypes that best match the current learning novel classes and make the novel classes inherit the selected prototypes’ embedding space. This operation allows the novel classes to be registered in the embedding space using few samples without affecting the distribution of the base classes. Results on Pascal-VOC and COCO show that OINet achieves a new state of the art.

[CV-123] Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

链接: https://arxiv.org/abs/2405.19567
作者: Shenghuan Sun,Gregory M. Goldgof,Alexander Schubert,Zhiqing Sun,Thomas Hartvigsen,Atul J. Butte,Ahmed Alaa
关键词: natural language interactions, treatment tasks, support clinicians, images and engaging, engaging in natural
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code available at: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions to assist in diagnostic and treatment tasks. However, VLMs often exhibit “hallucinogenic” behavior, generating textual outputs not grounded in contextual multimodal information. This challenge is particularly pronounced in the medical domain, where we do not only require VLM outputs to be accurate in single interactions but also to be consistent with clinical reasoning and diagnostic pathways throughout multi-turn conversations. For this purpose, we propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge. These representations are utilized to (i) generate GPT-4-guided visual instruction tuning data at scale, simulating clinician-VLM conversations with demonstrations of clinical reasoning, and (ii) create an automatic reward function that evaluates the clinical validity of VLM generations throughout clinician-VLM interactions. Our algorithm eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback (RLHF). We apply our alignment algorithm to develop Dr-LLaVA, a conversational VLM finetuned for analyzing bone marrow pathology slides, demonstrating strong performance in multi-turn medical conversations.

[CV-124] CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

链接: https://arxiv.org/abs/2405.19547
作者: Yiping Wang,Yifang Chen,Wendan Yan,Alex Fang,Wenjing Zhou,Kevin Jamieson,Simon Shaolei Du
关键词: noisy web-curated datasets, visual-language model pretaining, large-scale visual-language model, Data selection, CLIP
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper supercedes our previous VAS paper ( arXiv:2402.02055 )

点击查看摘要

Abstract:Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce negCLIPLoss, a CLIP loss-inspired method that adds the alignment between one sample and its contrastive pairs as an extra normalization term for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, NormSim, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp~\citegadre2023datacomp. Compared to the best baseline using only OpenAI’s CLIP-L/14, our methods achieve a 5.3% improvement on ImageNet-1k and a 2.8% improvement on 38 downstream evaluation tasks. Moreover, both negCLIPLoss and NormSim are compatible with existing techniques. By combining our methods with the current best methods DFN~\citefang2023data and HYPE~\citekim2024hype, we can boost average performance on downstream tasks by 0.9%, achieving a new state-of-the-art.

[CV-125] CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts Images and Patients

链接: https://arxiv.org/abs/2405.19538
作者: Pierre Chambon,Jean-Benoit Delbrouck,Thomas Sounack,Shih-Cheng Huang,Zhihong Chen,Maya Varma,Steven QH Truong,Chu The Chuong,Curtis P. Langlotz
关键词: original CheXpert paper, years ago, paper five years, original CheXpert, CheXpert paper
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: this https URL Models are available at the following URL: this https URL

[CV-126] Lifelong Learning Using a Dynamically Growing Tree of Sub-networks for Domain Generalization in Video Object Segmentation

链接: https://arxiv.org/abs/2405.19525
作者: Islam Osman,Mohamed S. Shehata
关键词: achieved great success, massive labeled training, video object segmentation, labeled training datasets, video object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current state-of-the-art video object segmentation models have achieved great success using supervised learning with massive labeled training datasets. However, these models are trained using a single source domain and evaluated using videos sampled from the same source domain. When these models are evaluated using videos sampled from a different target domain, their performance degrades significantly due to poor domain generalization, i.e., their inability to learn from multi-domain sources simultaneously using traditional supervised learning. In this paper, We propose a dynamically growing tree of sub-networks (DGT) to learn effectively from multi-domain sources. DGT uses a novel lifelong learning technique that allows the model to continuously and effectively learn from new domains without forgetting the previously learned domains. Hence, the model can generalize to out-of-domain videos. The proposed work is evaluated using single-source in-domain (traditional video object segmentation), multi-source in-domain, and multi-source out-of-domain video object segmentation. The results of DGT show a single source in-domain performance gain of 0.2% and 3.5% on the DAVIS16 and DAVIS17 datasets, respectively. However, when DGT is evaluated using in-domain multi-sources, the results show superior performance compared to state-of-the-art video object segmentation and other lifelong learning techniques with an average performance increase in the F-score of 6.9% with minimal catastrophic forgetting. Finally, in the out-of-domain experiment, the performance of DGT is 2.7% and 4% better than state-of-the-art in 1 and 5-shots, respectively.

[CV-127] MDS-ViTNet: Improving saliency prediction for Eye-Tracking with Vision Transformer

链接: https://arxiv.org/abs/2405.19501
作者: Polezhaev Ignat,Goncharenko Igor,Iurina Natalya
关键词: Multi Decoder Saliency, visual saliency prediction, enhancing visual saliency, Vision Transformer Network, Vision Transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present a novel methodology we call MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. We propose a network architecture that leverages the Vision Transformer, moving beyond the conventional ImageNet backbone. The framework adopts an encoder-decoder structure, with the encoder utilizing a Swin transformer to efficiently embed most important features. This process involves a Transfer Learning method, wherein layers from the Vision Transformer are converted by the Encoder Transformer and seamlessly integrated into a CNN Decoder. This methodology ensures minimal information loss from the original input image. The decoder employs a multi-decoding technique, utilizing dual decoders to generate two distinct attention maps. These maps are subsequently combined into a singular output via an additional CNN model. Our trained model MDS-ViTNet achieves state-of-the-art results across several benchmarks. Committed to fostering further collaboration, we intend to make our code, models, and datasets accessible to the public.

[CV-128] RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

链接: https://arxiv.org/abs/2405.19465
作者: Meng Cao,Haoran Tang,Jinfa Huang,Peng Jin,Can Zhang,Ruyang Liu,Long Chen,Xiaodan Liang,Li Yuan,Ge Li
关键词: natural language queries, align relevant video, relevant video content, aims to align, language queries
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACL 2024 Findings

点击查看摘要

Abstract:Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism that first selects the top responsive visual patches and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient fine-tuning methods.

[CV-129] Clustering-Based Validation Splits for Domain Generalisation

链接: https://arxiv.org/abs/2405.19461
作者: Andrea Napoli,Paul White
关键词: model selection, domain shift, problem of model, validation sets increases, validation sets
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper considers the problem of model selection under domain shift. In this setting, it is proposed that a high maximum mean discrepancy (MMD) between the training and validation sets increases the generalisability of selected models. A data splitting algorithm based on kernel k-means clustering, which maximises this objective, is presented. The algorithm leverages linear programming to control the size, label, and (optionally) group distributions of the splits, and comes with convergence guarantees. The technique consistently outperforms alternative splitting strategies across a range of datasets and training algorithms, for both domain generalisation (DG) and unsupervised domain adaptation (UDA) tasks. Analysis also shows the MMD between the training and validation sets to be strongly rank-correlated ( \rho=0.63 ) with test domain accuracy, further substantiating the validity of this approach.

[CV-130] MemControl: Mitigating Memorization in Medical Diffusion Models via Automated Parameter Selection

链接: https://arxiv.org/abs/2405.19458
作者: Raman Dutt,Pedro Sanchez,Ondrej Bohdal,Sotirios A. Tsaftaris,Timothy Hospedales
关键词: Diffusion models show, Diffusion models, remarkable ability, ability in generating, closely mirror
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models show a remarkable ability in generating images that closely mirror the training distribution. However, these models are prone to training data memorization, leading to significant privacy, ethical, and legal concerns, particularly in sensitive fields such as medical imaging. We hypothesize that memorization is driven by the overparameterization of deep models, suggesting that regularizing model capacity during fine-tuning could be an effective mitigation strategy. Parameter-efficient fine-tuning (PEFT) methods offer a promising approach to capacity control by selectively updating specific parameters. However, finding the optimal subset of learnable parameters that balances generation quality and memorization remains elusive. To address this challenge, we propose a bi-level optimization framework that guides automated parameter selection by utilizing memorization and generation quality metrics as rewards. Our framework successfully identifies the optimal parameter set to be updated to satisfy the generation-memorization tradeoff. We perform our experiments for the specific task of medical image generation and outperform existing state-of-the-art training-time mitigation strategies by fine-tuning as few as 0.019% of model parameters. Furthermore, we show that the strategies learned through our framework are transferable across different datasets and domains. Our proposed framework is scalable to large datasets and agnostic to the choice of reward functions. Finally, we show that our framework can be combined with existing approaches for further memorization mitigation.

[CV-131] FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining

链接: https://arxiv.org/abs/2405.19450
作者: Dong Li,Yidi Liu,Xueyang Fu,Senyan Xu,Zheng-Jun Zha
关键词: restore clear backgrounds, Fourier space, Image deraining, Image deraining aims, remove rain streaks
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely exploit the correlation of different frequencies for conjuncting their learning procedures, limiting the full utilization of frequency information for image deraining. Alternatively, the recently emerged Mamba technique depicts its effectiveness and efficiency for modeling correlation in various domains (e.g., spatial, temporal), and we argue that introducing Mamba into its unexplored Fourier spaces to correlate different frequencies would help improve image deraining. This motivates us to propose a new framework termed FourierMamba, which performs image deraining with Mamba in the Fourier space. Owning to the unique arrangement of frequency orders in Fourier space, the core of FourierMamba lies in the scanning encoding of different frequencies, where the low-high frequency order formats exhibit differently in the spatial dimension (unarranged in axis) and channel dimension (arranged in axis). Therefore, we design FourierMamba that correlates Fourier space information in the spatial and channel dimensions with distinct designs. Specifically, in the spatial dimension Fourier space, we introduce the zigzag coding to scan the frequencies to rearrange the orders from low to high frequencies, thereby orderly correlating the connections between frequencies; in the channel dimension Fourier space with arranged orders of frequencies in axis, we can directly use Mamba to perform frequency correlation and improve the channel information representation.

[CV-132] Large-scale DSM registration via motion averaging

链接: https://arxiv.org/abs/2405.19442
作者: Ningli Xu,Rongjun Qin
关键词: Generating wide-area digital, digital surface models, wide-area digital surface, Generating wide-area, surface models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 Figures

点击查看摘要

Abstract:Generating wide-area digital surface models (DSMs) requires registering a large number of individual, and partially overlapped DSMs. This presents a challenging problem for a typical registration algorithm, since when a large number of observations from these multiple DSMs are considered, it may easily cause memory overflow. Sequential registration algorithms, although can significantly reduce the computation, are especially vulnerable for small overlapped pairs, leading to a large error accumulation. In this work, we propose a novel solution that builds the DSM registration task as a motion averaging problem: pair-wise DSMs are registered to build a scene graph, with edges representing relative poses between DSMs. Specifically, based on the grid structure of the large DSM, the pair-wise registration is performed using a novel nearest neighbor search method. We show that the scene graph can be optimized via an extremely fast motion average algorithm with O(N) complexity (N refers to the number of images). Evaluation of high-resolution satellite-derived DSM demonstrates significant improvement in computation and accuracy.

[CV-133] Conformal Recursive Feature Elimination

链接: https://arxiv.org/abs/2405.19429
作者: Marcos López-De-Castro(1 and 2),Alberto García-Galindo(1 and 2),Rubén Armañanzas(1 and 2) ((1) DATAI - Institute of Data Science and Artificial Intelligence, Universidad de Navarra, Pamplona, Spain,(2) TECNUN School of Engineering, Universidad de Navarra, Donostia-San Sebastian, Spain)
关键词: Unlike traditional statistical, individual predictions based, accurate confidence levels, Recursive Feature Elimination, Unlike traditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unlike traditional statistical methods, Conformal Prediction (CP) allows for the determination of valid and accurate confidence levels associated with individual predictions based only on exchangeability of the data. We here introduce a new feature selection method that takes advantage of the CP framework. Our proposal, named Conformal Recursive Feature Elimination (CRFE), identifies and recursively removes features that increase the non-conformity of a dataset. We also present an automatic stopping criterion for CRFE, as well as a new index to measure consistency between subsets of features. CRFE selections are compared to the classical Recursive Feature Elimination (RFE) method on several multiclass datasets by using multiple partitions of the data. The results show that CRFE clearly outperforms RFE in half of the datasets, while achieving similar performance in the rest. The automatic stopping criterion provides subsets of effective and non-redundant features without computing any classification performance.

[CV-134] Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies

链接: https://arxiv.org/abs/2405.19424
作者: Yipu Chen,Haotian Xue,Yongxin Chen
关键词: behavior cloning, promising approach, approach for behavior, Diffusion, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have emerged as a promising approach for behavior cloning (BC). Diffusion policies (DP) based on DMs have elevated BC performance to new heights, demonstrating robust efficacy across diverse tasks, coupled with their inherent flexibility and ease of implementation. Despite the increasing adoption of DP as a foundation for policy generation, the critical issue of safety remains largely unexplored. While previous attempts have targeted deep policy networks, DP used diffusion models as the policy network, making it ineffective to be attacked using previous methods because of its chained structure and randomness injected. In this paper, we undertake a comprehensive examination of DP safety concerns by introducing adversarial scenarios, encompassing offline and online attacks, and global and patch-based attacks. We propose DP-Attacker, a suite of algorithms that can craft effective adversarial attacks across all aforementioned scenarios. We conduct attacks on pre-trained diffusion policies across various manipulation tasks. Through extensive experiments, we demonstrate that DP-Attacker has the capability to significantly decrease the success rate of DP for all scenarios. Particularly in offline scenarios, DP-Attacker can generate highly transferable perturbations applicable to all frames. Furthermore, we illustrate the creation of adversarial physical patches that, when applied to the environment, effectively deceive the model. Video results are put in: this https URL.

[CV-135] Evaluating Vision-Language Models on Bistable Images

链接: https://arxiv.org/abs/2405.19423
作者: Artemis Panagopoulou,Coby Melkin,Chris Callison-Burch
关键词: present visual stimuli, Bistable images, present visual, ambiguous or reversible, visual stimuli
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the model preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced.

[CV-136] VisTA-SR: Improving the Accuracy and Resolution of Low-Cost Thermal Imaging Cameras for Agriculture

链接: https://arxiv.org/abs/2405.19413
作者: Heesup Yun,Sassoum Lo,Christine H. Diepenbrock,Brian N. Bailey,J. Mason Earles
关键词: low-cost thermal cameras, textbf, Thermal cameras, important photochemical, Thermal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Thermal cameras are an important tool for agricultural research because they allow for non-invasive measurement of plant temperature, which relates to important photochemical, hydraulic, and agronomic traits. Utilizing low-cost thermal cameras can lower the barrier to introducing thermal imaging in agricultural research and production. This paper presents an approach to improve the temperature accuracy and image quality of low-cost thermal imaging cameras for agricultural applications. Leveraging advancements in computer vision techniques, particularly deep learning networks, we propose a method, called \textbfVisTA-SR ( \textbfVis ual \ \textbfT hermal \textbfA lignment and \textbfS uper- \textbfR esolution Enhancement) that combines RGB and thermal images to enhance the capabilities of low-resolution thermal cameras. The research includes calibration and validation of temperature measurements, acquisition of paired image datasets, and the development of a deep learning network tailored for agricultural thermal imaging. Our study addresses the challenges of image enhancement in the agricultural domain and explores the potential of low-cost thermal cameras to replace high-resolution industrial cameras. Experimental results demonstrate the effectiveness of our approach in enhancing temperature accuracy and image sharpness, paving the way for more accessible and efficient thermal imaging solutions in agriculture.

[CV-137] Video Anomaly Detection in 10 Years: A Survey and Outlook

链接: https://arxiv.org/abs/2405.19387
作者: Moshira Abdalla,Sajid Javed,Muaz Al Radi,Anwaar Ulhaq,Naoufel Werghi
关键词: holds immense importance, holds immense, environmental monitoring, immense importance, importance across diverse
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) holds immense importance across diverse domains such as surveillance, healthcare, and environmental monitoring. While numerous surveys focus on conventional VAD methods, they often lack depth in exploring specific approaches and emerging trends. This survey explores deep learning-based VAD, expanding beyond traditional supervised training paradigms to encompass emerging weakly supervised, self-supervised, and unsupervised approaches. A prominent feature of this review is the investigation of core challenges within the VAD paradigms including large-scale datasets, features extraction, learning methods, loss functions, regularization, and anomaly score prediction. Moreover, this review also investigates the vision language models (VLMs) as potent feature extractors for VAD. VLMs integrate visual data with textual descriptions or spoken language from videos, enabling a nuanced understanding of scenes crucial for anomaly detection. By addressing these challenges and proposing future research directions, this review aims to foster the development of robust and efficient VAD systems leveraging the capabilities of VLMs for enhanced anomaly detection in complex real-world scenarios. This comprehensive analysis seeks to bridge existing knowledge gaps, provide researchers with valuable insights, and contribute to shaping the future of VAD research.

[CV-138] DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

链接: https://arxiv.org/abs/2405.15306
作者: Jonas Belouadi,Simone Paolo Ponzetto,Steffen Eger
关键词: Creating high-quality scientific, Creating high-quality, high-quality scientific figures, scientific figures, time-consuming and challenging
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Creating high-quality scientific figures can be time-consuming and challenging, even though sketching ideas on paper is relatively easy. Furthermore, recreating existing figures that are not stored in formats preserving semantic information is equally complex. To tackle this problem, we introduce DeTikZify, a novel multimodal language model that automatically synthesizes scientific figures as semantics-preserving TikZ graphics programs based on sketches and existing figures. To achieve this, we create three new datasets: DaTikZv2, the largest TikZ dataset to date, containing over 360k human-created TikZ graphics; SketchFig, a dataset that pairs hand-drawn sketches with their corresponding scientific figures; and SciCap++, a collection of diverse scientific figures and associated metadata. We train DeTikZify on SciCap++ and DaTikZv2, along with synthetically generated sketches learned from SketchFig. We also introduce an MCTS-based inference algorithm that enables DeTikZify to iteratively refine its outputs without the need for additional training. Through both automatic and human evaluation, we demonstrate that DeTikZify outperforms commercial Claude 3 and GPT-4V in synthesizing TikZ programs, with the MCTS algorithm effectively boosting its performance. We make our code, models, and datasets publicly available.

[CV-139] Quantum Visual Feature Encoding Revisited

链接: https://arxiv.org/abs/2405.19725
作者: Xuan-Bac Nguyen,Hoang-Quan Nguyen,Hugh Churchill,Samee U. Khan,Khoa Luu
关键词: quantum machine learning, machine learning, quantum machine, machine learning algorithms, quantum
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although quantum machine learning has been introduced for a while, its applications in computer vision are still limited. This paper, therefore, revisits the quantum visual encoding strategies, the initial step in quantum machine learning. Investigating the root cause, we uncover that the existing quantum encoding design fails to ensure information preservation of the visual features after the encoding process, thus complicating the learning process of the quantum machine learning models. In particular, the problem, termed “Quantum Information Gap” (QIG), leads to a gap of information between classical and corresponding quantum features. We provide theoretical proof and practical demonstrations of that found and underscore the significance of QIG, as it directly impacts the performance of quantum machine learning algorithms. To tackle this challenge, we introduce a simple but efficient new loss function named Quantum Information Preserving (QIP) to minimize this gap, resulting in enhanced performance of quantum machine learning algorithms. Extensive experiments validate the effectiveness of our approach, showcasing superior performance compared to current methodologies and consistently achieving state-of-the-art results in quantum modeling.

[CV-140] Enabling Visual Recognition at Radio Frequency

链接: https://arxiv.org/abs/2405.19516
作者: Haowen Lai,Gaoxiang Luo,Yifei Liu,Mingmin Zhao
关键词: paper introduces PanoRadar, paper introduces, providing resilience, resilience against conditions, conditions challenging
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces PanoRadar, a novel RF imaging system that brings RF resolution close to that of LiDAR, while providing resilience against conditions challenging for optical signals. Our LiDAR-comparable 3D imaging results enable, for the first time, a variety of visual recognition tasks at radio frequency, including surface normal estimation, semantic segmentation, and object detection. PanoRadar utilizes a rotating single-chip mmWave radar, along with a combination of novel signal processing and machine learning algorithms, to create high-resolution 3D images of the surroundings. Our system accurately estimates robot motion, allowing for coherent imaging through a dense grid of synthetic antennas. It also exploits the high azimuth resolution to enhance elevation resolution using learning-based methods. Furthermore, PanoRadar tackles 3D learning via 2D convolutions and addresses challenges due to the unique characteristics of RF signals. Our results demonstrate PanoRadar’s robust performance across 12 buildings.

[CV-141] otalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR images

链接: https://arxiv.org/abs/2405.19492
作者: Tugba Akinci D’Antonoli,Lucas K. Berger,Ashraya K. Indrakanti,Nathan Vishwanathan,Jakob Weiß,Matthias Jung,Zeynep Berkarda,Alexander Rau,Marco Reisert,Thomas Küstner,Alexandra Walter,Elmar M. Merkle,Martin Segeroth,Joshy Cyriac,Shan Yang,Jakob Wasserthal
关键词: automatically and robustly, Dice score, Dice, Purpose, anatomical structures
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Purpose: To develop an open-source and easy-to-use segmentation model that can automatically and robustly segment most major anatomical structures in MR images independently of the MR sequence. Materials and Methods: In this study we extended the capabilities of TotalSegmentator to MR images. 298 MR scans and 227 CT scans were used to segment 59 anatomical structures (20 organs, 18 bones, 11 muscles, 7 vessels, 3 tissue types) relevant for use cases such as organ volumetry, disease characterization, and surgical planning. The MR and CT images were randomly sampled from routine clinical studies and thus represent a real-world dataset (different ages, pathologies, scanners, body parts, sequences, contrasts, echo times, repetition times, field strengths, slice thicknesses and sites). We trained an nnU-Net segmentation algorithm on this dataset and calculated Dice similarity coefficients (Dice) to evaluate the model’s performance. Results: The model showed a Dice score of 0.824 (CI: 0.801, 0.842) on the test set, which included a wide range of clinical data with major pathologies. The model significantly outperformed two other publicly available segmentation models (Dice score, 0.824 versus 0.762; p0.001 and 0.762 versus 0.542; p0.001). On the CT image test set of the original TotalSegmentator paper it almost matches the performance of the original TotalSegmentator (Dice score, 0.960 versus 0.970; p0.001). Conclusion: Our proposed model extends the capabilities of TotalSegmentator to MR images. The annotated dataset (this https URL) and open-source toolkit (this https URL) are publicly available. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2405.19492 [eess.IV] (or arXiv:2405.19492v1 [eess.IV] for this version) Submission history From: Jakob Wasserthal [view email] [v1] Wed, 29 May 2024 20:15:54 UTC (1,170 KB)

[CV-142] Beyond Isolated Frames: Enhancing Sensor-Based Human Activity Recognition through Intra- and Inter-Frame Attention

链接: https://arxiv.org/abs/2405.19349
作者: Shuai Shao,Yu Guan,Victor Sanchez
关键词: Human Activity Recognition, Activity Recognition, Convolutional Neural Networks, Human Activity, ubiquitous computing
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) has become increasingly popular with ubiquitous computing, driven by the popularity of wearable sensors in fields like healthcare and sports. While Convolutional Neural Networks (ConvNets) have significantly contributed to HAR, they often adopt a frame-by-frame analysis, concentrating on individual frames and potentially overlooking the broader temporal dynamics inherent in human activities. To address this, we propose the intra- and inter-frame attention model. This model captures both the nuances within individual frames and the broader contextual relationships across multiple frames, offering a comprehensive perspective on sequential data. We further enrich the temporal understanding by proposing a novel time-sequential batch learning strategy. This learning strategy preserves the chronological sequence of time-series data within each batch, ensuring the continuity and integrity of temporal patterns in sensor-based HAR.

[CV-143] Accurate Patient Alignment without Unnecessary Imaging Dose via Synthesizing Patient-specific 3D CT Images from 2D kV Images

链接: https://arxiv.org/abs/2405.19338
作者: Yuzhen Ding,Jason M. Holmes,Hongying Feng,Baoxin Li,Lisa A. McGee,Jean-Claude M. Rwigema,Sujay A. Vora,Daniel J. Ma,Robert L. Foote,Samir H. Patel,Wei Liu
关键词: orthogonally projected, OBI, patient, patient alignment, Abstract
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 8 figures and tables

点击查看摘要

Abstract:In radiotherapy, 2D orthogonally projected kV images are used for patient alignment when 3D-on-board imaging(OBI) unavailable. But tumor visibility is constrained due to the projection of patient’s anatomy onto a 2D plane, potentially leading to substantial setup errors. In treatment room with 3D-OBI such as cone beam CT(CBCT), the field of view(FOV) of CBCT is limited with unnecessarily high imaging dose, thus unfavorable for pediatric patients. A solution to this dilemma is to reconstruct 3D CT from kV images obtained at the treatment position. Here, we propose a dual-models framework built with hierarchical ViT blocks. Unlike a proof-of-concept approach, our framework considers kV images as the solo input and can synthesize accurate, full-size 3D CT in real time(within milliseconds). We demonstrate the feasibility of the proposed approach on 10 patients with head and neck (HN) cancer using image quality(MAE: 45HU), dosimetrical accuracy(Gamma passing rate (2%/2mm/10%)97%) and patient position uncertainty(shift error: 0.4mm). The proposed framework can generate accurate 3D CT faithfully mirroring real-time patient position, thus significantly improving patient setup accuracy, keeping imaging dose minimum, and maintaining treatment veracity.

机器学习

[LG-0] Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

链接: https://arxiv.org/abs/2405.20343
作者: Kailu Wu,Fangfu Liu,Zhihan Cai,Runjie Yan,Hanyang Wang,Yating Hu,Yueqi Duan,Kaisheng Ma
关键词: efficiently generating high-quality, Score Distillation Sampling, generating high-quality, meshes from single-view, strong generalizability
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion models, but they usually suffer from long per-case optimization time with inconsistent issues. Recent works address the problem and generate better 3D results either by finetuning a multi-view diffusion model or training a fast feed-forward model. However, they still lack intricate textures and complex geometries due to inconsistency and limited generated resolution. To simultaneously achieve high fidelity, consistency, and efficiency in single image-to-3D, we propose a novel framework Unique3D that includes a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images with their normal maps, a multi-level upscale process to progressively improve the resolution of generated orthographic multi-views, as well as an instant and consistent mesh reconstruction algorithm called ISOMER, which fully integrates the color and geometric priors into mesh results. Extensive experiments demonstrate that our Unique3D significantly outperforms other image-to-3D baselines in terms of geometric and textural details.

[LG-1] From Zero to Hero: Cold-Start Anomaly Detection

链接: https://arxiv.org/abs/2405.20341
作者: Tal Reiss,George Kour,Naama Zwerdling,Ateret Anaby-Tavor,Yedid Hoshen
关键词: making data-driven approaches, data-driven approaches ineffective, anomaly detection system, queries in chatbots, observed data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: ACL 2024. Our code is available at this https URL

点击查看摘要

Abstract:When first deploying an anomaly detection system, e.g., to detect out-of-scope queries in chatbots, there are no observed data, making data-driven approaches ineffective. Zero-shot anomaly detection methods offer a solution to such “cold-start” cases, but unfortunately they are often not accurate enough. This paper studies the realistic but underexplored cold-start setting where an anomaly detection model is initialized using zero-shot guidance, but subsequently receives a small number of contaminated observations (namely, that may include anomalies). The goal is to make efficient use of both the zero-shot guidance and the observations. We propose ColdFusion, a method that effectively adapts the zero-shot anomaly detector to contaminated observations. To support future development of this new setting, we propose an evaluation suite consisting of evaluation protocols and metrics.

[LG-2] CoSy: Evaluating Textual Explanations of Neurons

链接: https://arxiv.org/abs/2405.20331
作者: Laura Kopf,Philine Lou Bommer,Anna Hedström,Sebastian Lapuschkin,Marina M.-C. Höhne,Kirill Bykov
关键词: Deep Neural Networks, Neural Networks, Deep Neural, explain learned concepts, nature of Deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:A crucial aspect of understanding the complex nature of Deep Neural Networks (DNNs) is the ability to explain learned concepts within their latent representations. While various methods exist to connect neurons to textual descriptions of human-understandable concepts, evaluating the quality of these explanation methods presents a major challenge in the field due to a lack of unified, general-purpose quantitative evaluation. In this work, we introduce CoSy (Concept Synthesis) – a novel, architecture-agnostic framework to evaluate the quality of textual explanations for latent neurons. Given textual explanations, our proposed framework leverages a generative model conditioned on textual input to create data points representing the textual explanation. Then, the neuron’s response to these explanation data points is compared with the response to control data points, providing a quality estimate of the given explanation. We ensure the reliability of our proposed framework in a series of meta-evaluation experiments and demonstrate practical value through insights from benchmarking various concept-based textual explanation methods for Computer Vision tasks, showing that tested explanation methods significantly differ in quality.

[LG-3] Dont drop your samples! Coherence-aware training benefits Conditional diffusion

链接: https://arxiv.org/abs/2405.20324
作者: Nicolas Dufour,Victor Besnier,Vicky Kalogeiton,David Picard
关键词: conditional information, powerful generative models, segmentation masks, Conditional, class labels
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at CVPR 2024 as a Highlight. Project page: this https URL

点击查看摘要

Abstract:Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many real-world scenarios, conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a novel method that integrates coherence in conditional information into diffusion models, allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated coherence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the coherence score. In this way, the model learns to ignore or discount the conditioning when the coherence is low. We show that CAD is theoretically sound and empirically effective on various conditional generation tasks. Moreover, we show that leveraging coherence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low coherence have been discarded.

[LG-4] Vision-based Manipulation from Single Human Video with Open-World Object Graphs

链接: https://arxiv.org/abs/2405.20321
作者: Yifeng Zhu,Arisrei Lim,Peter Stone,Yuke Zhu
关键词: vision-based manipulation skills, single human video, learn vision-based manipulation, approach to empower, single human
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos. We investigate the problem of imitating robot manipulation from a single human video in the open-world setting, where a robot must learn to manipulate novel objects from one video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices such as an iPad and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world. Videos can be found in the project website this https URL.

[LG-5] Improving the Training of Rectified Flows

链接: https://arxiv.org/abs/2405.20320
作者: Sangyun Lee,Zinan Lin,Giulia Fanti
关键词: shown great promise, expensive numerical integration, Diffusion models, models requires expensive, requires expensive numerical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have shown great promise for image and video generation, but sampling from state-of-the-art models requires expensive numerical integration of a generative ODE. One approach for tackling this problem is rectified flows, which iteratively learn smooth ODE paths that are less susceptible to truncation error. However, rectified flows still require a relatively large number of function evaluations (NFEs). In this work, we propose improved techniques for training rectified flows, allowing them to compete with knowledge distillation methods even in the low NFE setting. Our main insight is that under realistic settings, a single iteration of the Reflow algorithm for training rectified flows is sufficient to learn nearly straight trajectories; hence, the current practice of using multiple Reflow iterations is unnecessary. We thus propose techniques to improve one-round training of rectified flows, including a U-shaped timestep distribution and LPIPS-Huber premetric. With these techniques, we improve the FID of the previous 2-rectified flow by up to 72% in the 1 NFE setting on CIFAR-10. On ImageNet 64 \times 64, our improved rectified flow outperforms the state-of-the-art distillation methods such as consistency distillation and progressive distillation in both one-step and two-step settings and rivals the performance of improved consistency training (iCT) in FID. Code is available at this https URL.

[LG-6] CausalQuest: Collecting Natural Causal Questions for AI Agents

链接: https://arxiv.org/abs/2405.20318
作者: Roberto Ceraolo,Dmitrii Kharlapenko,Amélie Reymond,Rada Mihalcea,Mrinmaya Sachan,Bernhard Schölkopf,Zhijing Jin
关键词: innate drive, drive to seek, questions, causal questions, dataset
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humans have an innate drive to seek out causality. Whether fuelled by curiosity or specific goals, we constantly question why things happen, how they are interconnected, and many other related phenomena. To develop AI agents capable of addressing this natural human quest for causality, we urgently need a comprehensive dataset of natural causal questions. Unfortunately, existing datasets either contain only artificially-crafted questions that do not reflect real AI usage scenarios or have limited coverage of questions from specific sources. To address this gap, we present CausalQuest, a dataset of 13,500 naturally occurring questions sourced from social networks, search engines, and AI assistants. We formalize the definition of causal questions and establish a taxonomy for finer-grained classification. Through a combined effort of human annotators and large language models (LLMs), we carefully label the dataset. We find that 42% of the questions humans ask are indeed causal, with the majority seeking to understand the causes behind given effects. Using this dataset, we train efficient classifiers (up to 2.85B parameters) for the binary task of identifying causal questions, achieving high performance with F1 scores of up to 0.877. We conclude with a rich set of future research directions that can build upon our data and models.

[LG-7] Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

链接: https://arxiv.org/abs/2405.20313
作者: Guillaume Huguet,James Vuckovic,Kilian Fatras,Eric Thibodeau-Laufer,Pablo Lemos,Riashat Islam,Cheng-Hao Liu,Jarrid Rector-Brooks,Tara Akhound-Sadegh,Michael Bronstein,Alexander Tong,Avishek Joey Bose
关键词: amino acid sequences, amino acid, functions from complex, processes and derive, derive their diverse
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples – crucial for de-novo drug design – we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.

[LG-8] Large Language Models Can Self-Improve At Web Agent Tasks

链接: https://arxiv.org/abs/2405.20309
作者: Ajay Patel,Markus Hofmarcher,Claudiu Leoveanu-Condrei,Marius-Constantin Dinu,Chris Callison-Burch,Sepp Hochreiter
关键词: typically been challenging, challenging due, due to lack, complex environment, effectively navigate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

[LG-9] Group Robust Preference Optimization in Reward-free RLHF

链接: https://arxiv.org/abs/2405.20304
作者: Shyam Sundhar Ramesh,Yifan Hu,Iason Chaimalas,Viraj Mehta,Pier Giuseppe Sessa,Haitham Bou Ammar,Ilija Bogunovic
关键词: Adapting large language, large language models, Adapting large, human feedback, traditional RLHF approaches
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers’ groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a “one-size-fits-all” approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups’ preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.

[LG-10] DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

链接: https://arxiv.org/abs/2405.20289
作者: Zachary Novack,Julian McAuley,Taylor Berg-Kirkpatrick,Nicholas Bryan
关键词: AI-based music creation, human-centered AI-based music, control design trade-offs, design trade-offs, critical for human-centered
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at this https URL.

[LG-11] Flexible SE(2) graph neural networks with applications to PDE surrogates

链接: https://arxiv.org/abs/2405.20287
作者: Maria Bånkestad,Olof Mogren,Aleksis Pirinen
关键词: constructing graph neural, graph neural networks, neural networks equivariant, rotations and translations, non-gridded domains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注: 9 pages

点击查看摘要

Abstract:This paper presents a novel approach for constructing graph neural networks equivariant to 2D rotations and translations and leveraging them as PDE surrogates on non-gridded domains. We show that aligning the representations with the principal axis allows us to sidestep many constraints while preserving SE(2) equivariance. By applying our model as a surrogate for fluid flow simulations and conducting thorough benchmarks against non-equivariant models, we demonstrate significant gains in terms of both data efficiency and accuracy.

[LG-12] Length independent generalization bounds for deep SSM architectures with stability constraints

链接: https://arxiv.org/abs/2405.20278
作者: Dániel Rácz,Mihály Petreczky,Bálint Daróczy
关键词: combining State-Space Models, stable SSM blocks, sequential blocks combining, blocks combining State-Space, SSM blocks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, no figures, under submission

点击查看摘要

Abstract:Many state-of-the-art models trained on long-range sequences, for example S4, S5 or LRU, are made of sequential blocks combining State-Space Models (SSMs) with neural networks. In this paper we provide a PAC bound that holds for these kind of architectures with stable SSM blocks and does not depend on the length of the input sequence. Imposing stability of the SSM blocks is a standard practice in the literature, and it is known to help performance. Our results provide a theoretical justification for the use of stable SSM blocks as the proposed PAC bound decreases as the degree of stability of the SSM blocks increases.

[LG-13] ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection

链接: https://arxiv.org/abs/2405.20274
作者: Siva Uday Sampreeth Chebolu,Franck Dernoncourt,Nedim Lipka,Thamar Solorio
关键词: Aspect-Based Sentiment Analysis, experienced tremendous expansion, workshops and Germeval, shared tasks spanning, Aspect Sentiment Target
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2309.13297

点击查看摘要

Abstract:Aspect-Based Sentiment Analysis (ABSA) has experienced tremendous expansion and diversity due to various shared tasks spanning several languages and fields and organized via SemEval workshops and Germeval. Nonetheless, a few shortcomings still need to be addressed, such as the lack of low-resource language evaluations and the emphasis on sentence-level analysis. To thoroughly assess ABSA techniques in the context of complete reviews, this research presents a novel task, Review-Level Opinion Aspect Sentiment Target (ROAST). ROAST seeks to close the gap between sentence-level and text-level ABSA by identifying every ABSA constituent at the review level. We extend the available datasets to enable ROAST, addressing the drawbacks noted in previous research by incorporating low-resource languages, numerous languages, and a variety of topics. Through this effort, ABSA research will be able to cover more ground and get a deeper comprehension of the task and its practical application in a variety of languages and domains (this https URL).

[LG-14] Reconstruction Attacks on Machine Unlearning: Simple Models are Vulnerable

链接: https://arxiv.org/abs/2405.20272
作者: Martin Bertran,Shuai Tang,Michael Kearns,Jamie Morgenstern,Aaron Roth,Zhiwei Steven Wu
关键词: Machine unlearning, data influence removed, unlearning is motivated, motivated by desire, influence removed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning is motivated by desire for data autonomy: a person can request to have their data’s influence removed from deployed models, and those models should be updated as if they were retrained without the person’s data. We show that, counter-intuitively, these updates expose individuals to high-accuracy reconstruction attacks which allow the attacker to recover their data in its entirety, even when the original models are so simple that privacy risk might not otherwise have been a concern. We show how to mount a near-perfect attack on the deleted data point from linear regression models. We then generalize our attack to other loss functions and architectures, and empirically demonstrate the effectiveness of our attacks across a wide range of datasets (capturing both tabular and image data). Our work highlights that privacy risk is significant even for extremely simple model classes when individuals can request deletion of their data from the model.

[LG-15] ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections

链接: https://arxiv.org/abs/2405.20271
作者: Massimo Bini,Karsten Roth,Zeynep Akata,Anna Khoreva
关键词: adapt foundation models, downstream task requirements, generalization ability, ubiquitous to adapt, adapt foundation
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ICML 2024. Code available at this https URL

点击查看摘要

Abstract:Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters ( \sim 10 - 100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility. The code is available at this https URL.

[LG-16] KerasCV and KerasNLP: Vision and Language Power-Ups

链接: https://arxiv.org/abs/2405.20247
作者: Matthew Watson,Divyashree Shivakumar Sreepathihalli,Francois Chollet,Martin Gorner,Kiranbir Sodhia,Ramesh Sampath,Tirth Patel,Haifeng Jin,Neel Kovelamudi,Gabriel Rasskin,Samaneh Saadat,Luke Wood,Chen Qian,Jonathan Bischof,Ian Stenbit
关键词: Natural Language Processing, Language Processing workflows, Keras domain packages, Computer Vision, Vision and Natural
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Submitted to Journal of Machine Learning Open Source Software

点击查看摘要

Abstract:We present the Keras domain packages KerasCV and KerasNLP, extensions of the Keras API for Computer Vision and Natural Language Processing workflows, capable of running on either JAX, TensorFlow, or PyTorch. These domain packages are designed to enable fast experimentation, with a focus on ease-of-use and performance. We adopt a modular, layered design: at the library’s lowest level of abstraction, we provide building blocks for creating models and data preprocessing pipelines, and at the library’s highest level of abstraction, we provide pretrained ``task" models for popular architectures such as Stable Diffusion, YOLOv8, GPT2, BERT, Mistral, CLIP, Gemma, T5, etc. Task models have built-in preprocessing, pretrained weights, and can be fine-tuned on raw inputs. To enable efficient training, we support XLA compilation for all models, and run all preprocessing via a compiled graph of TensorFlow operations using the tf.data API. The libraries are fully open-source (Apache 2.0 license) and available on GitHub.

[LG-17] Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use

链接: https://arxiv.org/abs/2405.20245
作者: Franz Louis Cesista,Rui Aguiar,Jason Kim,Paolo Acilo
关键词: Business Document Information, Document Information Extraction, Business Document, Line Items Recognition, scanned documents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), 2024

点击查看摘要

Abstract:Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks. The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs + RASG is oftentimes superior given real-world applications and constraints of BDIE. Comments: Accepted by IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2405.20245 [cs.CL] (or arXiv:2405.20245v1 [cs.CL] for this version)

[LG-18] Grokfast: Accelerated Grokking by Amplifying Slow Gradients

链接: https://arxiv.org/abs/2405.20233
作者: Jaerin Lee,Bong Gyun Kang,Kihoon Kim,Kyoung Mu Lee
关键词: machine learning dubbed, learning dubbed grokking, machine learning practitioners, machine learning, achieved tenfolds
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 12 figures. Project page: this https URL

点击查看摘要

Abstract:One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under grokking phenomenon. By regarding a series of gradients of a parameter over training iterations as a random signal over time, we can spectrally decompose the parameter trajectories under gradient descent into two components: the fast-varying, overfitting-yielding component and the slow-varying, generalization-inducing component. This analysis allows us to accelerate the grokking phenomenon more than \times 50 with only a few lines of code that amplifies the slow-varying components of gradients. The experiments show that our algorithm applies to diverse tasks involving images, languages, and graphs, enabling practical availability of this peculiar artifact of sudden generalization. Our code is available at \urlthis https URL.

[LG-19] he Empirical Impact of Neural Parameter Symmetries or Lack Thereof

链接: https://arxiv.org/abs/2405.20231
作者: Derek Lim,Moe Putterman,Robin Walters,Haggai Maron,Stefanie Jegelka
关键词: Bayesian neural network, neural network, parameter space symmetries, neural network function, underlying neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 27 pages. Preparing code for release

点击查看摘要

Abstract:Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries – transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. However, theoretical analysis of the relationship between parameter space symmetries and these phenomena is difficult. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity between our networks without alignment of weight spaces, and we find that our networks allow for faster and more effective Bayesian neural network training.

[LG-20] Feature Fusion for Improved Classification: Combining Dempster-Shafer Theory and Multiple CNN Architectures

链接: https://arxiv.org/abs/2405.20230
作者: Ayyub Alzahem,Wadii Boulila,Maha Driss,Anis Koubaa
关键词: Deep Learning, make reliable predictions, Addressing uncertainty, uncertainty in Deep, decisions in complex
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Addressing uncertainty in Deep Learning (DL) is essential, as it enables the development of models that can make reliable predictions and informed decisions in complex, real-world environments where data may be incomplete or ambiguous. This paper introduces a novel algorithm leveraging Dempster-Shafer Theory (DST) to integrate multiple pre-trained models to form an ensemble capable of providing more reliable and enhanced classifications. The main steps of the proposed method include feature extraction, mass function calculation, fusion, and expected utility calculation. Several experiments have been conducted on CIFAR-10 and CIFAR-100 datasets, demonstrating superior classification accuracy of the proposed DST-based method, achieving improvements of 5.4% and 8.4%, respectively, compared to the best individual pre-trained models. Results highlight the potential of DST as a robust framework for managing uncertainties related to data when applying DL in real-world scenarios.

[LG-21] Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback

链接: https://arxiv.org/abs/2405.20216
作者: Sanghyeon Na,Yonggyu Kim,Hyunjoon Lee
关键词: challenging task, human image generation, image generation, significant yet challenging, Direct Preference Optimization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 28 pages, 18 figures

点击查看摘要

Abstract:The generation of high-quality human images through text-to-image (T2I) methods is a significant yet challenging task. Distinct from general image generation, human image synthesis must satisfy stringent criteria related to human pose, anatomy, and alignment with textual prompts, making it particularly difficult to achieve realistic results. Recent advancements in T2I generation based on diffusion models have shown promise, yet challenges remain in meeting human-specific preferences. In this paper, we introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO). Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback. We also propose a modified loss function that enhances the DPO training process by minimizing artifacts and improving image fidelity. Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation. Through comprehensive evaluations, we show that our approach significantly advances the state of human image generation, achieving superior results in terms of natural anatomies, poses, and text-image alignment.

[LG-22] PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization

链接: https://arxiv.org/abs/2405.20213
作者: Vijay Jaisankar,Sambaran Bandyopadhyay,Kalp Vyas,Varre Chaitanya,Shwetha Somasundaram
关键词: summary presented, long input document, good design elements, text and images, input document
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.

[LG-23] Unified Explanations in Machine Learning Models: A Perturbation Approach

链接: https://arxiv.org/abs/2405.20200
作者: Jacob Dineen,Don Kridel,Daniel Dolk,David Castillo
关键词: Explainable Artificial Intelligence, Explainable Artificial, Artificial Intelligence, high-velocity paradigm shift, recent years
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A high-velocity paradigm shift towards Explainable Artificial Intelligence (XAI) has emerged in recent years. Highly complex Machine Learning (ML) models have flourished in many tasks of intelligence, and the questions have started to shift away from traditional metrics of validity towards something deeper: What is this model telling me about my data, and how is it arriving at these conclusions? Inconsistencies between XAI and modeling techniques can have the undesirable effect of casting doubt upon the efficacy of these explainability approaches. To address these problems, we propose a systematic, perturbation-based analysis against a popular, model-agnostic method in XAI, SHapley Additive exPlanations (Shap). We devise algorithms to generate relative feature importance in settings of dynamic inference amongst a suite of popular machine learning and deep learning methods, and metrics that allow us to quantify how well explanations generated under the static case hold. We propose a taxonomy for feature importance methodology, measure alignment, and observe quantifiable similarity amongst explanation models across several datasets.

[LG-24] Occam Gradient Descent

链接: https://arxiv.org/abs/2405.20194
作者: B.N. Kausik
关键词: avoid overfitting training, overfitting training data, gradient descent, deep learning models, overprovisioned deep learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning neural network models must be large enough to adapt to their problem domain, while small enough to avoid overfitting training data during gradient descent. To balance these competing demands, overprovisioned deep learning models such as transformers are trained for a single epoch on large data sets, and hence inefficient with both computing resources and training data. In response to these inefficiencies, we exploit learning theory to derive Occam Gradient Descent, an algorithm that interleaves adaptive reduction of model size to minimize generalization error, with gradient descent on model weights to minimize fitting error. In contrast, traditional gradient descent greedily minimizes fitting error without regard to generalization error. Our algorithm simultaneously descends the space of weights and topological size of any neural network without modification, and is effective in our experiments in outperforming traditional gradient descent with or without post-train pruning in accuracy, compute and model compression.

[LG-25] ransformers and Slot Encoding for Sample Efficient Physical World Modelling

链接: https://arxiv.org/abs/2405.20180
作者: Francesco Petri,Luigi Asprino,Aldo Gangemi
关键词: World modelling, predict its evolution, rules that govern, essential ability, physical world
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Recent applications of the Transformer architecture to the problem of world modelling from video input show notable improvements in sample efficiency. However, existing approaches tend to work only at the image level thus disregarding that the environment is composed of objects interacting with each other. In this paper, we propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene. We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples. The code for our architecture and experiments is available at this https URL

[LG-26] Non-intrusive data-driven model order reduction for circuits based on Hammerstein architectures

链接: https://arxiv.org/abs/2405.20178
作者: Joshua Hanson,Biliana Paskaleva,Pavel Bochev
关键词: data-driven system identification, system identification techniques, key building blocks, model order reduction, non-intrusive model order
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 13 pages, 13 figures; submitted to IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

点击查看摘要

Abstract:We demonstrate that data-driven system identification techniques can provide a basis for effective, non-intrusive model order reduction (MOR) for common circuits that are key building blocks in microelectronics. Our approach is motivated by the practical operation of these circuits and utilizes a canonical Hammerstein architecture. To demonstrate the approach we develop a parsimonious Hammerstein model for a non-linear CMOS differential amplifier. We train this model on a combination of direct current (DC) and transient Spice (Xyce) circuit simulation data using a novel sequential strategy to identify the static nonlinear and linear dynamical parts of the model. Simulation results show that the Hammerstein model is an effective surrogate for the differential amplifier circuit that accurately and efficiently reproduces its behavior over a wide range of operating points and input frequencies.

[LG-27] ropical Expressivity of Neural Networks

链接: https://arxiv.org/abs/2405.20174
作者: Shiv Bhatia,Yueqi Cao,Paul Lezeau,Anthea Monod
关键词: linear activation neural, activation neural networks, neural networks, algebraic geometric framework, tropical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose an algebraic geometric framework to study the expressivity of linear activation neural networks. A particular quantity that has been actively studied in the field of deep learning is the number of linear regions, which gives an estimate of the information capacity of the architecture. To study and evaluate information capacity and expressivity, we work in the setting of tropical geometry – a combinatorial and polyhedral variant of algebraic geometry – where there are known connections between tropical rational maps and feedforward neural networks. Our work builds on and expands this connection to capitalize on the rich theory of tropical geometry to characterize and study various architectural aspects of neural networks. Our contributions are threefold: we provide a novel tropical geometric approach to selecting sampling domains among linear regions; an algebraic result allowing for a guided restriction of the sampling domain for network architectures with symmetries; and an open source library to analyze neural networks as tropical Puiseux rational maps. We provide a comprehensive set of proof-of-concept numerical experiments demonstrating the breadth of neural network architectures to which tropical geometric theory can be applied to reveal insights on expressivity characteristics of a network. Our work provides the foundations for the adaptation of both theory and existing software from computational tropical geometry and symbolic computation to deep learning.

[LG-28] Iterative Feature Boosting for Explainable Speech Emotion Recognition

链接: https://arxiv.org/abs/2405.20172
作者: Alaa Nfissi,Wassim Bouachir,Nizar Bouguila,Brian Mishara
关键词: high dimensional datasets, including redundant, irrelevant information, lead to high, high dimensional
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Published in: 2023 International Conference on Machine Learning and Applications (ICMLA)

点击查看摘要

Abstract:In speech emotion recognition (SER), using predefined features without considering their practical importance may lead to high dimensional datasets, including redundant and irrelevant information. Consequently, high-dimensional learning often results in decreasing model accuracy while increasing computational complexity. Our work underlines the importance of carefully considering and analyzing features in order to build efficient SER systems. We present a new supervised SER method based on an efficient feature engineering approach. We pay particular attention to the explainability of results to evaluate feature relevance and refine feature sets. This is performed iteratively through feature evaluation loop, using Shapley values to boost feature selection and improve overall framework performance. Our approach allows thus to balance the benefits between model performance and transparency. The proposed method outperforms human-level performance (HLP) and state-of-the-art machine learning methods in emotion recognition on the TESS dataset.

[LG-29] GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning

链接: https://arxiv.org/abs/2405.20139
作者: Costas Mavromatis,George Karypis
关键词: human-crafted factual knowledge, factual knowledge, Knowledge Graphs, represent human-crafted factual, collectively form
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9–15.5% points at answer F1.

[LG-30] Near Optimal Decentralized Optimization with Compression and Momentum Tracking

链接: https://arxiv.org/abs/2405.20114
作者: Rustem Islamov,Yuan Gao,Sebastian U. Stich
关键词: Machine Learning applications, decentralized Machine Learning, Machine Learning, large-scale decentralized Machine, garnered significant attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of quantized information to their neighbors over a communication graph. Numerous endeavors have been made to address this challenging problem by developing algorithms with compressed communication for decentralized non-convex optimization problems. Despite considerable efforts, the current results suffer from various issues such as non-scalability with the number of clients, requirements for large batches, or bounded gradient assumption. In this paper, we introduce MoTEF, a novel approach that integrates communication compression with Momentum Tracking and Error Feedback. Our analysis demonstrates that MoTEF achieves most of the desired properties, and significantly outperforms existing methods under arbitrary data heterogeneity. We provide numerical experiments to validate our theoretical findings and confirm the practical superiority of MoTEF.

[LG-31] Low-dimensional approximations of the conditional law of Volterra processes: a non-positive curvature approach

链接: https://arxiv.org/abs/2405.20094
作者: Reza Arabpour,John Armstrong,Luca Galimberti,Anastasis Kratsios,Giulia Livieri
关键词: mathematical finance, Predicting the conditional, Volterra process, stochastic volatility, crucial challenge
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Differential Geometry (math.DG)
*备注: Main body: 25 Pages, Appendices 29 Pages, 14 Tables, 6 Figures

点击查看摘要

Abstract:Predicting the conditional evolution of Volterra processes with stochastic volatility is a crucial challenge in mathematical finance. While deep neural network models offer promise in approximating the conditional law of such processes, their effectiveness is hindered by the curse of dimensionality caused by the infinite dimensionality and non-smooth nature of these problems. To address this, we propose a two-step solution. Firstly, we develop a stable dimension reduction technique, projecting the law of a reasonably broad class of Volterra process onto a low-dimensional statistical manifold of non-positive sectional curvature. Next, we introduce a sequentially deep learning model tailored to the manifold’s geometry, which we show can approximate the projected conditional law of the Volterra process. Our model leverages an auxiliary hypernetwork to dynamically update its internal parameters, allowing it to encode non-stationary dynamics of the Volterra process, and it can be interpreted as a gating mechanism in a mixture of expert models where each expert is specialized at a specific point in time. Our hypernetwork further allows us to achieve approximation rates that would seemingly only be possible with very large networks.

[LG-32] Visual Attention Analysis in Online Learning

链接: https://arxiv.org/abs/2405.20091
作者: Navarro Miriam,Becerra Álvaro,Daza Roberto,Cobos Ruth,Morales Aythami,Fierrez Julian
关键词: Multimodal Learning Analytics, Learning Analytics field, Analytics field, Multimodal Learning, Learning Analytics
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted in CEDI 2024 (VII Congreso Español de Informática), A Coruña, Spain

点击查看摘要

Abstract:In this paper, we present an approach in the Multimodal Learning Analytics field. Within this approach, we have developed a tool to visualize and analyze eye movement data collected during learning sessions in online courses. The tool is named VAAD (an acronym for Visual Attention Analysis Dashboard). These eye movement data have been gathered using an eye-tracker and subsequently processed and visualized for interpretation. The purpose of the tool is to conduct a descriptive analysis of the data by facilitating its visualization, enabling the identification of differences and learning patterns among various learner populations. Additionally, it integrates a predictive module capable of anticipating learner activities during a learning session. Consequently, VAAD holds the potential to offer valuable insights into online learning behaviors from both descriptive and predictive perspectives.

[LG-33] Soft Partitioning of Latent Space for Semantic Channel Equalization

链接: https://arxiv.org/abs/2405.20085
作者: Tomás Huttebraucker,Mohamed Sana,Emilio Calvanese Strinati
关键词: multi-user semantic communications, address language mismatch, Semantic channel equalization, semantic space, Semantic
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Semantic channel equalization has emerged as a solution to address language mismatch in multi-user semantic communications. This approach aims to align the latent spaces of an encoder and a decoder which were not jointly trained and it relies on a partition of the semantic (latent) space into atoms based on the the semantic meaning. In this work we explore the role of the semantic space partition in scenarios where the task structure involves a one-to-many mapping between the semantic space and the action space. In such scenarios, partitioning based on hard inference results results in loss of information which degrades the equalization performance. We propose a soft criterion to derive the atoms of the partition which leverages the soft decoder’s output and offers a more comprehensive understanding of the semantic space’s structure. Through empirical validation, we demonstrate that soft partitioning yields a more descriptive and regular partition of the space, consequently enhancing the performance of the equalization algorithm.

[LG-34] Segment Shuffle and Stitch: A Simple Mechanism for Improving Time-Series Representations

链接: https://arxiv.org/abs/2405.20082
作者: Shivam Grover,Amin Jalali,Ali Etemad
关键词: time-steps intact, time-series representation learning, Existing approaches, representation learning, original order
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing approaches for learning representations of time-series keep the temporal arrangement of the time-steps intact with the presumption that the original order is the most optimal for learning. However, non-adjacent sections of real-world time-series may have strong dependencies. Accordingly we raise the question: Is there an alternative arrangement for time-series which could enable more effective representation learning? To address this, we propose a simple plug-and-play mechanism called Segment, Shuffle, and Stitch (S3) designed to improve time-series representation learning of existing models. S3 works by creating non-overlapping segments from the original sequence and shuffling them in a learned manner that is the most optimal for the task at hand. It then re-attaches the shuffled segments back together and performs a learned weighted sum with the original input to capture both the newly shuffled sequence along with the original sequence. S3 is modular and can be stacked to create various degrees of granularity, and can be added to many forms of neural architectures including CNNs or Transformers with negligible computation overhead. Through extensive experiments on several datasets and state-of-the-art baselines, we show that incorporating S3 results in significant improvements for the tasks of time-series classification and forecasting, improving performance on certain datasets by up to 68%. We also show that S3 makes the learning more stable with a smoother training loss curve and loss landscape compared to the original baseline. The code is available at this https URL .

[LG-35] Student Answer Forecasting: Transformer-Driven Answer Choice Prediction for Language Learning

链接: https://arxiv.org/abs/2405.20079
作者: Elena Grazia Gado,Tommaso Martorella,Luca Zunino,Paola Mejia-Domenzain,Vinitra Swamy,Jibril Frej,Tanja Käser
关键词: Intelligent Tutoring Systems, Intelligent Tutoring, Tutoring Systems, specific answer choices, answer choices
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted as a poster paper at EDM 2024: 17th International Conference on Educational Data Mining in Atlanta, USA

点击查看摘要

Abstract:Intelligent Tutoring Systems (ITS) enhance personalized learning by predicting student answers to provide immediate and customized instruction. However, recent research has primarily focused on the correctness of the answer rather than the student’s performance on specific answer choices, limiting insights into students’ thought processes and potential misconceptions. To address this gap, we present MCQStudentBert, an answer forecasting model that leverages the capabilities of Large Language Models (LLMs) to integrate contextual understanding of students’ answering history along with the text of the questions and answers. By predicting the specific answer choices students are likely to make, practitioners can easily extend the model to new answer choices or remove answer choices for the same multiple-choice question (MCQ) without retraining the model. In particular, we compare MLP, LSTM, BERT, and Mistral 7B architectures to generate embeddings from students’ past interactions, which are then incorporated into a finetuned BERT’s answer-forecasting mechanism. We apply our pipeline to a dataset of language learning MCQ, gathered from an ITS with over 10,000 students to explore the predictive accuracy of MCQStudentBert, which incorporates student interaction patterns, in comparison to correct answer prediction and traditional mastery-learning feature-based approaches. This work opens the door to more personalized content, modularization, and granular support.

[LG-36] Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

链接: https://arxiv.org/abs/2405.20053
作者: Avelina Asada Hadji-Kyriacou,Ognjen Arandjelovic
关键词: Pre-trained Language Models, Direct Preference Optimization, in-context learning capabilities, exhibit strong zero-shot, Pre-trained Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model’s reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.

[LG-37] hreshold-Independent Fair Matching through Score Calibration

链接: https://arxiv.org/abs/2405.20051
作者: Mohammad Hossein Moslemi,Mostafa Milani
关键词: public administration, identifies records, records that refer, Entity Matching, Entity
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Entity Matching (EM) is a critical task in numerous fields, such as healthcare, finance, and public administration, as it identifies records that refer to the same entity within or across different databases. EM faces considerable challenges, particularly with false positives and negatives. These are typically addressed by generating matching scores and apply thresholds to balance false positives and negatives in various contexts. However, adjusting these thresholds can affect the fairness of the outcomes, a critical factor that remains largely overlooked in current fair EM research. The existing body of research on fair EM tends to concentrate on static thresholds, neglecting their critical impact on fairness. To address this, we introduce a new approach in EM using recent metrics for evaluating biases in score based binary classification, particularly through the lens of distributional parity. This approach enables the application of various bias metrics like equalized odds, equal opportunity, and demographic parity without depending on threshold settings. Our experiments with leading matching methods reveal potential biases, and by applying a calibration technique for EM scores using Wasserstein barycenters, we not only mitigate these biases but also preserve accuracy across real world datasets. This paper contributes to the field of fairness in data cleaning, especially within EM, which is a central task in data cleaning, by promoting a method for generating matching scores that reduce biases across different thresholds.

[LG-38] Iterative Learning Control of Fast Nonlinear Oscillatory Dynamics (Preprint)

链接: https://arxiv.org/abs/2405.20045
作者: John W. Brooks,Christine M. Greve
关键词: Gaussian Process Regression, called instabilities, sudden onset, onset of deleterious, Time-Lagged Phase Portraits
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:The sudden onset of deleterious and oscillatory dynamics (often called instabilities) is a known challenge in many fluid, plasma, and aerospace systems. These dynamics are difficult to address because they are nonlinear, chaotic, and are often too fast for active control schemes. In this work, we develop an alternative active controls system using an iterative, trajectory-optimization and parameter-tuning approach based on Iterative Learning Control (ILC), Time-Lagged Phase Portraits (TLPP) and Gaussian Process Regression (GPR). The novelty of this approach is that it can control a system’s dynamics despite the controller being much slower than the dynamics. We demonstrate this controller on the Lorenz system of equations where it iteratively adjusts (tunes) the system’s input parameters to successfully reproduce a desired oscillatory trajectory or state. Additionally, we investigate the system’s dynamical sensitivity to its control parameters, identify continuous and bounded regions of desired dynamical trajectories, and demonstrate that the controller is robust to missing information and uncontrollable parameters as long as certain requirements are met. The controller presented in this work provides a framework for low-speed control for a variety of fast, nonlinear systems that may aid in instability suppression and mitigation.

[LG-39] CycleFormer : TSP Solver Based on Language Modeling

链接: https://arxiv.org/abs/2405.20042
作者: Jieun Yook,Junpyo Seo,Joon Huh,Han Joon Byun,Byung-ro Mooon
关键词: Traveling Salesman Problem, Salesman Problem, Traveling Salesman, conventional transformer model, transformer model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new transformer model for the Traveling Salesman Problem (TSP) called CycleFormer. We identified distinctive characteristics that need to be considered when applying a conventional transformer model to TSP and aimed to fully incorporate these elements into the TSP-specific transformer. Unlike the token sets in typical language models, which are limited and static, the token (node) set in TSP is unlimited and dynamic. To exploit this fact to the fullest, we equated the encoder output with the decoder linear layer and directly connected the context vector of the encoder to the decoder encoding. Additionally, we added a positional encoding to the encoder tokens that reflects the two-dimensional nature of TSP, and devised a circular positional encoding for the decoder tokens that considers the cyclic properties of a tour. By incorporating these ideas, CycleFormer outperforms state-of-the-art (SOTA) transformer models for TSP from TSP-50 to TSP-500. Notably, on TSP-500, the optimality gap was reduced by approximately 2.8 times, from 3.09% to 1.10%, compared to the existing SOTA. The code will be made available at this https URL.

[LG-40] A Random Forest-based Prediction Model for Turning Points in Antagonistic event-group Competitions

链接: https://arxiv.org/abs/2405.20029
作者: Zishuo Zhu
关键词: provide real-time feedback, athletes’ state information, prediction studies related, event-group competitions focus, antagonistic event-group competitions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:At present, most of the prediction studies related to antagonistic event-group competitions focus on the prediction of competition results, and less on the prediction of the competition process, which can not provide real-time feedback of the athletes’ state information in the actual competition, and thus can not analyze the changes of the competition situation. In order to solve this problem, this paper proposes a prediction model based on Random Forest for the turning point of the antagonistic event-group. Firstly, the quantitative equation of competitive potential energy is proposed; Secondly, the quantitative value of competitive potential energy is obtained by using the dynamic combination of weights method, and the turning point of the competition situation of the antagonistic event-group is marked according to the quantitative time series graph; Finally, the random forest prediction model based on the optimisation of the KM-SMOTE algorithm and the grid search method is established. The experimental analysis shows that: the quantitative equation of competitive potential energy can effectively reflect the dynamic situation of the competition; The model can effectively predict the turning point of the competition situation of the antagonistic event-group, and the recall rate of the model in the test set is 86.13%; the model has certain significance for the future study of the competition situation of the antagonistic event-group.

[LG-41] A Simple and Adaptive Learning Rate for FTRL in Online Learning with Minimax Regret of Theta(T2/3) and its Application to Best-of-Both-Worlds

链接: https://arxiv.org/abs/2405.20028
作者: Taira Tsuchiya,Shinji Ito
关键词: minimax regret, Theta, learning rate, learning, online learning problems
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages

点击查看摘要

Abstract:Follow-the-Regularized-Leader (FTRL) is a powerful framework for various online learning problems. By designing its regularizer and learning rate to be adaptive to past observations, FTRL is known to work adaptively to various properties of an underlying environment. However, most existing adaptive learning rates are for online learning problems with a minimax regret of \Theta(\sqrtT) for the number of rounds T , and there are only a few studies on adaptive learning rates for problems with a minimax regret of \Theta(T^2/3) , which include several important problems dealing with indirect feedback. To address this limitation, we establish a new adaptive learning rate framework for problems with a minimax regret of \Theta(T^2/3) . Our learning rate is designed by matching the stability, penalty, and bias terms that naturally appear in regret upper bounds for problems with a minimax regret of \Theta(T^2/3) . As applications of this framework, we consider two major problems dealing with indirect feedback: partial monitoring and graph bandits. We show that FTRL with our learning rate and the Tsallis entropy regularizer improves existing Best-of-Both-Worlds (BOBW) regret upper bounds, which achieve simultaneous optimality in the stochastic and adversarial regimes. The resulting learning rate is surprisingly simple compared to the existing learning rates for BOBW algorithms for problems with a minimax regret of \Theta(T^2/3) .

[LG-42] Safe Multi-agent Reinforcement Learning with Natural Language Constraints

链接: https://arxiv.org/abs/2405.20018
作者: Ziyan Wang,Meng Fang,Tristan Tomilin,Fei Fang,Yali Du
关键词: Safe Multi-agent Reinforcement, Multi-agent Reinforcement Learning, natural language constraints, Safe MARL, Reinforcement Learning
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:The role of natural language constraints in Safe Multi-agent Reinforcement Learning (MARL) is crucial, yet often overlooked. While Safe MARL has vast potential, especially in fields like robotics and autonomous vehicles, its full potential is limited by the need to define constraints in pre-designed mathematical terms, which requires extensive domain expertise and reinforcement learning knowledge, hindering its broader adoption. To address this limitation and make Safe MARL more accessible and adaptable, we propose a novel approach named Safe Multi-agent Reinforcement Learning with Natural Language constraints (SMALL). Our method leverages fine-tuned language models to interpret and process free-form textual constraints, converting them into semantic embeddings that capture the essence of prohibited states and behaviours. These embeddings are then integrated into the multi-agent policy learning process, enabling agents to learn policies that minimize constraint violations while optimizing rewards. To evaluate the effectiveness of SMALL, we introduce the LaMaSafe, a multi-task benchmark designed to assess the performance of multiple agents in adhering to natural language constraints. Empirical evaluations across various environments demonstrate that SMALL achieves comparable rewards and significantly fewer constraint violations, highlighting its effectiveness in understanding and enforcing natural language constraints.

[LG-43] subMFL: Compatiple subModel Generation for Federated Learning in Device Heterogenous Environment

链接: https://arxiv.org/abs/2405.20014
作者: Zeyneddin Oz,Ceylan Soygul Oz,Abdollah Malekjafarian,Nima Afraz,Fatemeh Golpayegani
关键词: Federated Learning, devices, Deep Neural Networks, systems with distributed, model
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, European Conference on Parallel Processing, pp. between 52 and 64, Springer, 2023

点击查看摘要

Abstract:Federated Learning (FL) is commonly used in systems with distributed and heterogeneous devices with access to varying amounts of data and diverse computing and storage capacities. FL training process enables such devices to update the weights of a shared model locally using their local data and then a trusted central server combines all of those models to generate a global model. In this way, a global model is generated while the data remains local to devices to preserve privacy. However, training large models such as Deep Neural Networks (DNNs) on resource-constrained devices can take a prohibitively long time and consume a large amount of energy. In the current process, the low-capacity devices are excluded from the training process, although they might have access to unseen data. To overcome this challenge, we propose a model compression approach that enables heterogeneous devices with varying computing capacities to participate in the FL process. In our approach, the server shares a dense model with all devices to train it: Afterwards, the trained model is gradually compressed to obtain submodels with varying levels of sparsity to be used as suitable initial global models for resource-constrained devices that were not capable of train the first dense model. This results in an increased participation rate of resource-constrained devices while the transferred weights from the previous round of training are preserved. Our validation experiments show that despite reaching about 50 per cent global sparsity, generated submodels maintain their accuracy while can be shared to increase participation by around 50 per cent.

[LG-44] FlexiDrop: Theoretical Insights and Practical Advances in Random Dropout Method on GNNs

链接: https://arxiv.org/abs/2405.20012
作者: Zhiheng Zhou,Sihao Liu,Weichen Zhao
关键词: Graph Neural Networks, Graph Neural, Neural Networks, handling graph-type data, random dropout methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are powerful tools for handling graph-type data. Recently, GNNs have been widely applied in various domains, but they also face some issues, such as overfitting, over-smoothing and non-robustness. The existing research indicates that random dropout methods are an effective way to address these issues. However, random dropout methods in GNNs still face unresolved problems. Currently, the choice of dropout rate, often determined by heuristic or grid search methods, can increase the generalization error, contradicting the principal aims of dropout. In this paper, we propose a novel random dropout method for GNNs called FlexiDrop. First, we conduct a theoretical analysis of dropout in GNNs using rademacher complexity and demonstrate that the generalization error of traditional random dropout methods is constrained by a function related to the dropout rate. Subsequently, we use this function as a regularizer to unify the dropout rate and empirical loss within a single loss function, optimizing them simultaneously. Therefore, our method enables adaptive adjustment of the dropout rate and theoretically balances the trade-off between model complexity and generalization ability. Furthermore, extensive experimental results on benchmark datasets show that FlexiDrop outperforms traditional random dropout methods in GNNs.

[LG-45] Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities

链接: https://arxiv.org/abs/2405.20003
作者: Alexander Nikitin,Jannik Kossen,Yarin Gal,Pekka Marttinen
关键词: Large Language Models, Large Language, reliability are important, crucial for applications, applications where safety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Uncertainty quantification in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. In particular, uncertainty can be used to improve the trustworthiness of LLMs by detecting factually incorrect model responses, commonly called hallucinations. Critically, one should seek to capture the model’s semantic uncertainty, i.e., the uncertainty over the meanings of LLM outputs, rather than uncertainty over lexical or syntactic variations that do not affect answer correctness. To address this problem, we propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs. KLE defines positive semidefinite unit trace kernels to encode the semantic similarities of LLM outputs and quantifies uncertainty using the von Neumann entropy. It considers pairwise semantic dependencies between answers (or semantic clusters), providing more fine-grained uncertainty estimates than previous methods based on hard clustering of answers. We theoretically prove that KLE generalizes the previous state-of-the-art method called semantic entropy and empirically demonstrate that it improves uncertainty quantification performance across multiple natural language generation datasets and LLM architectures.

[LG-46] Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

链接: https://arxiv.org/abs/2405.19988
作者: Minttu Alakuijala,Reginald McLean,Isaac Woungang,Nariman Farsad,Samuel Kaski,Pekka Marttinen,Kai Yuan
关键词: Natural language, convenient modality, modality for humans, Natural, data
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages in the main text, 16 pages including references and supplementary materials. 4 figures and 3 tables in the main text, 1 table in supplementary materials

点击查看摘要

Abstract:Natural language is often the easiest and most convenient modality for humans to specify tasks for robots. However, learning to ground language to behavior typically requires impractical amounts of diverse, language-annotated demonstrations collected on each target robot. In this work, we aim to separate the problem of what to accomplish from how to accomplish it, as the former can benefit from substantial amounts of external observation-only data, and only the latter depends on a specific robot embodiment. To this end, we propose Video-Language Critic, a reward model that can be trained on readily available cross-embodiment data using contrastive learning and a temporal ranking objective, and use it to score behavior traces from a separate reinforcement learning actor. When trained on Open X-Embodiment data, our reward model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only, despite a significant domain gap. Using in-domain data but in a challenging task generalization setting on Meta-World, we further demonstrate more sample-efficient training than is possible with prior language-conditioned reward models that are either trained with binary classification, use static images, or do not leverage the temporal information present in video data.

[LG-47] Domain Adaptation with Cauchy-Schwarz Divergence

链接: https://arxiv.org/abs/2405.19978
作者: Wenzhe Yin,Shujian Yu,Yicong Lin,Jie Liu,Jan-Jakob Sonke,Efstratios Gavves
关键词: Domain adaptation aims, training data, learn a hypothesis, Domain adaptation, adaptation aims
类目: Machine Learning (cs.LG)
*备注: Accepted by UAI-24

点击查看摘要

Abstract:Domain adaptation aims to use training data from one or multiple source domains to learn a hypothesis that can be generalized to a different, but related, target domain. As such, having a reliable measure for evaluating the discrepancy of both marginal and conditional distributions is crucial. We introduce Cauchy-Schwarz (CS) divergence to the problem of unsupervised domain adaptation (UDA). The CS divergence offers a theoretically tighter generalization error bound than the popular Kullback-Leibler divergence. This holds for the general case of supervised learning, including multi-class classification and regression. Furthermore, we illustrate that the CS divergence enables a simple estimator on the discrepancy of both marginal and conditional distributions between source and target domains in the representation space, without requiring any distributional assumptions. We provide multiple examples to illustrate how the CS divergence can be conveniently used in both distance metric- or adversarial training-based UDA frameworks, resulting in compelling performance.

[LG-48] Consistent Submodular Maximization

链接: https://arxiv.org/abs/2405.19977
作者: Paul Dütting,Federico Fusco,Silvio Lattanzi,Ashkan Norouzi-Fard,Morteza Zadimoghaddam
关键词: Maximizing monotone submodular, monotone submodular functions, classic optimization task, Maximizing monotone, machine learning
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear at ICML 24

点击查看摘要

Abstract:Maximizing monotone submodular functions under cardinality constraints is a classic optimization task with several applications in data mining and machine learning. In this paper we study this problem in a dynamic environment with consistency constraints: elements arrive in a streaming fashion and the goal is maintaining a constant approximation to the optimal solution while having a stable solution (i.e., the number of changes between two consecutive solutions is bounded). We provide algorithms in this setting with different trade-offs between consistency and approximation quality. We also complement our theoretical results with an experimental analysis showing the effectiveness of our algorithms in real-world instances.

[LG-49] GasTrace: Detecting Sandwich Attack Malicious Accounts in Ethereum

链接: https://arxiv.org/abs/2405.19971
作者: Zekai Liu,Xiaoqi Li,Hongli Peng,Wenkai Li
关键词: Ethereum transaction data, Automated Market Maker, transparency of Ethereum, transaction data make, executing malicious attacks
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The openness and transparency of Ethereum transaction data make it easy to be exploited by any entities, executing malicious attacks. The sandwich attack manipulates the Automated Market Maker (AMM) mechanism, profiting from manipulating the market price through front or after-running transactions. To identify and prevent sandwich attacks, we propose a cascade classification framework GasTrace. GasTrace analyzes various transaction features to detect malicious accounts, notably through the analysis and modeling of Gas features. In the initial classification, we utilize the Support Vector Machine (SVM) with the Radial Basis Function (RBF) kernel to generate the predicted probabilities of accounts, further constructing a detailed transaction network. Subsequently, the behavior features are captured by the Graph Attention Network (GAT) technique in the second classification. Through cascade classification, GasTrace can analyze and classify the sandwich attacks. Our experimental results demonstrate that GasTrace achieves a remarkable detection and generation capability, performing an accuracy of 96.73% and an F1 score of 95.71% for identifying sandwich attack accounts.

[LG-50] Improved Out-of-Scope Intent Classification with Dual Encoding and Threshold-based Re-Classification

链接: https://arxiv.org/abs/2405.19967
作者: Hossam M. Zawbaa,Wael Rashwan,Sourav Dutta,Haytham Assem
关键词: essential for task-oriented, task-oriented dialogues, Universal Sentence Encoder, Detecting, DETER
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting out-of-scope user utterances is essential for task-oriented dialogues and intent classification. Current methodologies face difficulties with the unpredictable distribution of outliers and often rely on assumptions about data distributions. We present the Dual Encoder for Threshold-Based Re-Classification (DETER) to address these challenges. This end-to-end framework efficiently detects out-of-scope intents without requiring assumptions on data distributions or additional post-processing steps. The core of DETER utilizes dual text encoders, the Universal Sentence Encoder (USE) and the Transformer-based Denoising AutoEncoder (TSDAE), to generate user utterance embeddings, which are classified through a branched neural architecture. Further, DETER generates synthetic outliers using self-supervision and incorporates out-of-scope phrases from open-domain datasets. This approach ensures a comprehensive training set for out-of-scope detection. Additionally, a threshold-based re-classification mechanism refines the model’s initial predictions. Evaluations on the CLINC-150, Stackoverflow, and Banking77 datasets demonstrate DETER’s efficacy. Our model outperforms previous benchmarks, increasing up to 13% and 5% in F1 score for known and unknown intents on CLINC-150 and Stackoverflow, and 16% for known and 24% % for unknown intents on Banking77. The source code has been released at this https URL_Classification_OOS.

[LG-51] Collective Variable Free Transition Path Sampling with Generative Flow Network

链接: https://arxiv.org/abs/2405.19961
作者: Kiyoung Seong,Seonghyun Park,Seonghwan Kim,Woo Youn Kim,Sungsoo Ahn
关键词: Understanding transition paths, Understanding transition, drug discovery, meta-stable states, fundamental for material
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Understanding transition paths between meta-stable states in molecular systems is fundamental for material design and drug discovery. However, sampling these paths via molecular dynamics simulations is computationally prohibitive due to the high-energy barriers between the meta-stable states. Recent machine learning approaches are often restricted to simple systems or rely on collective variables (CVs) extracted from expensive domain knowledge. In this work, we propose to leverage generative flow networks (GFlowNets) to sample transition paths without relying on CVs. We reformulate the problem as amortized energy-based sampling over molecular trajectories and train a bias potential by minimizing the squared log-ratio between the target distribution and the generator, derived from the flow matching objective of GFlowNets. Our evaluation on three proteins (Alanine Dipeptide, Polyproline, and Chignolin) demonstrates that our approach, called TPS-GFN, generates more realistic and diverse transition paths than the previous CV-free machine learning approach.

[LG-52] GenKubeSec: LLM-Based Kubernetes Misconfiguration Detection Localization Reasoning and Remediation

链接: https://arxiv.org/abs/2405.19954
作者: Ehud Malul,Yair Meidan,Dudu Mimran,Yuval Elovici,Asaf Shabtai
关键词: Kubernetes configuration files, configuration files, complex and error-prone, operational setbacks, highly complex
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key challenge associated with Kubernetes configuration files (KCFs) is that they are often highly complex and error-prone, leading to security vulnerabilities and operational setbacks. Rule-based (RB) tools for KCF misconfiguration detection rely on static rule sets, making them inherently limited and unable to detect newly-discovered misconfigurations. RB tools also suffer from misdetection, since mistakes are likely when coding the detection rules. Recent methods for detecting and remediating KCF misconfigurations are limited in terms of their scalability and detection coverage, or due to the fact that they have high expertise requirements and do not offer automated remediation along with misconfiguration detection. Novel approaches that employ LLMs in their pipeline rely on API-based, general-purpose, and mainly commercial models. Thus, they pose security challenges, have inconsistent classification performance, and can be costly. In this paper, we propose GenKubeSec, a comprehensive and adaptive, LLM-based method, which, in addition to detecting a wide variety of KCF misconfigurations, also identifies the exact location of the misconfigurations and provides detailed reasoning about them, along with suggested remediation. When empirically compared with three industry-standard RB tools, GenKubeSec achieved equivalent precision (0.990) and superior recall (0.999). When a random sample of KCFs was examined by a Kubernetes security expert, GenKubeSec’s explanations as to misconfiguration localization, reasoning and remediation were 100% correct, informative and useful. To facilitate further advancements in this domain, we share the unique dataset we collected, a unified misconfiguration index we developed for label standardization, our experimentation code, and GenKubeSec itself as an open-source tool.

[LG-53] MM-Lego: Modular Biomedical Multimodal Models with Minimal Fine-Tuning

链接: https://arxiv.org/abs/2405.19950
作者: Konstantin Hemker,Nikola Simidjievski,Mateja Jamnik
关键词: Learning holistic computational, biological systems requires, holistic computational representations, chemical or biological, holistic computational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning holistic computational representations in physical, chemical or biological systems requires the ability to process information from different distributions and modalities within the same model. Thus, the demand for multimodal machine learning models has sharply risen for modalities that go beyond vision and language, such as sequences, graphs, time series, or tabular data. While there are many available multimodal fusion and alignment approaches, most of them require end-to-end training, scale quadratically with the number of modalities, cannot handle cases of high modality imbalance in the training set, or are highly topology-specific, making them too restrictive for many biomedical learning tasks. This paper presents Multimodal Lego (MM-Lego), a modular and general-purpose fusion and model merging framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning. We achieve this by introducing a wrapper for unimodal encoders that enforces lightweight dimensionality assumptions between modalities and harmonises their representations by learning features in the frequency domain to enable model merging with little signal interference. We show that MM-Lego 1) can be used as a model merging method which achieves competitive performance with end-to-end fusion models without any fine-tuning, 2) can operate on any unimodal encoder, and 3) is a model fusion method that, with minimal fine-tuning, achieves state-of-the-art results on six benchmarked multimodal biomedical tasks.

[LG-54] Learning Latent Graph Structures and their Uncertainty

链接: https://arxiv.org/abs/2405.19933
作者: Alessandro Manenti,Daniele Zambon,Cesare Alippi
关键词: Graph Neural Networks, Neural Networks, Graph Neural, prediction task, downstream prediction task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Within a prediction task, Graph Neural Networks (GNNs) use relational information as an inductive bias to enhance the model’s accuracy. As task-relevant relations might be unknown, graph structure learning approaches have been proposed to learn them while solving the downstream prediction task. In this paper, we demonstrate that minimization of a point-prediction loss function, e.g., the mean absolute error, does not guarantee proper learning of the latent relational information and its associated uncertainty. Conversely, we prove that a suitable loss function on the stochastic model outputs simultaneously grants (i) the unknown adjacency matrix latent distribution and (ii) optimal performance on the prediction task. Finally, we propose a sampling-based method that solves this joint learning task. Empirical results validate our theoretical claims and demonstrate the effectiveness of the proposed approach.

[LG-55] Exploring Diffusion Models Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks

链接: https://arxiv.org/abs/2405.19931
作者: Xiaoyu Wu,Jiaru Zhang,Yang Hua,Bohan Lyu,Hao Wang,Tao Song,Haibing Guan
关键词: reducing training costs, Diffusion Models, significantly reducing training, key advancement, personalized AI applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement, significantly reducing training costs and enabling personalized AI applications. However, we explore the training dynamics of DMs and observe an unanticipated phenomenon: during the training process, image fidelity initially improves, then unexpectedly deteriorates with the emergence of noisy patterns, only to recover later with severe overfitting. We term the stage with generated noisy patterns as corruption stage. To understand this corruption stage, we begin by theoretically modeling the one-shot fine-tuning scenario, and then extend this modeling to more general cases. Through this modeling, we identify the primary cause of this corruption stage: a narrowed learning distribution inherent in the nature of few-shot fine-tuning. To tackle this, we apply Bayesian Neural Networks (BNNs) on DMs with variational inference to implicitly broaden the learned distribution, and present that the learning target of the BNNs can be naturally regarded as an expectation of the diffusion loss and a further regularization with the pretrained DMs. This approach is highly compatible with current few-shot fine-tuning methods in DMs and does not introduce any extra inference costs. Experimental results demonstrate that our method significantly mitigates corruption, and improves the fidelity, quality and diversity of the generated images in both object-driven and subject-driven generation tasks.

[LG-56] BAN: Detecting Backdoors Activated by Adversarial Neuron Noise

链接: https://arxiv.org/abs/2405.19928
作者: Xiaoyun Xu,Zhuoran Liu,Stefanos Koffas,Shujian Yu,Stjepan Picek
关键词: deep learning represent, gained significant attention, Backdoor, research community, attacks on deep
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37 \times (on CIFAR-10) and 5.11 \times (on ImageNet200) more efficient with 9.99% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available.\urlhttps://anonymous.4open.science/r/ban-4B32

[LG-57] Unraveling the Impact of Heterophilic Structures on Graph Positive-Unlabeled Learning

链接: https://arxiv.org/abs/2405.19919
作者: Yuhao Wu,Jiangchao Yao,Bo Han,Lina Yao,Tongliang Liu
关键词: Label Propagation Loss, real-world scenarios, remains under-explored, data still remains, graph data
类目: Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:While Positive-Unlabeled (PU) learning is vital in many real-world scenarios, its application to graph data still remains under-explored. We unveil that a critical challenge for PU learning on graph lies on the edge heterophily, which directly violates the irreducibility assumption for Class-Prior Estimation (class prior is essential for building PU learning algorithms) and degenerates the latent label inference on unlabeled nodes during classifier training. In response to this challenge, we introduce a new method, named Graph PU Learning with Label Propagation Loss (GPL). Specifically, GPL considers learning from PU nodes along with an intermediate heterophily reduction, which helps mitigate the negative impact of the heterophilic structure. We formulate this procedure as a bilevel optimization that reduces heterophily in the inner loop and efficiently learns a classifier in the outer loop. Extensive experiments across a variety of datasets have shown that GPL significantly outperforms baseline methods, confirming its effectiveness and superiority.

[LG-58] Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2405.19909
作者: Tenglong Liu,Yang Li,Yixing Lan,Hao Gao,Wei Pan,Xin Xu
关键词: offline reinforcement learning, reinforcement learning, policy, Advantage-guided Policy Regularization, behavior policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: ICML 2024, 19 pages

点击查看摘要

Abstract:In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain the learned policy through policy regularization. However, these methods often suffer from the issue of unnecessary conservativeness, hampering policy improvement. This occurs due to the indiscriminate use of all actions from the behavior policy that generates the offline dataset as constraints. The problem becomes particularly noticeable when the quality of the dataset is suboptimal. Thus, we propose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining high-advantage actions from an augmented behavior policy combined with VAE to guide the learned policy. A2PR can select high-advantage actions that differ from those present in the dataset, while still effectively maintaining conservatism from OOD actions. This is achieved by harnessing the VAE capacity to generate samples matching the distribution of the data points. We theoretically prove that the improvement of the behavior policy is guaranteed. Besides, it effectively mitigates value overestimation with a bounded performance gap. Empirically, we conduct a series of experiments on the D4RL benchmark, where A2PR demonstrates state-of-the-art performance. Furthermore, experimental results on additional suboptimal mixed datasets reveal that A2PR exhibits superior performance. Code is available at this https URL.

[LG-59] Learning Discriminative Dynamics with Label Corruption for Noisy Label Detection

链接: https://arxiv.org/abs/2405.19902
作者: Suyeon Kim,Dongha Lee,SeongKu Kang,Sukang Chae,Sanghwan Jang,Hwanjo Yu
关键词: commonly found, noisy labels, found in real-world, detrimental impact, training signals
类目: Machine Learning (cs.LG)
*备注: Accepted to CVPR 2024

点击查看摘要

Abstract:Label noise, commonly found in real-world datasets, has a detrimental impact on a model’s generalization. To effectively detect incorrectly labeled instances, previous works have mostly relied on distinguishable training signals, such as training loss, as indicators to differentiate between clean and noisy labels. However, they have limitations in that the training signals incompletely reveal the model’s behavior and are not effectively generalized to various noise types, resulting in limited detection accuracy. In this paper, we propose DynaCor framework that distinguishes incorrectly labeled instances from correctly labeled ones based on the dynamics of the training signals. To cope with the absence of supervision for clean and noisy labels, DynaCor first introduces a label corruption strategy that augments the original dataset with intentionally corrupted labels, enabling indirect simulation of the model’s behavior on noisy labels. Then, DynaCor learns to identify clean and noisy instances by inducing two clearly distinguishable clusters from the latent representations of training dynamics. Our comprehensive experiments show that DynaCor outperforms the state-of-the-art competitors and shows strong robustness to various noise types and noise rates.

[LG-60] Urban Air Pollution Forecasting: a Machine Learning Approach leveraging Satellite Observations and Meteorological Forecasts

链接: https://arxiv.org/abs/2405.19901
作者: Giacomo Blanco,Luca Barco,Lorenzo Innocenti,Claudio Rossi
关键词: health and well-being, poses a significant, significant threat, Air pollution poses, public health
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, submitted to IEEE MetroLivEnv 2024

点击查看摘要

Abstract:Air pollution poses a significant threat to public health and well-being, particularly in urban areas. This study introduces a series of machine-learning models that integrate data from the Sentinel-5P satellite, meteorological conditions, and topological characteristics to forecast future levels of five major pollutants. The investigation delineates the process of data collection, detailing the combination of diverse data sources utilized in the study. Through experiments conducted in the Milan metropolitan area, the models demonstrate their efficacy in predicting pollutant levels for the forthcoming day, achieving a percentage error of around 30%. The proposed models are advantageous as they are independent of monitoring stations, facilitating their use in areas without existing infrastructure. Additionally, we have released the collected dataset to the public, aiming to stimulate further research in this field. This research contributes to advancing our understanding of urban air quality dynamics and emphasizes the importance of amalgamating satellite, meteorological, and topographical data to develop robust pollution forecasting models.

[LG-61] Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

链接: https://arxiv.org/abs/2405.19893
作者: Chunjing Gan,Dan Yang,Binbin Hu,Hanxiao Zhang,Siyuan Li,Ziqi Liu,Yue Shen,Lin Ju,Zhiqiang Zhang,Jinjie Gu,Lei Liang,Jun Zhou
关键词: retrieval augmented generation, made remarkable achievements, retrieval augmented, augmented generation, large language models
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:In recent years, large language models (LLMs) have made remarkable achievements in various domains. However, the untimeliness and cost of knowledge updates coupled with hallucination issues of LLMs have curtailed their applications in knowledge intensive tasks, where retrieval augmented generation (RAG) can be of help. Nevertheless, existing retrieval augmented models typically use similarity as a bridge between queries and documents and follow a retrieve then read procedure. In this work, we argue that similarity is not always the panacea and totally relying on similarity would sometimes degrade the performance of retrieval augmented generation. To this end, we propose MetRag, a Multi layEred Thoughts enhanced Retrieval Augmented Generation framework. To begin with, beyond existing similarity oriented thought, we embrace a small scale utility model that draws supervision from an LLM for utility oriented thought and further come up with a smarter model by comprehensively combining the similarity and utility oriented thoughts. Furthermore, given the fact that the retrieved document set tends to be huge and using them in isolation makes it difficult to capture the commonalities and characteristics among them, we propose to make an LLM as a task adaptive summarizer to endow retrieval augmented generation with compactness-oriented thought. Finally, with multi layered thoughts from the precedent stages, an LLM is called for knowledge augmented generation. Extensive experiments on knowledge-intensive tasks have demonstrated the superiority of MetRag.

[LG-62] Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

链接: https://arxiv.org/abs/2405.19888
作者: Chaofan Lin,Zhenhua Han,Chengruidong Zhang,Yuqing Yang,Fan Yang,Chen Chen,Lili Qiu
关键词: LLM, public LLM services, large language models, LLM applications, public LLM
类目: Machine Learning (cs.LG)
*备注: To appear on USENIX OSDI 2024

点击查看摘要

Abstract:The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today’s public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications. This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications. Comments: To appear on USENIX OSDI 2024 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2405.19888 [cs.LG] (or arXiv:2405.19888v1 [cs.LG] for this version)

[LG-63] Federated Learning with Multi-resolution Model Broadcast

链接: https://arxiv.org/abs/2405.19886
作者: Henrik Rydén,Reza Moosavi,Erik G. Larsson
关键词: low-SNR receiver, periodically broadcast, multi-resolution coding, receiver, low-SNR
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In federated learning, a server must periodically broadcast a model to the agents. We propose to use multi-resolution coding and modulation (also known as non-uniform modulation) for this purpose. In the simplest instance, broadcast transmission is used, whereby all agents are targeted with one and the same transmission (typically without any particular favored beam direction), which is coded using multi-resolution coding/modulation. This enables high-SNR agents, with high path gains to the server, to receive a more accurate model than the low-SNR agents do, without consuming more downlink resources. As one implementation, we use transmission with a non-uniform 8-PSK constellation, where a high-SNR receiver (agent) can separate all 8 constellation points (hence receive 3 bits) whereas a low-SNR receiver can only separate 4 points (hence receive 2 bits). By encoding the least significant information in the third bit, the high-SNR receivers can obtain the model with higher accuracy, while the low-SNR receiver can still obtain the model although with reduced accuracy, thereby facilitating at least some basic participation of the low-SNR receiver. We show the effectiveness of our proposed scheme via experimentation using federated learning with the MNIST data-set.

[LG-64] Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning

链接: https://arxiv.org/abs/2405.19885
作者: Hengkai Tan,Songming Liu,Kai Ma,Chengyang Ying,Xingxing Zhang,Hang Su,Jun Zhu
关键词: embodied learning scenarios, obtain generalized low-level, Reinforcement learning, generalized low-level robot, low-level robot policies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning is able to obtain generalized low-level robot policies on diverse robotics datasets in embodied learning scenarios, and Transformer has been widely used to model time-varying features. However, it still suffers from the issues of low data efficiency and high inference latency. In this paper, we propose to investigate the task from a new perspective of the frequency domain. We first observe that the energy density in the frequency domain of a robot’s trajectory is mainly concentrated in the low-frequency part. Then, we present the Fourier Controller Network (FCNet), a new network that utilizes the Short-Time Fourier Transform (STFT) to extract and encode time-varying features through frequency domain interpolation. We further achieve parallel training and efficient recurrent inference by using FFT and Sliding DFT methods in the model architecture for real-time decision-making. Comprehensive analyses in both simulated (e.g., D4RL) and real-world environments (e.g., robot locomotion) demonstrate FCNet’s substantial efficiency and effectiveness over existing methods such as Transformer, e.g., FCNet outperforms Transformer on multi-environmental robotics datasets of all types of sizes (from 1.9M to 120M). The project page and code can be found this https URL.

[LG-65] From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems

链接: https://arxiv.org/abs/2405.19883
作者: Jianliang He,Siyu Chen,Fengzhuo Zhang,Zhuoran Yang
关键词: solve decision-making problems, LLM Planner, large language model, LLM Planner navigates, pretrained LLM Planner
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:In this work, from a theoretical lens, we aim to understand why large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. To this end, consider a hierarchical reinforcement learning (RL) model where the LLM Planner and the Actor perform high-level task planning and low-level execution, respectively. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. Under proper assumptions on the pretraining data, we prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning. Additionally, we highlight the necessity for exploration beyond the subgoals derived from BAIL by proving that naively executing the subgoals returned by LLM leads to a linear regret. As a remedy, we introduce an \epsilon -greedy exploration strategy to BAIL, which is proven to incur sublinear regret when the pretraining error is small. Finally, we extend our theoretical framework to include scenarios where the LLM Planner serves as a world model for inferring the transition model of the environment and to multi-agent settings, enabling coordination among multiple Actors.

[LG-66] Learning from Random Demonstrations: Offline Reinforcement Learning with Importance-Sampled Diffusion Models

链接: https://arxiv.org/abs/2405.19878
作者: Zeyu Fang,Tian Lan
关键词: generate synthetic data, Generative models, world model, offline reinforcement learning, generates diffusion models
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Generative models such as diffusion have been employed as world models in offline reinforcement learning to generate synthetic data for more effective learning. Existing work either generates diffusion models one-time prior to training or requires additional interaction data to update it. In this paper, we propose a novel approach for offline reinforcement learning with closed-loop policy evaluation and world-model adaptation. It iteratively leverages a guided diffusion world model to directly evaluate the offline target policy with actions drawn from it, and then performs an importance-sampled world model update to adaptively align the world model with the updated policy. We analyzed the performance of the proposed method and provided an upper bound on the return gap between our method and the real environment under an optimal policy. The result sheds light on various factors affecting learning performance. Evaluations in the D4RL environment show significant improvement over state-of-the-art baselines, especially when only random or medium-expertise demonstrations are available – thus requiring improved alignment between the world model and offline policy evaluation.

[LG-67] Is In-Context Learning Sufficient for Instruction Following in LLMs?

链接: https://arxiv.org/abs/2405.19874
作者: Hao Zhao,Maksym Andriushchenko,Francesco Croce,Nicolas Flammarion
关键词: potentially learn, changing their weights, promising capability, In-context learning, ICL
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Code at this https URL

点击查看摘要

Abstract:In-context learning (ICL) allows LLMs to learn from examples without changing their weights, which is a particularly promising capability for long-context LLMs that can potentially learn from many examples. Recently, Lin et al. (2024) proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on established benchmarks such as MT-Bench and AlpacaEval 2.0 (LC), especially with more capable base LMs. Unlike for tasks such as classification, translation, or summarization, adding more ICL demonstrations for long-context LLMs does not systematically improve instruction following performance. To address this limitation, we derive a greedy selection approach for ICL examples that noticeably improves performance, yet without bridging the gap to instruction fine-tuning. Finally, we provide a series of ablation studies to better understand the reasons behind the remaining gap, and we show how some aspects of ICL depart from the existing knowledge and are specific to the instruction tuning setting. Overall, our work advances the understanding of ICL as an alignment technique. We provide our code at this https URL.

[LG-68] On Vessel Location Forecasting and the Effect of Federated Learning

链接: https://arxiv.org/abs/2405.19870
作者: Andreas Tritsarolis,Nikos Pelekis,Konstantina Bereta,Dimitris Zissis,Yannis Theodoridis
关键词: Automatic Identification System, Identification System, Automatic Identification, maritime analytics operations, spread of Automatic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The wide spread of Automatic Identification System (AIS) has motivated several maritime analytics operations. Vessel Location Forecasting (VLF) is one of the most critical operations for maritime awareness. However, accurate VLF is a challenging problem due to the complexity and dynamic nature of maritime traffic conditions. Furthermore, as privacy concerns and restrictions have grown, training data has become increasingly fragmented, resulting in dispersed databases of several isolated data silos among different organizations, which in turn decreases the quality of learning models. In this paper, we propose an efficient VLF solution based on LSTM neural networks, in two variants, namely Nautilus and FedNautilus for the centralized and the federated learning approach, respectively. We also demonstrate the superiority of the centralized approach with respect to current state of the art and discuss the advantages and disadvantages of the federated against the centralized approach.

[LG-69] Out-of-distribution Reject Option Method for Dataset Shift Problem in Early Disease Onset Prediction

链接: https://arxiv.org/abs/2405.19864
作者: Taisei Tosaki,Eiichiro Uchino,Ryosuke Kojima,Yohei Mineharu,Mikio Arita,Nobuyuki Miyai,Yoshinori Tamada,Tatsuya Mikami,Koichi Murashita,Shigeyuki Nakaji,Yasushi Okuno
关键词: Machine learning, OOD detection, OOD detection models, predict lifestyle-related disease, OOD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning is increasingly used to predict lifestyle-related disease onset using health and medical data. However, the prediction effectiveness is hindered by dataset shift, which involves discrepancies in data distribution between the training and testing datasets, misclassifying out-of-distribution (OOD) data. To diminish dataset shift effects, this paper proposes the out-of-distribution reject option for prediction (ODROP), which integrates OOD detection models to preclude OOD data from the prediction phase. We investigated the efficacy of five OOD detection methods (variational autoencoder, neural network ensemble std, neural network ensemble epistemic, neural network energy, and neural network gaussian mixture based energy measurement) across two datasets, the Hirosaki and Wakayama health checkup data, in the context of three disease onset prediction tasks: diabetes, dyslipidemia, and hypertension. To evaluate the ODROP method, we trained disease onset prediction models and OOD detection models on Hirosaki data and used AUROC-rejection curve plots from Wakayama data. The variational autoencoder method showed superior stability and magnitude of improvement in Area Under the Receiver Operating Curve (AUROC) in five cases: AUROC in the Wakayama data was improved from 0.80 to 0.90 at a 31.1% rejection rate for diabetes onset and from 0.70 to 0.76 at a 34% rejection rate for dyslipidemia. We categorized dataset shifts into two types using SHAP clustering - those that considerably affect predictions and those that do not. We expect that this classification will help standardize measuring instruments. This study is the first to apply OOD detection to actual health and medical data, demonstrating its potential to substantially improve the accuracy and reliability of disease prediction models amidst dataset shift.

[LG-70] he Merit of River Network Topology for Neural Flood Forecasting

链接: https://arxiv.org/abs/2405.19836
作者: Nikolas Kirschstein,Yixuan Sun
关键词: Climate change exacerbates, exacerbates riverine floods, change exacerbates riverine, Climate change, riverine floods
类目: Machine Learning (cs.LG)
*备注: this https URL

点击查看摘要

Abstract:Climate change exacerbates riverine floods, which occur with higher frequency and intensity than ever. The much-needed forecasting systems typically rely on accurate river discharge predictions. To this end, the SOTA data-driven approaches treat forecasting at spatially distributed gauge stations as isolated problems, even within the same river network. However, incorporating the known topology of the river network into the prediction model has the potential to leverage the adjacency relationship between gauges. Thus, we model river discharge for a network of gauging stations with GNNs and compare the forecasting performance achieved by different adjacency definitions. Our results show that the model fails to benefit from the river network topology information, both on the entire network and small subgraphs. The learned edge weights correlate with neither of the static definitions and exhibit no regular pattern. Furthermore, the GNNs struggle to predict sudden, narrow discharge spikes. Our work hints at a more general underlying phenomenon of neural prediction not always benefitting from graphical structure and may inspire a systematic study of the conditions under which this happens.

[LG-71] Joint Selective State Space Model and Detrending for Robust Time Series Anomaly Detection

链接: https://arxiv.org/abs/2405.19823
作者: Junqi Chen,Xu Tan,Sylwan Rahardja,Jiawei Yang,Susanto Rahardja
关键词: Series Anomaly Detection, Time Series Anomaly, Deep learning-based sequence, sequential modeling capabilities, effective sequential modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE Signal Processing Letters

点击查看摘要

Abstract:Deep learning-based sequence models are extensively employed in Time Series Anomaly Detection (TSAD) tasks due to their effective sequential modeling capabilities. However, the ability of TSAD is limited by two key challenges: (i) the ability to model long-range dependency and (ii) the generalization issue in the presence of non-stationary data. To tackle these challenges, an anomaly detector that leverages the selective state space model known for its proficiency in capturing long-term dependencies across various domains is proposed. Additionally, a multi-stage detrending mechanism is introduced to mitigate the prominent trend component in non-stationary data to address the generalization issue. Extensive experiments conducted on realworld public datasets demonstrate that the proposed methods surpass all 12 compared baseline methods.

[LG-72] Approximate Global Convergence of Independent Learning in Multi-Agent Systems

链接: https://arxiv.org/abs/2405.19811
作者: Ruiyang Jin,Zaiwei Chen,Yiheng Lin,Jie Song,Adam Wierman
关键词: large-scale multi-agent systems, global convergence guarantees, lacks global convergence, multi-agent systems, global convergence
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Independent learning (IL), despite being a popular approach in practice to achieve scalability in large-scale multi-agent systems, usually lacks global convergence guarantees. In this paper, we study two representative algorithms, independent Q -learning and independent natural actor-critic, within value-based and policy-based frameworks, and provide the first finite-sample analysis for approximate global convergence. The results imply a sample complexity of \tilde\mathcalO(\epsilon^-2) up to an error term that captures the dependence among agents and characterizes the fundamental limit of IL in achieving global convergence. To establish the result, we develop a novel approach for analyzing IL by constructing a separable Markov decision process (MDP) for convergence analysis and then bounding the gap due to model difference between the separable MDP and the original one. Moreover, we conduct numerical experiments using a synthetic MDP and an electric vehicle charging example to verify our theoretical findings and to demonstrate the practical applicability of IL.

[LG-73] MetaCURL: Non-stationary Concave Utility Reinforcement Learning

链接: https://arxiv.org/abs/2405.19807
作者: Bianca Marin Moreno(UGA, Thoth, EDF Ramp;D, FiME Lab),Margaux Brégère(LPSM, EDF Ramp;D),Pierre Gaillard(UGA, Thoth),Nadia Oudjane(EDF Ramp;D, FiME Lab)
关键词: episodic loop-free Markov, loop-free Markov decision, Markov decision processes, Concave Utility Reinforcement, loop-free Markov
类目: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We explore online learning in episodic loop-free Markov decision processes on non-stationary environments (changing losses and probability transitions). Our focus is on the Concave Utility Reinforcement Learning problem (CURL), an extension of classical RL for handling convex performance criteria in state-action distributions induced by agent policies. While various machine learning problems can be written as CURL, its non-linearity invalidates traditional Bellman equations. Despite recent solutions to classical CURL, none address non-stationary MDPs. This paper introduces MetaCURL, the first CURL algorithm for non-stationary MDPs. It employs a meta-algorithm running multiple black-box algorithms instances over different intervals, aggregating outputs via a sleeping expert framework. The key hurdle is partial information due to MDP uncertainty. Under partial information on the probability transitions (uncertainty and non-stationarity coming only from external noise, independent of agent state-action pairs), we achieve optimal dynamic regret without prior knowledge of MDP changes. Unlike approaches for RL, MetaCURL handles full adversarial losses, not just stochastic ones. We believe our approach for managing non-stationarity with experts can be of interest to the RL community.

[LG-74] Preference Alignment with Flow Matching

链接: https://arxiv.org/abs/2405.19806
作者: Minu Kim,Yongsik Lee,Sehyeok Kang,Jihwan Oh,Song Chong,Seyoung Yun
关键词: preference-based reinforcement learning, Flow Matching, Preference Flow Matching, PFM utilizes flow, reinforcement learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Preference Flow Matching (PFM), a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models. Existing PbRL methods require fine-tuning pre-trained models, which presents challenges such as scalability, inefficiency, and the need for model modifications, especially with black-box APIs like GPT-4. In contrast, PFM utilizes flow matching techniques to directly learn from preference data, thereby reducing the dependency on extensive fine-tuning of pre-trained models. By leveraging flow-based models, PFM transforms less preferred data into preferred outcomes, and effectively aligns model outputs with human preferences without relying on explicit or implicit reward function estimation, thus avoiding common issues like overfitting in reward models. We provide theoretical insights that support our method’s alignment with standard PbRL objectives. Experimental results indicate the practical effectiveness of our method, offering a new direction in aligning a pre-trained model to preference.

[LG-75] Complexity of Deciding Injectivity and Surjectivity of ReLU Neural Networks

链接: https://arxiv.org/abs/2405.19805
作者: Vincent Froese,Moritz Grillo,Martin Skutella
关键词: modern machine learning, ReLU activation play, Neural networks, machine learning, activation play
类目: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Neural networks with ReLU activation play a key role in modern machine learning. In view of safety-critical applications, the verification of trained networks is of great importance and necessitates a thorough understanding of essential properties of the function computed by a ReLU network, including characteristics like injectivity and surjectivity. Recently, Puthawala et al. [JMLR 2022] came up with a characterization for injectivity of a ReLU layer, which implies an exponential time algorithm. However, the exact computational complexity of deciding injectivity remained open. We answer this question by proving coNP-completeness of deciding injectivity of a ReLU layer. On the positive side, as our main result, we present a parameterized algorithm which yields fixed-parameter tractability of the problem with respect to the input dimension. In addition, we also characterize surjectivity for two-layer ReLU networks with one-dimensional output. Remarkably, the decision problem turns out to be the complement of a basic network verification task. We prove NP-hardness for surjectivity, implying a stronger hardness result than previously known for the network verification problem. Finally, we reveal interesting connections to computational convexity by formulating the surjectivity problem as a zonotope containment problem

[LG-76] Exploring Key Factors for Long-Term Vessel Incident Risk Prediction

链接: https://arxiv.org/abs/2405.19804
作者: Tianyi Chen,Hua Wang,Yutong Cai,Maohan Liang,Qiang Meng
关键词: Factor analysis acts, Factor analysis, acts a pivotal, pivotal role, role in enhancing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Factor analysis acts a pivotal role in enhancing maritime safety. Most previous studies conduct factor analysis within the framework of incident-related label prediction, where the developed models can be categorized into short-term and long-term prediction models. The long-term models offer a more strategic approach, enabling more proactive risk management, compared to the short-term ones. Nevertheless, few studies have devoted to rigorously identifying the key factors for the long-term prediction and undertaking comprehensive factor analysis. Hence, this study aims to delve into the key factors for predicting the incident risk levels in the subsequent year given a specific datestamp. The majority of candidate factors potentially contributing to the incident risk are collected from vessels’ historical safety performance data spanning up to five years. An improved embedded feature selection, which integrates Random Forest classifier with a feature filtering process is proposed to identify key risk-contributing factors from the candidate pool. The results demonstrate superior performance of the proposed method in incident prediction and factor interpretability. Comprehensive analysis is conducted upon the key factors, which could help maritime stakeholders formulate management strategies for incident prevenion.

[LG-77] Estimating before Debiasing: A Bayesian Approach to Detaching Prior Bias in Federated Semi-Supervised Learning

链接: https://arxiv.org/abs/2405.19789
作者: Guogang Zhu,Xuefeng Liu,Xinghao Wu,Shaojie Tang,Chao Tang,Jianwei Niu,Hao Su
关键词: Federated Semi-Supervised Learning, Federated Semi-Supervised, Semi-Supervised Learning, introduce prediction bias, leverages both labeled
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by IJCAI 2024

点击查看摘要

Abstract:Federated Semi-Supervised Learning (FSSL) leverages both labeled and unlabeled data on clients to collaboratively train a this http URL FSSL, the heterogeneous data can introduce prediction bias into the model, causing the model’s prediction to skew towards some certain classes. Existing FSSL methods primarily tackle this issue by enhancing consistency in model parameters or outputs. However, as the models themselves are biased, merely constraining their consistency is not sufficient to alleviate prediction bias. In this paper, we explore this bias from a Bayesian perspective and demonstrate that it principally originates from label prior bias within the training data. Building upon this insight, we propose a debiasing method for FSSL named FedDB. FedDB utilizes the Average Prediction Probability of Unlabeled Data (APP-U) to approximate the biased prior.During local training, FedDB employs APP-U to refine pseudo-labeling through Bayes’ theorem, thereby significantly reducing the label prior bias. Concurrently, during the model aggregation, FedDB uses APP-U from participating clients to formulate unbiased aggregate weights, thereby effectively diminishing bias in the global model. Experimental results show that FedDB can surpass existing FSSL methods. The code is available at this https URL.

[LG-78] From Symbolic Tasks to Code Generation: Diversification Yields Better Task Performers

链接: https://arxiv.org/abs/2405.19787
作者: Dylan Zhang,Justin Wang,Francois Charton
关键词: tuning large language, large language models, instruction-output pairs, real world, tuning large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Instruction tuning – tuning large language models on instruction-output pairs – is a promising technique for making models better adapted to the real world. Yet, the key factors driving the model’s capability to understand and follow instructions not seen during training remain under-explored. Our investigation begins with a series of synthetic experiments within the theoretical framework of a Turing-complete algorithm called Markov algorithm, which allows fine-grained control over the instruction-tuning data. Generalization and robustness with respect to the training distribution emerge once a diverse enough set of tasks is provided, even though very few examples are provided for each task. We extend these initial results to a real-world application scenario of code generation and find that a more diverse instruction set, extending beyond code-related tasks, improves the performance of code generation. Our observations suggest that a more diverse semantic space for instruction-tuning sets greatly improves the model’s ability to follow instructions and perform tasks.

[LG-79] Recurrent Deep Kernel Learning of Dynamical Systems

链接: https://arxiv.org/abs/2405.19785
作者: Nicolò Botteghi,Paolo Motta,Andrea Manzoni,Paolo Zunino,Mengwu Guo
关键词: Digital twins require, computationally-efficient reduced-order models, twins require computationally-efficient, require computationally-efficient reduced-order, accurately describe complex
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital twins require computationally-efficient reduced-order models (ROMs) that can accurately describe complex dynamics of physical assets. However, constructing ROMs from noisy high-dimensional data is challenging. In this work, we propose a data-driven, non-intrusive method that utilizes stochastic variational deep kernel learning (SVDKL) to discover low-dimensional latent spaces from data and a recurrent version of SVDKL for representing and predicting the evolution of latent dynamics. The proposed method is demonstrated with two challenging examples – a double pendulum and a reaction-diffusion system. Results show that our framework is capable of (i) denoising and reconstructing measurements, (ii) learning compact representations of system states, (iii) predicting system evolution in low-dimensional latent spaces, and (iv) quantifying modeling uncertainties.

[LG-80] PixelsDB: Serverless and Natural-Language-Aided Data Analytics with Flexible Service Levels and Prices

链接: https://arxiv.org/abs/2405.19784
作者: Haoqiong Bian,Dongyang Geng,Haoyang Li,Anastasia Ailamaki
关键词: including automated hardware, increasingly popular due, Serverless query processing, Serverless query, serverless query engine
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:Serverless query processing has become increasingly popular due to its advantages, including automated hardware and software management, high elasticity, and pay-as-you-go pricing. For users who are not system experts, serverless query processing greatly reduces the cost of owning a data analytic system. However, it is still a significant challenge for non-expert users to transform their complex and evolving data analytic needs into proper SQL queries and select a serverless query engine that delivers satisfactory performance and price for each type of query. This paper presents PixelsDB, an open-source data analytic system that allows users who lack system or SQL expertise to explore data efficiently. It allows users to generate and debug SQL queries using a natural language interface powered by fine-tuned language models. The queries are then executed by a serverless query engine that offers varying prices for different service levels on query urgency. The service levels are natively supported by dedicated architecture design and heterogeneous resource scheduling that can apply cost-efficient resources to process non-urgent queries. We envision that the combination of a serverless paradigm, a natural-language-aided interface, and flexible service levels and prices will substantially improve the user experience in data analysis. Comments: 4 pages, 3 figures Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2405.19784 [cs.DB] (or arXiv:2405.19784v1 [cs.DB] for this version)

[LG-81] Instruction-Guided Visual Masking

链接: https://arxiv.org/abs/2405.19783
作者: Jinliang Zheng,Jianxiong Li,Sijie Cheng,Yinan Zheng,Jiaming Li,Jihao Liu,Yu Liu,Jingjing Liu,Xianyuan Zhan
关键词: contemporary LLM, crucial in contemporary, LLM, multimodal, Instruction-guided Visual Masking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: preprint, 21 pages

点击查看摘要

Abstract:Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code is available at this https URL.

[LG-82] Automatic Graph Topology-Aware Transformer

链接: https://arxiv.org/abs/2405.19779
作者: Chao Wang,Jiaxuan Zhao,Lingling Li,Licheng Jiao,Fang Liu,Shuyuan Yang
关键词: Existing efforts, graph Transformer, model representation capabilities, graph Transformer architecture, Transformer
类目: Neural and Evolutionary Computing (cs.NE); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE (Under Second Review). Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Existing efforts are dedicated to designing many topologies and graph-aware strategies for the graph Transformer, which greatly improve the model’s representation capabilities. However, manually determining the suitable Transformer architecture for a specific graph dataset or task requires extensive expert knowledge and laborious trials. This paper proposes an evolutionary graph Transformer architecture search framework (EGTAS) to automate the construction of strong graph Transformers. We build a comprehensive graph Transformer search space with the micro-level and macro-level designs. EGTAS evolves graph Transformer topologies at the macro level and graph-aware strategies at the micro level. Furthermore, a surrogate model based on generic architectural coding is proposed to directly predict the performance of graph Transformers, substantially reducing the evaluation cost of evolutionary search. We demonstrate the efficacy of EGTAS across a range of graph-level and node-level tasks, encompassing both small-scale and large-scale graph datasets. Experimental results and ablation studies show that EGTAS can construct high-performance architectures that rival state-of-the-art manual and automated baselines.

[LG-83] Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise Filtering

链接: https://arxiv.org/abs/2405.19757
作者: Sungchul Hong,Seunghwan An,Jong-June Jeon
关键词: generative neural network, network model extend, neural network model, Recent advances, data augmentation methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in a generative neural network model extend the development of data augmentation methods. However, the augmentation methods based on the modern generative models fail to achieve notable performance for class imbalance data compared to the conventional model, the SMOTE. We investigate the problem of the generative model for imbalanced classification and introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE). Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty. Then, the data points potentially degrading the augmentation are systematically excluded, and the neighboring observations are directly augmented on the data space. Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models. Consequently, we conclude that the selection of minority data and the interpolation in the data space are beneficial for imbalanced classification problems with a relatively small number of data points.

[LG-84] Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

链接: https://arxiv.org/abs/2405.19752
作者: Yuchen He,Zichun Ye,Chihao Zhang
关键词: pass streaming model, stochastic multi-armed bandit, multi-armed bandit problem, pass streaming, streaming model
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the stochastic multi-armed bandit problem in the P -pass streaming model. In this problem, the n arms are present in a stream and at most mn arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of m, n and P . Specifically, we design an algorithm with \tilde O\left((n-m)^1+\frac2^P-22^P+1-1 n^\frac2-2^P+12^P+1-1 T^\frac2^P2^P+1-1\right) regret and complement it with an \tilde \Omega\left((n-m)^1+\frac2^P-22^P+1-1 n^\frac2-2^P+12^P+1-1 T^\frac2^P2^P+1-1\right) lower bound when the number of rounds T is sufficiently large. Our results are tight up to a logarithmic factor in n and P .

[LG-85] Understanding and mitigating difficulties in posterior predictive evaluation

链接: https://arxiv.org/abs/2405.19747
作者: Abhinav Agrawal,Justin Domke
关键词: Predictive posterior densities, approximate Bayesian inference, Predictive posterior, approximate Bayesian, simple Monte Carlo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predictive posterior densities (PPDs) are of interest in approximate Bayesian inference. Typically, these are estimated by simple Monte Carlo (MC) averages using samples from the approximate posterior. We observe that the signal-to-noise ratio (SNR) of such estimators can be extremely low. An analysis for exact inference reveals SNR decays exponentially as there is an increase in (a) the mismatch between training and test data, (b) the dimensionality of the latent space, or © the size of the test data relative to the training data. Further analysis extends these results to approximate inference. To remedy the low SNR problem, we propose replacing simple MC sampling with importance sampling using a proposal distribution optimized at test time on a variational proxy for the SNR and demonstrate that this yields greatly improved estimates.

[LG-86] wo Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization

链接: https://arxiv.org/abs/2405.19732
作者: Zixian Guo,Ming Liu,Zhilong Ji,Jinfeng Bai,Yiwen Guo,Wangmeng Zuo
关键词: skill generally relies, Learning a skill, insightful high-level guidance, skill generally, generally relies
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for concrete problems by inferring from natural language instructions, akin to a high-level instructor. In this paper, we show that these two optimizers are complementary to each other, suggesting a collaborative optimization approach. The gradient-based optimizer and LLM-based optimizer are combined in an interleaved manner. We instruct LLMs using task descriptions and timely optimization trajectories recorded during gradient-based optimization. Inferred results from LLMs are used as restarting points for the next stage of gradient optimization. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, our combined optimization method consistently yields improvements over competitive baseline prompt tuning methods. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at this https URL.

[LG-87] Research on Foundation Model for Spatial Data Intelligence: Chinas 2024 White Paper on Strategic Development of Spatial Data Intelligence

链接: https://arxiv.org/abs/2405.19730
作者: Shaohua Wang(1),Xing Xie(2),Yong Li(3),Danhuai Guo(4),Zhi Cai(5),Yu Liu(6),Yang Yue(7),Xiao Pan(8),Feng Lu(9),Huayi Wu(10),Zhipeng Gui(10),Zhiming Ding(11),Bolong Zheng(12),Fuzheng Zhang(13),Tao Qin(2),Jingyuan Wang(14),Chuang Tao(15),Zhengchao Chen(1),Hao Lu(16),Jiayi Li(10),Hongyang Chen(17),Peng Yue(10),Wenhao Yu(18),Yao Yao(18),Leilei Sun(14),Yong Zhang(5),Longbiao Chen(19),Xiaoping Du(20),Xiang Li(21),Xueying Zhang(22),Kun Qin(10),Zhaoya Gong(6),Weihua Dong(23),Xiaofeng Meng(24) ((1) Aerospace Information Research Institute, Chinese Academy of Sciences,(2) Microsoft Research Asia, (3) Tsinghua University, (4) Beijing University of Chemical Technology, (5) Beijing University of Technology, (6) Peking University, (7) Shenzhen University, (8) Shijiazhuang Tiedao University, (9) Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, (10) Wuhan University, (11) Institute of Software, Chinese Academy of Sciences, (12) Huazhong University of Science and Technology, (13) Kuaishou Natural Language Processing Center and Audio Center, (14) Beijing University of Aeronautics and Astronautics, (15) Shanghai Figure Interesting Information Technology Co., Ltd., (16) SuperMap Software Co., Ltd., (17) Zhejiang Lab, (18) China University of Geosciences (Wuhan), (19) Xiamen University, (20) Key Laboratory of Digital Earth, Chinese Academy of Sciences, (21) East China Normal University, (22) Nanjing Normal University, (23) Beijing Normal University, (24) Renmin University of China)
关键词: data intelligent large, spatial data intelligent, intelligent large models, data intelligent, intelligent large
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial data intelligent large models and their applications in urban environments, aerospace remote sensing, geography, transportation, and other scenarios. Additionally, it summarizes the latest application cases of spatial data intelligent large models in themes such as urban development, multimodal systems, remote sensing, smart transportation, and resource environments. Finally, the report concludes with an overview and outlook on the development prospects of spatial data intelligent large models.

[LG-88] Dynamic feature selection in medical predictive monitoring by reinforcement learning

链接: https://arxiv.org/abs/2405.19729
作者: Yutong Chen,Jiandong Gao,Ji Wu
关键词: multivariate time-series scenario, investigate dynamic feature, investigate dynamic, common occurrence, time-series scenario
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preview version

点击查看摘要

Abstract:In this paper, we investigate dynamic feature selection within multivariate time-series scenario, a common occurrence in clinical prediction monitoring where each feature corresponds to a bio-test result. Many existing feature selection methods fall short in effectively leveraging time-series information, primarily because they are designed for static data. Our approach addresses this limitation by enabling the selection of time-varying feature subsets for each patient. Specifically, we employ reinforcement learning to optimize a policy under maximum cost restrictions. The prediction model is subsequently updated using synthetic data generated by trained policy. Our method can seamlessly integrate with non-differentiable prediction models. We conducted experiments on a sizable clinical dataset encompassing regression and classification tasks. The results demonstrate that our approach outperforms strong feature selection baselines, particularly when subjected to stringent cost limitations. Code will be released once paper is accepted.

[LG-89] SpecDec: Boosting Speculative Decoding via Adaptive Candidate Lengths

链接: https://arxiv.org/abs/2405.19715
作者: Kaixuan Huang,Xudong Guo,Mengdi Wang
关键词: target large language, large language model, Markov Decision Process, Speculative decoding reduces, faster draft model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K – the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.

[LG-90] Universal Online Convex Optimization with 1 Projection per Round

链接: https://arxiv.org/abs/2405.19705
作者: Wenhao Yang,Yibo Wang,Peng Zhao,Lijun Zhang
关键词: online convex optimization, attain minimax rates, universal OCO algorithms, recent progress, address the uncertainty
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address the uncertainty in function types, recent progress in online convex optimization (OCO) has spurred the development of universal algorithms that simultaneously attain minimax rates for multiple types of convex functions. However, for a T -round online problem, state-of-the-art methods typically conduct O(\log T) projections onto the domain in each round, a process potentially time-consuming with complicated feasible sets. In this paper, inspired by the black-box reduction of Cutkosky and Orabona (2018), we employ a surrogate loss defined over simpler domains to develop universal OCO algorithms that only require 1 projection. Embracing the framework of prediction with expert advice, we maintain a set of experts for each type of functions and aggregate their predictions via a meta-algorithm. The crux of our approach lies in a uniquely designed expert-loss for strongly convex functions, stemming from an innovative decomposition of the regret into the meta-regret and the expert-regret. Our analysis sheds new light on the surrogate loss, facilitating a rigorous examination of the discrepancy between the regret of the original loss and that of the surrogate loss, and carefully controlling meta-regret under the strong convexity condition. In this way, with only 1 projection per round, we establish optimal regret bounds for general convex, exponentially concave, and strongly convex functions simultaneously. Furthermore, we enhance the expert-loss to exploit the smoothness property, and demonstrate that our algorithm can attain small-loss regret for multiple types of convex and smooth functions.

[LG-91] owards a Better Evaluation of Out-of-Domain Generalization

链接: https://arxiv.org/abs/2405.19703
作者: Duhun Hwang,Suhyun Kang,Moonjung Eo,Jimyeong Kim,Wonjong Rhee
关键词: unseen test distributions, previously unseen test, average measure, achieving high performance, domain generalization performance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The objective of Domain Generalization (DG) is to devise algorithms and models capable of achieving high performance on previously unseen test distributions. In the pursuit of this objective, average measure has been employed as the prevalent measure for evaluating models and comparing algorithms in the existing DG studies. Despite its significance, a comprehensive exploration of the average measure has been lacking and its suitability in approximating the true domain generalization performance has been questionable. In this study, we carefully investigate the limitations inherent in the average measure and propose worst+gap measure as a robust alternative. We establish theoretical grounds of the proposed measure by deriving two theorems starting from two different assumptions. We conduct extensive experimental investigations to compare the proposed worst+gap measure with the conventional average measure. Given the indispensable need to access the true DG performance for studying measures, we modify five existing datasets to come up with SR-CMNIST, C-CatsDogs, L-CIFAR10, PACS-corrupted, and VLCS-corrupted datasets. The experiment results unveil an inferior performance of the average measure in approximating the true DG performance and confirm the robustness of the theoretically supported worst+gap measure.

[LG-92] Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2405.19690
作者: Tianyu Chen,Zhendong Wang,Mingyuan Zhou
关键词: Offline reinforcement learning, leverages pre-collected datasets, train optimal policies, reinforcement learning, leverages pre-collected
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) leverages pre-collected datasets to train optimal policies. Diffusion Q-Learning (DQL), introducing diffusion models as a powerful and expressive policy class, significantly boosts the performance of offline RL. However, its reliance on iterative denoising sampling to generate actions slows down both training and inference. While several recent attempts have tried to accelerate diffusion-QL, the improvement in training and/or inference speed often results in degraded performance. In this paper, we introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy. We bridge the two polices by a newly introduced diffusion trust region loss. The diffusion policy maintains expressiveness, while the trust region loss directs the one-step policy to explore freely and seek modes within the region defined by the diffusion policy. DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient. We evaluate its effectiveness and algorithmic characteristics against popular Kullback-Leibler (KL) based distillation methods in 2D bandit scenarios and gym tasks. We then show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds. The PyTorch implementation will be made available.

[LG-93] Breaking Indistinguishability with Transfer Learning: A First Look at SPECK32/64 Lightweight Block Ciphers

链接: https://arxiv.org/abs/2405.19683
作者: Jimmy Dani,Kalyan Nakka,Nitesh Saxena
关键词: Cipher Block Chaining, Block Chaining, CBC mode, algorithm in CBC, attack framework
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this research, we introduce MIND-Crypt, a novel attack framework that uses deep learning (DL) and transfer learning (TL) to challenge the indistinguishability of block ciphers, specifically SPECK32/64 encryption algorithm in CBC mode (Cipher Block Chaining) against Known Plaintext Attacks (KPA). Our methodology includes training a DL model with ciphertexts of two messages encrypted using the same key. The selected messages have the same byte-length and differ by only one bit at the binary level. This DL model employs a residual network architecture. For the TL, we use the trained DL model as a feature extractor, and these features are then used to train a shallow machine learning, such as XGBoost. This dual strategy aims to distinguish ciphertexts of two encrypted messages, addressing traditional cryptanalysis challenges. Our findings demonstrate that the DL model achieves an accuracy of approximately 99% under consistent cryptographic conditions (Same Key or Rounds) with the SPECK32/64 cipher. However, performance degrades to random guessing levels (50%) when tested with ciphertext generated from different keys or different encryption rounds of SPECK32/64. To enhance the results, the DL model requires retraining with different keys or encryption rounds using larger datasets (10^7 samples). To overcome this limitation, we implement TL, achieving an accuracy of about 53% with just 10,000 samples, which is better than random guessing. Further training with 580,000 samples increases accuracy to nearly 99%, showing a substantial reduction in data requirements by over 94%. This shows that an attacker can utilize machine learning models to break indistinguishability by accessing pairs of plaintexts and their corresponding ciphertexts encrypted with the same key, without directly interacting with the communicating parties. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2405.19683 [cs.CR] (or arXiv:2405.19683v1 [cs.CR] for this version)

[LG-94] Efficient Trajectory Inference in Wasserstein Space Using Consecutive Averaging

链接: https://arxiv.org/abs/2405.19679
作者: Amartya Banerjee,Harlin Lee,Nir Sharon,Caroline Moosmüller
关键词: computational biology, Capturing data, cross-sectional measurements, dynamic processes, Capturing
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Capturing data from dynamic processes through cross-sectional measurements is seen in many fields such as computational biology. Trajectory inference deals with the challenge of reconstructing continuous processes from such observations. In this work, we propose methods for B-spline approximation and interpolation of point clouds through consecutive averaging that is instrinsic to the Wasserstein space. Combining subdivision schemes with optimal transport-based geodesic, our methods carry out trajectory inference at a chosen level of precision and smoothness, and can automatically handle scenarios where particles undergo division over time. We rigorously evaluate our method by providing convergence guarantees and testing it on simulated cell data characterized by bifurcations and merges, comparing its performance against state-of-the-art trajectory inference and interpolation methods. The results not only underscore the effectiveness of our method in inferring trajectories, but also highlight the benefit of performing interpolation and approximation that respect the inherent geometric properties of the data.

[LG-95] Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models

链接: https://arxiv.org/abs/2405.19673
作者: Masatoshi Uehara,Yulai Zhao,Ehsan Hajiramezanali,Gabriele Scalia,Gökcen Eraslan,Avantika Lal,Sergey Levine,Tommaso Biancalani
关键词: protein sequence design, AI-driven design problems, feasible design space, protein sequence, generative modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Under review

点击查看摘要

Abstract:AI-driven design problems, such as DNA/protein sequence design, are commonly tackled from two angles: generative modeling, which efficiently captures the feasible design space (e.g., natural images or biological sequences), and model-based optimization, which utilizes reward models for extrapolation. To combine the strengths of both approaches, we adopt a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL. Although prior work has explored similar avenues, they primarily focus on scenarios where accurate reward models are accessible. In contrast, we concentrate on an offline setting where a reward model is unknown, and we must learn from static offline datasets, a common scenario in scientific domains. In offline scenarios, existing approaches tend to suffer from overoptimization, as they may be misled by the reward model in out-of-distribution regions. To address this, we introduce a conservative fine-tuning approach, BRAID, by optimizing a conservative reward model, which includes additional penalization outside of offline data distributions. Through empirical and theoretical analysis, we demonstrate the capability of our approach to outperform the best designs in offline data, leveraging the extrapolation capabilities of reward models while avoiding the generation of invalid designs through pre-trained diffusion models.

[LG-96] CRIS: Collaborative Refinement Integrated with Segmentation for Polyp Segmentation

链接: https://arxiv.org/abs/2405.19672
作者: Ankush Gajanan Arudkar,Bernard J.E. Evans
关键词: early prevention heavily, prevention heavily rely, Accurate detection, precise polyp identification, gastrointestinal colonoscopy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate detection of colorectal cancer and early prevention heavily rely on precise polyp identification during gastrointestinal colonoscopy. Due to limited data, many current state-of-the-art deep learning methods for polyp segmentation often rely on post-processing of masks to reduce noise and enhance results. In this study, we propose an approach that integrates mask refinement and binary semantic segmentation, leveraging a novel collaborative training strategy that surpasses current widely-used refinement strategies. We demonstrate the superiority of our approach through comprehensive evaluation on established benchmark datasets and its successful application across various medical image segmentation architectures.

[LG-97] Reconciling Model Multiplicity for Downstream Decision Making

链接: https://arxiv.org/abs/2405.19667
作者: Ally Yalei Du,Dung Daniel Ngo,Zhiwei Steven Wu
关键词: downstream loss function, predictive models, individual probability prediction, loss function, best-response actions
类目: Machine Learning (cs.LG)
*备注: 16 pages main body, 6 figures

点击查看摘要

Abstract:We consider the problem of model multiplicity in downstream decision-making, a setting where two predictive models of equivalent accuracy cannot agree on the best-response action for a downstream loss function. We show that even when the two predictive models approximately agree on their individual predictions almost everywhere, it is still possible for their induced best-response actions to differ on a substantial portion of the population. We address this issue by proposing a framework that calibrates the predictive models with regard to both the downstream decision-making problem and the individual probability prediction. Specifically, leveraging tools from multi-calibration, we provide an algorithm that, at each time-step, first reconciles the differences in individual probability prediction, then calibrates the updated models such that they are indistinguishable from the true probability distribution to the decision-maker. We extend our results to the setting where one does not have direct access to the true probability distribution and instead relies on a set of i.i.d data to be the empirical distribution. Finally, we provide a set of experiments to empirically evaluate our methods: compared to existing work, our proposed algorithm creates a pair of predictive models with both improved downstream decision-making losses and agrees on their best-response actions almost everywhere.

[LG-98] A novel fault localization with data refinement for hydroelectric units

链接: https://arxiv.org/abs/2405.19665
作者: Jialong Huang,Junlin Song,Penglong Lian,Mengjie Gan,Zhiheng Su,Benhao Wang,Wenji Zhu,Xiaomin Pu,Jianxiao Zou,Shicai Fan
关键词: traditional hydroelectric unit, unit fault localization, hydroelectric unit fault, hydroelectric units, fault localization methods
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6pages,4 figures,Conference on Decision and Control(CDC) conference

点击查看摘要

Abstract:Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learning (SG-WMBDL) based fault localization method for hydroelectric units is proposed. To overcome the data scarcity, a SAE is embedded into the GAN to generate more high-quality samples in the data generation module. Considering the signals involving non-linear and non-smooth characteristics, the improved WNR which combining both soft and hard thresholding and local linear embedding (LLE) are utilized to the data preprocessing module in order to reduce the noise and effectively capture the local features. In addition, to seek higher performance, the novel Adaptive Boost (AdaBoost) combined with multi deep learning is proposed to achieve accurate fault localization. The experimental results show that the SG-WMBDL can locate faults for hydroelectric units under a small number of fault samples with non-linear and non-smooth characteristics on higher precision and accuracy compared to other frontier methods, which verifies the effectiveness and practicality of the proposed method.

[LG-99] MGCP: A Multi-Grained Correlation based Prediction Network for Multivariate Time Series

链接: https://arxiv.org/abs/2405.19661
作者: Zhicheng Chen,Xi Xiao,Ke Xu,Zhong Zhang,Yu Rong,Qing Li,Guojun Gan,Zhiqiang Xu,Peilin Zhao
关键词: Multivariate time series, time series prediction, poses significant challenges, significant challenges due, time series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series prediction is widely used in daily life, which poses significant challenges due to the complex correlations that exist at multi-grained levels. Unfortunately, the majority of current time series prediction models fail to simultaneously learn the correlations of multivariate time series at multi-grained levels, resulting in suboptimal performance. To address this, we propose a Multi-Grained Correlations-based Prediction (MGCP) Network, which simultaneously considers the correlations at three granularity levels to enhance prediction performance. Specifically, MGCP utilizes Adaptive Fourier Neural Operators and Graph Convolutional Networks to learn the global spatiotemporal correlations and inter-series correlations, enabling the extraction of potential features from multivariate time series at fine-grained and medium-grained levels. Additionally, MGCP employs adversarial training with an attention mechanism-based predictor and conditional discriminator to optimize prediction results at coarse-grained level, ensuring high fidelity between the generated forecast results and the actual data distribution. Finally, we compare MGCP with several state-of-the-art time series prediction algorithms on real-world benchmark datasets, and our results demonstrate the generality and effectiveness of the proposed model.

[LG-100] SysCaps: Language Interfaces for Simulation Surrogates of Complex Systems

链接: https://arxiv.org/abs/2405.19653
作者: Patrick Emami,Zhaonan Li,Saumya Sinha,Truc Nguyen
关键词: computational scientists study, Data-driven simulation surrogates, Data-driven simulation, scientists study complex, computational scientists
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 17 pages. Under review

点击查看摘要

Abstract:Data-driven simulation surrogates help computational scientists study complex systems. They can also help inform impactful policy decisions. We introduce a learning framework for surrogate modeling where language is used to interface with the underlying system being simulated. We call a language description of a system a “system caption”, or SysCap. To address the lack of datasets of paired natural language SysCaps and simulation runs, we use large language models (LLMs) to synthesize high-quality captions. Using our framework, we train multimodal text and timeseries regression models for two real-world simulators of complex energy systems. Our experiments demonstrate the feasibility of designing language interfaces for real-world surrogate models at comparable accuracy to standard baselines. We qualitatively and quantitatively show that SysCaps unlock text-prompt-style surrogate modeling and new generalization abilities beyond what was previously possible. We will release the generated SysCaps datasets and our code to support follow-on studies.

[LG-101] Few for Many: Tchebycheff Set Scalarization for Many-Objective Optimization

链接: https://arxiv.org/abs/2405.19650
作者: Xi Lin,Yilu Liu,Xiaoyuan Zhang,Fei Liu,Zhenkun Wang,Qingfu Zhang
关键词: Multi-objective optimization, real-world applications, Pareto solutions, Multi-objective, objectives
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Multi-objective optimization can be found in many real-world applications where some conflicting objectives can not be optimized by a single solution. Existing optimization methods often focus on finding a set of Pareto solutions with different optimal trade-offs among the objectives. However, the required number of solutions to well approximate the whole Pareto optimal set could be exponentially large with respect to the number of objectives, which makes these methods unsuitable for handling many optimization objectives. In this work, instead of finding a dense set of Pareto solutions, we propose a novel Tchebycheff set scalarization method to find a few representative solutions (e.g., 5) to cover a large number of objectives (e.g., 100 ) in a collaborative and complementary manner. In this way, each objective can be well addressed by at least one solution in the small solution set. In addition, we further develop a smooth Tchebycheff set scalarization approach for efficient optimization with good theoretical guarantees. Experimental studies on different problems with many optimization objectives demonstrate the effectiveness of our proposed method.

[LG-102] owards Deeper Understanding of PPR-based Embedding Approaches: A Topological Perspective

链接: https://arxiv.org/abs/2405.19649
作者: Xingyi Zhang,Zixuan Weng,Sibo Wang
关键词: learns low-dimensional vectors, embedding learns low-dimensional, Node embedding learns, node embedding approaches, embedding approaches
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Node embedding learns low-dimensional vectors for nodes in the graph. Recent state-of-the-art embedding approaches take Personalized PageRank (PPR) as the proximity measure and factorize the PPR matrix or its adaptation to generate embeddings. However, little previous work analyzes what information is encoded by these approaches, and how the information correlates with their superb performance in downstream tasks. In this work, we first show that state-of-the-art embedding approaches that factorize a PPR-related matrix can be unified into a closed-form framework. Then, we study whether the embeddings generated by this strategy can be inverted to better recover the graph topology information than random-walk based embeddings. To achieve this, we propose two methods for recovering graph topology via PPR-based embeddings, including the analytical method and the optimization method. Extensive experimental results demonstrate that the embeddings generated by factorizing a PPR-related matrix maintain more topological information, such as common edges and community structures, than that generated by random walks, paving a new way to systematically comprehend why PPR-based node embedding approaches outperform random walk-based alternatives in various downstream tasks. To the best of our knowledge, this is the first work that focuses on the interpretability of PPR-based node embedding approaches.

[LG-103] Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

链接: https://arxiv.org/abs/2405.19648
作者: Ernesto Quevedo,Jorge Yero,Rachel Koerner,Pablo Rivas,Tomas Cerny
关键词: Large Language Models, produce inaccurate outputs, Language Models, Large Language, propensity of Large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: ICAI’24 - The 26th Int’l Conf on Artificial Intelligence

点击查看摘要

Abstract:Concerns regarding the propensity of Large Language Models (LLMs) to produce inaccurate outputs, also known as hallucinations, have escalated. Detecting them is vital for ensuring the reliability of applications relying on LLM-generated content. Current methods often demand substantial resources and rely on extensive LLMs or employ supervised learning with multidimensional features or intricate linguistic and semantic analyses difficult to reproduce and largely depend on using the same LLM that hallucinated. This paper introduces a supervised learning approach employing two simple classifiers utilizing only four numerical features derived from tokens and vocabulary probabilities obtained from other LLM evaluators, which are not necessarily the same. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks. Additionally, we provide a comprehensive examination of the strengths and weaknesses of our approach, highlighting the significance of the features utilized and the LLM employed as an evaluator. We have released our code publicly at this https URL.

[LG-104] FTS: A Framework to Find a Faithful TimeSieve

链接: https://arxiv.org/abs/2405.19647
作者: Songning Lai,Ninghui Feng,Haochen Sui,Ze Ma,Hao Wang,Zichen Song,Hang Zhao,Yutao Yue
关键词: demonstrates impressive performance, time series forecasting, garnered significant attention, recent years, prompting the development
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of time series forecasting has garnered significant attention in recent years, prompting the development of advanced models like TimeSieve, which demonstrates impressive performance. However, an analysis reveals certain unfaithfulness issues, including high sensitivity to random seeds and minute input noise perturbations. Recognizing these challenges, we embark on a quest to define the concept of \textbf\underlineFaithful \underlineTime\underlineSieve \underline(FTS), a model that consistently delivers reliable and robust predictions. To address these issues, we propose a novel framework aimed at identifying and rectifying unfaithfulness in TimeSieve. Our framework is designed to enhance the model’s stability and resilience, ensuring that its outputs are less susceptible to the aforementioned factors. Experimentation validates the effectiveness of our proposed framework, demonstrating improved faithfulness in the model’s behavior. Looking forward, we plan to expand our experimental scope to further validate and optimize our algorithm, ensuring comprehensive faithfulness across a wide range of scenarios. Ultimately, we aspire to make this framework can be applied to enhance the faithfulness of not just TimeSieve but also other state-of-the-art temporal methods, thereby contributing to the reliability and robustness of temporal modeling as a whole.

[LG-105] EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos

链接: https://arxiv.org/abs/2405.19644
作者: Ryo Fujii,Masashi Hatano,Hideo Saito,Hiroki Kajita
关键词: Surgical phase recognition, modern operating room, Surgical phase, phase recognition, open surgery video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Early accepted by MICCAI 2024

点击查看摘要

Abstract:Surgical phase recognition has gained significant attention due to its potential to offer solutions to numerous demands of the modern operating room. However, most existing methods concentrate on minimally invasive surgery (MIS), leaving surgical phase recognition for open surgery understudied. This discrepancy is primarily attributed to the scarcity of publicly available open surgery video datasets for surgical phase recognition. To address this issue, we introduce a new egocentric open surgery video dataset for phase recognition, named EgoSurgery-Phase. This dataset comprises 15 hours of real open surgery videos spanning 9 distinct surgical phases all captured using an egocentric camera attached to the surgeon’s head. In addition to video, the EgoSurgery-Phase offers eye gaze. As far as we know, it is the first real open surgery video dataset for surgical phase recognition publicly available. Furthermore, inspired by the notable success of masked autoencoders (MAEs) in video understanding tasks (e.g., action recognition), we propose a gaze-guided masked autoencoder (GGMAE). Considering the regions where surgeons’ gaze focuses are often critical for surgical phase recognition (e.g., surgical field), in our GGMAE, the gaze information acts as an empirical semantic richness prior to guiding the masking process, promoting better attention to semantically rich spatial regions. GGMAE significantly improves the previous state-of-the-art recognition method (6.4% in Jaccard) and the masked autoencoder-based method (3.1% in Jaccard) on EgoSurgery-Phase. The dataset will be released at this https URL.

[LG-106] Easy Problems That LLMs Get Wrong

链接: https://arxiv.org/abs/2405.19616
作者: Sean Williams,James Huckle
关键词: comprehensive Linguistic Benchmark, Linguistic Benchmark designed, Large Language Models, Large Language, Linguistic Benchmark
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: AutogenAI Ltd. Associated code at this https URL

点击查看摘要

Abstract:We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding, among others. Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease. It also highlights the potential of prompt engineering to mitigate some errors and underscores the necessity for better training methodologies. Our findings stress the importance of grounding LLMs with human reasoning and common sense, emphasising the need for human-in-the-loop for enterprise applications. We hope this work paves the way for future research to enhance the usefulness and reliability of new models.

[LG-107] Do spectral cues matter in contrast-based graph self-supervised learning?

链接: https://arxiv.org/abs/2405.19600
作者: Xiangru Jian,Xinjian Zhao,Wei Pang,Chaolong Ying,Yimu Wang,Yaoyao Xu,Tianshu Yu
关键词: graph self-supervised learning, contrast-based graph self-supervised, self-supervised learning, graph self-supervised, recent surge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent surge in contrast-based graph self-supervised learning has prominently featured an intensified exploration of spectral cues. However, an intriguing paradox emerges, as methods grounded in seemingly conflicting assumptions or heuristic approaches regarding the spectral domain demonstrate notable enhancements in learning performance. This paradox prompts a critical inquiry into the genuine contribution of spectral information to contrast-based graph self-supervised learning. This study undertakes an extensive investigation into this inquiry, conducting a thorough study of the relationship between spectral characteristics and the learning outcomes of contemporary methodologies. Based on this analysis, we claim that the effectiveness and significance of spectral information need to be questioned. Instead, we revisit simple edge perturbation: random edge dropping designed for node-level self-supervised learning and random edge adding intended for graph-level self-supervised learning. Compelling evidence is presented that these simple yet effective strategies consistently yield superior performance while demanding significantly fewer computational resources compared to all prior spectral augmentation methods. The proposed insights represent a significant leap forward in the field, potentially reshaping the understanding and implementation of graph self-supervised learning.

[LG-108] SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors

链接: https://arxiv.org/abs/2405.19597
作者: Vijay Lingam,Atula Tejaswi,Aditya Vavre,Aneesh Shetty,Gautham Krishna Gudur,Joydeep Ghosh,Alex Dimakis,Eunsol Choi,Aleksandar Bojchevski,Sujay Sanghavi
关键词: Popular parameter-efficient fine-tuning, freeze pre-trained model, Popular parameter-efficient, pre-trained model weights, inject learnable matrices
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 17 pages, 5 figures, 14 tables

点击查看摘要

Abstract:Popular parameter-efficient fine-tuning (PEFT) methods, such as LoRA and its variants, freeze pre-trained model weights (W) and inject learnable matrices (\Delta W). These (\Delta W) matrices are structured for efficient parameterization, often using techniques like low-rank approximations or scaling vectors. However, these methods typically show a performance gap compared to full fine-tuning. Although recent PEFT methods have narrowed this gap, they do so at the cost of additional learnable parameters. We propose SVFT, a simple approach that fundamentally differs from existing methods: the structure imposed on (\Delta W) depends on the specific weight matrix (W). Specifically, SVFT updates (W) as a sparse combination of outer products of its singular vectors, training only the coefficients (scales) of these sparse combinations. This approach allows fine-grained control over expressivity through the number of coefficients. Extensive experiments on language and vision benchmarks show that SVFT recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25% of parameters, outperforming existing methods that only recover up to 85% performance using 0.03 to 0.8% of the trainable parameter budget.

[LG-109] Why Larger Language Models Do In-context Learning Differently?

链接: https://arxiv.org/abs/2405.19592
作者: Zhenmei Shi,Junyi Wei,Zhuoyan Xu,Yingyu Liang
关键词: unseen tasks based, in-context learning, unseen tasks, tasks based, ICL behaviors
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.

[LG-110] Weights Augmentation: it has never ever ever ever let her model down

链接: https://arxiv.org/abs/2405.19590
作者: Junbin Zhuang,Guiguang Din,Yunyi Yan
关键词: deep learning network, Weight Augmentation Strategy, Weight, weight augmentation, play an essential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weight play an essential role in deep learning network models. Unlike network structure design, this article proposes the concept of weight augmentation, focusing on weight exploration. The core of Weight Augmentation Strategy (WAS) is to adopt random transformed weight coefficients training and transformed coefficients, named Shadow Weight(SW), for networks that can be used to calculate loss function to affect parameter updates. However, stochastic gradient descent is applied to Plain Weight(PW), which is referred to as the original weight of the network before the random transformation. During training, numerous SW collectively form high-dimensional space, while PW is directly learned from the distribution of SW instead of the data. The weight of the accuracy-oriented mode(AOM) relies on PW, which guarantees the network is highly robust and accurate. The desire-oriented mode(DOM) weight uses SW, which is determined by the network model’s unique functions based on WAT’s performance desires, such as lower computational complexity, lower sensitivity to particular data, etc. The dual mode be switched at anytime if needed. WAT extends the augmentation technique from data augmentation to weight, and it is easy to understand and implement, but it can improve almost all networks amazingly. Our experimental results show that convolutional neural networks, such as VGG-16, ResNet-18, ResNet-34, GoogleNet, MobilementV2, and Efficientment-Lite, can benefit much at little or no cost. The accuracy of models is on the CIFAR100 and CIFAR10 datasets, which can be evaluated to increase by 7.32% and 9.28%, respectively, with the highest values being 13.42% and 18.93%, respectively. In addition, DOM can reduce floating point operations (FLOPs) by up to 36.33%. The code is available at this https URL.

[LG-111] SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

链接: https://arxiv.org/abs/2405.19586
作者: Junjie Zhang,Chenjia Bai,Haoran He,Wenke Xia,Zhigang Wang,Bin Zhao,Xiu Li,Xuelong Li
关键词: Acquiring a multi-task, multi-task imitation policy, manipulation poses challenges, challenges in terms, poses challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: ICML 2024. Project page: this https URL

点击查看摘要

Abstract:Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot’s end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

[LG-112] Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

链接: https://arxiv.org/abs/2405.19567
作者: Shenghuan Sun,Gregory M. Goldgof,Alexander Schubert,Zhiqing Sun,Thomas Hartvigsen,Atul J. Butte,Ahmed Alaa
关键词: natural language interactions, treatment tasks, support clinicians, images and engaging, engaging in natural
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Code available at: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions to assist in diagnostic and treatment tasks. However, VLMs often exhibit “hallucinogenic” behavior, generating textual outputs not grounded in contextual multimodal information. This challenge is particularly pronounced in the medical domain, where we do not only require VLM outputs to be accurate in single interactions but also to be consistent with clinical reasoning and diagnostic pathways throughout multi-turn conversations. For this purpose, we propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge. These representations are utilized to (i) generate GPT-4-guided visual instruction tuning data at scale, simulating clinician-VLM conversations with demonstrations of clinical reasoning, and (ii) create an automatic reward function that evaluates the clinical validity of VLM generations throughout clinician-VLM interactions. Our algorithm eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback (RLHF). We apply our alignment algorithm to develop Dr-LLaVA, a conversational VLM finetuned for analyzing bone marrow pathology slides, demonstrating strong performance in multi-turn medical conversations.

[LG-113] Selective Explanations

链接: https://arxiv.org/abs/2405.19562
作者: Lucas Monteiro Paes,Dennis Wei,Flavio P. Calmon
关键词: explain black-box machine, assigning importance scores, methods explain black-box, black-box machine learning, Feature attribution
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature attribution methods explain black-box machine learning (ML) models by assigning importance scores to input features. These methods can be computationally expensive for large ML models. To address this challenge, there has been increasing efforts to develop amortized explainers, where a machine learning model is trained to predict feature attribution scores with only one inference. Despite their efficiency, amortized explainers can produce inaccurate predictions and misleading explanations. In this paper, we propose selective explanations, a novel feature attribution method that (i) detects when amortized explainers generate low-quality explanations and (ii) improves these explanations using a technique called explanations with initial guess. Our selective explanation method allows practitioners to specify the fraction of samples that receive explanations with initial guess, offering a principled way to bridge the gap between amortized explainers and their high-quality counterparts.

[LG-114] Clustering Mixtures of Discrete Distributions: A Note on Mitras Algorithm

链接: https://arxiv.org/abs/2405.19559
作者: Mohamed Seif,Yanxi Chen
关键词: classifying general discrete, general discrete mixture, discrete mixture distribution, cite, Mitra algorithm
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this note, we provide a refined analysis of Mitra’s algorithm \citemitra2008clustering for classifying general discrete mixture distribution models. Built upon spectral clustering \citemcsherry2001spectral, this algorithm offers compelling conditions for probability distributions. We enhance this analysis by tailoring the model to bipartite stochastic block models, resulting in more refined conditions. Compared to those derived in \citemitra2008clustering, our improved separation conditions are obtained.

[LG-115] Stress-Testing Capability Elicitation With Password-Locked Models

链接: https://arxiv.org/abs/2405.19550
作者: Ryan Greenblatt,Fabien Roger,Dmitrii Krasheninnikov,David Krueger
关键词: capabilities, large language models, determine the safety, safety of large, large language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM’s full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. More surprisingly, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords. Furthermore, when only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities. Overall, our findings suggest that fine-tuning is an effective method of eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available, e.g. as may be the case when models’ (hidden) capabilities exceed those of human demonstrators.

[LG-116] RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning

链接: https://arxiv.org/abs/2405.19548
作者: Mingqi Yuan,Roger Creus Castanyer,Bo Li,Xin Jin,Glen Berseth,Wenjun Zeng
关键词: guide reinforcement learning, effectively guide reinforcement, Extrinsic rewards, extrinsic rewards frequently, reinforcement learning
类目: Machine Learning (cs.LG)
*备注: 25 pages, 19 figures

点击查看摘要

Abstract:Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. However, extrinsic rewards frequently fall short in complex environments due to the significant human effort needed for their design and annotation. This limitation underscores the necessity for intrinsic rewards, which offer auxiliary and dense signals and can enable agents to learn in an unsupervised manner. Although various intrinsic reward formulations have been proposed, their implementation and optimization details are insufficiently explored and lack standardization, thereby hindering research progress. To address this gap, we introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward algorithms. Furthermore, we conduct an in-depth study that identifies critical implementation details and establishes well-justified standard practices in intrinsically-motivated RL. The source code for RLeXplore is available at this https URL.

[LG-117] CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

链接: https://arxiv.org/abs/2405.19547
作者: Yiping Wang,Yifang Chen,Wendan Yan,Alex Fang,Wenjing Zhou,Kevin Jamieson,Simon Shaolei Du
关键词: noisy web-curated datasets, visual-language model pretaining, large-scale visual-language model, Data selection, CLIP
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper supercedes our previous VAS paper ( arXiv:2402.02055 )

点击查看摘要

Abstract:Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce negCLIPLoss, a CLIP loss-inspired method that adds the alignment between one sample and its contrastive pairs as an extra normalization term for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, NormSim, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp~\citegadre2023datacomp. Compared to the best baseline using only OpenAI’s CLIP-L/14, our methods achieve a 5.3% improvement on ImageNet-1k and a 2.8% improvement on 38 downstream evaluation tasks. Moreover, both negCLIPLoss and NormSim are compatible with existing techniques. By combining our methods with the current best methods DFN~\citefang2023data and HYPE~\citekim2024hype, we can boost average performance on downstream tasks by 0.9%, achieving a new state-of-the-art.

[LG-118] One-Shot Safety Alignment for Large Language Models via Optimal Dualization

链接: https://arxiv.org/abs/2405.19544
作者: Xinmeng Huang,Shuo Li,Edgar Dobriban,Osbert Bastani,Hamed Hassani,Dongsheng Ding
关键词: Large Language Models, surrounding Large Language, concerns surrounding Large, Language Models, Large Language
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The growing safety concerns surrounding Large Language Models (LLMs) raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, common Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, thus greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based scenarios (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness of our methods.

[LG-119] CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts Images and Patients

链接: https://arxiv.org/abs/2405.19538
作者: Pierre Chambon,Jean-Benoit Delbrouck,Thomas Sounack,Shih-Cheng Huang,Zhihong Chen,Maya Varma,Steven QH Truong,Chu The Chuong,Curtis P. Langlotz
关键词: original CheXpert paper, years ago, paper five years, original CheXpert, CheXpert paper
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: this https URL Models are available at the following URL: this https URL

[LG-120] Preference Learning Algorithms Do Not Learn Preference Rankings

链接: https://arxiv.org/abs/2405.19534
作者: Angelica Chen,Sadhika Malladi,Lily H. Zhang,Xinyi Chen,Qiuyi Zhang,Rajesh Ranganath,Kyunghyun Cho
关键词: Preference learning algorithms, ranking accuracy, Preference learning, produce generations, preferred outputs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via \textitranking accuracy . Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the \textitidealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant \textitalignment gap – \textiti.e. , a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.

[LG-121] Contrasting Multiple Representations with the Multi-Marginal Matching Gap

链接: https://arxiv.org/abs/2405.19532
作者: Zoe Piran,Michal Klein,James Thornton,Marco Cuturi
关键词: Learning meaningful representations, Learning meaningful, machine learning, meaningful representations, representations of complex
类目: Machine Learning (cs.LG)
*备注: To be presented at ICML 2024

点击查看摘要

Abstract:Learning meaningful representations of complex objects that can be seen through multiple ( k\geq 3 ) views or modalities is a core task in machine learning. Existing methods use losses originally intended for paired views, and extend them to k views, either by instantiating \tfrac12k(k-1) loss-pairs, or by using reduced embeddings, following a \textitone vs. average-of-rest strategy. We propose the multi-marginal matching gap (M3G), a loss that borrows tools from multi-marginal optimal transport (MM-OT) theory to simultaneously incorporate all k views. Given a batch of n points, each seen as a k -tuple of views subsequently transformed into k embeddings, our loss contrasts the cost of matching these n ground-truth k -tuples with the MM-OT polymatching cost, which seeks n optimally arranged k -tuples chosen within these n\times k vectors. While the exponential complexity O(n^k ) of the MM-OT problem may seem daunting, we show in experiments that a suitable generalization of the Sinkhorn algorithm for that problem can scale to, e.g., k=3\sim 6 views using mini-batches of size 64~\sim128 . Our experiments demonstrate improved performance over multiview extensions of pairwise losses, for both self-supervised and multimodal tasks.

[LG-122] Real-Time Dynamic Robot-Assisted Hand-Object Interaction via Motion Primitives

链接: https://arxiv.org/abs/2405.19531
作者: Mingqi Yuan,Huijiang Wang,Kai-Fung Chu,Fumiya Iida,Bo Li,Wenjun Zeng
关键词: Advances in artificial, artificial intelligence, propelling the evolution, Advances, achieving seamless interactions
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:Advances in artificial intelligence (AI) have been propelling the evolution of human-robot interaction (HRI) technologies. However, significant challenges remain in achieving seamless interactions, particularly in tasks requiring physical contact with humans. These challenges arise from the need for accurate real-time perception of human actions, adaptive control algorithms for robots, and the effective coordination between human and robotic movements. In this paper, we propose an approach to enhancing physical HRI with a focus on dynamic robot-assisted hand-object interaction (HOI). Our methodology integrates hand pose estimation, adaptive robot control, and motion primitives to facilitate human-robot collaboration. Specifically, we employ a transformer-based algorithm to perform real-time 3D modeling of human hands from single RGB images, based on which a motion primitives model (MPM) is designed to translate human hand motions into robotic actions. The robot’s action implementation is dynamically fine-tuned using the continuously updated 3D hand models. Experimental validations, including a ring-wearing task, demonstrate the system’s effectiveness in adapting to real-time movements and assisting in precise task executions.

[LG-123] Crowdsourcing with Difficulty: A Bayesian Rating Model for Heterogeneous Items

链接: https://arxiv.org/abs/2405.19521
作者: Seong Woo Han,Ozan Adıgüzel,Bob Carpenter
关键词: gold standards, machine learning, applied statistics, statistics and machine, Dawid and Skene
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In applied statistics and machine learning, the “gold standards” used for training are often biased and almost always noisy. Dawid and Skene’s justifiably popular crowdsourcing model adjusts for rater (coder, annotator) sensitivity and specificity, but fails to capture distributional properties of rating data gathered for training, which in turn biases training. In this study, we introduce a general purpose measurement-error model with which we can infer consensus categories by adding item-level effects for difficulty, discriminativeness, and guessability. We further show how to constrain the bimodal posterior of these models to avoid (or if necessary, allow) adversarial raters. We validate our model’s goodness of fit with posterior predictive checks, the Bayesian analogue of \chi^2 tests. Dawid and Skene’s model is rejected by goodness of fit tests, whereas our new model, which adjusts for item heterogeneity, is not rejected. We illustrate our new model with two well-studied data sets, binary rating data for caries in dental X-rays and implication in natural language.

[LG-124] Decentralized Optimization in Time-Varying Networks with Arbitrary Delays

链接: https://arxiv.org/abs/2405.19513
作者: Tomas Ortega,Hamid Jafarkhani
关键词: decentralized optimization problem, optimization problem, networks, decentralized optimization, Stochastic Gradient Descent
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2401.11344

点击查看摘要

Abstract:We consider a decentralized optimization problem for networks affected by communication delays. Examples of such networks include collaborative machine learning, sensor networks, and multi-agent systems. To mimic communication delays, we add virtual non-computing nodes to the network, resulting in directed graphs. This motivates investigating decentralized optimization solutions on directed graphs. Existing solutions assume nodes know their out-degrees, resulting in limited applicability. To overcome this limitation, we introduce a novel gossip-based algorithm, called DT-GO, that does not need to know the out-degrees. The algorithm is applicable in general directed networks, for example networks with delays or limited acknowledgment capabilities. We derive convergence rates for both convex and non-convex objectives, showing that our algorithm achieves the same complexity order as centralized Stochastic Gradient Descent. In other words, the effects of the graph topology and delays are confined to higher-order terms. Additionally, we extend our analysis to accommodate time-varying network topologies. Numerical simulations are provided to support our theoretical findings.

[LG-125] Momentum for the Win: Collaborative Federated Reinforcement Learning across Heterogeneous Environments

链接: https://arxiv.org/abs/2405.19499
作者: Han Wang,Sihong He,Zhili Zhang,Fei Miao,James Anderson
关键词: Federated Reinforcement Learning, Reinforcement Learning, Federated Reinforcement, agents collaboratively learn, explore a Federated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore a Federated Reinforcement Learning (FRL) problem where N agents collaboratively learn a common policy without sharing their trajectory data. To date, existing FRL work has primarily focused on agents operating in the same or ``similar" environments. In contrast, our problem setup allows for arbitrarily large levels of environment heterogeneity. To obtain the optimal policy which maximizes the average performance across all potentially completely different environments, we propose two algorithms: FedSVRPG-M and FedHAPG-M. In contrast to existing results, we demonstrate that both FedSVRPG-M and FedHAPG-M, both of which leverage momentum mechanisms, can exactly converge to a stationary point of the average performance function, regardless of the magnitude of environment heterogeneity. Furthermore, by incorporating the benefits of variance-reduction techniques or Hessian approximation, both algorithms achieve state-of-the-art convergence results, characterized by a sample complexity of \mathcalO\left(\epsilon^-\frac32/N\right) . Notably, our algorithms enjoy linear convergence speedups with respect to the number of agents, highlighting the benefit of collaboration among agents in finding a common policy.

[LG-126] Participation in the age of foundation models

链接: https://arxiv.org/abs/2405.19479
作者: Harini Suresh,Emily Tseng,Meg Young,Mary L. Gray,Emma Pierson,Karen Levy
关键词: Growing interest, foundation models, interest and investment, impact a wide, wide array
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 13 pages, 2 figures. Appeared at FAccT '24

点击查看摘要

Abstract:Growing interest and investment in the capabilities of foundation models has positioned such systems to impact a wide array of public services. Alongside these opportunities is the risk that these systems reify existing power imbalances and cause disproportionate harm to marginalized communities. Participatory approaches hold promise to instead lend agency and decision-making power to marginalized stakeholders. But existing approaches in participatory AI/ML are typically deeply grounded in context - how do we apply these approaches to foundation models, which are, by design, disconnected from context? Our paper interrogates this question. First, we examine existing attempts at incorporating participation into foundation models. We highlight the tension between participation and scale, demonstrating that it is intractable for impacted communities to meaningfully shape a foundation model that is intended to be universally applicable. In response, we develop a blueprint for participatory foundation models that identifies more local, application-oriented opportunities for meaningful participation. In addition to the “foundation” layer, our framework proposes the "subfloor’’ layer, in which stakeholders develop shared technical infrastructure, norms and governance for a grounded domain, and the "surface’’ layer, in which affected communities shape the use of a foundation model for a specific downstream task. The intermediate "subfloor’’ layer scopes the range of potential harms to consider, and affords communities more concrete avenues for deliberation and intervention. At the same time, it avoids duplicative effort by scaling input across relevant use cases. Through three case studies in clinical care, financial services, and journalism, we illustrate how this multi-layer model can create more meaningful opportunities for participation than solely intervening at the foundation layer. Comments: 13 pages, 2 figures. Appeared at FAccT '24 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2405.19479 [cs.CY] (or arXiv:2405.19479v1 [cs.CY] for this version) Journalreference: In The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24), June 3-6, 2024, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 13 pages Related DOI: https://doi.org/10.1145/3630106.3658992 Focus to learn more DOI(s) linking to related resources

[LG-127] he Data Minimization Principle in Machine Learning

链接: https://arxiv.org/abs/2405.19471
作者: Prakhar Ganesh,Cuong Tran,Reza Shokri,Ferdinando Fioretto
关键词: unauthorized access, processed or retained, potential for misuse, data minimization, data minimization aims
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The principle of data minimization aims to reduce the amount of data collected, processed or retained to minimize the potential for misuse, unauthorized access, or data breaches. Rooted in privacy-by-design principles, data minimization has been endorsed by various global data protection regulations. However, its practical implementation remains a challenge due to the lack of a rigorous formulation. This paper addresses this gap and introduces an optimization framework for data minimization based on its legal definitions. It then adapts several optimization algorithms to perform data minimization and conducts a comprehensive evaluation in terms of their compliance with minimization objectives as well as their impact on user privacy. Our analysis underscores the mismatch between the privacy expectations of data minimization and the actual privacy benefits, emphasizing the need for approaches that account for multiple facets of real-world privacy risks.

[LG-128] Posterior Sampling via Autoregressive Generation

链接: https://arxiv.org/abs/2405.19466
作者: Kelly W Zhang,Tiffany(Tianhui)Cai,Hongseok Namkoong,Daniel Russo
关键词: actively gather information, Real-world decision-making requires, decision-making requires grappling, environments change, intelligent agents
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Real-world decision-making requires grappling with a perpetual lack of data as environments change; intelligent agents must comprehend uncertainty and actively gather information to resolve it. We propose a new framework for learning bandit algorithms from massive historical data, which we demonstrate in a cold-start recommendation problem. First, we use historical data to pretrain an autoregressive model to predict a sequence of repeated feedback/rewards (e.g., responses to news articles shown to different users over time). In learning to make accurate predictions, the model implicitly learns an informed prior based on rich action features (e.g., article headlines) and how to sharpen beliefs as more rewards are gathered (e.g., clicks as each article is recommended). At decision-time, we autoregressively sample (impute) an imagined sequence of rewards for each action, and choose the action with the largest average imputed reward. Far from a heuristic, our approach is an implementation of Thompson sampling (with a learned prior), a prominent active exploration algorithm. We prove our pretraining loss directly controls online decision-making performance, and we demonstrate our framework on a news recommendation task where we integrate end-to-end fine-tuning of a pretrained language model to process news article headline text to improve performance.

[LG-129] Clustering-Based Validation Splits for Domain Generalisation

链接: https://arxiv.org/abs/2405.19461
作者: Andrea Napoli,Paul White
关键词: model selection, domain shift, problem of model, validation sets increases, validation sets
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper considers the problem of model selection under domain shift. In this setting, it is proposed that a high maximum mean discrepancy (MMD) between the training and validation sets increases the generalisability of selected models. A data splitting algorithm based on kernel k-means clustering, which maximises this objective, is presented. The algorithm leverages linear programming to control the size, label, and (optionally) group distributions of the splits, and comes with convergence guarantees. The technique consistently outperforms alternative splitting strategies across a range of datasets and training algorithms, for both domain generalisation (DG) and unsupervised domain adaptation (UDA) tasks. Analysis also shows the MMD between the training and validation sets to be strongly rank-correlated ( \rho=0.63 ) with test domain accuracy, further substantiating the validity of this approach.

[LG-130] Deep Grokking: Would Deep Neural Networks Generalize Better?

链接: https://arxiv.org/abs/2405.19454
作者: Simin Fan,Razvan Pascanu,Martin Jaggi
关键词: networks’ training dynamics, Recent research, neural networks’ training, deep neural networks, illuminated the intricacies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research on the grokking phenomenon has illuminated the intricacies of neural networks’ training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network’s generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model’s generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.

[LG-131] Gaitor: Learning a Unified Representation Across Gaits for Real-World Quadruped Locomotion

链接: https://arxiv.org/abs/2405.19452
作者: Alexander L. Mitchell,Wolfgang Merkt,Aristotelis Papatheodorou,Ioannis Havoutis,Ingmar Posner
关键词: trot and crawl, traversal but requires, requires the segmentation, discrete set, gait types
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures, 2 tables

点击查看摘要

Abstract:The current state-of-the-art in quadruped locomotion is able to produce robust motion for terrain traversal but requires the segmentation of a desired robot trajectory into a discrete set of locomotion skills such as trot and crawl. In contrast, in this work we demonstrate the feasibility of learning a single, unified representation for quadruped locomotion enabling continuous blending between gait types and characteristics. We present Gaitor, which learns a disentangled representation of locomotion skills, thereby sharing information common to all gait types seen during training. The structure emerging in the learnt representation is interpretable in that it is found to encode phase correlations between the different gait types. These can be leveraged to produce continuous gait transitions. In addition, foot swing characteristics are disentangled and directly addressable. Together with a rudimentary terrain encoding and a learned planner operating in this structured latent representation, Gaitor is able to take motion commands including desired gait type and characteristics from a user while reacting to uneven terrain. We evaluate Gaitor in both simulated and real-world settings on the ANYmal C platform. To the best of our knowledge, this is the first work learning such a unified and interpretable latent representation for multiple gaits, resulting in on-demand continuous blending between different locomotion modes on a real quadruped robot.

[LG-132] On the Convergence of Multi-objective Optimization under Generalized Smoothness

链接: https://arxiv.org/abs/2405.19440
作者: Qi Zhang,Peiyao Xiao,Kaiyi Ji,Shaofeng Zou
关键词: Smooth Multi-objective Gradient, Generalized Smooth Multi-objective, Multi-objective Gradient descent, Smooth Multi-objective, Multi-objective Gradient
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard L -smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as recurrent neural networks (RNNs) and transformers. In this paper, we study a more general and realistic class of \ell -smooth loss functions, where \ell is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for \ell -smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an \epsilon -accurate Pareto stationary point with a guaranteed \epsilon -level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally \mathcalO(\epsilon^-2) and \mathcalO(\epsilon^-4) samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter \epsilon -level CA distance in each iteration using more samples. Moreover, we propose a practical variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.

[LG-133] Using Contrastive Learning with Generative Similarity to Learn Spaces that Capture Human Inductive Biases

链接: https://arxiv.org/abs/2405.19420
作者: Raja Marjieh,Sreejan Kumar,Declan Campbell,Liyi Zhang,Gianluca Bencomo,Jake Snell,Thomas L. Griffiths
关键词: human inductive biases, strong inductive biases, inductive biases, rely on strong, information from sensory
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Humans rely on strong inductive biases to learn from few examples and abstract useful information from sensory data. Instilling such biases in machine learning models has been shown to improve their performance on various benchmarks including few-shot learning, robustness, and alignment. However, finding effective training procedures to achieve that goal can be challenging as psychologically-rich training data such as human similarity judgments are expensive to scale, and Bayesian models of human inductive biases are often intractable for complex, realistic domains. Here, we address this challenge by introducing a Bayesian notion of generative similarity whereby two datapoints are considered similar if they are likely to have been sampled from the same distribution. This measure can be applied to complex generative processes, including probabilistic programs. We show that generative similarity can be used to define a contrastive learning objective even when its exact form is intractable, enabling learning of spatial embeddings that express specific inductive biases. We demonstrate the utility of our approach by showing how it can be used to capture human inductive biases for geometric shapes, and to better distinguish different abstract drawing styles that are parameterized by probabilistic programs.

[LG-134] Safety through Permissibility: Shield Construction for Fast and Safe Reinforcement Learning

链接: https://arxiv.org/abs/2405.19414
作者: Alexander Politowicz,Sahisnu Mazumder,Bing Liu
关键词: Designing Reinforcement Learning, real-life problems remains, Designing Reinforcement, Reinforcement Learning, significant challenge
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Designing Reinforcement Learning (RL) solutions for real-life problems remains a significant challenge. A major area of concern is safety. “Shielding” is a popular technique to enforce safety in RL by turning user-defined safety specifications into safe agent behavior. However, these methods either suffer from extreme learning delays, demand extensive human effort in designing models and safe domains in the problem, or require pre-computation. In this paper, we propose a new permissibility-based framework to deal with safety and shield construction. Permissibility was originally designed for eliminating (non-permissible) actions that will not lead to an optimal solution to improve RL training efficiency. This paper shows that safety can be naturally incorporated into this framework, i.e. extending permissibility to include safety, and thereby we can achieve both safety and improved efficiency. Experimental evaluation using three standard RL applications shows the effectiveness of the approach.

[LG-135] Network Analytics for Anti-Money Laundering – A Systematic Literature Review and Experimental Evaluation

链接: https://arxiv.org/abs/2405.19383
作者: Bruno Deprez,Toon Vanderschueren,Wouter Verbeke,Bart Baesens,Tim Verdonck
关键词: financing illegal activities, Money laundering, Money laundering presents, money laundering necessarily, detect money laundering
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Money laundering presents a pervasive challenge, burdening society by financing illegal activities. To more effectively combat and detect money laundering, the use of network information is increasingly being explored, exploiting that money laundering necessarily involves interconnected parties. This has lead to a surge in literature on network analytics (NA) for anti-money laundering (AML). The literature, however, is fragmented and a comprehensive overview of existing work is missing. This results in limited understanding of the methods that may be applied and their comparative detection power. Therefore, this paper presents an extensive and systematic review of the literature. We identify and analyse 97 papers in the Web of Science and Scopus databases, resulting in a taxonomy of approaches following the fraud analytics framework of Bockel-Rickermann et al… Moreover, this paper presents a comprehensive experimental framework to evaluate and compare the performance of prominent NA methods in a uniform setup. The framework is applied on the publicly available Elliptic data set and implements manual feature engineering, random walk-based methods, and deep learning GNNs. We conclude from the results that network analytics increases the predictive power of the AML model with graph neural networks giving the best results. An open source implementation of the experimental framework is provided to facilitate researchers and practitioners to extend upon these results and experiment on proprietary data. As such, we aim to promote a standardised approach towards the analysis and evaluation of network analytics for AML.

[LG-136] PureEBM: Universal Poison Purification via Mid-Run Dynamics of Energy-Based Models

链接: https://arxiv.org/abs/2405.19376
作者: Omead Pooladzandi,Jeffrey Jiang,Sunay Bhat,Gregory Pottie
关键词: target distribution test, poisoning attacks pose, distribution test data, Data poisoning attacks, machine learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.18627

点击查看摘要

Abstract:Data poisoning attacks pose a significant threat to the integrity of machine learning models by leading to misclassification of target distribution test data by injecting adversarial examples during training. Existing state-of-the-art (SoTA) defense methods suffer from a variety of limitations, such as significantly reduced generalization performance, specificity to particular attack types and classifiers, and significant overhead during training, making them impractical or limited for real-world applications. In response to this challenge, we introduce a universal data purification method that defends naturally trained classifiers from malicious white-, gray-, and black-box image poisons by applying a universal stochastic preprocessing step \Psi_T(x) , realized by iterative Langevin sampling of a convergent Energy Based Model (EBM) initialized with an image x. Mid-run dynamics of \Psi_T(x) purify poison information with minimal impact on features important to the generalization of a classifier network. We show that the contrastive learning process of EBMs allows them to remain universal purifiers, even in the presence of poisoned EBM training data, and to achieve SoTA defense on leading triggered poison Narcissus and triggerless poisons Gradient Matching and Bullseye Polytope. This work is a subset of a larger framework introduced in PureGen with a more detailed focus on EBM purification and poison defense.

[LG-137] Improving global awareness of linkset predictions using Cross-Attentive Modulation tokens

链接: https://arxiv.org/abs/2405.19375
作者: Félix Marcoccia,Cédric Adjih,Paul Mühlethaler
关键词: Graph Neural Networks, Neural Networks, generation techniques rely, graph generation techniques, form proper link
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 17 pages, 2 figures, not published nor submitted yet

点击查看摘要

Abstract:Most of multiple link prediction or graph generation techniques rely on the attention mechanism or on Graph Neural Networks (GNNs), which consist in leveraging node-level information exchanges in order to form proper link predictions. Such node-level interactions do not process nodes as an ordered sequence, which would imply some kind of natural ordering of the nodes: they are said to be permutation invariant mechanisms. They are well suited for graph problems, but struggle at providing a global orchestration of the predicted links, which can result in a loss of performance. Some typical issues can be the difficulty to ensure high-level properties such as global connectedness, fixed diameter or to avoid information bottleneck effects such as oversmoothing and oversquashing, which respectively consist in abundant smoothing in dense areas leading to a loss of information and a tendency to exclude isolated nodes from the message passing scheme, and often result in irrelevant, unbalanced link predictions. To tackle this problem, we hereby present Cross-Attentive Modulation (CAM) tokens, which introduce cross-attentive units used to condition node and edge-level modulations in order to enable context-aware computations that improve the global consistency of the prediction links. We will implement it on a few permutation invariant architectures, and showcase benchmarks that prove the merits of our work.

[LG-138] Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants

链接: https://arxiv.org/abs/2405.19342
作者: Chloé Sekkat,Fanny Leroy,Salima Mdhaffar,Blake Perry Smith,Yannick Estève,Joseph Dureau,Alice Coucke
关键词: Recent works demonstrate, Recent works, Sonos Voice Control, North American English, Voice Control Bias
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.

[LG-139] Recasting Continual Learning as Sequence Modeling

链接: https://arxiv.org/abs/2310.11952
作者: Soochan Lee,Jaehyeon Son,Gunhee Kim
关键词: machine learning research, continual learning, aim to establish, establish a strong, strong connection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2023

点击查看摘要

Abstract:In this work, we aim to establish a strong connection between two significant bodies of machine learning research: continual learning and sequence modeling. That is, we propose to formulate continual learning as a sequence modeling problem, allowing advanced sequence models to be utilized for continual learning. Under this formulation, the continual learning process becomes the forward pass of a sequence model. By adopting the meta-continual learning (MCL) framework, we can train the sequence model at the meta-level, on multiple continual learning episodes. As a specific example of our new formulation, we demonstrate the application of Transformers and their efficient variants as MCL methods. Our experiments on seven benchmarks, covering both classification and regression, show that sequence models can be an attractive solution for general MCL.

[LG-140] Entropy annealing for policy mirror descent in continuous time and space

链接: https://arxiv.org/abs/2405.20250
作者: Deven Sethi,David Šiška,Yufei Zhang
关键词: additional regularization bias, Entropy regularization, landscape and accelerate, cost of introducing, introducing an additional
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Entropy regularization has been extensively used in policy optimization algorithms to regularize the optimization landscape and accelerate convergence; however, it comes at the cost of introducing an additional regularization bias. This work quantifies the impact of entropy regularization on the convergence of policy gradient methods for stochastic exit time control problems. We analyze a continuous-time policy mirror descent dynamics, which updates the policy based on the gradient of an entropy-regularized value function and adjusts the strength of entropy regularization as the algorithm progresses. We prove that with a fixed entropy level, the dynamics converges exponentially to the optimal solution of the regularized problem. We further show that when the entropy level decays at suitable polynomial rates, the annealed flow converges to the solution of the unregularized problem at a rate of \mathcal O(1/S) for discrete action spaces and, under suitable conditions, at a rate of \mathcal O(1/\sqrtS) for general action spaces, with S being the gradient flow time. This paper explains how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate.

[LG-141] raining-efficient density quantum machine learning

链接: https://arxiv.org/abs/2405.20237
作者: Brian Coyle,El Amine Cherrat,Nishant Jain,Natansh Mathur,Snehal Raj,Skander Kazdaghli,Iordanis Kerenidis
关键词: quantum neural networks, learning requires powerful, neural networks, quantum neural, solving challenging problems
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages main text, 9 pages appendices. 9 figures

点击查看摘要

Abstract:Quantum machine learning requires powerful, flexible and efficiently trainable models to be successful in solving challenging problems. In this work, we present density quantum neural networks, a learning model incorporating randomisation over a set of trainable unitaries. These models generalise quantum neural networks using parameterised quantum circuits, and allow a trade-off between expressibility and efficient trainability, particularly on quantum hardware. We demonstrate the flexibility of the formalism by applying it to two recently proposed model families. The first are commuting-block quantum neural networks (QNNs) which are efficiently trainable but may be limited in expressibility. The second are orthogonal (Hamming-weight preserving) quantum neural networks which provide well-defined and interpretable transformations on data but are challenging to train at scale on quantum devices. Density commuting QNNs improve capacity with minimal gradient complexity overhead, and density orthogonal neural networks admit a quadratic-to-constant gradient query advantage with minimal to no performance loss. We conduct numerical experiments on synthetic translationally invariant data and MNIST image data with hyperparameter optimisation to support our findings. Finally, we discuss the connection to post-variational quantum neural networks, measurement-based quantum machine learning and the dropout mechanism.

[LG-142] Disentangling and Mitigating the Impact of Task Similarity for Continual Learning

链接: https://arxiv.org/abs/2405.20236
作者: Naoki Hiratani
关键词: artificial neural networks, Continual learning, partially similar tasks, similar tasks poses, neural networks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning of partially similar tasks poses a challenge for artificial neural networks, as task similarity presents both an opportunity for knowledge transfer and a risk of interference and catastrophic forgetting. However, it remains unclear how task similarity in input features and readout patterns influences knowledge transfer and forgetting, as well as how they interact with common algorithms for continual learning. Here, we develop a linear teacher-student model with latent structure and show analytically that high input feature similarity coupled with low readout similarity is catastrophic for both knowledge transfer and retention. Conversely, the opposite scenario is relatively benign. Our analysis further reveals that task-dependent activity gating improves knowledge retention at the expense of transfer, while task-dependent plasticity gating does not affect either retention or transfer performance at the over-parameterized limit. In contrast, weight regularization based on the Fisher information metric significantly improves retention, regardless of task similarity, without compromising transfer performance. Nevertheless, its diagonal approximation and regularization in the Euclidean space are much less robust against task similarity. We demonstrate consistent results in a permuted MNIST task with latent variables. Overall, this work provides insights into when continual learning is difficult and how to mitigate it.

[LG-143] Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

链接: https://arxiv.org/abs/2405.20165
作者: Wooseong Cho,Taehyun Hwang,Joongkyu Lee,Min-hwan Oh
关键词: Markov decision processes, study reinforcement learning, underlying transition probability, transition probability kernel, Markov decision
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study reinforcement learning with multinomial logistic (MNL) function approximation where the underlying transition probability kernel of the Markov decision processes (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. For our first algorithm, \textttRRL-MNL , we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency and establish that \textttRRL-MNL is both statistically and computationally efficient, achieving a \tildeO(\kappa^-1 d^\frac32 H^\frac32 \sqrtT) frequentist regret bound with constant-time computational cost per episode. Here, d is the dimension of the transition core, H is the horizon length, T is the total number of steps, and \kappa is a problem-dependent constant. Despite the simplicity and practicality of \textttRRL-MNL , its regret bound scales with \kappa^-1 , which is potentially large in the worst case. To improve the dependence on \kappa^-1 , we propose \textttORRL-MNL , which estimates the value function using local gradient information of the MNL transition model. We show that its frequentist regret bound is \tildeO(d^\frac32 H^\frac32 \sqrtT + \kappa^-1 d^2 H^2) . To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve both computational and statistical efficiency. Numerical experiments demonstrate the superior performance of the proposed algorithms.

[LG-144] SPAM: Stochastic Proximal Point Method with Momentum Variance Reduction for Non-convex Cross-Device Federated Learning

链接: https://arxiv.org/abs/2405.20127
作者: Avetik Karagulyan,Egor Shulgin,Abdurakhmon Sadiev,Peter Richtárik
关键词: crucial subfield, cross-device federated learning, Cross-device training, federated learning, cross-device federated
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: The main part of the paper is around 9 pages. It contains the proposed algorithms, the main theoretical results and the experimental setting. The proofs of the main results and other technicalities are deferred to the Appendix

点击查看摘要

Abstract:Cross-device training is a crucial subfield of federated learning, where the number of clients can reach into the billions. Standard approaches and local methods are prone to issues such as client drift and insensitivity to data similarities. We propose a novel algorithm (SPAM) for cross-device federated learning with non-convex losses, which solves both issues. We provide sharp analysis under second-order (Hessian) similarity, a condition satisfied by a variety of machine learning problems in practice. Additionally, we extend our results to the partial participation setting, where a cohort of selected clients communicate with the server at each communication round. Our method is the first in its kind, that does not require the smoothness of the objective and provably benefits from clients having similar data.

[LG-145] A Geometric Unification of Distributionally Robust Covariance Estimators: Shrinking the Spectrum by Inflating the Ambiguity Set

链接: https://arxiv.org/abs/2405.20124
作者: Man-Chung Yue,Yves Rychener,Daniel Kuhn,Viet Anh Nguyen
关键词: data-insensitive shrinkage target, sample covariance matrix, estimating high-dimensional covariance, high-dimensional covariance matrices, methods for estimating
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The state-of-the-art methods for estimating high-dimensional covariance matrices all shrink the eigenvalues of the sample covariance matrix towards a data-insensitive shrinkage target. The underlying shrinkage transformation is either chosen heuristically - without compelling theoretical justification - or optimally in view of restrictive distributional assumptions. In this paper, we propose a principled approach to construct covariance estimators without imposing restrictive assumptions. That is, we study distributionally robust covariance estimation problems that minimize the worst-case Frobenius error with respect to all data distributions close to a nominal distribution, where the proximity of distributions is measured via a divergence on the space of covariance matrices. We identify mild conditions on this divergence under which the resulting minimizers represent shrinkage estimators. We show that the corresponding shrinkage transformations are intimately related to the geometrical properties of the underlying divergence. We also prove that our robust estimators are efficiently computable and asymptotically consistent and that they enjoy finite-sample performance guarantees. We exemplify our general methodology by synthesizing explicit estimators induced by the Kullback-Leibler, Fisher-Rao, and Wasserstein divergences. Numerical experiments based on synthetic and real data show that our robust estimators are competitive with state-of-the-art estimators.

[LG-146] Analysis of a multi-target linear shrinkage covariance estimator

链接: https://arxiv.org/abs/2405.20086
作者: Benoit Oriol
关键词: Multi-target linear shrinkage, single-target linear shrinkage, Multi-target linear, linear shrinkage, standard single-target linear
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multi-target linear shrinkage is an extension of the standard single-target linear shrinkage for covariance estimation. We combine several constant matrices - the targets - with the sample covariance matrix. We derive the oracle and a \textitbona fide multi-target linear shrinkage estimator with exact and empirical mean. In both settings, we proved its convergence towards the oracle under Kolmogorov asymptotics. Finally, we show empirically that it outperforms other standard estimators in various situations.

[LG-147] A Staged Approach using Machine Learning and Uncertainty Quantification to Predict the Risk of Hip Fracture

链接: https://arxiv.org/abs/2405.20071
作者: Anjum Shaik,Kristoffer Larsen,Nancy E. Lane,Chen Zhao,Kuan-Jui Su,Joyce H. Keyak,Qing Tian,Qiuying Sha,Hui Shen,Hong-Wen Deng,Weihua Zhou
关键词: hip fractures impose, medical care, healthcare systems, DXA, advancements in medical
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: 29 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Despite advancements in medical care, hip fractures impose a significant burden on individuals and healthcare systems. This paper focuses on the prediction of hip fracture risk in older and middle-aged adults, where falls and compromised bone quality are predominant factors. We propose a novel staged model that combines advanced imaging and clinical data to improve predictive performance. By using CNNs to extract features from hip DXA images, along with clinical variables, shape measurements, and texture features, our method provides a comprehensive framework for assessing fracture risk. A staged machine learning-based model was developed using two ensemble models: Ensemble 1 (clinical variables only) and Ensemble 2 (clinical variables and DXA imaging features). This staged approach used uncertainty quantification from Ensemble 1 to decide if DXA features are necessary for further prediction. Ensemble 2 exhibited the highest performance, achieving an AUC of 0.9541, an accuracy of 0.9195, a sensitivity of 0.8078, and a specificity of 0.9427. The staged model also performed well, with an AUC of 0.8486, an accuracy of 0.8611, a sensitivity of 0.5578, and a specificity of 0.9249, outperforming Ensemble 1, which had an AUC of 0.5549, an accuracy of 0.7239, a sensitivity of 0.1956, and a specificity of 0.8343. Furthermore, the staged model suggested that 54.49% of patients did not require DXA scanning. It effectively balanced accuracy and specificity, offering a robust solution when DXA data acquisition is not always feasible. Statistical tests confirmed significant differences between the models, highlighting the advantages of the advanced modeling strategies. Our staged approach could identify individuals at risk with a high accuracy but reduce the unnecessary DXA scanning. It has great promise to guide interventions to prevent hip fractures with reduced cost and radiation.

[LG-148] A Hardware-Efficient EMG Decoder with an Attractor-based Neural Network for Next-Generation Hand Prostheses

链接: https://arxiv.org/abs/2405.20052
作者: Mohammad Kalbasi,MohammadAli Shaeri,Vincent Alexandre Mendez,Solaiman Shokur,Silvestro Micera,Mahsa Shoaran
关键词: Robotic Prosthetic Hands, restoring hand functionality, Robotic Prosthetic, Prosthetic Hands, development of Robotic
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Advancements in neural engineering have enabled the development of Robotic Prosthetic Hands (RPHs) aimed at restoring hand functionality. Current commercial RPHs offer limited control through basic on/off commands. Recent progresses in machine learning enable finger movement decoding with higher degrees of freedom, yet the high computational complexity of such models limits their application in portable devices. Future RPH designs must balance portability, low power consumption, and high decoding accuracy to be practical for individuals with disabilities. To this end, we introduce a novel attractor-based neural network to realize on-chip movement decoding for next-generation portable RPHs. The proposed architecture comprises an encoder, an attention layer, an attractor network, and a refinement regressor. We tested our model on four healthy subjects and achieved a decoding accuracy of 80.6\pm3.3%. Our proposed model is over 120 and 50 times more compact compared to state-of-the-art LSTM and CNN models, respectively, with comparable (or superior) decoding accuracy. Therefore, it exhibits minimal hardware complexity and can be effectively integrated as a System-on-Chip.

[LG-149] ask-Agnostic Machine Learning-Assisted Inference

链接: https://arxiv.org/abs/2405.20039
作者: Jiacheng Miao,Qiongshi Lu
关键词: increasingly important role, Machine learning, playing an increasingly, increasingly important, important role
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Machine learning (ML) is playing an increasingly important role in scientific research. In conjunction with classical statistical approaches, ML-assisted analytical strategies have shown great promise in accelerating research findings. This has also opened up a whole new field of methodological research focusing on integrative approaches that leverage both ML and statistics to tackle data science challenges. One type of study that has quickly gained popularity employs ML to predict unobserved outcomes in massive samples and then uses the predicted outcomes in downstream statistical inference. However, existing methods designed to ensure the validity of this type of post-prediction inference are limited to very basic tasks such as linear regression analysis. This is because any extension of these approaches to new, more sophisticated statistical tasks requires task-specific algebraic derivations and software implementations, which ignores the massive library of existing software tools already developed for complex inference tasks and severely constrains the scope of post-prediction inference in real applications. To address this challenge, we propose a novel statistical framework for task-agnostic ML-assisted inference. It provides a post-prediction inference solution that can be easily plugged into almost any established data analysis routine. It delivers valid and efficient inference that is robust to arbitrary choices of ML models, while allowing nearly all existing analytical frameworks to be incorporated into the analysis of ML-predicted outcomes. Through extensive experiments, we showcase the validity, versatility, and superiority of our method compared to existing approaches.

[LG-150] Symmetries in Overparametrized Neural Networks: A Mean-Field View

链接: https://arxiv.org/abs/2405.19995
作者: Javier Maass Martínez,Joaquin Fontbona
关键词: Artificial Neural Networks, overparametrized Artificial Neural, Neural Networks, Artificial Neural, overparametrized Artificial
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We develop a Mean-Field (MF) view of the learning dynamics of overparametrized Artificial Neural Networks (NN) under data symmetric in law wrt the action of a general compact group G . We consider for this a class of generalized shallow NNs given by an ensemble of N multi-layer units, jointly trained using stochastic gradient descent (SGD) and possibly symmetry-leveraging (SL) techniques, such as Data Augmentation (DA), Feature Averaging (FA) or Equivariant Architectures (EA). We introduce the notions of weakly and strongly invariant laws (WI and SI) on the parameter space of each single unit, corresponding, respectively, to G -invariant distributions, and to distributions supported on parameters fixed by the group action (which encode EA). This allows us to define symmetric models compatible with taking N\to\infty and give an interpretation of the asymptotic dynamics of DA, FA and EA in terms of Wasserstein Gradient Flows describing their MF limits. When activations respect the group action, we show that, for symmetric data, DA, FA and freely-trained models obey the exact same MF dynamic, which stays in the space of WI laws and minimizes therein the population risk. We also give a counterexample to the general attainability of an optimum over SI laws. Despite this, quite remarkably, we show that the set of SI laws is also preserved by the MF dynamics even when freely trained. This sharply contrasts the finite- N setting, in which EAs are generally not preserved by unconstrained SGD. We illustrate the validity of our findings as N gets larger in a teacher-student experimental setting, training a student NN to learn from a WI, SI or arbitrary teacher model through various SL schemes. We last deduce a data-driven heuristic to discover the largest subspace of parameters supporting SI distributions for a problem, that could be used for designing EA with minimal generalization error.

[LG-151] argeted Sequential Indirect Experiment Design

链接: https://arxiv.org/abs/2405.19985
作者: Elisabeth Ailer,Niclas Dern,Jason Hartford,Niki Kilbertus
关键词: Scientific hypotheses typically, influence environmental health, hypotheses typically concern, typically concern specific, concern specific aspects
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scientific hypotheses typically concern specific aspects of complex, imperfectly understood or entirely unknown mechanisms, such as the effect of gene expression levels on phenotypes or how microbial communities influence environmental health. Such queries are inherently causal (rather than purely associational), but in many settings, experiments can not be conducted directly on the target variables of interest, but are indirect. Therefore, they perturb the target variable, but do not remove potential confounding factors. If, additionally, the resulting experimental measurements are multi-dimensional and the studied mechanisms nonlinear, the query of interest is generally not identified. We develop an adaptive strategy to design indirect experiments that optimally inform a targeted query about the ground truth mechanism in terms of sequentially narrowing the gap between an upper and lower bound on the query. While the general formulation consists of a bi-level optimization procedure, we derive an efficiently estimable analytical kernel-based estimator of the bounds for the causal effect, a query of key interest, and demonstrate the efficacy of our approach in confounded, multivariate, nonlinear synthetic settings.

[LG-152] Robust Kernel Hypothesis Testing under Data Corruption

链接: https://arxiv.org/abs/2405.19912
作者: Antonin Schrab,Ilmun Kim
关键词: constructing robust permutation, data corruption, propose two general, robust permutation tests, permutation tests
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 2 figures, 2 algorithms

点击查看摘要

Abstract:We propose two general methods for constructing robust permutation tests under data corruption. The proposed tests effectively control the non-asymptotic type I error under data corruption, and we prove their consistency in power under minimal conditions. This contributes to the practical deployment of hypothesis tests for real-world applications with potential adversarial attacks. One of our methods inherently ensures differential privacy, further broadening its applicability to private data analysis. For the two-sample and independence settings, we show that our kernel robust tests are minimax optimal, in the sense that they are guaranteed to be non-asymptotically powerful against alternatives uniformly separated from the null in the kernel MMD and HSIC metrics at some optimal rate (tight with matching lower bound). Finally, we provide publicly available implementations and empirically illustrate the practicality of our proposed tests.

[LG-153] Deep Joint Semantic Coding and Beamforming for Near-Space Airship-Borne Massive MIMO Network

链接: https://arxiv.org/abs/2405.19889
作者: Minghui Wu,Zhen Gao,Zhaocheng Wang,Dusit Niyato,George K. Karagiannidis,Sheng Chen
关键词: Near-space airship-borne communication, Near-space airship-borne, airship-borne communication network, future integrated, stratospheric altitudes
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Major Revision by IEEE JSAC

点击查看摘要

Abstract:Near-space airship-borne communication network is recognized to be an indispensable component of the future integrated ground-air-space network thanks to airships’ advantage of long-term residency at stratospheric altitudes, but it urgently needs reliable and efficient Airship-to-X link. To improve the transmission efficiency and capacity, this paper proposes to integrate semantic communication with massive multiple-input multiple-output (MIMO) technology. Specifically, we propose a deep joint semantic coding and beamforming (JSCBF) scheme for airship-based massive MIMO image transmission network in space, in which semantics from both source and channel are fused to jointly design the semantic coding and physical layer beamforming. First, we design two semantic extraction networks to extract semantics from image source and channel state information, respectively. Then, we propose a semantic fusion network that can fuse these semantics into complex-valued semantic features for subsequent physical-layer transmission. To efficiently transmit the fused semantic features at the physical layer, we then propose the hybrid data and model-driven semantic-aware beamforming networks. At the receiver, a semantic decoding network is designed to reconstruct the transmitted images. Finally, we perform end-to-end deep learning to jointly train all the modules, using the image reconstruction quality at the receivers as a metric. The proposed deep JSCBF scheme fully combines the efficient source compressibility and robust error correction capability of semantic communication with the high spectral efficiency of massive MIMO, achieving a significant performance improvement over existing approaches.

[LG-154] Identifiability of a statistical model with two latent vectors: Importance of the dimensionality relation and application to graph embedding

链接: https://arxiv.org/abs/2405.19760
作者: Hiroaki Sasaki
关键词: unsupervised representation learning, latent vectors, Identifiability, identifiability conditions, representation learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifiability of statistical models is a key notion in unsupervised representation learning. Recent work of nonlinear independent component analysis (ICA) employs auxiliary data and has established identifiable conditions. This paper proposes a statistical model of two latent vectors with single auxiliary data generalizing nonlinear ICA, and establishes various identifiability conditions. Unlike previous work, the two latent vectors in the proposed model can have arbitrary dimensions, and this property enables us to reveal an insightful dimensionality relation among two latent vectors and auxiliary data in identifiability conditions. Furthermore, surprisingly, we prove that the indeterminacies of the proposed model has the same as \emphlinear ICA under certain conditions: The elements in the latent vector can be recovered up to their permutation and scales. Next, we apply the identifiability theory to a statistical model for graph data. As a result, one of the identifiability conditions includes an appealing implication: Identifiability of the statistical model could depend on the maximum value of link weights in graph data. Then, we propose a practical method for identifiable graph embedding. Finally, we numerically demonstrate that the proposed method well-recovers the latent vectors and model identifiability clearly depends on the maximum value of link weights, which supports the implication of our theoretical results

[LG-155] Enhancing Sufficient Dimension Reduction via Hellinger Correlation

链接: https://arxiv.org/abs/2405.19704
作者: Seungbeom Hong,Ilmun Kim,Jun Song
关键词: single-index models, conditional independence, sufficient dimension reduction, supervised dimension reduction, dimension reduction based
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In this work, we develop a new theory and method for sufficient dimension reduction (SDR) in single-index models, where SDR is a sub-field of supervised dimension reduction based on conditional independence. Our work is primarily motivated by the recent introduction of the Hellinger correlation as a dependency measure. Utilizing this measure, we develop a method capable of effectively detecting the dimension reduction subspace, complete with theoretical justification. Through extensive numerical experiments, we demonstrate that our proposed method significantly enhances and outperforms existing SDR methods. This improvement is largely attributed to our proposed method’s deeper understanding of data dependencies and the refinement of existing SDR techniques.

[LG-156] Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity

链接: https://arxiv.org/abs/2405.19697
作者: Yan Yang,Bin Gao,Ya-xiang Yuan
关键词: growing interest recently, features intertwined two-level, attracted growing interest, intertwined two-level problems, Bilevel reinforcement learning
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages, 1 figure, 1 table

点击查看摘要

Abstract:Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we propose both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms are provable to enjoy the convergence rate \mathcalO(\epsilon^-1) . To the best of our knowledge, this is the first time that AID-based bilevel RL gets rid of additional assumptions on the lower-level problem. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.

[LG-157] Bayesian Online Natural Gradient (BONG)

链接: https://arxiv.org/abs/2405.19681
作者: Matt Jones,Peter Chang,Kevin Murphy
关键词: Bayesian inference based, sequential Bayesian inference, variational Bayes, exact Bayesian inference, approach to sequential
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 41 pages, 11 figures

点击查看摘要

Abstract:We propose a novel approach to sequential Bayesian inference based on variational Bayes. The key insight is that, in the online setting, we do not need to add the KL term to regularize to the prior (which comes from the posterior at the previous timestep); instead we can optimize just the expected log-likelihood, performing a single step of natural gradient descent starting at the prior predictive. We prove this method recovers exact Bayesian inference if the model is conjugate, and empirically outperforms other online VB methods in the non-conjugate setting, such as online learning for neural networks, especially when controlling for computational costs.

[LG-158] Factor Augmented Tensor-on-Tensor Neural Networks

链接: https://arxiv.org/abs/2405.19610
作者: Guanhao Zhou,Yuefeng Han,Xiufan Yu
关键词: arbitrary tensor order, tensor factor models, multi-dimensional arrays, deep neural networks, Neural Network
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper studies the prediction task of tensor-on-tensor regression in which both covariates and responses are multi-dimensional arrays (a.k.a., tensors) across time with arbitrary tensor order and data dimension. Existing methods either focused on linear models without accounting for possibly nonlinear relationships between covariates and responses, or directly employed black-box deep learning algorithms that failed to utilize the inherent tensor structure. In this work, we propose a Factor Augmented Tensor-on-Tensor Neural Network (FATTNN) that integrates tensor factor models into deep neural networks. We begin with summarizing and extracting useful predictive information (represented by the ``factor tensor’') from the complex structured tensor covariates, and then proceed with the prediction task using the estimated factor tensor as input of a temporal convolutional neural network. The proposed methods effectively handle nonlinearity between complex data structures, and improve over traditional statistical models and conventional deep learning approaches in both prediction accuracy and computational cost. By leveraging tensor factor models, our proposed methods exploit the underlying latent factor structure to enhance the prediction, and in the meantime, drastically reduce the data dimensionality that speeds up the computation. The empirical performances of our proposed methods are demonstrated via simulation studies and real-world applications to three public datasets. Numerical results show that our proposed algorithms achieve substantial increases in prediction accuracy and significant reductions in computational time compared to benchmark methods.

[LG-159] Convergence Bounds for Sequential Monte Carlo on Multimodal Distributions using Soft Decomposition

链接: https://arxiv.org/abs/2405.19553
作者: Holden Lee,Matheau Santana-Gijzen
关键词: Sequential Monte Carlo, Chain Monte Carlo, Markov Chain Monte, Markov chain, Monte Carlo
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We prove bounds on the variance of a function f under the empirical measure of the samples obtained by the Sequential Monte Carlo (SMC) algorithm, with time complexity depending on local rather than global Markov chain mixing dynamics. SMC is a Markov Chain Monte Carlo (MCMC) method, which starts by drawing N particles from a known distribution, and then, through a sequence of distributions, re-weights and re-samples the particles, at each instance applying a Markov chain for smoothing. In principle, SMC tries to alleviate problems from multi-modality. However, most theoretical guarantees for SMC are obtained by assuming global mixing time bounds, which are only efficient in the uni-modal setting. We show that bounds can be obtained in the truly multi-modal setting, with mixing times that depend only on local MCMC dynamics.

[LG-160] Anatomical Region Recognition and Real-time Bone Tracking Methods by Dynamically Decoding A-Mode Ultrasound Signals

链接: https://arxiv.org/abs/2405.19542
作者: Bangyu Lan,Stefano Stramigioli,Kenan Niu
关键词: Accurate bone tracking, Accurate bone, anatomical region, bone tracking, prosthetic robotics
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Accurate bone tracking is crucial for kinematic analysis in orthopedic surgery and prosthetic robotics. Traditional methods (e.g., skin markers) are subject to soft tissue artifacts, and the bone pins used in surgery introduce the risk of additional trauma and infection. For electromyography (EMG), its inability to directly measure joint angles requires complex algorithms for kinematic estimation. To address these issues, A-mode ultrasound-based tracking has been proposed as a non-invasive and safe alternative. However, this approach suffers from limited accuracy in peak detection when processing received ultrasound signals. To build a precise and real-time bone tracking approach, this paper introduces a deep learning-based method for anatomical region recognition and bone tracking using A-mode ultrasound signals, specifically focused on the knee joint. The algorithm is capable of simultaneously performing bone tracking and identifying the anatomical region where the A-mode ultrasound transducer is placed. It contains the fully connection between all encoding and decoding layers of the cascaded U-Nets to focus only on the signal region that is most likely to have the bone peak, thus pinpointing the exact location of the peak and classifying the anatomical region of the signal. The experiment showed a 97% accuracy in the classification of the anatomical regions and a precision of around 0.5 \pm 1mm under dynamic tracking conditions for various anatomical areas surrounding the knee joint. In general, this approach shows great potential beyond the traditional method, in terms of the accuracy achieved and the recognition of the anatomical region where the ultrasound has been attached as an additional functionality.

[LG-161] Exploring the Potential of Hybrid Machine-Learning/Physics-Based Modeling for Atmospheric/Oceanic Prediction Beyond the Medium Range

链接: https://arxiv.org/abs/2405.19518
作者: Dhruvit Patel,Troy Arcomano,Brian Hunt,Istvan Szunyogh,Edward Ott
关键词: combines machine learning, work of Arcomano, machine learning, medium range, combines machine
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:This paper explores the potential of a hybrid modeling approach that combines machine learning (ML) with conventional physics-based modeling for weather prediction beyond the medium range. It extends the work of Arcomano et al. (2022), which tested the approach for short- and medium-range weather prediction, and the work of Arcomano et al. (2023), which investigated its potential for climate modeling. The hybrid model used for the forecast experiments of the paper is based on the low-resolution, simplified parameterization atmospheric general circulation model (AGCM) SPEEDY. In addition to the hybridized prognostic variables of SPEEDY, the current version of the model has three purely ML-based prognostic variables. One of these is 6~h cumulative precipitation, another is the sea surface temperature, while the third is the heat content of the top 300 m deep layer of the ocean. The model has skill in predicting the El Niño cycle and its global teleconnections with precipitation for 3-7 months depending on the season. The model captures equatorial variability of the precipitation associated with Kelvin and Rossby waves and MJO. Predictions of the precipitation in the equatorial region have skill for 15 days in the East Pacific and 11.5 days in the West Pacific. Though the model has low spatial resolution, for these tasks it has prediction skill comparable to what has been published for high-resolution, purely physics-based, conventional operational forecast models.

[LG-162] Enabling Visual Recognition at Radio Frequency

链接: https://arxiv.org/abs/2405.19516
作者: Haowen Lai,Gaoxiang Luo,Yifei Liu,Mingmin Zhao
关键词: paper introduces PanoRadar, paper introduces, providing resilience, resilience against conditions, conditions challenging
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper introduces PanoRadar, a novel RF imaging system that brings RF resolution close to that of LiDAR, while providing resilience against conditions challenging for optical signals. Our LiDAR-comparable 3D imaging results enable, for the first time, a variety of visual recognition tasks at radio frequency, including surface normal estimation, semantic segmentation, and object detection. PanoRadar utilizes a rotating single-chip mmWave radar, along with a combination of novel signal processing and machine learning algorithms, to create high-resolution 3D images of the surroundings. Our system accurately estimates robot motion, allowing for coherent imaging through a dense grid of synthetic antennas. It also exploits the high azimuth resolution to enhance elevation resolution using learning-based methods. Furthermore, PanoRadar tackles 3D learning via 2D convolutions and addresses challenges due to the unique characteristics of RF signals. Our results demonstrate PanoRadar’s robust performance across 12 buildings.

[LG-163] Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data

链接: https://arxiv.org/abs/2405.19497
作者: Eloi Moliner,Sebastian Braun,Hannes Gamper
关键词: modifying audio signals, Gaussian Flow Bridges, Audio domain transfer, process of modifying, match characteristics
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Submitted to IWAENC 2024

点击查看摘要

Abstract:Audio domain transfer is the process of modifying audio signals to match characteristics of a different domain, while retaining the original content. This paper investigates the potential of Gaussian Flow Bridges, an emerging approach in generative modeling, for this problem. The presented framework addresses the transport problem across different distributions of audio signals through the implementation of a series of two deterministic probability flows. The proposed framework facilitates manipulation of the target distribution properties through a continuous control variable, which defines a certain aspect of the target domain. Notably, this approach does not rely on paired examples for training. To address identified challenges on maintaining the speech content consistent, we recommend a training strategy that incorporates chunk-based minibatch Optimal Transport couplings of data samples and noise. Comparing our unsupervised method with established baselines, we find competitive performance in tasks of reverberation and distortion manipulation. Despite encoutering limitations, the intriguing results obtained in this study underscore potential for further exploration.

[LG-164] Online Nonparametric Supervised Learning for Massive Data

链接: https://arxiv.org/abs/2405.19486
作者: Mohamed Chaouch,Omama M. Al-Hamed
关键词: imposed normal distribution, drawbacks including linearity, quadratic discriminant analysis, linear discriminant analysis, quadratic discriminant
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Despite their benefits in terms of simplicity, low computational cost and data requirement, parametric machine learning algorithms, such as linear discriminant analysis, quadratic discriminant analysis or logistic regression, suffer from serious drawbacks including linearity, poor fit of features to the usually imposed normal distribution and high dimensionality. Batch kernel-based nonparametric classifier, which overcomes the linearity and normality of features constraints, represent an interesting alternative for supervised classification problem. However, it suffers from the ``curse of dimension". The problem can be alleviated by the explosive sample size in the era of big data, while large-scale data size presents some challenges in the storage of data and the calculation of the classifier. These challenges make the classical batch nonparametric classifier no longer applicable. This motivates us to develop a fast algorithm adapted to the real-time calculation of the nonparametric classifier in massive as well as streaming data frameworks. This online classifier includes two steps. First, we consider an online principle components analysis to reduce the dimension of the features with a very low computation cost. Then, a stochastic approximation algorithm is deployed to obtain a real-time calculation of the nonparametric classifier. The proposed methods are evaluated and compared to some commonly used machine learning algorithms for real-time fetal well-being monitoring. The study revealed that, in terms of accuracy, the offline (or Batch), as well as, the online classifiers are good competitors to the random forest algorithm. Moreover, we show that the online classifier gives the best trade-off accuracy/computation cost compared to the offline classifier.

[LG-165] Stochastic Optimization Algorithms for Instrumental Variable Regression with Streaming Data

链接: https://arxiv.org/abs/2405.19463
作者: Xuxing Chen,Abhishek Roy,Yifan Hu,Krishnakumar Balasubramanian
关键词: instrumental variable regression, conditional stochastic optimization, variable regression, instrumental variable, stochastic optimization problem
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We develop and analyze algorithms for instrumental variable regression by viewing the problem as a conditional stochastic optimization problem. In the context of least-squares instrumental variable regression, our algorithms neither require matrix inversions nor mini-batches and provides a fully online approach for performing instrumental variable regression with streaming data. When the true model is linear, we derive rates of convergence in expectation, that are of order \mathcalO(\log T/T) and \mathcalO(1/T^1-\iota) for any \iota0 , respectively under the availability of two-sample and one-sample oracles, respectively, where T is the number of iterations. Importantly, under the availability of the two-sample oracle, our procedure avoids explicitly modeling and estimating the relationship between confounder and the instrumental variables, demonstrating the benefit of the proposed approach over recent works based on reformulating the problem as minimax optimization problems. Numerical experiments are provided to corroborate the theoretical results.

[LG-166] Neural Scaling Laws From Large-N Field Theory: Solvable Model Beyond the Ridgeless Limit

链接: https://arxiv.org/abs/2405.19398
作者: Zhengkang Zhang
关键词: machine learning models, learning models based, exhibit scaling laws, networks exhibit scaling, training data set
类目: High Energy Physics - Theory (hep-th); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 51 pages, 3 figures

点击查看摘要

Abstract:Many machine learning models based on neural networks exhibit scaling laws: their performance scales as power laws with respect to the sizes of the model and training data set. We use large-N field theory methods to solve a model recently proposed by Maloney, Roberts and Sully which provides a simplified setting to study neural scaling laws. Our solution extends the result in this latter paper to general nonzero values of the ridge parameter, which are essential to regularize the behavior of the model. In addition to obtaining new and more precise scaling laws, we also uncover a duality transformation at the diagrams level which explains the symmetry between model and training data set sizes. The same duality underlies recent efforts to design neural networks to simulate quantum field theories.

[LG-167] Ground state phases of the two-dimension electron gas with a unified variational approach

链接: https://arxiv.org/abs/2405.19397
作者: Conor Smith,Yixiao Chen,Ryan Levy,Yubo Yang,Miguel A. Morales,Shiwei Zhang
关键词: two-dimensional electron gas, drawing increasing interest, Monte Carlo calculations, quantum Monte Carlo, electron gas
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The two-dimensional electron gas (2DEG) is a fundamental model, which is drawing increasing interest because of recent advances in experimental and theoretical studies of 2D materials. Current understanding of the ground state of the 2DEG relies on quantum Monte Carlo calculations, based on variational comparisons of different ansatze for different phases. We use a single variational ansatz, a general backflow-type wave function using a message-passing neural quantum state architecture, for a unified description across the entire density range. The variational optimization consistently leads to lower ground-state energies than previous best results. Transition into a Wigner crystal (WC) phase occurs automatically at rs = 37 +/- 1, a density lower than currently believed. Between the liquid and WC phases, the same ansatz and variational search strongly suggest the existence of intermediate states in a broad range of densities, with enhanced short-range nematic spin correlations.

[LG-168] NeuralODEs for VLEO simulations: Introducing thermoNET for Thermosphere Modeling

链接: https://arxiv.org/abs/2405.19384
作者: Dario Izzo,Giacomo Acciarini,Francesco Biscani
关键词: architecture termed thermoNET, neural architecture termed, neural Ordinary Differential, Ordinary Differential Equation, represent thermospheric density
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注: Paper presented and published in the 29th ISSFD Conference, Darmstadt, Germany

点击查看摘要

Abstract:We introduce a novel neural architecture termed thermoNET, designed to represent thermospheric density in satellite orbital propagation using a reduced amount of differentiable computations. Due to the appearance of a neural network on the right-hand side of the equations of motion, the resulting satellite dynamics is governed by a NeuralODE, a neural Ordinary Differential Equation, characterized by its fully differentiable nature, allowing the derivation of variational equations (hence of the state transition matrix) and facilitating its use in connection to advanced numerical techniques such as Taylor-based numerical propagation and differential algebraic techniques. Efficient training of the network parameters occurs through two distinct approaches. In the first approach, the network undergoes training independently of spacecraft dynamics, engaging in a pure regression task against ground truth models, including JB-08 and NRLMSISE-00. In the second paradigm, network parameters are learned based on observed dynamics, adapting through ODE sensitivities. In both cases, the outcome is a flexible, compact model of the thermosphere density greatly enhancing numerical propagation efficiency while maintaining accuracy in the orbital predictions.

[LG-169] Approximate Thompson Sampling for Learning Linear Quadratic Regulators with O(sqrtT) Regret

链接: https://arxiv.org/abs/2405.19380
作者: Yeoneung Kim,Gihun Kim,Insoon Yang
关键词: linear quadratic regulators, learns linear quadratic, improved Bayesian regret, approximate Thompson sampling, Thompson sampling algorithm
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 61 pages, 6 figures

点击查看摘要

Abstract:We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of O(\sqrtT) . Our method leverages Langevin dynamics with a meticulously designed preconditioner as well as a simple excitation mechanism. We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Moreover, we identify nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an O(\sqrtT) regret bound without the unrealistic restrictive assumptions on parameter sets that are often used in the literature.

[LG-170] Optimal Multiclass U-Calibration Error and Beyond

链接: https://arxiv.org/abs/2405.19374
作者: Haipeng Luo,Spandan Senapati,Vatsal Sharan
关键词: make sequential distributional, sequential distributional predictions, U-calibration error, online multiclass U-calibration, U-calibration
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of online multiclass U-calibration, where a forecaster aims to make sequential distributional predictions over K classes with low U-calibration error, that is, low regret with respect to all bounded proper losses simultaneously. Kleinberg et al. (2023) developed an algorithm with U-calibration error O(K\sqrtT) after T rounds and raised the open question of what the optimal bound is. We resolve this question by showing that the optimal U-calibration error is \Theta(\sqrtKT) – we start with a simple observation that the Follow-the-Perturbed-Leader algorithm of Daskalakis and Syrgkanis (2016) achieves this upper bound, followed by a matching lower bound constructed with a specific proper loss (which, as a side result, also proves the optimality of the algorithm of Daskalakis and Syrgkanis (2016) in the context of online learning against an adversary with finite choices). We also strengthen our results under natural assumptions on the loss functions, including \Theta(\log T) U-calibration error for Lipschitz proper losses, O(\log T) U-calibration error for a certain class of decomposable proper losses, U-calibration error bounds for proper losses with a low covering number, and others.

[LG-171] Multi-modal Mood Reader: Pre-trained Model Empowers Cross-Subject Emotion Recognition

链接: https://arxiv.org/abs/2405.19373
作者: Yihang Dong,Xuhang Chen,Yanyan Shen,Michael Kwok-Po Ng,Tao Qian,Shuqiang Wang
关键词: cross-subject emotion recognition, Emotion recognition, gained significant attention, EEG signals, cross-subject emotion
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted by International Conference on Neural Computing for Advanced Applications, 2024

点击查看摘要

Abstract:Emotion recognition based on Electroencephalography (EEG) has gained significant attention and diversified development in fields such as neural signal processing and affective computing. However, the unique brain anatomy of individuals leads to non-negligible natural differences in EEG signals across subjects, posing challenges for cross-subject emotion recognition. While recent studies have attempted to address these issues, they still face limitations in practical effectiveness and model framework unity. Current methods often struggle to capture the complex spatial-temporal dynamics of EEG signals and fail to effectively integrate multimodal information, resulting in suboptimal performance and limited generalizability across subjects. To overcome these limitations, we develop a Pre-trained model based Multimodal Mood Reader for cross-subject emotion recognition that utilizes masked brain signal modeling and interlinked spatial-temporal attention mechanism. The model learns universal latent representations of EEG signals through pre-training on large scale dataset, and employs Interlinked spatial-temporal attention mechanism to process Differential Entropy(DE) features extracted from EEG data. Subsequently, a multi-level fusion layer is proposed to integrate the discriminative features, maximizing the advantages of features across different dimensions and modalities. Extensive experiments on public datasets demonstrate Mood Reader’s superior performance in cross-subject emotion recognition tasks, outperforming state-of-the-art methods. Additionally, the model is dissected from attention perspective, providing qualitative analysis of emotion-related brain areas, offering valuable insights for affective research in neural signal processing.

[LG-172] Medformer: A Multi-Granularity Patching Transformer for Medical Time-Series Classification

链接: https://arxiv.org/abs/2405.19363
作者: Yihe Wang,Nan Huang,Taida Li,Yujun Yan,Xiang Zhang
关键词: Medical time series, Medical time, time series data, time series, time series classification
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20pages (14 pages main paper + 6 pages supplementary materials)

点击查看摘要

Abstract:Medical time series data, such as Electroencephalography (EEG) and Electrocardiography (ECG), play a crucial role in healthcare, such as diagnosing brain and heart diseases. Existing methods for medical time series classification primarily rely on handcrafted biomarkers extraction and CNN-based models, with limited exploration of transformers tailored for medical time series. In this paper, we introduce Medformer, a multi-granularity patching transformer tailored specifically for medical time series classification. Our method incorporates three novel mechanisms to leverage the unique characteristics of medical time series: cross-channel patching to leverage inter-channel correlations, multi-granularity embedding for capturing features at different scales, and two-stage (intra- and inter-granularity) multi-granularity self-attention for learning features and correlations within and among granularities. We conduct extensive experiments on five public datasets under both subject-dependent and challenging subject-independent setups. Results demonstrate Medformer’s superiority over 10 baselines, achieving top averaged ranking across five datasets on all six evaluation metrics. These findings underscore the significant impact of our method on healthcare applications, such as diagnosing Myocardial Infarction, Alzheimer’s, and Parkinson’s disease. We release the source code at \urlthis https URL.

[LG-173] Modally Reduced Representation Learning of Multi-Lead ECG Signals through Simultaneous Alignment and Reconstruction

链接: https://arxiv.org/abs/2405.19359
作者: Nabil Ibtehaz,Masood Mortazavi
关键词: ECG signals, ECG, profiling the electrical, electrical activities, plethora of diagnostic
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted as a Workshop Paper at TS4H@ICLR2024

点击查看摘要

Abstract:Electrocardiogram (ECG) signals, profiling the electrical activities of the heart, are used for a plethora of diagnostic applications. However, ECG systems require multiple leads or channels of signals to capture the complete view of the cardiac system, which limits their application in smartwatches and wearables. In this work, we propose a modally reduced representation learning method for ECG signals that is capable of generating channel-agnostic, unified representations for ECG signals. Through joint optimization of reconstruction and alignment, we ensure that the embeddings of the different channels contain an amalgamation of the overall information across channels while also retaining their specific information. On an independent test dataset, we generated highly correlated channel embeddings from different ECG channels, leading to a moderate approximation of the 12-lead signals from a single-channel embedding. Our generated embeddings can work as competent features for ECG signals for downstream tasks.

[LG-174] An LSTM Feature Imitation Network for Hand Movement Recognition from sEMG Signals

链接: https://arxiv.org/abs/2405.19356
作者: Chuheng Wu,S. Farokh Atashzar,Mohammad M. Ghassemi,Tuka Alhanai
关键词: Surface Electromyography, hand movement recognition, hand movement patterns, diagnosis of diseases, control of prostheses
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: This work has been submitted to RA-L, and under review

点击查看摘要

Abstract:Surface Electromyography (sEMG) is a non-invasive signal that is used in the recognition of hand movement patterns, the diagnosis of diseases, and the robust control of prostheses. Despite the remarkable success of recent end-to-end Deep Learning approaches, they are still limited by the need for large amounts of labeled data. To alleviate the requirement for big data, researchers utilize Feature Engineering, which involves decomposing the sEMG signal into several spatial, temporal, and frequency features. In this paper, we propose utilizing a feature-imitating network (FIN) for closed-form temporal feature learning over a 300ms signal window on Ninapro DB2, and applying it to the task of 17 hand movement recognition. We implement a lightweight LSTM-FIN network to imitate four standard temporal features (entropy, root mean square, variance, simple square integral). We then explore transfer learning capabilities by applying the pre-trained LSTM-FIN for tuning to a downstream hand movement recognition task. We observed that the LSTM network can achieve up to 99% R2 accuracy in feature reconstruction and 80% accuracy in hand movement recognition. Our results also showed that the model can be robustly applied for both within- and cross-subject movement recognition, as well as simulated low-latency environments. Overall, our work demonstrates the potential of the FIN modeling paradigm in data-scarce scenarios for sEMG signal processing.

[LG-175] Resonate-and-Fire Spiking Neurons for Target Detection and Hand Gesture Recognition: A Hybrid Approach

链接: https://arxiv.org/abs/2405.19351
作者: Ahmed Shaaban,Zeineb Chaabouni,Maximilian Strobel,Wolfgang Furtner,Robert Weigel,Fabian Lurz
关键词: fast Fourier transforms, expensive fast Fourier, fast Fourier, Fourier transforms, computationally expensive fast
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Hand gesture recognition using radar often relies on computationally expensive fast Fourier transforms. This paper proposes an alternative approach that bypasses fast Fourier transforms using resonate-and-fire neurons. These neurons directly detect the hand in the time-domain signal, eliminating the need for fast Fourier transforms to retrieve range information. Following detection, a simple Goertzel algorithm is employed to extract five key features, eliminating the need for a second fast Fourier transform. These features are then fed into a recurrent neural network, achieving an accuracy of 98.21% for classifying five gestures. The proposed approach demonstrates competitive performance with reduced complexity compared to traditional methods

[LG-176] Beyond Isolated Frames: Enhancing Sensor-Based Human Activity Recognition through Intra- and Inter-Frame Attention

链接: https://arxiv.org/abs/2405.19349
作者: Shuai Shao,Yu Guan,Victor Sanchez
关键词: Human Activity Recognition, Activity Recognition, Convolutional Neural Networks, Human Activity, ubiquitous computing
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) has become increasingly popular with ubiquitous computing, driven by the popularity of wearable sensors in fields like healthcare and sports. While Convolutional Neural Networks (ConvNets) have significantly contributed to HAR, they often adopt a frame-by-frame analysis, concentrating on individual frames and potentially overlooking the broader temporal dynamics inherent in human activities. To address this, we propose the intra- and inter-frame attention model. This model captures both the nuances within individual frames and the broader contextual relationships across multiple frames, offering a comprehensive perspective on sequential data. We further enrich the temporal understanding by proposing a novel time-sequential batch learning strategy. This learning strategy preserves the chronological sequence of time-series data within each batch, ensuring the continuity and integrity of temporal patterns in sensor-based HAR.

[LG-177] NERULA: A Dual-Pathway Self-Supervised Learning Framework for Electrocardiogram Signal Analysis

链接: https://arxiv.org/abs/2405.19348
作者: Gouthamaan Manimaran,Sadasivan Puthusserypady,Helena Domínguez,Adrian Atienza,Jakob E. Bardram
关键词: diagnosing heart conditions, ECG, detailed cardiac patterns, Unsupervised Learning Algorithm, critical for diagnosing
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Paper in review

点击查看摘要

Abstract:Electrocardiogram (ECG) signals are critical for diagnosing heart conditions and capturing detailed cardiac patterns. As wearable single-lead ECG devices become more common, efficient analysis methods are essential. We present NERULA (Non-contrastive ECG and Reconstruction Unsupervised Learning Algorithm), a self-supervised framework designed for single-lead ECG signals. NERULA’s dual-pathway architecture combines ECG reconstruction and non-contrastive learning to extract detailed cardiac features. Our 50% masking strategy, using both masked and inverse-masked signals, enhances model robustness against real-world incomplete or corrupted data. The non-contrastive pathway aligns representations of masked and inverse-masked signals, while the reconstruction pathway comprehends and reconstructs missing features. We show that combining generative and discriminative paths into the training spectrum leads to better results by outperforming state-of-the-art self-supervised learning benchmarks in various tasks, demonstrating superior performance in ECG analysis, including arrhythmia classification, gender classification, age regression, and human activity recognition. NERULA’s dual-pathway design offers a robust, efficient solution for comprehensive ECG signal interpretation.

[LG-178] Near-Field Spot Beamfocusing: A Correlation-Aware Transfer Learning Approach

链接: https://arxiv.org/abs/2405.19347
作者: Mohammad Amir Fallah,Mehdi Monemi,Mehdi Rasti,Matti Latva-Aho
关键词: conventional angular-domain beamforming, concentrates radiating power, Desired Focal Point, angular-domain beamforming, concentrates radiating
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:3D spot beamfocusing (SBF), in contrast to conventional angular-domain beamforming, concentrates radiating power within very small volume in both radial and angular domains in the near-field zone. Recently the implementation of channel-state-information (CSI)-independent machine learning (ML)-based approaches have been developed for effective SBF using extremely-largescale-programable-metasurface (ELPMs). These methods involve dividing the ELPMs into subarrays and independently training them with Deep Reinforcement Learning to jointly focus the beam at the Desired Focal Point (DFP). This paper explores near-field SBF using ELPMs, addressing challenges associated with lengthy training times resulting from independent training of subarrays. To achieve a faster CSIindependent solution, inspired by the correlation between the beamfocusing matrices of the subarrays, we leverage transfer learning techniques. First, we introduce a novel similarity criterion based on the Phase Distribution Image of subarray apertures. Then we devise a subarray policy propagation scheme that transfers the knowledge from trained to untrained subarrays. We further enhance learning by introducing Quasi-Liquid-Layers as a revised version of the adaptive policy reuse technique. We show through simulations that the proposed scheme improves the training speed about 5 times. Furthermore, for dynamic DFP management, we devised a DFP policy blending process, which augments the convergence rate up to 8-fold.

[LG-179] Subject-Adaptive Transfer Learning Using Resting State EEG Signals for Cross-Subject EEG Motor Imagery Classification

链接: https://arxiv.org/abs/2405.19346
作者: Sion An,Myeongkyun Kang,Soopil Kim,Philip Chikontwe,Li Shen,Sang Hyun Park
关键词: EEG signals, EEG, motor imagery, inter-subject variability, signals
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Early Accepted at MICCAI 2024

点击查看摘要

Abstract:Electroencephalography (EEG) motor imagery (MI) classification is a fundamental, yet challenging task due to the variation of signals between individuals i.e., inter-subject variability. Previous approaches try to mitigate this using task-specific (TS) EEG signals from the target subject in training. However, recording TS EEG signals requires time and limits its applicability in various fields. In contrast, resting state (RS) EEG signals are a viable alternative due to ease of acquisition with rich subject information. In this paper, we propose a novel subject-adaptive transfer learning strategy that utilizes RS EEG signals to adapt models on unseen subject data. Specifically, we disentangle extracted features into task- and subject-dependent features and use them to calibrate RS EEG signals for obtaining task information while preserving subject characteristics. The calibrated signals are then used to adapt the model to the target subject, enabling the model to simulate processing TS EEG signals of the target subject. The proposed method achieves state-of-the-art accuracy on three public benchmarks, demonstrating the effectiveness of our method in cross-subject EEG MI classification. Our findings highlight the potential of leveraging RS EEG signals to advance practical brain-computer interface systems.

[LG-180] Review of Deep Representation Learning Techniques for Brain-Computer Interfaces and Recommendations

链接: https://arxiv.org/abs/2405.19345
作者: Pierre Guetschel,Sara Ahmadi,Michael Tangermann
关键词: gained substantial interest, deep representation learning, representation learning techniques, leveraging deep learning, deep learning techniques
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to: Journal of Neural Engineering (JNE)

点击查看摘要

Abstract:In the field of brain-computer interfaces (BCIs), the potential for leveraging deep learning techniques for representing electroencephalogram (EEG) signals has gained substantial interest. This review synthesizes empirical findings from a collection of articles using deep representation learning techniques for BCI decoding, to provide a comprehensive analysis of the current state-of-the-art. Each article was scrutinized based on three criteria: (1) the deep representation learning technique employed, (2) the underlying motivation for its utilization, and (3) the approaches adopted for characterizing the learned representations. Among the 81 articles finally reviewed in depth, our analysis reveals a predominance of 31 articles using autoencoders. We identified 13 studies employing self-supervised learning (SSL) techniques, among which ten were published in 2022 or later, attesting to the relative youth of the field. However, at the time being, none of these have led to standard foundation models that are picked up by the BCI community. Likewise, only a few studies have introspected their learned representations. We observed that the motivation in most studies for using representation learning techniques is for solving transfer learning tasks, but we also found more specific motivations such as to learn robustness or invariances, as an algorithmic bridge, or finally to uncover the structure of the data. Given the potential of foundation models to effectively tackle these challenges, we advocate for a continued dedication to the advancement of foundation models specifically designed for EEG signal decoding by using SSL techniques. We also underline the imperative of establishing specialized benchmarks and datasets to facilitate the development and continuous improvement of such foundation models.

[LG-181] Obtaining physical layer data of latest generation networks for investigating adversary attacks

链接: https://arxiv.org/abs/2405.19340
作者: M.V. Ushakova,Yu. A. Ushakov,L.V. Legashev
关键词: machine learning, fields of science, science and technology, developing rapidly, machine learning models
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The field of machine learning is developing rapidly and is being used in various fields of science and technology. In this way, machine learning can be used to optimize the functions of latest generation data networks such as 5G and 6G. This also applies to functions at a lower level. A feature of the use of machine learning in the radio path for targeted radiation generation in modern ultra-massive MIMO, reconfigurable intelligent interfaces and other technologies is the complex acquisition and processing of data from the physical layer. Additionally, adversarial measures that manipulate the behaviour of intelligent machine learning models are becoming a major concern, as many machine learning models are sensitive to incorrect input data. To obtain data on attacks directly from processing service information, a simulation model is proposed that works in conjunction with machine learning applications.

信息检索

[IR-0] Jina CLIP: Your CLIP Model Is Also Your Text Retriever

链接: https://arxiv.org/abs/2405.20204
作者: Andreas Koukounas,Georgios Mastrapas,Michael Günther,Bo Wang,Scott Martens,Isabelle Mohr,Saba Sturua,Mohammad Kalim Akram,Joan Fontanals Martínez,Saahil Ognawala,Susana Guzman,Maximilian Werk,Nan Wang,Han Xiao
关键词: Contrastive Language-Image Pretraining, Language-Image Pretraining, common embedding space, fixed-sized vectors, align images
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: 4 pages, ICML2024 workshop submission

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

[IR-1] Generating Query Recommendations via LLMs

链接: