本篇博文主要展示 2024-08-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-30)

今日共更新389篇论文,其中:

  • 自然语言处理40篇(Computation and Language (cs.CL))
  • 人工智能72篇(Artificial Intelligence (cs.AI))
  • 计算机视觉97篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习117篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
[NLP-0] SAM2 Point:以零镜头和可预见的方式将任何3D分割为视频

链接: https://arxiv.org/abs/2408.16768
作者: Ziyu Guo,Renrui Zhang,Xiangyang Zhu,Chengzhuo Tong,Peng Gao,Chunyuan Li,Pheng-Ann Heng
关键词-EN: exploration adapting Segment, Segment Anything Model, preliminary exploration adapting, adapting Segment, preliminary exploration
关键词-ZH: 探索适应细分,细分任何模型,初步探索适应,适应细分,初步探索
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress. Online Demo: this https URL . Code: this https URL

点击查看摘要

Abstract:We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor environments, and raw sparse LiDAR. Demonstrations on multiple 3D datasets, e.g., Objaverse, S3DIS, ScanNet, Semantic3D, and KITTI, highlight the robust generalization capabilities of SAM2Point. To our best knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation. Online Demo: this https URL . Code: this https URL .
摘要:我们引入了ASM 2 Point,这是一种初步探索,采用Segment Anything Model 2(Sam 2)进行零镜头和可预测的3D分割。SAM 2 Point将任何3D数据解释为一系列多方向视频,并利用SAM 2进行3D空间分割,无需进一步训练或2D-3D投影。我们的框架支持各种提示类型,包括3D点、框和面具,并且可以在各种场景中进行概括,例如3D对象、室内场景、室外环境和原始稀疏LiDART。多个3D数据集的演示,例如Objaverse、S3 DIS、ScanNet、Semantic 3D和KITTI强调了ASM 2 Point强大的概括能力。据我们所知,我们提供了3D中最忠实的Sam实现,这可能成为未来可识别3D分割研究的起点。在线演示:此https URL。代码:这个httpsURL。

[NLP-1] How Far Can Cantonese NLP Go? Benchmarking Cantonese Capabilities of Large Language Models
[NLP-1] 粤语NLP能走多远?大型语言模型的粤语能力基准

链接: https://arxiv.org/abs/2408.16756
作者: Jiyue Jiang,Liheng Chen,Pengan Chen,Sheng Wang,Qinghang Bao,Lingpeng Kong,Yu Li,Chuan Wu
关键词-EN: Greater Bay Area, natural language processing, Kong-Macau Greater Bay, rapid evolution, evolution of large
关键词-ZH: 大湾区、自然语言处理、港澳大湾区、快速演变、大演变
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) has transformed the competitive landscape in natural language processing (NLP), particularly for English and other data-rich languages. However, underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps, which is particularly concerning given the economic significance of the Guangdong-Hong Kong-Macau Greater Bay Area, and in substantial Cantonese-speaking populations in places like Singapore and North America. Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions. To bridge these gaps, we outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese, which aim to advance open-source Cantonese LLM technology. We also propose future research directions and recommended models to enhance Cantonese LLM development.
摘要:大型语言模型的快速发展改变了自然语言处理领域的竞争格局,特别是对于英语和其他数据丰富的语言。然而,像粤语这样代表不足的语言,如8500多万人说的粤语,面临着巨大的发展差距,考虑到粤港澳大湾区的经济重要性,以及新加坡和北美等地大量讲粤语的人口,这一差距尤其令人担忧。尽管粤语被广泛使用,但它在自然语言处理研究中的代表性很小,特别是与来自类似发达地区的其他语言相比。为了弥补这些差距,我们概述了现有的粤语自然语言处理方法,并引入了新的基准,旨在评估粤语自然语言处理在事实生成、数理逻辑、复杂推理和粤语常识方面的表现,旨在推动开源粤语LLM技术的发展。我们也提出了未来的研究方向和推荐的模式,以促进粤语法律硕士的发展。

[NLP-2] Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models
[NLP-2] 无需人类反馈的强化学习,对大型语言模型进行最后一英里微调

链接: https://arxiv.org/abs/2408.16753
作者: Alec Solway
关键词-EN: align language models, human preference signals, Reinforcement learning, likelihood maximization, align language
关键词-ZH: 对齐语言模型、人类偏好信号、强化学习、可能性最大化、对齐语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning is used to align language models with human preference signals after first pre-training the model to predict the next token of text within a large corpus using likelihood maximization. Before being deployed in a specific domain, models are often further fine-tuned on task specific data. Since human preferences are often unavailable for the last step, it is performed using likelihood maximization as that is the typical default method. However, reinforcement learning has other advantages besides facilitating alignment to a human derived reward function. For one, whereas likelihood maximization is a form of imitation learning in which the model is trained on what to do under ideal conditions, reinforcement learning is not limited to demonstrating actions just for optimally reached states and trains a model what to do under a range of scenarios as it explores the policy space. In addition, it also trains a model what not to do, suppressing competitive but poor actions. This work develops a framework for last-mile fine-tuning using reinforcement learning and tests whether it garners performance gains. The experiments center on abstractive summarization, but the framework is general and broadly applicable. Use of the procedure produced significantly better results than likelihood maximization when comparing raw predictions. For the specific data tested, the gap could be bridged by employing post-processing of the maximum likelihood outputs. Nonetheless, the framework offers a new avenue for model optimization in situations where post-processing may be less straightforward or effective, and it can be extended to include more complex classes of undesirable outputs to penalize and train against, such as hallucinations.
摘要:强化学习用于将语言模型与人类偏好信号对齐,首先对模型进行预训练,然后使用似然最大化预测大型语料库中的下一个文本标记。在特定领域中部署模型之前,通常会根据特定于任务的数据对模型进行进一步的微调。由于人类偏好对于最后一步通常不可用,因此使用似然最大化来执行,因为这是典型的默认方法。然而,强化学习除了有助于与人类派生的奖励函数保持一致之外,还有其他优势。首先,可能性最大化是模仿学习的一种形式,在这种学习中,模型被训练在理想条件下做什么,而强化学习并不限于演示仅针对最优到达状态的行动,并在探索政策空间时训练模型在一系列场景下做什么。此外,它还训练模特什么不应该做,压制竞争性但糟糕的行为。这项工作开发了一个使用强化学习进行最后一英里微调的框架,并测试了它是否获得了性能提升。实验以抽象总结为中心,但该框架具有通用性和广泛适用性。在比较原始预测时,使用该方法比似然最大化方法产生的结果要好得多。对于测试的具体数据,可以通过采用最大似然输出的后处理来弥合差距。尽管如此,在后处理可能不那么直接或有效的情况下,该框架为模型优化提供了一条新的途径,它可以扩展到包括更复杂的类别的不良输出,以惩罚和训练,如幻觉。

[NLP-3] A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models
[NLP-3] 语言模型中奖励好例子和惩罚坏例子的梯度分析框架

链接: https://arxiv.org/abs/2408.16751
作者: Yi-Lin Tuan,William Yang Wang
关键词-EN: including unlikelihood training, maximum likelihood estimation, average treatment effect, exponential maximizing average, maximizing average treatment
关键词-ZH: 包括不可能性训练、最大可能性估计、平均治疗效果、指数最大化平均、最大化平均治疗
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Beyond maximum likelihood estimation (MLE), the standard objective of a language model (LM) that optimizes good examples probabilities, many studies have explored ways that also penalize bad examples for enhancing the quality of output distribution, including unlikelihood training, exponential maximizing average treatment effect (ExMATE), and direct preference optimization (DPO). To systematically compare these methods and further provide a unified recipe for LM optimization, in this paper, we present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs. Through both mathematical results and experiments on CausalDialogue and Anthropic HH-RLHF datasets, we identify distinct functional characteristics among these methods. We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.
摘要:除了最大似然估计(MLE)是语言模型(LM)优化好样本概率的标准目标外,许多研究还探索了惩罚不良样本以提高输出分布质量的方法,包括非似然训练、指数最大化平均处理效果(ExMATE)和直接偏好优化(DPO)。为了系统地比较这些方法,并进一步为最小二乘优化提供一个统一的配方,本文提出了一种独特的损失函数梯度分析的角度,它同时奖励和惩罚最小二乘方法中的好例子和坏例子。通过数学结果和在CausalDialog和人类HH-RLHF数据集上的实验,我们识别了这些方法中明显的功能特征。我们发现ExMATE是MLE的一个更好的替代品,DPO和ExMATE而不是MLE的结合进一步提高了统计性能(5-7%)和生成性能(+18%的胜率)。

[NLP-4] Assessing Large Language Models for Online Extremism Research: Identification Explanation and New Knowledge
[NLP-4] 评估在线极端主义研究的大型语言模型:识别解释和新知识

链接: https://arxiv.org/abs/2408.16749
作者: Beidi Dong,Jin R. Lee,Ziwei Zhu,Balassubramanian Srinivasan
关键词-EN: United States, Bidirectional Encoder Representations, States has experienced, Generative Pre-Trained Transformers, GPT
关键词-ZH: 美国,双向编码器代表,美国经验丰富,生成预训练变形金刚,GPT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The United States has experienced a significant increase in violent extremism, prompting the need for automated tools to detect and limit the spread of extremist ideology online. This study evaluates the performance of Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-Trained Transformers (GPT) in detecting and classifying online domestic extremist posts. We collected social media posts containing “far-right” and “far-left” ideological keywords and manually labeled them as extremist or non-extremist. Extremist posts were further classified into one or more of five contributing elements of extremism based on a working definitional framework. The BERT model’s performance was evaluated based on training data size and knowledge transfer between categories. We also compared the performance of GPT 3.5 and GPT 4 models using different prompts: naïve, layperson-definition, role-playing, and professional-definition. Results showed that the best performing GPT models outperformed the best performing BERT models, with more detailed prompts generally yielding better results. However, overly complex prompts may impair performance. Different versions of GPT have unique sensitives to what they consider extremist. GPT 3.5 performed better at classifying far-left extremist posts, while GPT 4 performed better at classifying far-right extremist posts. Large language models, represented by GPT models, hold significant potential for online extremism classification tasks, surpassing traditional BERT models in a zero-shot setting. Future research should explore human-computer interactions in optimizing GPT models for extremist detection and classification tasks to develop more efficient (e.g., quicker, less effort) and effective (e.g., fewer errors or mistakes) methods for identifying extremist content.
摘要:美国经历了暴力极端主义的显著增加,这促使人们需要自动化工具来检测和限制极端主义意识形态在网上的传播。本研究评估了来自变形金刚的双向编码表征(BERT)和生成性预训练变形金刚(GPT)在检测和分类在线国内极端主义帖子方面的性能。我们收集了包含“极右翼”和“极左翼”意识形态关键词的社交媒体帖子,并手动将它们贴上极端主义或非极端主义的标签。根据一个工作定义框架,极端主义帖子被进一步归类为极端主义的五个促成因素中的一个或多个。基于训练数据量和类别间的知识转移对BERT模型的性能进行了评估。我们还比较了GPT 3.5和GPT 4模型在不同提示下的表现:幼稚、外行定义、角色扮演和专业定义。结果表明,表现最好的GPT模型表现优于表现最好的BERT模型,更详细的提示通常会产生更好的结果。然而,过于复杂的提示可能会影响性能。不同版本的GPT对他们认为的极端主义有独特的敏感性。GPT 3.5在分类极左翼极端分子帖子方面表现更好,而GPT 4在分类极右翼极端分子帖子方面表现更好。以GPT模型为代表的大型语言模型在在线极端主义分类任务中具有巨大的潜力,在零射击环境下超过了传统的BERT模型。未来的研究应该探索在优化极端分子检测和分类任务的GPT模型方面的人机交互,以开发更有效(例如,更快、更少的工作量)和有效(例如,更少的错误或错误)识别极端主义内容的方法。

[NLP-5] heoretical and Methodological Framework for Studying Texts Produced by Large Language Models
[NLP-5] 研究大型语言模型产生的文本的理论和方法论框架

链接: https://arxiv.org/abs/2408.16740
作者: Jiří Milička
关键词-EN: quantitative linguistics perspective, large language models, addresses the conceptual, methodological and technical, studying large language
关键词-ZH: 量化语言学视角,大型语言模型,解决概念、方法和技术问题,研究大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the conceptual, methodological and technical challenges in studying large language models (LLMs) and the texts they produce from a quantitative linguistics perspective. It builds on a theoretical framework that distinguishes between the LLM as a substrate and the entities the model simulates. The paper advocates for a strictly non-anthropomorphic approach to models while cautiously applying methodologies used in studying human linguistic behavior to the simulated entities. While natural language processing researchers focus on the models themselves, their architecture, evaluation, and methods for improving performance, we as quantitative linguists should strive to build a robust theory concerning the characteristics of texts produced by LLMs, how they differ from human-produced texts, and the properties of simulated entities. Additionally, we should explore the potential of LLMs as an instrument for studying human culture, of which language is an integral part.
摘要:本文从定量语言学的角度探讨了大型语言模型及其生成的语篇在概念、方法和技术上的挑战。它建立在一个理论框架上,该框架区分了作为衬底的LLM和模型模拟的实体。本文主张对模型采取严格的非拟人化方法,同时谨慎地将研究人类语言行为的方法应用于模拟实体。虽然自然语言处理研究人员关注的是模型本身、它们的体系结构、评估和提高性能的方法,但作为定量语言学家,我们应该努力建立一个关于LLMS产生的文本的特征、它们与人类产生的文本有何不同以及模拟实体的属性的可靠理论。此外,我们应该探索LLMS作为研究人类文化的工具的潜力,语言是人类文化不可分割的一部分。

[NLP-6] Smaller Weaker Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
[NLP-6] 更小更弱但更好:通过计算机最佳抽样训练LLM推理者

链接: https://arxiv.org/abs/2408.16737
作者: Hritik Bansal,Arian Hosseini,Rishabh Agarwal,Vinh Q. Tran,Mehran Kazemi
关键词-EN: strong language models, strong language, data, high-quality synthetic data, common strategy
关键词-ZH: 强大的语言模型、强大的语言、数据、高质量的合成数据、通用策略
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training on high-quality synthetic data from strong language models (LMs) is a common strategy to improve the reasoning performance of LMs. In this work, we revisit whether this strategy is compute-optimal under a fixed inference budget (e.g., FLOPs). To do so, we investigate the trade-offs between generating synthetic data using a stronger but more expensive (SE) model versus a weaker but cheaper (WC) model. We evaluate the generated data across three key metrics: coverage, diversity, and false positive rate, and show that the data from WC models may have higher coverage and diversity, but also exhibit higher false positive rates. We then finetune LMs on data from SE and WC models in different settings: knowledge distillation, self-improvement, and a novel weak-to-strong improvement setup where a weaker LM teaches reasoning to a stronger LM. Our findings reveal that models finetuned on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks and multiple choices of WC and SE models. These results challenge the prevailing practice of relying on SE models for synthetic data generation, suggesting that WC may be the compute-optimal approach for training advanced LM reasoners.
摘要:从强语言模型(LMS)中训练高质量的合成数据是提高LMS推理性能的常用策略。在这项工作中,我们重新审视了该策略在固定的推理预算(例如,Flop)下是否是计算最优的。为此,我们研究了使用更强大但更昂贵的(SE)模型与更弱但更便宜的(WC)模型生成合成数据之间的权衡。我们从覆盖率、多样性和误警率三个关键指标对生成的数据进行了评估,结果表明WC模型的数据可能具有更高的覆盖率和多样性,但也表现出更高的误检率。然后,我们在不同环境下对SE和WC模型的数据进行LMS微调:知识蒸馏、自我改进和一种新的从弱到强的改进设置,其中较弱的LM向较强的LM传授推理。我们的发现表明,在多个基准和多种WC和SE模型选择上,基于WC生成的数据优化的模型始终优于基于SE生成的数据训练的模型。这些结果挑战了依赖SE模型生成合成数据的普遍做法,表明WC可能是训练高级LM推理人员的计算最佳方法。

[NLP-7] Mini-Omni: Language Models Can Hear Talk While Thinking in Streaming
[NLP-7] Mini-Omni:语言模型可以在流媒体中思考时听到说话

链接: https://arxiv.org/abs/2408.16725
作者: Zhifei Xie,Changqiao Wu
关键词-EN: achieved significant progress, Recent advances, significant progress, achieved significant, Recent
关键词-ZH: 取得重大进展,最近的进展,重大进展,取得重大进展,最近
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 10 pages

点击查看摘要

Abstract:Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model’s language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method “Any Model Can Talk”. We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.
摘要:语言模型的最新进展取得了重大进展。GPT-40作为一个新的里程碑,实现了与人类的实时对话,展示了近乎人类的自然流畅性。这种人机交互要求模型必须具有直接与音频通道进行推理并以流的形式生成输出的能力。然而,这仍然超出了当前的学术模型的范围,因为它们通常依赖额外的TTS系统进行语音合成,导致不希望的延迟。本文介绍了Mini-Omni,一种基于音频的端到端对话模型,能够进行实时语音交互。为了实现这一能力,我们提出了一种文本指导的语音生成方法,并在推理过程中采用了批处理-并行策略来进一步提高性能。我们的方法还有助于以最小的降级保留原始模型的语言能力,使其他作品能够建立实时交互能力。我们称这种训练方法为“任何模型都会说话”。我们还引入了VoiceAssistant-400K数据集来微调针对语音输出进行优化的模型。据我们所知,Mini-Omni是第一个完全端到端的、开源的实时语音交互模型,为未来的研究提供了宝贵的潜力。

[NLP-8] Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
[NLP-8] Jina-ColBERT-v2:通用多语言晚期互动检索器

链接: https://arxiv.org/abs/2408.16672
作者: Rohan Jha,Bo Wang,Michael Günther,Saba Sturua,Mohammad Kalim Akram,Han Xiao
关键词-EN: proven highly effective, Multi-vector dense models, Multi-vector dense, proven highly, highly effective
关键词-ZH: 被证明非常有效,多载体密集模型,多载体密集,被证明非常,非常有效
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.
摘要:ColBERT等多载体密集模型已被证明在信息检索中非常有效。ColBERT的后期交互评分接近交叉编码器中看到的联合查询-文档注意力,同时由于其双编码器架构以及最近对索引和搜索的优化,保持了更接近传统密集检索模型的推理效率。在本文中,我们介绍了对ColBERT模型架构和训练管道的几项改进,利用了更成熟的单载体嵌入模型范式中成功的技术,特别是适合异类多语言数据的技术。我们的新模型Jina-ColBERT-v2在一系列英语和多语言检索任务中表现出强劲的性能,同时与之前的模型相比,存储需求可减少多达50%。

[NLP-9] Iterative Graph Alignment
[NLP-9] 迭代图形对齐

链接: https://arxiv.org/abs/2408.16667
作者: Fangyuan Yu,Hardeep Singh Arora,Matt Johnson
关键词-EN: generalizable causal relationships, capturing generalizable causal, compressing diverse narratives, causal relationships, intelligence by capturing
关键词-ZH: 可概括的因果关系,捕获可概括的因果关系,通过捕获压缩不同的叙述、因果关系、智力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:By compressing diverse narratives, LLMs go beyond memorization, achieving intelligence by capturing generalizable causal relationships. However, they suffer from local ‘representation gaps’ due to insufficient training data diversity, limiting their real-world utility, especially in tasks requiring strict alignment to rules. Traditional alignment methods relying on heavy human annotations are inefficient and unscalable. Recent self-alignment techniques also fall short, as they often depend on self-selection based prompting and memorization-based learning. To address these issues, we introduce Iterative Graph Alignment (IGA), an annotation-free rule-based alignment algorithm. A teacher model (VLM) employs Iterative Graph Prompting (IGP) to create logical graphs and reference answers. The student model (LLM) identifies local knowledge gaps by attempting to align its responses with these references, collaborating with helper models to generate diverse answers. These aligned responses are then used for iterative supervised fine-tuning (SFT). Our evaluations across five rule-based scenarios demonstrate IGP’s effectiveness, with a 73.12% alignment improvement in Claude Sonnet 3.5, and Llama3-8B-Instruct achieving an 86.20% improvement, outperforming Claude Sonnet 3.5 in rule-based alignment.
摘要:通过压缩不同的叙述,LLMS超越了记忆,通过捕捉概括的因果关系来实现智能。然而,由于训练数据的多样性不足,它们受到了局部“表征差距”的影响,限制了它们在现实世界中的实用性,特别是在需要严格遵守规则的任务中。传统的对齐方法依赖于大量的人工标注,效率低下且不可伸缩。最近的自我调整技术也有不足之处,因为它们通常依赖于基于自我选择的提示和基于记忆的学习。为了解决这些问题,我们引入了迭代图对齐(IGA),这是一种无注释的基于规则的对齐算法。教师模型(VLM)使用迭代图提示(IGP)来创建逻辑图和参考答案。学生模型(LLM)通过尝试使其响应与这些参考相一致来识别本地知识差距,并与助手模型协作生成不同的答案。然后,这些对准的响应被用于迭代监督微调(SFT)。我们在五个基于规则的场景中的评估证明了IGP的有效性,在Claude Sonnet 3.5和Llama3-8B-Indict中对齐的改进为73.12%,而Llama3-8B-Indict的改进为86.20%,在基于规则的对齐方面优于Claude Sonnet 3.5。

[NLP-10] Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies
[NLP-10] 通过情境分析和说服策略增强狼人游戏中的对话生成

链接: https://arxiv.org/abs/2408.16586
作者: Zhiyang Qi,Michimasa Inaba
关键词-EN: natural language processing, large language models, enhanced dialogue systems, Recent advancements, significantly enhanced dialogue
关键词-ZH: 自然语言处理、大型语言模型、增强的对话系统、最近的进步、显着增强的对话
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the AIWolfDial2024 workshop at INLG 2024

点击查看摘要

Abstract:Recent advancements in natural language processing, particularly with large language models (LLMs) like GPT-4, have significantly enhanced dialogue systems, enabling them to generate more natural and fluent conversations. Despite these improvements, challenges persist, such as managing continuous dialogues, memory retention, and minimizing hallucinations. The AIWolfDial2024 addresses these challenges by employing the Werewolf Game, an incomplete information game, to test the capabilities of LLMs in complex interactive environments. This paper introduces a LLM-based Werewolf Game AI, where each role is supported by situation analysis to aid response generation. Additionally, for the werewolf role, various persuasion strategies, including logical appeal, credibility appeal, and emotional appeal, are employed to effectively persuade other players to align with its actions.
摘要:自然语言处理的最新进展,特别是GPT-4等大型语言模型(LLM),显着增强了对话系统,使它们能够生成更自然、更流畅的对话。尽管取得了这些改进,但挑战仍然存在,例如管理连续对话、记忆保留和最大限度地减少幻觉。AIWolfDial 2024通过使用Werewolf Game(一种不完整信息游戏)来测试LLM在复杂交互环境中的能力来应对这些挑战。本文介绍了一个基于LLM的Werewolf Game AI,其中每个角色都得到情况分析的支持,以帮助生成响应。此外,对于狼人角色,采用各种说服策略,包括逻辑吸引力、可信吸引力和情感吸引力,来有效说服其他玩家与其行为保持一致。

[NLP-11] Predictability maximization and the origins of word order harmony
[NLP-11] 可预测性最大化和语序和谐的起源

链接: https://arxiv.org/abs/2408.16570
作者: Ramon Ferrer-i-Cancho
关键词-EN: information theoretic perspective, head, predictability, theoretic perspective, address the linguistic
关键词-ZH: 信息理论视角,头部,可预测性,理论视角,解决语言学
类目: Computation and Language (cs.CL); Physics and Society (physics.soc-ph); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:We address the linguistic problem of the sequential arrangement of a head and its dependents from an information theoretic perspective. In particular, we consider the optimal placement of a head that maximizes the predictability of the sequence. We assume that dependents are statistically independent given a head, in line with the open-choice principle and the core assumptions of dependency grammar. We demonstrate the optimality of harmonic order, i.e., placing the head last maximizes the predictability of the head whereas placing the head first maximizes the predictability of dependents. We also show that postponing the head is the optimal strategy to maximize its predictability while bringing it forward is the optimal strategy to maximize the predictability of dependents. We unravel the advantages of the strategy of maximizing the predictability of the head over maximizing the predictability of dependents. Our findings shed light on the placements of the head adopted by real languages or emerging in different kinds of experiments.
摘要:我们从信息论的角度讨论了头部及其从属词的顺序排列的语言学问题。特别是,我们考虑了使序列的可预测性最大化的头部的最佳位置。根据开放选择原则和依存语法的核心假设,我们假设从属关系在统计上是独立的。我们证明了调和顺序的最优性,即,将头部放在最后将最大化头部的可预测性,而将头部放在第一位将最大化从属的可预测性。我们还表明,推迟头部是最大化其可预测性的最优策略,而提前头部是最大化受抚养者可预测性的最优策略。我们揭示了最大化头部的可预测性的策略相对于最大化受抚养者的可预测性的优势。我们的发现揭示了真实语言或不同类型实验中出现的头部的位置。

[NLP-12] SALSA: Speedy ASR-LLM Synchronous Aggregation INTERSPEECH2024
[NLP-12] SALSA:快速ASR-LLM同步聚合

链接: https://arxiv.org/abs/2408.16542
作者: Ashish Mittal,Darshan Prabhu,Sunita Sarawagi,Preethi Jyothi
关键词-EN: Harnessing pre-trained LLMs, Harnessing pre-trained, improve ASR systems, ASR systems, ASR
关键词-ZH: 利用预先培训的LLM,利用预先培训的,改进ASB系统,ASB系统,ASB
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. Such coupling is performed with a simple projection of the last decoder state, and is thus significantly more training efficient than earlier approaches. A challenge of our proposed coupling is handling the mismatch between the tokenizers of the LLM and ASR systems. We handle this mismatch using cascading tokenization with respect to the LLM and ASR vocabularies. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
摘要:利用预先训练的LLM来改进ASB系统,特别是对于低资源语言,现在是一个新兴的研究领域。现有的方法范围从使用LLM进行ASB错误纠正到用LLM取代ASB解码器的紧耦合系统。这些方法要么增加解码时间,要么需要对交叉注意层进行昂贵的训练。我们提出了SALSA,它将ASB的解码器层耦合到LLM解码器,同时同步推进两个解码器。这种耦合是通过最后一个解码器状态的简单投影来执行的,因此比早期方法的训练效率明显更高。我们提出的耦合的一个挑战是处理LLM和ASB系统的标记器之间的不匹配。我们使用LLM和ASB词汇表的级联标记化来处理这种不匹配。我们在FLEURS基准测试中评估了8种低资源语言的SALSA,结果WER大幅降低了高达38%。

[NLP-13] CNIMA: A Universal Evaluation Framework and Automated Approach for Assessing Second Language Dialogues
[NLP-13] CNIMA:评估第二语言对话的通用评估框架和自动化方法

链接: https://arxiv.org/abs/2408.16518
作者: Rena Gao,Jingxuan Wu,Carsten Roever,Xuetong Wu,Jing Wu,Long Lv,Jey Han Lau
关键词-EN: Non-Native Interactivity Measurement, Chinese Non-Native Interactivity, Measurement and Automation, Interactivity Measurement, develop CNIMA
关键词-ZH: 非本地互动性测量,中国非本地互动性,测量与自动化,互动性测量,开发CNIMA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We develop CNIMA (Chinese Non-Native Interactivity Measurement and Automation), a Chinese-as-a-second-language labelled dataset with 10K dialogues. We annotate CNIMA using an evaluation framework – originally introduced for English-as-a-second-language dialogues – that assesses micro-level features (e.g.\ backchannels) and macro-level interactivity labels (e.g.\ topic management) and test the framework’s transferability from English to Chinese. We found the framework robust across languages and revealed universal and language-specific relationships between micro-level and macro-level features. Next, we propose an approach to automate the evaluation and find strong performance, creating a new tool for automated second language assessment. Our system can be adapted to other languages easily as it uses large language models and as such does not require large-scale annotated training data.
摘要:我们开发CNIMA(中国非母语互动测量和自动化),这是一个具有10 K对话的中文作为第二语言标签数据集。我们使用评估框架(最初是为英语作为第二语言对话而引入的)来注释CNIMA,该框架评估微观层面的特征(例如\反向频道)和宏观层面的互动标签(例如\主题管理)并测试框架从英语到中文的可移植性。我们发现该框架在语言中都很强大,并揭示了微观层面和宏观层面特征之间的普遍和特定语言的关系。接下来,我们提出了一种自动化评估并找到出色性能的方法,创建了一个用于自动化第二语言评估的新工具。我们的系统可以轻松地适应其他语言,因为它使用大型语言模型,因此不需要大规模带注释的训练数据。

[NLP-14] LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?
[NLP-14] LLM与既定文本增强分类技术:什么时候收益超过成本?

链接: https://arxiv.org/abs/2408.16502
作者: Jan Cegin,Jakub Simko,Peter Brusilovsky
关键词-EN: generative large language, data augmentation tasks, large language models, generative large, large language
关键词-ZH: 生成式大型语言、数据增强任务、大型语言模型、生成式大型语言
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:The generative large language models (LLMs) are increasingly being used for data augmentation tasks, where text samples are LLM-paraphrased and then used for classifier fine-tuning. However, a research that would confirm a clear cost-benefit advantage of LLMs over more established augmentation methods is largely missing. To study if (and when) is the LLM-based augmentation advantageous, we compared the effects of recent LLM augmentation methods with established ones on 6 datasets, 3 classifiers and 2 fine-tuning methods. We also varied the number of seeds and collected samples to better explore the downstream model accuracy space. Finally, we performed a cost-benefit analysis and show that LLM-based methods are worthy of deployment only when very small number of seeds is used. Moreover, in many cases, established methods lead to similar or better model accuracies.
摘要:生成式大型语言模型(LLM)越来越多地用于数据增强任务,其中文本样本经过LLM解释,然后用于分类器微调。然而,一项能够证实LLM相对于更成熟的增强方法具有明显成本效益优势的研究基本上是缺失的。为了研究基于LLM的增强是否(以及何时)具有优势,我们比较了最近的LLM增强方法与现有方法对6个数据集、3个分类器和2种微调方法的影响。我们还改变了种子和收集样本的数量,以更好地探索下游模型准确性空间。最后,我们进行了成本效益分析,并表明基于LLM的方法只有在使用非常少量的种子时才值得部署。此外,在许多情况下,已建立的方法可以产生类似或更好的模型准确性。

[NLP-15] Learning from Negative Samples in Generative Biomedical Entity Linking
[NLP-15] 生成生物医学实体链接中的负样本学习

链接: https://arxiv.org/abs/2408.16493
作者: Chanhwi Kim,Hyunjae Kim,Sihyeon Park,Jiwoo Lee,Mujeen Sung,Jaewoo Kang
关键词-EN: biomedical entity linking, efficient memory usage, Generative Biomedical Entity, entity linking, biomedical entity
关键词-ZH: 生物医学实体链接,高效内存使用,生成生物医学实体,实体链接,生物医学实体
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative models have become widely used in biomedical entity linking (BioEL) due to their excellent performance and efficient memory usage. However, these models are usually trained only with positive samples–entities that match the input mention’s identifier–and do not explicitly learn from hard negative samples, which are entities that look similar but have different meanings. To address this limitation, we introduce ANGEL (Learning from Negative Samples in Generative Biomedical Entity Linking), the first framework that trains generative BioEL models using negative samples. Specifically, a generative model is initially trained to generate positive samples from the knowledge base for given input entities. Subsequently, both correct and incorrect outputs are gathered from the model’s top-k predictions. The model is then updated to prioritize the correct predictions through direct preference optimization. Our models fine-tuned with ANGEL outperform the previous best baseline models by up to an average top-1 accuracy of 1.4% on five benchmarks. When incorporating our framework into pre-training, the performance improvement further increases to 1.7%, demonstrating its effectiveness in both the pre-training and fine-tuning stages. Our code is available at this https URL.
摘要:产生式模型以其优异的性能和高效的内存使用,在生物医学实体链接(BioEL)中得到了广泛应用。然而,这些模型通常只用正样本训练–与输入提及的标识符匹配的实体–而不显式地从硬负样本中学习,硬负样本是看起来相似但含义不同的实体。为了解决这一局限性,我们引入了Angel(在生成性生物医学实体链接中从负样本学习),这是第一个使用负样本训练生成性BioEL模型的框架。具体地说,初始训练生成模型以从给定输入实体的知识库中生成正样本。随后,从模型的top-k预测中收集正确和不正确的输出。然后更新模型,通过直接偏好优化来确定正确预测的优先顺序。我们与Angel一起微调的模型在五个基准上的平均TOP-1准确率高达1.4%,超过了之前最好的基准模型。当将我们的框架融入到预训练中时,性能改进进一步提高到1.7%,证明了其在预训练和微调阶段的有效性。我们的代码可以在这个HTTPS URL上找到。

[NLP-16] Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning
[NLP-16] 自我一致:通过上下文学习改善法学硕士文化价值观的一致性

链接: https://arxiv.org/abs/2408.16482
作者: Rochelle Choenni,Ekaterina Shutova
关键词-EN: increasingly important topic, Large Language Models, Large Language, important topic, increasingly important
关键词-ZH: 越来越重要的主题,大型语言模型,大型语言,重要的主题,越来越重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Improving the alignment of Large Language Models (LLMs) with respect to the cultural values that they encode has become an increasingly important topic. In this work, we study whether we can exploit existing knowledge about cultural values at inference time to adjust model responses to cultural value probes. We present a simple and inexpensive method that uses a combination of in-context learning (ICL) and human survey data, and show that we can improve the alignment to cultural values across 5 models that include both English-centric and multilingual LLMs. Importantly, we show that our method could prove useful in test languages other than English and can improve alignment to the cultural values that correspond to a range of culturally diverse countries.
摘要:改善大型语言模型(LLM)与其编码的文化价值观的一致性已成为一个越来越重要的话题。在这项工作中,我们研究是否可以在推理时利用有关文化价值的现有知识来调整模型对文化价值调查的反应。我们提出了一种简单且廉价的方法,该方法结合了上下文学习(ICL)和人类调查数据,并表明我们可以改善5个模型与文化价值观的一致性,这些模型包括以英语为中心和多语言LLM。重要的是,我们表明,我们的方法在英语以外的测试语言中可能很有用,并且可以改善与一系列文化多元化国家相对应的文化价值观的一致性。

[NLP-17] Is text normalization relevant for classifying medieval charters?
[NLP-17] 文本规范化与中世纪宪章分类相关吗?

链接: https://arxiv.org/abs/2408.16446
作者: Florian Atzenhofer-Baumgartner,Tamás Kovács
关键词-EN: Middle High German, High German charters, specifically focusing, study examines, examines the impact
关键词-ZH: 中部高地德语、高地德语宪章,特别关注,研究审查,审查影响
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections

点击查看摘要

Abstract:This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.
摘要:本研究探讨了历史文本规范化对中世纪宪章分类的影响,特别关注文件的年代和定位。使用来自数字档案馆的中高地德语宪章数据集,我们评估各种分类器,包括传统模型和基于变换器的模型,无论是否进行规范化。我们的结果表明,给定的规范化对定位任务的改进微乎其微,但降低了约会的准确性,这意味着原始文本包含规范化可能会掩盖的关键特征。我们发现支持向量机和梯度提升优于其他模型,并质疑该用例中变压器的效率。结果建议对历史文本规范化采取选择性方法,强调保留一些对于文档分析中的分类任务至关重要的文本特征的重要性。

[NLP-18] SurveySum: A Dataset for Summarizing Multiple Scientific Articles into a Survey Section
[NLP-18] SurveySum:用于将多篇科学文章汇总到调查部分的数据集

链接: https://arxiv.org/abs/2408.16444
作者: Leandro Carísio Fernandes,Gustavo Bartz Guedes,Thiago Soares Laitz,Thales Sales Almeida,Rodrigo Nogueira,Roberto Lotufo,Jayr Pereira
关键词-EN: Document summarization, task to shorten, shorten texts, texts into concise, concise and informative
关键词-ZH: 文件摘要,任务要缩短,将文本缩短,文本变得简洁、简洁、信息丰富
类目: Computation and Language (cs.CL)
备注: 15 pages, 6 figures, 1 table. Submitted to BRACIS 2024

点击查看摘要

Abstract:Document summarization is a task to shorten texts into concise and informative summaries. This paper introduces a novel dataset designed for summarizing multiple scientific articles into a section of a survey. Our contributions are: (1) SurveySum, a new dataset addressing the gap in domain-specific summarization tools; (2) two specific pipelines to summarize scientific articles into a section of a survey; and (3) the evaluation of these pipelines using multiple metrics to compare their performance. Our results highlight the importance of high-quality retrieval stages and the impact of different configurations on the quality of generated summaries.
摘要:文档摘要是将文本缩短为简洁且信息丰富的摘要的任务。本文介绍了一种新颖的数据集,旨在将多篇科学文章汇总到调查的一部分中。我们的贡献是:(1)SurveySum,一个新的数据集,解决了特定领域摘要工具的差距;(2)两个特定的管道,将科学文章总结到调查的一部分中;(3)使用多个指标对这些管道进行评估,以比较其性能。我们的结果强调了高质量检索阶段的重要性以及不同配置对生成摘要质量的影响。

[NLP-19] Instruction-tuned Large Language Models for Machine Translation in the Medical Domain
[NLP-19] 用于医疗领域机器翻译的指令调整大型语言模型

链接: https://arxiv.org/abs/2408.16440
作者: Miguel Rios
关键词-EN: Large Language Models, high resource language, resource language pairs, Large Language, shown promising results
关键词-ZH: 大型语言模型、高资源语言、资源语言对、大型语言显示出有希望的结果
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising results on machine translation for high resource language pairs and domains. However, in specialised domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models. The consistency in the machine translation of terminology is crucial for users, researchers, and translators in specialised domains. In this study, we compare the performance between baseline LLMs and instruction-tuned LLMs in the medical domain. In addition, we introduce terminology from specialised medical dictionaries into the instruction formatted datasets for fine-tuning LLMs. The instruction-tuned LLMs significantly outperform the baseline models with automatic metrics.
摘要:大型语言模型(LLM)在高资源语言对和域的机器翻译方面显示出令人鼓舞的结果。然而,在专业领域(例如医疗),LLM的性能与标准神经机器翻译模型相比较低。术语机器翻译的一致性对于专业领域的用户、研究人员和翻译人员至关重要。在这项研究中,我们比较了医疗领域的基线LLM和经描述调整的LLM之间的性能。此外,我们还将专业医学词典中的术语引入到指令格式的数据集中,以微调LLM。经过描述优化的LLM在自动指标方面的表现显着优于基线模型。

[NLP-20] MQM-Chat: Multidimensional Quality Metrics for Chat Translation
[NLP-20] MQM-Chat:聊天翻译的多维质量工作空间

链接: https://arxiv.org/abs/2408.16390
作者: Yunmeng Li,Jun Suzuki,Makoto Morishita,Kaori Abe,Kentaro Inui
关键词-EN: pose significant challenges, chats pose significant, Multidimensional Quality Metrics, machine translation models, chat translation
关键词-ZH: 构成重大挑战,聊天构成重大,多维质量收件箱,机器翻译模型,聊天翻译
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The complexities of chats pose significant challenges for machine translation models. Recognizing the need for a precise evaluation metric to address the issues of chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat). Through the experiments of five models using MQM-Chat, we observed that all models generated certain fundamental errors, while each of them has different shortcomings, such as omission, overly correcting ambiguous source content, and buzzword issues, resulting in the loss of stylized information. Our findings underscore the effectiveness of MQM-Chat in evaluating chat translation, emphasizing the importance of stylized content and dialogue consistency for future studies.
摘要:聊天的复杂性给机器翻译模型带来了重大挑战。认识到需要一个精确的评估指标来解决聊天翻译问题,本研究引入了聊天翻译多维质量工作表(MQM-Chat)。通过使用MQM-Chat对五个模型进行实验,我们观察到所有模型都会产生某些根本性错误,而每个模型都有不同的缺点,例如遗漏、过度纠正模糊的源内容、流行语问题,导致风格化信息的丢失。我们的研究结果强调了MQM-Chat在评估聊天翻译方面的有效性,强调了风格化内容和对话一致性对未来研究的重要性。

[NLP-21] he Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization
[NLP-21] 核采样对缓解文本序列化的不合理无效

链接: https://arxiv.org/abs/2408.16345
作者: Luka Borec,Philipp Sadler,David Schlangen
关键词-EN: large language models, text memorization behavior, nucleus sampling, work analyses, behavior of large
关键词-ZH: 大型语言模型、文本记忆行为、核心抽样、工作分析、大型行为
类目: Computation and Language (cs.CL)
备注: 9 pages, Accepted at INLG 2024 (International Natural Language Generation Conference)

点击查看摘要

Abstract:This work analyses the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling. Stochastic decoding methods like nucleus sampling are typically applied to overcome issues such as monotonous and repetitive text generation, which are often observed with maximization-based decoding techniques. We hypothesize that nucleus sampling might also reduce the occurrence of memorization patterns, because it could lead to the selection of tokens outside the memorized sequence. To test this hypothesis we create a diagnostic dataset with a known distribution of duplicates that gives us some control over the likelihood of memorization of certain parts of the training data. Our analysis of two GPT-Neo models fine-tuned on this dataset interestingly shows that (i) an increase of the nucleus size reduces memorization only modestly, and (ii) even when models do not engage in “hard” memorization – a verbatim reproduction of training samples – they may still display “soft” memorization whereby they generate outputs that echo the training data but without a complete one-by-one resemblance.
摘要:本文分析了大语言模型(LLMS)在核抽样条件下的文本记忆行为。像核抽样这样的随机解码方法通常被应用来克服诸如单调和重复的文本生成之类的问题,这是基于最大化的解码技术经常观察到的。我们假设,核采样也可能减少记忆模式的出现,因为它可能导致对记忆序列之外的表征的选择。为了验证这一假设,我们创建了一个具有已知重复分布的诊断数据集,它使我们能够对记忆训练数据的某些部分的可能性进行一些控制。我们对两个在该数据集上进行微调的GPT-Neo模型的分析有趣地表明,(I)核大小的增加仅适度地减少了记忆,(Ii)即使模型不进行“硬”记忆–训练样本的逐字再现–它们仍然可能显示“软”记忆,从而产生与训练数据相呼应但没有完全逐一相似的输出。

[NLP-22] Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic
[NLP-22] Critic-CoT:通过思想链提升大型语言模型的推理能力Critic

链接: https://arxiv.org/abs/2408.16326
作者: Xin Zheng,Jie Lou,Boxi Cao,Xueru Wen,Yuqiu Ji,Hongyu Lin,Yaojie Lu,Xianpei Han,Debing Zhang,Le Sun
关键词-EN: important mechanism, mechanism for enhancing, http URL address, Self-critic, CoT reasoning format
关键词-ZH: 重要机制、增强机制、http URL地址、自我批评、CoT推理格式
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Self-critic has become an important mechanism for enhancing the reasoning performance of LLMs. However, current approaches mainly involve basic prompts without further training, which tend to be over-simplified, leading to limited accuracy.Moreover, there is a lack of in-depth investigation of the relationship between LLM’s ability to criticism and its task-solving this http URL address these issues, we propose Critic-CoT, a novel framework that pushes LLMs toward System-2-like critic capability, via step-wise CoT reasoning format and distant-supervision data construction, without the need for human annotation. Experiments on GSM8K and MATH show that via filtering out invalid solutions or iterative refinement, our enhanced model boosts task-solving performance, which demonstrates the effectiveness of our method. Further, we find that training on critique and refinement alone improves the generation. We hope our work could shed light on future research on improving the reasoning and critic ability of LLMs.
摘要:自我批评已成为提高LLMS推理能力的一种重要机制。然而,目前的方法主要涉及未经过进一步训练的基本提示,这往往过于简化,导致准确率有限。此外,缺乏对LLM的批评能力与其任务解决之间关系的深入研究。针对这些问题,我们提出了Critic-COT框架,通过逐步COT推理格式和远程监督数据构建,将LLM推向System-2式的批评能力,而不需要人工标注。在GSM8K和MATH上的实验表明,通过过滤无效解或迭代求精,改进的模型提高了任务求解的性能,证明了该方法的有效性。此外,我们发现,仅在批评和提炼方面的培训就能改善这一代人。我们希望我们的工作能够对未来提高LLMS的推理和批判能力的研究有所启示。

[NLP-23] Physics of Language Models: Part 2.2 How to Learn From Mistakes on Grade-School Math Problems
[NLP-23] 语言模型物理学:第2.2部分如何从小学数学问题的错误中学习

链接: https://arxiv.org/abs/2408.16293
作者: Tian Ye,Zicheng Xu,Yuanzhi Li,Zeyuan Allen-Zhu
关键词-EN: demonstrated remarkable performance, solving reasoning tasks, occasionally make reasoning, make reasoning mistakes, Language models
关键词-ZH: 表现出色,解决推理任务,偶尔进行推理,犯推理错误,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2407.20311

点击查看摘要

Abstract:Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to “self-correct” their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating “error-correction” data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others.
摘要:语言模型在解决推理问题时表现出了显著的性能,但即使是最强的语言模型也偶尔会出现推理错误。最近,有一些积极的研究旨在提高推理的准确性,特别是通过使用预先训练的语言模型通过多轮提示来自我纠正错误。在本文中,我们遵循这一工作路线,但重点是理解将“纠错”数据直接纳入预培训阶段的有用性。这些数据包括错误的解决步骤,紧跟其后的是它们的更正。使用合成的数学数据集,我们得到了令人满意的结果:与在相同数量的无错误数据上进行预训练相比,这种类型的预训练数据可以帮助语言模型直接获得更高的推理精度(即通过简单的自动回归,而不需要多轮提示)。我们还深入研究了许多细节,例如(1)这种方法与波束搜索有何不同,(2)如何准备这种数据,(3)是否需要对错误的标记进行掩蔽,(4)所需的误差量,(5)这种数据是否可以推迟到微调阶段,等等。

[NLP-24] Measuring the Accuracy of Automatic Speech Recognition Solutions
[NLP-24] 衡量自动语音识别解决方案的准确性

链接: https://arxiv.org/abs/2408.16287
作者: Korbinian Kuhn,Verena Kersken,Benedikt Reuter,Niklas Egger,Gottfried Zimmermann
关键词-EN: essential accessibility tool, Automatic Speech Recognition, Deaf and hard, hard of hearing, accessibility tool
关键词-ZH: 基本的无障碍工具、自动语音识别、聋哑人和重听、无障碍工具
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:For d/Deaf and hard of hearing (DHH) people, captioning is an essential accessibility tool. Significant developments in artificial intelligence (AI) mean that Automatic Speech Recognition (ASR) is now a part of many popular applications. This makes creating captions easy and broadly available - but transcription needs high levels of accuracy to be accessible. Scientific publications and industry report very low error rates, claiming AI has reached human parity or even outperforms manual transcription. At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. There seems to be a mismatch between technical innovations and the real-life experience for people who depend on transcription. Independent and comprehensive data is needed to capture the state of ASR. We measured the performance of eleven common ASR services with recordings of Higher Education lectures. We evaluated the influence of technical conditions like streaming, the use of vocabularies, and differences between languages. Our results show that accuracy ranges widely between vendors and for the individual audio samples. We also measured a significant lower quality for streaming ASR, which is used for live events. Our study shows that despite the recent improvements of ASR, common services lack reliability in accuracy.
摘要:对于聋哑人和重听人来说,字幕是一种必不可少的辅助工具。人工智能(AI)的重大发展意味着自动语音识别(ASR)现在是许多流行应用的一部分。这使得创建字幕变得容易和广泛-但转录需要高水平的准确性才能访问。科学出版物和行业报告的错误率非常低,声称人工智能已经达到了与人类平起平坐的水平,甚至超过了人工转录。与此同时,DHH社区报告了ASR的准确性和可靠性方面的严重问题。对于依赖抄写的人来说,技术创新和现实生活体验之间似乎存在不匹配。需要独立和全面的数据来捕获ASR的状态。我们使用高等教育讲座的录音来衡量11种常见的ASR服务的性能。我们评估了技术条件的影响,如流媒体、词汇的使用和语言之间的差异。我们的结果表明,不同供应商和单个音频样本的准确率差别很大。我们还测量到,用于现场活动的流媒体ASR的质量明显较低。我们的研究表明,尽管ASR最近有所改进,但公共服务在准确性方面缺乏可靠性。

[NLP-25] Enhancing AI-Driven Psychological Consultation: Layered Prompts with Large Language Models
[NLP-25] 增强人工智能驱动的心理咨询:具有大型语言模型的分层预算

链接: https://arxiv.org/abs/2408.16276
作者: Rafael Souza,Jia-Hao Lim,Alexander Davis
关键词-EN: scalability issues limit, limit its accessibility, Psychological consultation, essential for improving, shortage of qualified
关键词-ZH: 可扩展性问题限制,限制其可及性,心理咨询,改进必不可少,合格人才短缺
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Psychological consultation is essential for improving mental health and well-being, yet challenges such as the shortage of qualified professionals and scalability issues limit its accessibility. To address these challenges, we explore the use of large language models (LLMs) like GPT-4 to augment psychological consultation services. Our approach introduces a novel layered prompting system that dynamically adapts to user input, enabling comprehensive and relevant information gathering. We also develop empathy-driven and scenario-based prompts to enhance the LLM’s emotional intelligence and contextual understanding in therapeutic settings. We validated our approach through experiments using a newly collected dataset of psychological consultation dialogues, demonstrating significant improvements in response quality. The results highlight the potential of our prompt engineering techniques to enhance AI-driven psychological consultation, offering a scalable and accessible solution to meet the growing demand for mental health support.
摘要:心理咨询对于提高心理健康和幸福感至关重要,但缺乏合格的专业人员和可扩展性问题等挑战限制了心理咨询的可及性。为了应对这些挑战,我们探索使用像GPT-4这样的大型语言模型(LLM)来增强心理咨询服务。我们的方法引入了一种新颖的分层提示系统,该系统动态适应用户的输入,使全面和相关的信息收集成为可能。我们还开发了同理心驱动的和基于情景的提示,以提高LLM在治疗环境中的情商和上下文理解。我们通过使用新收集的心理咨询对话数据集的实验来验证我们的方法,显示出响应质量的显著改善。这些结果突显了我们的即时工程技术在增强人工智能驱动的心理咨询方面的潜力,提供了一种可扩展和可访问的解决方案,以满足日益增长的心理健康支持需求。

[NLP-26] LoraMap: Harnessing the Power of LoRA Connections
[NLP-26] LoraMap:利用LoRA连接的力量

链接: https://arxiv.org/abs/2408.16264
作者: Hyeryun Park,Jeongwon Kwak,Dongsuk Jang,Sumin Park,Jinwook Choi
关键词-EN: Large Language Models, Large Language, Language Models, overcoming substantial computational, substantial computational overhead
关键词-ZH: 大型语言模型,大型语言,语言模型,克服大量计算,大量计算负担
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) can benefit from mitigating hallucinations through fact-checking and overcoming substantial computational overhead with parameter-efficient techniques such as Low-Rank Adaptation (LoRA). While some studies have explored the parallel integration of multiple LoRAs, these approaches need attention to the connections between them. This paper investigates methods to establish connections among multiple LoRAs. We create three reasoning datasets tailored to fact-checking and fine-tune individual LoRAs, allowing them to view and reason from diverse perspectives. Then, we explore strategies for allocating these reasoning LoRAs and introduce LoraMap, an approach to map connections between them. The results on the fact-checking task demonstrate that the performance of LoraMap is superior to LoraHub, an existing LoRA composition method. LoraMap also outperforms with significantly fewer parameters than LoraConcat, which concatenates LoRAs and further fine-tunes them.
摘要:大型语言模型(LLM)可以通过事实核查来减少幻觉,并通过低阶自适应(LORA)等参数高效技术来克服大量计算开销。虽然一些研究已经探索了多个LORA的并行整合,但这些方法需要注意它们之间的联系。本文研究了多个LORA之间建立连接的方法。我们创建了三个为事实核查和微调个别LORA量身定做的推理数据集,允许他们从不同的角度进行查看和推理。然后,我们探索了分配这些推理LORA的策略,并介绍了一种映射它们之间联系的方法LoraMap。事实核查任务的结果表明,LoraMap的性能优于现有的LoraHub合成方法。与LoraConcat相比,LoraMap使用的参数要少得多,LoraMap的性能也要好于LoraConcat,后者将Lora连接起来并对它们进行进一步的微调。

[NLP-27] Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers
[NLP-27] 充分利用你的模型:微调和应用预训练变形金刚的方法

链接: https://arxiv.org/abs/2408.16241
作者: Davis Yoshida
关键词-EN: make progress, transformer decoder, models, transformer, language models
关键词-ZH: 取得进展,Transformer解码器,模型,Transformer,语言模型
类目: Computation and Language (cs.CL)
备注: PhD thesis

点击查看摘要

Abstract:This thesis provides methods and analysis of models which make progress on this goal. The techniques outlined are task agnostic, and should provide benefit when used with nearly any transformer LM. We introduce two new finetuning methods which add new capabilities to the models they are used on. The first adds a recurrence mechanism, which removes the fixed-window sized constraint and improves the efficiency of a transformer decoder. The second allows masked language models (MLMs) to be used for initialization of both the encoder and decoder of a non-autoregressive sequence-to-sequence transformer, opening up generative applications of models which were previously only used for natural language understanding tasks. We also introduce two new techniques for improving the quality of predictions of any transformer decoder without additional finetuning. One, hidden state optimization, can be applied to any transformer decoder to improve the quality of predictions at inference time, especially for few-shot classification. The other, conditional beam search, allows practitioners to search for natural language generation (NLG) model outputs with high likelihood while conditioning on the event that the output is not degenerate (e.g. empty, repetitive, etc.). Finally, we provide theoretical and empirical insights on the divergence of model-likelihood and output quality which has widely been observed in prior work. These insights apply to any model which represents a distribution over text, and apply to language models which are not transformers or even autoregressive. We argue that the NLP community has, to some extent, misunderstood the implications of these findings, and encourage a point of view which has more nuance. Comments: PhD thesis Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.16241 [cs.CL] (or arXiv:2408.16241v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.16241 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:本文提供了在这一目标上取得进展的方法和模型分析。概述的技术与任务无关,当与几乎任何变压器LM一起使用时,应该会提供好处。我们引入了两种新的微调方法,它们为使用它们的模型添加了新的功能。第一种算法增加了递归机制,消除了窗口大小固定的限制,提高了变压器译码的效率。第二种方法允许屏蔽语言模型(MLM)用于非自回归序列到序列转换器的编码器和解码器的初始化,从而打开了以前仅用于自然语言理解任务的模型的生成性应用。我们还介绍了两种新的技术来提高任何变换解码器的预测质量,而不需要额外的微调。一种是隐状态优化,它可以应用于任何变压器译码,以提高推理时的预测质量,特别是对于少镜头分类。另一种是条件波束搜索,允许实践者搜索高似然性的自然语言生成(NLG)模型输出,同时以输出不退化(例如,空、重复等)为条件。最后,我们提供了关于模型似然和输出质量差异的理论和经验见解,这在以前的工作中已经被广泛观察到了。这些见解适用于任何代表文本分布的模型,也适用于不是转换器或甚至不是自回归的语言模型。我们认为,NLP社区在某种程度上误解了这些发现的含义,并鼓励采取更微妙的观点。评论:博士论文主题:计算与语言(cs.CL)引用如下:arxiv:2408.16241cs.CLhttps://doi.org/10.48550/arXiv.2408.16241 Focus通过DataCite了解更多arxiv发布的文档说明(待注册)

[NLP-28] M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation
[NLP-28] M4 CXR:探索胸部X射线解释的多模式大型语言模型的多任务潜力

链接: https://arxiv.org/abs/2408.16213
作者: Jonggwon Park,Soobum Kim,Byungmu Yoon,Jihun Hyun,Kyoyun Choi
关键词-EN: large language models, including healthcare, artificial intelligence, impacted various domains, rapid evolution
关键词-ZH: 包括医疗保健、人工智能在内的大型语言模型影响了各个领域,快速发展
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the multi-tasking capabilities of LLMs or lacking clinical accuracy. This paper presents M4CXR, a multi-modal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multi-image, and multi-study contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and also demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR’s versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.
摘要:人工智能的快速发展,特别是在大型语言模型(LLM)中,已经对包括医疗保健在内的各个领域产生了重大影响。在胸部X光(CXR)分析中,以前的研究使用了LLMS,但有局限性:要么没有充分利用LLMS的多任务能力,要么缺乏临床准确性。本文介绍了M4CXR,一种用于增强CXR解释的多模式LLM。该模型是在一个可视化的遵循指令的数据集上进行训练的,该数据集以对话格式集成了各种特定于任务的数据集。因此,该模型支持多种任务,如医疗报告生成(MRG)、视觉基础和视觉问答(VQA)。M4CXR通过采用思维链提示策略在MRG中实现最先进的临床准确性,在该策略中,它识别CXR图像中的发现并随后生成相应的报告。该模型适用于各种不同的MRG场景,取决于可用的输入,例如单图像、多图像和多研究环境。除了MRG,M4CXR还在视觉上达到了与专业型号相媲美的水平,并在VQA方面表现出了出色的表现。定量和定性评估均显示M4CXR在MRG、视觉基础和VQA方面的多功能性,同时始终如一地保持临床准确性。

[NLP-29] From cart to truck: meaning shift through words in English in the last two centuries
[NLP-29] 从手推车到卡车:过去两个世纪英语单词的含义转变

链接: https://arxiv.org/abs/2408.16209
作者: Esteban Rodríguez Betancourt,Edgar Casasola Murillo
关键词-EN: historical word data, diachronic word embeddings, concepts over time, onomasiological study, diachronic word
关键词-ZH: 历史词数据、历时词嵌入、概念随时间的变化、专名学研究、历时词
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:This onomasiological study uses diachronic word embeddings to explore how different words represented the same concepts over time, using historical word data from 1800 to 2000. We identify shifts in energy, transport, entertainment, and computing domains, revealing connections between language and societal changes. Our approach consisted in using diachronic word embeddings trained using word2vec with skipgram and aligning them using orthogonal Procrustes. We discuss possible difficulties linked to the relationships the method identifies. Moreover, we look at the ethical aspects of interpreting results, highlighting the need for expert insights to understand the method’s significance. Comments: 7 pages, 1 figure Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.16209 [cs.CL] (or arXiv:2408.16209v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.16209 Focus to learn more arXiv-issued DOI via DataCite
摘要:本研究利用1800年至2000年的历史词语数据,采用历时词语嵌入的方法,探讨不同词语在一段时间内如何代表相同的概念。我们发现了能源、交通、娱乐和计算机领域的变化,揭示了语言和社会变化之间的联系。我们的方法包括使用使用word2vec和Skipgram训练的历时单词嵌入,并使用正交Procrstes将它们对齐。我们讨论了与该方法确定的关系相关的可能的困难。此外,我们着眼于解释结果的伦理方面,强调需要专家的洞察力来理解方法的重要性。评论:7页,1图主题:计算与语言(cs.CL)引用如下:arxiv:2408.16209cs.CLhttps://doi.org/10.48550/arXiv.2408.16209 Focus通过DataCite了解更多arxiv发布的文档

[NLP-30] ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics
[NLP-30] ReXamine-Global:揭露放射学报告生成收件箱中Incredit的框架

链接: https://arxiv.org/abs/2408.16208
作者: Oishi Banerjee,Agustina Saenz,Kay Wu,Warren Clements,Adil Zia,Dominic Buensalido,Helen Kavnoudias,Alain S. Abi-Ghanem,Nour El Ghawi,Cibele Luna,Patricia Castillo,Khaled Al-Surimi,Rayyan A. Daghistani,Yuh-Min Chen,Heng-sheng Chao,Lars Heiliger,Moon Kim,Johannes Haubold,Frederic Jonske,Pranav Rajpurkar
关键词-EN: rapidly expanding capabilities, rapidly expanding, expanding capabilities, capabilities of generative, generative AI models
关键词-ZH: 快速扩展的能力,快速扩展,扩展的能力,生成的、生成的人工智能模型的能力
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given the rapidly expanding capabilities of generative AI models for radiology, there is a need for robust metrics that can accurately measure the quality of AI-generated radiology reports across diverse hospitals. We develop ReXamine-Global, a LLM-powered, multi-site framework that tests metrics across different writing styles and patient populations, exposing gaps in their generalization. First, our method tests whether a metric is undesirably sensitive to reporting style, providing different scores depending on whether AI-generated reports are stylistically similar to ground-truth reports or not. Second, our method measures whether a metric reliably agrees with experts, or whether metric and expert scores of AI-generated report quality diverge for some sites. Using 240 reports from 6 hospitals around the world, we apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability. Developers can apply ReXamine-Global when designing new report evaluation metrics, ensuring their robustness across sites. Additionally, our analysis of existing metrics can guide users of those metrics towards evaluation procedures that work reliably at their sites of interest.
摘要:鉴于用于放射学的生成性人工智能模型的能力迅速扩展,需要稳健的度量标准来准确衡量不同医院的人工智能生成的放射学报告的质量。我们开发了rexamine-Global,这是一个基于LLM的多站点框架,可以测试不同写作风格和患者群体的指标,暴露出它们在泛化方面的差距。首先,我们的方法测试一个指标是否对报告风格异常敏感,根据人工智能生成的报告是否在风格上类似于地面事实报告而提供不同的分数。其次,我们的方法衡量一个指标是否可靠地与专家一致,或者一些站点的人工智能生成的报告质量的指标和专家分数是否存在差异。使用来自全球6家医院的240份报告,我们将瑞克沙明-全球应用于7个已建立的报告评估指标,并发现它们在普适性方面存在严重差距。开发人员可以在设计新的报告评估指标时应用rexamine-Global,确保其跨站点的健壮性。此外,我们对现有指标的分析可以指导这些指标的用户进行在他们感兴趣的站点可靠工作的评估程序。

[NLP-31] FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
[NLP-31] Fractured-SORRY-Bench:揭示对话转折中攻击的框架,削弱了对SORRY-Bench的拒绝功效和防御

链接: https://arxiv.org/abs/2408.16163
作者: Aman Priyanshu,Supriti Vijay
关键词-EN: Large Language Models, Large Language, Language Models, paper introduces, framework for evaluating
关键词-ZH: 大型语言模型、大型语言、语言模型、论文介绍、评估框架
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 tables

点击查看摘要

Abstract:This paper introduces FRACTURED-SORRY-Bench, a framework for evaluating the safety of Large Language Models (LLMs) against multi-turn conversational attacks. Building upon the SORRY-Bench dataset, we propose a simple yet effective method for generating adversarial prompts by breaking down harmful queries into seemingly innocuous sub-questions. Our approach achieves a maximum increase of +46.22% in Attack Success Rates (ASRs) across GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods. We demonstrate that this technique poses a challenge to current LLM safety measures and highlights the need for more robust defenses against subtle, multi-turn attacks.
摘要:本文介绍了FRACTURED-SORRY-Bench,这是一个用于评估大型语言模型(LLM)针对多轮对话攻击的安全性的框架。基于SORRY-Bench数据集,我们提出了一种简单而有效的方法,通过将有害的查询分解为看似无害的子问题来生成对抗性提示。与基线方法相比,我们的方法在GPT-4、GPT-4 o、GPT-4 o-mini和GPT-3.5-Turbo模型中实现了+46.22%的攻击成功率(SVR)最大增加。我们证明这种技术对当前的LLM安全措施构成了挑战,并强调了对微妙的多回合攻击进行更强大的防御的必要性。

[NLP-32] Evaluating Computational Representations of Character: An Austen Character Similarity Benchmark
[NLP-32] 评估角色的计算表示:奥斯汀角色相似性基准

链接: https://arxiv.org/abs/2408.16131
作者: Funing Yang,Carolyn Jane Anderson
关键词-EN: English literature, analysis of English, aid computational analysis, developed to extract, extract information
关键词-ZH: 英语文献,英语分析,辅助计算分析,开发来提取,提取信息
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Several systems have been developed to extract information about characters to aid computational analysis of English literature. We propose character similarity grouping as a holistic evaluation task for these pipelines. We present AustenAlike, a benchmark suite of character similarities in Jane Austen’s novels. Our benchmark draws on three notions of character similarity: a structurally defined notion of similarity; a socially defined notion of similarity; and an expert defined set extracted from literary criticism. We use AustenAlike to evaluate character features extracted using two pipelines, BookNLP and FanfictionNLP. We build character representations from four kinds of features and compare them to the three AustenAlike benchmarks and to GPT-4 similarity rankings. We find that though computational representations capture some broad similarities based on shared social and narrative roles, the expert pairings in our third benchmark are challenging for all systems, highlighting the subtler aspects of similarity noted by human readers. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.16131 [cs.CL] (or arXiv:2408.16131v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.16131 Focus to learn more arXiv-issued DOI via DataCite
摘要:已经开发了几个系统来提取有关字符的信息,以帮助对英语文学进行计算分析。我们建议将字符相似度分组作为对这些管道的整体评估任务。我们介绍了奥斯汀·阿莱克,简·奥斯汀小说中性格相似的基准套件。我们的基准借鉴了三个关于性格相似性的概念:结构上定义的相似性概念;社会定义的相似性概念;以及从文学批评中提取的专家定义集。我们使用Auustin Alike对BookNLP和FanfictionNLP两个流水线提取的字符特征进行评估。我们从四种特征中构建了字符表示,并将它们与三个Auustin Alike基准和GPT-4相似度排名进行了比较。我们发现,尽管计算表示法基于共享的社会和叙事角色捕获了一些广泛的相似性,但我们第三个基准中的专家配对对所有系统都是具有挑战性的,突出了人类读者注意到的相似性的更微妙方面。主题:计算与语言(cs.CL)引用如下:arxiv:2408.16131cs.CLhttps://doi.org/10.48550/arXiv.2408.16131 Focus通过DataCite了解更多arxiv发布的文档

[NLP-33] Structured Event Reasoning with Large Language Models
[NLP-33] 使用大型语言模型的结构化事件推理

链接: https://arxiv.org/abs/2408.16098
作者: Li Zhang
关键词-EN: unifying challenge, profound utility, fallacy in high-stake, high-stake applications, LLMs
关键词-ZH: 统一的挑战、深刻的效用、高风险应用程序、LLM中的谬误
类目: Computation and Language (cs.CL)
备注: PhD thesis

点击查看摘要

Abstract:Reasoning about real-life events is a unifying challenge in AI and NLP that has profound utility in a variety of domains, while fallacy in high-stake applications could be catastrophic. Able to work with diverse text in these domains, large language models (LLMs) have proven capable of answering questions and solving problems. However, I show that end-to-end LLMs still systematically fail to reason about complex events, and they lack interpretability due to their black-box nature. To address these issues, I propose three general approaches to use LLMs in conjunction with a structured representation of events. The first is a language-based representation involving relations of sub-events that can be learned by LLMs via fine-tuning. The second is a semi-symbolic representation involving states of entities that can be predicted and leveraged by LLMs via few-shot prompting. The third is a fully symbolic representation that can be predicted by LLMs trained with structured data and be executed by symbolic solvers. On a suite of event reasoning tasks spanning common-sense inference and planning, I show that each approach greatly outperforms end-to-end LLMs with more interpretability. These results suggest manners of synergy between LLMs and structured representations for event reasoning and beyond.
摘要:关于真实事件的推理是人工智能和自然语言处理中的一个统一挑战,在各种领域具有深远的实用价值,而在高风险应用程序中的谬误可能是灾难性的。大型语言模型(LLM)能够处理这些领域的不同文本,已被证明有能力回答问题和解决问题。然而,我指出,端到端的LLM仍然系统性地无法对复杂事件进行推理,而且由于它们的黑箱性质,它们缺乏可解释性。为了解决这些问题,我提出了三种将LLMS与事件的结构化表示结合使用的一般方法。第一种是基于语言的表示,涉及子事件之间的关系,LLMS可以通过微调学习这些关系。第二种是涉及实体状态的半符号表示,LLMS可以通过少镜头提示来预测和利用这些状态。第三种是完全符号表示,它可以由用结构化数据训练的LLMS预测,并由符号求解器执行。在一组跨越常识推理和规划的事件推理任务上,我证明了每种方法都大大优于端到端的LLM,具有更好的可解释性。这些结果表明,在事件推理和其他领域中,LLMS和结构化表征之间存在着协同作用。

[NLP-34] Is Personality Prediction Possible Based on Reddit Comments?
[NLP-34] 根据Reddit评论进行性格预测是否可能?

链接: https://arxiv.org/abs/2408.16089
作者: Robert Deimann,Till Preidt,Shaptarshi Roy,Jan Stanicki
关键词-EN: Myers-Briggs Type Indicator, Type Indicator, Reddit comments labeled, personality type, texts they wrote
关键词-ZH: Myers-Briggs类型指标、类型指标、Reddit评论标签、性格类型、他们写的文本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this assignment, we examine whether there is a correlation between the personality type of a person and the texts they wrote. In order to do this, we aggregated datasets of Reddit comments labeled with the Myers-Briggs Type Indicator (MBTI) of the author and built different supervised classifiers based on BERT to try to predict the personality of an author given a text. Despite experiencing issues with the unfiltered character of the dataset, we can observe potential in the classification.
摘要:在这项作业中,我们检查一个人的性格类型与他们所写的文本之间是否存在相关性。为了做到这一点,我们聚集了标有作者Myers-Briggs类型指标(MBTI)的Reddit评论数据集,并基于BERT构建不同的监督分类器,以尝试预测给定文本的作者的个性。尽管数据集的未过滤特征存在问题,但我们可以观察到分类的潜力。

[NLP-35] Logic-Enhanced Language Model Agents for Trustworthy Social Simulations
[NLP-35] 用于值得信赖的社交模拟的逻辑增强语言模型代理

链接: https://arxiv.org/abs/2408.16081
作者: Agnieszka Mensfelt,Kostas Stathis,Vince Trencsenyi
关键词-EN: utilize large language, Language Model Agents, Logic-Enhanced Language Model, large language models, introduce the Logic-Enhanced
关键词-ZH: 利用大型语言、语言模型代理、逻辑增强语言模型、大型语言模型,引入逻辑增强
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Logic in Computer Science (cs.LO)
备注: Source code: this https URL

点击查看摘要

Abstract:We introduce the Logic-Enhanced Language Model Agents (LELMA) framework, a novel approach to enhance the trustworthiness of social simulations that utilize large language models (LLMs). While LLMs have gained attention as agents for simulating human behaviour, their applicability in this role is limited by issues such as inherent hallucinations and logical inconsistencies. LELMA addresses these challenges by integrating LLMs with symbolic AI, enabling logical verification of the reasoning generated by LLMs. This verification process provides corrective feedback, refining the reasoning output. The framework consists of three main components: an LLM-Reasoner for producing strategic reasoning, an LLM-Translator for mapping natural language reasoning to logic queries, and a Solver for evaluating these queries. This study focuses on decision-making in game-theoretic scenarios as a model of human interaction. Experiments involving the Hawk-Dove game, Prisoner’s Dilemma, and Stag Hunt highlight the limitations of state-of-the-art LLMs, GPT-4 Omni and Gemini 1.0 Pro, in producing correct reasoning in these contexts. LELMA demonstrates high accuracy in error detection and improves the reasoning correctness of LLMs via self-refinement, particularly in GPT-4 Omni.
摘要:我们介绍了逻辑增强的语言模型代理(LEMA)框架,这是一种利用大型语言模型(LLMS)来增强社会模拟可信性的新方法。虽然LLM作为模拟人类行为的媒介获得了关注,但它们在这一角色中的适用性受到固有幻觉和逻辑不一致等问题的限制。Lelma通过将LLMS与符号人工智能相集成来解决这些挑战,从而能够对LLMS生成的推理进行逻辑验证。这一验证过程提供了纠正反馈,细化了推理输出。该框架由三个主要部分组成:用于产生策略推理的LLM推理器,用于将自然语言推理映射到逻辑查询的LLM翻译器,以及用于评估这些查询的求解器。本研究以博弈论情景中的决策为研究对象,以人与人之间的互动为模型。涉及鹰鸽游戏、囚徒困境和Stag Hunt的实验突出了最先进的LLMS、GPT-4 Omni和Gemini 1.0 Pro在这些背景下产生正确推理方面的局限性。LELMA在错误检测方面表现出很高的准确率,并通过自我求精提高了LLMS的推理正确性,特别是在GPT-4 OMNI中。

[NLP-36] Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings
[NLP-36] 使用大型语言模型创建人工智能角色以复制和预测媒体效果:对133项已发表实验研究结果的实证测试

链接: https://arxiv.org/abs/2408.16073
作者: Leo Yeykelis,Kaavya Pichai,James J. Cummings,Byron Reeves
关键词-EN: large language models, expedite accurate replication, published message effects, language models, Anthropic Claude Sonnet
关键词-ZH: 大型语言模型、加快准确复制、发布的消息效果、语言模型、Anthropic克劳德十四行诗
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This report analyzes the potential for large language models (LLMs) to expedite accurate replication of published message effects studies. We tested LLM-powered participants (personas) by replicating 133 experimental findings from 14 papers containing 45 recent studies in the Journal of Marketing (January 2023-May 2024). We used a new software tool, Viewpoints AI (this https URL), that takes study designs, stimuli, and measures as input, automatically generates prompts for LLMs to act as a specified sample of unique personas, and collects their responses to produce a final output in the form of a complete dataset and statistical analysis. The underlying LLM used was Anthropic’s Claude Sonnet 3.5. We generated 19,447 AI personas to replicate these studies with the exact same sample attributes, study designs, stimuli, and measures reported in the original human research. Our LLM replications successfully reproduced 76% of the original main effects (84 out of 111), demonstrating strong potential for AI-assisted replication of studies in which people respond to media stimuli. When including interaction effects, the overall replication rate was 68% (90 out of 133). The use of LLMs to replicate and accelerate marketing research on media effects is discussed with respect to the replication crisis in social science, potential solutions to generalizability problems in sampling subjects and experimental conditions, and the ability to rapidly test consumer responses to various media stimuli. We also address the limitations of this approach, particularly in replicating complex interaction effects in media response studies, and suggest areas for future research and improvement in AI-assisted experimental replication of media effects.
摘要:这份报告分析了大型语言模型(LLM)的潜力,以加快已发表的消息效应研究的准确复制。我们测试了LLM支持的参与者(角色),复制了《营销杂志》(2023年1月至2024年5月)上发表的包含45项最新研究的14篇论文中的133项实验结果。我们使用了一个新的软件工具–视点AI(这个HTTPS URL),它将研究设计、刺激和测量作为输入,自动为LLM生成提示,以作为特定的独特人物角色样本,并收集他们的响应,以完整数据集和统计分析的形式生成最终输出。LLM使用的基础是人类的克劳德十四行诗3.5。我们生成了19,447个人工智能人物角色,以复制这些研究,这些研究具有与原始人类研究中报告的完全相同的样本属性、研究设计、刺激和测量。我们的LLM复制成功地复制了76%的原始主效应(111个中的84个),展示了人工智能辅助复制人们对媒体刺激做出反应的研究的强大潜力。当计入相互作用影响时,总体复制率为68%(133人中有90人)。关于社会科学中的复制危机,样本对象和实验条件中的概括性问题的潜在解决方案,以及快速测试消费者对各种媒体刺激的反应的能力,讨论了使用LLMS来复制和加速关于媒体效应的营销研究。我们还讨论了这种方法的局限性,特别是在复制媒体反应研究中的复杂交互效应方面,并建议在人工智能辅助的媒体效应复制实验中未来研究和改进的领域。

[NLP-37] Using large language models to estimate features of multi-word expressions: Concreteness valence arousal
[NLP-37] 使用大型语言模型估计多词表达的特征:具体价唤起

链接: https://arxiv.org/abs/2408.16012
作者: Gonzalo Martínez,Juan Diego Molero,Sandra González,Javier Conde,Marc Brysbaert,Pedro Reviriego
关键词-EN: large language models, provide accurate estimates, multi-word expressions, large language, accurate estimates
关键词-ZH: 大型语言模型,提供准确的估计、多词表达、大型语言、准确的估计
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the potential of large language models (LLMs) to provide accurate estimates of concreteness, valence and arousal for multi-word expressions. Unlike previous artificial intelligence (AI) methods, LLMs can capture the nuanced meanings of multi-word expressions. We systematically evaluated ChatGPT-4o’s ability to predict concreteness, valence and arousal. In Study 1, ChatGPT-4o showed strong correlations with human concreteness ratings (r = .8) for multi-word expressions. In Study 2, these findings were repeated for valence and arousal ratings of individual words, matching or outperforming previous AI models. Study 3 extended the prevalence and arousal analysis to multi-word expressions and showed promising results despite the lack of large-scale human benchmarks. These findings highlight the potential of LLMs for generating valuable psycholinguistic data related to multiword expressions. To help researchers with stimulus selection, we provide datasets with AI norms of concreteness, valence and arousal for 126,397 English single words and 63,680 multi-word expressions
摘要:本研究探讨了大语言模型(LLM)对多词表达的具体性、配价和唤醒的准确估计的潜力。与以前的人工智能(AI)方法不同,LLMS可以捕捉多个单词表达的细微差别含义。我们系统地评估了ChatGPT-40预测具体性、效价和唤醒的能力。在研究1中,ChatGPT-4o与人类对多个单词表达的具体性评分有很强的相关性(r=0.8)。在研究2中,这些发现在单个单词的配价和唤醒评级上重复,与之前的人工智能模型持平或优于之前的模型。研究3将流行率和唤醒分析扩展到多词表达,尽管缺乏大规模的人类基准,但仍显示出令人振奋的结果。这些发现突显了LLMS在产生与多词表达相关的有价值的心理语言学数据方面的潜力。为了帮助研究人员进行刺激选择,我们为126,397个英语单词和63,680个多词表达提供了具体、配价和唤醒的人工智能标准

[NLP-38] SSDM: Scalable Speech Dysfluency Modeling
[NLP-38] SSDP:可扩展语音流畅性建模

链接: https://arxiv.org/abs/2408.16221
作者: Jiachen Lian,Xuanru Zhou,Zoe Ezzes,Jet Vonk,Brittany Morin,David Baquirin,Zachary Mille,Maria Luisa Gorno Tempini,Gopala Anumanchipalli
关键词-EN: core module, module for spoken, Speech dysfluency modeling, speech therapy, dysfluency
关键词-ZH: 核心模块,口语模块,言语不流利建模,言语治疗,言语不流利
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Speech dysfluency modeling is the core module for spoken language learning, and speech therapy. However, there are three challenges. First, current state-of-the-art solutions suffer from poor scalability. Second, there is a lack of a large-scale dysfluency corpus. Third, there is not an effective learning framework. In this paper, we propose \textitSSDM: Scalable Speech Dysfluency Modeling, which (1) adopts articulatory gestures as scalable forced alignment; (2) introduces connectionist subsequence aligner (CSA) to achieve dysfluency alignment; (3) introduces a large-scale simulated dysfluency corpus called Libri-Dys; and (4) develops an end-to-end system by leveraging the power of large language models (LLMs). We expect SSDM to serve as a standard in the area of dysfluency modeling. Demo is available at \urlthis https URL.
摘要:言语不流利建模是口语学习和言语治疗的核心模块。然而,存在三个挑战。首先,当前最先进的解决方案的可扩展性较差。其次,缺乏大规模的不流利性文集。第三,没有有效的学习框架。在本文中,我们提出了\textitSIDM:可扩展语音流畅性建模,它(1)采用发音手势作为可扩展的强制对齐;(2)引入连接主义子序列对齐器(CSA)来实现不流畅性对齐;(3)引入一个名为Libri-Dys的大规模模拟不流畅性库;(4)利用大型语言模型(LLM)的力量开发端到端系统。我们预计SSDP将成为不流畅建模领域的标准。演示可在\urlThis https URL上获取。

[NLP-39] Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction
[NLP-39] 在ASR-LLM设置上对日语语音识别进行基准测试,具有多遍增强生成式错误纠正

链接: https://arxiv.org/abs/2408.16180
作者: Yuka Ko,Sheng Li,Chao-Han Huck Yang,Tatsuya Kawahara
关键词-EN: automatic speech recognition, strong representational power, large language models, address ASR errors, generative error correction
关键词-ZH: 自动语音识别、强大的代表能力、大型语言模型、解决SVR错误、生成式错误纠正
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: submitted to SLT2024

点击查看摘要

Abstract:With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors. This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text utterances. We also introduce a new multi-pass augmented generative error correction (MPA GER) by integrating multiple system hypotheses on the input side with corrections from multiple LLMs on the output side and then merging them. To the best of our knowledge, this is the first investigation of the use of LLMs for Japanese GER, which involves second-pass language modeling on the output transcriptions generated by the ASR system (e.g., N-best hypotheses). Our experiments demonstrated performance improvement in the proposed methods of ASR quality and generalization both in SPREDS-U1-ja and CSJ data.
摘要:自动语音识别(ASR)中的生成纠错(GER)利用大型语言模型(LLMS)强大的表征能力,旨在提供语义和语音上的精化以解决ASR错误。这项工作探讨了基于LLM的GER如何增强和扩展日语处理能力,提出了第一个具有0.9-2.6k文本话语的日语ASR的GER基准。我们还提出了一种新的多通道增广生成纠错算法(MPAGER),它将输入端的多个系统假设和输出端的多个LLMS的校正整合在一起,然后将它们合并。据我们所知,这是第一次对LLMS在日语GER中的使用进行调查,这涉及到对ASR系统生成的输出转录进行第二遍语言建模(例如,N最佳假设)。我们的实验表明,在SPREDS-U1-JA和CSJ数据上,所提出的ASR质量和泛化方法的性能都有所提高。

人工智能

[AI-0] SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

链接: https://arxiv.org/abs/2408.16768
作者: Ziyu Guo,Renrui Zhang,Xiangyang Zhu,Chengzhuo Tong,Peng Gao,Chunyuan Li,Pheng-Ann Heng
关键词-EN: exploration adapting Segment, Segment Anything Model, preliminary exploration adapting, adapting Segment, preliminary exploration
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Work in progress. Online Demo: this https URL . Code: this https URL

点击查看摘要

Abstract:We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor environments, and raw sparse LiDAR. Demonstrations on multiple 3D datasets, e.g., Objaverse, S3DIS, ScanNet, Semantic3D, and KITTI, highlight the robust generalization capabilities of SAM2Point. To our best knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation. Online Demo: this https URL . Code: this https URL .

[AI-1] ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

链接: https://arxiv.org/abs/2408.16767
作者: Fangfu Liu,Wenqiang Sun,Hanyang Wang,Yikai Wang,Haowen Sun,Junliang Ye,Jun Zhang,Yueqi Duan
关键词-EN: producing realistic, results from hundreds, real world, scene reconstruction, scene
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.

[AI-2] A Score-Based Density Formula with Applications in Diffusion Generative Models

链接: https://arxiv.org/abs/2408.16765
作者: Gen Li,Yuling Yan
关键词-EN: Score-based generative models, achieving unprecedented success, diffusion generative models, Score-based generative, generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based generative models (SGMs) have revolutionized the field of generative modeling, achieving unprecedented success in generating realistic and diverse content. Despite empirical advances, the theoretical basis for why optimizing the evidence lower bound (ELBO) on the log-likelihood is effective for training diffusion generative models, such as DDPMs, remains largely unexplored. In this paper, we address this question by establishing a density formula for a continuous-time diffusion process, which can be viewed as the continuous-time limit of the forward process in an SGM. This formula reveals the connection between the target density and the score function associated with each step of the forward process. Building on this, we demonstrate that the minimizer of the optimization objective for training DDPMs nearly coincides with that of the true objective, providing a theoretical foundation for optimizing DDPMs using the ELBO. Furthermore, we offer new insights into the role of score-matching regularization in training GANs, the use of ELBO in diffusion classifiers, and the recently proposed diffusion loss.

[AI-3] Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

链接: https://arxiv.org/abs/2408.16757
作者: Hongjun Wang,Sagar Vaze,Kai Han
关键词-EN: Detecting test-time distribution, machine learning models, test-time distribution shift, safely deployed machine, deployed machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to IJCV, preprint version

点击查看摘要

Abstract:Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: \urlthis https URL

[AI-4] Assessing Large Language Models for Online Extremism Research: Identification Explanation and New Knowledge

链接: https://arxiv.org/abs/2408.16749
作者: Beidi Dong,Jin R. Lee,Ziwei Zhu,Balassubramanian Srinivasan
关键词-EN: United States, Bidirectional Encoder Representations, States has experienced, Generative Pre-Trained Transformers, GPT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The United States has experienced a significant increase in violent extremism, prompting the need for automated tools to detect and limit the spread of extremist ideology online. This study evaluates the performance of Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-Trained Transformers (GPT) in detecting and classifying online domestic extremist posts. We collected social media posts containing “far-right” and “far-left” ideological keywords and manually labeled them as extremist or non-extremist. Extremist posts were further classified into one or more of five contributing elements of extremism based on a working definitional framework. The BERT model’s performance was evaluated based on training data size and knowledge transfer between categories. We also compared the performance of GPT 3.5 and GPT 4 models using different prompts: naïve, layperson-definition, role-playing, and professional-definition. Results showed that the best performing GPT models outperformed the best performing BERT models, with more detailed prompts generally yielding better results. However, overly complex prompts may impair performance. Different versions of GPT have unique sensitives to what they consider extremist. GPT 3.5 performed better at classifying far-left extremist posts, while GPT 4 performed better at classifying far-right extremist posts. Large language models, represented by GPT models, hold significant potential for online extremism classification tasks, surpassing traditional BERT models in a zero-shot setting. Future research should explore human-computer interactions in optimizing GPT models for extremist detection and classification tasks to develop more efficient (e.g., quicker, less effort) and effective (e.g., fewer errors or mistakes) methods for identifying extremist content.

[AI-5] Smaller Weaker Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

链接: https://arxiv.org/abs/2408.16737
作者: Hritik Bansal,Arian Hosseini,Rishabh Agarwal,Vinh Q. Tran,Mehran Kazemi
关键词-EN: strong language models, strong language, data, high-quality synthetic data, common strategy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training on high-quality synthetic data from strong language models (LMs) is a common strategy to improve the reasoning performance of LMs. In this work, we revisit whether this strategy is compute-optimal under a fixed inference budget (e.g., FLOPs). To do so, we investigate the trade-offs between generating synthetic data using a stronger but more expensive (SE) model versus a weaker but cheaper (WC) model. We evaluate the generated data across three key metrics: coverage, diversity, and false positive rate, and show that the data from WC models may have higher coverage and diversity, but also exhibit higher false positive rates. We then finetune LMs on data from SE and WC models in different settings: knowledge distillation, self-improvement, and a novel weak-to-strong improvement setup where a weaker LM teaches reasoning to a stronger LM. Our findings reveal that models finetuned on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks and multiple choices of WC and SE models. These results challenge the prevailing practice of relying on SE models for synthetic data generation, suggesting that WC may be the compute-optimal approach for training advanced LM reasoners.

[AI-6] Mini-Omni: Language Models Can Hear Talk While Thinking in Streaming

链接: https://arxiv.org/abs/2408.16725
作者: Zhifei Xie,Changqiao Wu
关键词-EN: achieved significant progress, Recent advances, significant progress, achieved significant, Recent
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 10 pages

点击查看摘要

Abstract:Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model’s language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method “Any Model Can Talk”. We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.

[AI-7] A GREAT Architecture for Edge-Based Graph Problems Like TSP

链接: https://arxiv.org/abs/2408.16717
作者: Attila Lischka,Jiaming Wu,Morteza Haghir Chehreghani,Balázs Kulcsár
关键词-EN: tackle combinatorial optimization, combinatorial optimization problems, routing problems, neural network-based approaches, proposed to tackle
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:In the last years, many neural network-based approaches have been proposed to tackle combinatorial optimization problems such as routing problems. Many of these approaches are based on graph neural networks (GNNs) or related transformers, operating on the Euclidean coordinates representing the routing problems. However, GNNs are inherently not well suited to operate on dense graphs, such as in routing problems. Furthermore, models operating on Euclidean coordinates cannot be applied to non-Euclidean versions of routing problems that are often found in real-world settings. To overcome these limitations, we propose a novel GNN-related edge-based neural model called Graph Edge Attention Network (GREAT). We evaluate the performance of GREAT in the edge-classification task to predict optimal edges in the Traveling Salesman Problem (TSP). We can use such a trained GREAT model to produce sparse TSP graph instances, keeping only the edges GREAT finds promising. Compared to other, non-learning-based methods to sparsify TSP graphs, GREAT can produce very sparse graphs while keeping most of the optimal edges. Furthermore, we build a reinforcement learning-based GREAT framework which we apply to Euclidean and non-Euclidean asymmetric TSP. This framework achieves state-of-the-art results.

[AI-8] Entropic Distribution Matching in Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity

链接: https://arxiv.org/abs/2408.16673
作者: Ziniu Li,Congliang Chen,Tian Xu,Zeyu Qin,Jiancong Xiao,Ruoyu Sun,Zhi-Quan Luo
关键词-EN: Large language models, Large language, rely on Supervised, language models rely, specialize in downstream
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks. Cross Entropy (CE) loss is the de facto choice in SFT, but it often leads to overfitting and limited output diversity due to its aggressive updates to the data distribution. This paper aim to address these issues by introducing the maximum entropy principle, which favors models with flatter distributions that still effectively capture the data. Specifically, we develop a new distribution matching method called GEM, which solves reverse Kullback-Leibler divergence minimization with an entropy regularizer. For the SFT of Llama-3-8B models, GEM outperforms CE in several aspects. First, when applied to the UltraFeedback dataset to develop general instruction-following abilities, GEM exhibits reduced overfitting, evidenced by lower perplexity and better performance on the IFEval benchmark. Furthermore, GEM enhances output diversity, leading to performance gains of up to 7 points on math reasoning and code generation tasks using best-of-n sampling, even without domain-specific data. Second, when fine-tuning with domain-specific datasets for math reasoning and code generation, GEM also shows less overfitting and improvements of up to 10 points compared with CE. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.16673 [cs.LG] (or arXiv:2408.16673v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16673 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

链接: https://arxiv.org/abs/2408.16672
作者: Rohan Jha,Bo Wang,Michael Günther,Saba Sturua,Mohammad Kalim Akram,Han Xiao
关键词-EN: proven highly effective, Multi-vector dense models, Multi-vector dense, proven highly, highly effective
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.

[AI-10] Iterative Graph Alignment

链接: https://arxiv.org/abs/2408.16667
作者: Fangyuan Yu,Hardeep Singh Arora,Matt Johnson
关键词-EN: generalizable causal relationships, capturing generalizable causal, compressing diverse narratives, causal relationships, intelligence by capturing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:By compressing diverse narratives, LLMs go beyond memorization, achieving intelligence by capturing generalizable causal relationships. However, they suffer from local ‘representation gaps’ due to insufficient training data diversity, limiting their real-world utility, especially in tasks requiring strict alignment to rules. Traditional alignment methods relying on heavy human annotations are inefficient and unscalable. Recent self-alignment techniques also fall short, as they often depend on self-selection based prompting and memorization-based learning. To address these issues, we introduce Iterative Graph Alignment (IGA), an annotation-free rule-based alignment algorithm. A teacher model (VLM) employs Iterative Graph Prompting (IGP) to create logical graphs and reference answers. The student model (LLM) identifies local knowledge gaps by attempting to align its responses with these references, collaborating with helper models to generate diverse answers. These aligned responses are then used for iterative supervised fine-tuning (SFT). Our evaluations across five rule-based scenarios demonstrate IGP’s effectiveness, with a 73.12% alignment improvement in Claude Sonnet 3.5, and Llama3-8B-Instruct achieving an 86.20% improvement, outperforming Claude Sonnet 3.5 in rule-based alignment.

[AI-11] DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

链接: https://arxiv.org/abs/2408.16647
作者: Yongjie Fu,Anmol Jain,Xuan Di,Xu Chen,Zhaobin Mo
关键词-EN: technologies necessitates increasingly, necessitates increasingly sophisticated, increasingly sophisticated methods, driving technologies necessitates, autonomous driving technologies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fréchet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.

[AI-12] RLCP: A Reinforcement Learning-based Copyright Protection Method for Text-to-Image Diffusion Model

链接: https://arxiv.org/abs/2408.16634
作者: Zhuan Shi,Jing Yan,Xiaoli Tang,Lingjuan Lyu,Boi Faltings
关键词-EN: Learning-based Copyright Protection, enforcing copyright infringement, copyright infringement criteria, increasing sophistication, led to complex
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: arXiv admin note: text overlap with arXiv:2403.12052 by other authors

点击查看摘要

Abstract:The increasing sophistication of text-to-image generative models has led to complex challenges in defining and enforcing copyright infringement criteria and protection. Existing methods, such as watermarking and dataset deduplication, fail to provide comprehensive solutions due to the lack of standardized metrics and the inherent complexity of addressing copyright infringement in diffusion models. To deal with these challenges, we propose a Reinforcement Learning-based Copyright Protection(RLCP) method for Text-to-Image Diffusion Model, which minimizes the generation of copyright-infringing content while maintaining the quality of the model-generated dataset. Our approach begins with the introduction of a novel copyright metric grounded in copyright law and court precedents on infringement. We then utilize the Denoising Diffusion Policy Optimization (DDPO) framework to guide the model through a multi-step decision-making process, optimizing it using a reward function that incorporates our proposed copyright metric. Additionally, we employ KL divergence as a regularization term to mitigate some failure modes and stabilize RL fine-tuning. Experiments conducted on 3 mixed datasets of copyright and non-copyright images demonstrate that our approach significantly reduces copyright infringement risk while maintaining image quality.

[AI-13] Optimizing Automated Picking Systems in Warehouse Robots Using Machine Learning

链接: https://arxiv.org/abs/2408.16633
作者: Keqin Li,Jin Wang,Xubo Wu,Xirui Peng,Runmian Chang,Xiaoyu Deng,Yiwen Kang,Yue Yang,Fanghao Ni,Bo Hong
关键词-EN: global e-commerce, industry is increasing, rapid growth, growth of global, logistics industry
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid growth of global e-commerce, the demand for automation in the logistics industry is increasing. This study focuses on automated picking systems in warehouses, utilizing deep learning and reinforcement learning technologies to enhance picking efficiency and accuracy while reducing system failure rates. Through empirical analysis, we demonstrate the effectiveness of these technologies in improving robot picking performance and adaptability to complex environments. The results show that the integrated machine learning model significantly outperforms traditional methods, effectively addressing the challenges of peak order processing, reducing operational errors, and improving overall logistics efficiency. Additionally, by analyzing environmental factors, this study further optimizes system design to ensure efficient and stable operation under variable conditions. This research not only provides innovative solutions for logistics automation but also offers a theoretical and empirical foundation for future technological development and application.

[AI-14] Maelstrom Networks

链接: https://arxiv.org/abs/2408.16632
作者: Matthew Evanusa,Cornelia Fermüller,Yiannis Aloimonos
关键词-EN: incorporate working memory, Neural Networks, working memory, Networks, Artificial Neural Networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Neural Networks has struggled to devise a way to incorporate working memory into neural networks. While the long term'' memory can be seen as the learned weights, the working memory consists likely more of dynamical activity, that is missing from feed-forward models. Current state of the art models such as transformers tend to solve’’ this by ignoring working memory entirely and simply process the sequence as an entire piece of data; however this means the network cannot process the sequence in an online fashion, and leads to an immense explosion in memory requirements. Here, inspired by a combination of controls, reservoir computing, deep learning, and recurrent neural networks, we offer an alternative paradigm that combines the strength of recurrent networks, with the pattern matching capability of feed-forward neural networks, which we call the \textitMaelstrom Networks paradigm. This paradigm leaves the recurrent component - the \textitMaelstrom - unlearned, and offloads the learning to a powerful feed-forward network. This allows the network to leverage the strength of feed-forward training without unrolling the network, and allows for the memory to be implemented in new neuromorphic hardware. It endows a neural network with a sequential memory that takes advantage of the inductive bias that data is organized causally in the temporal domain, and imbues the network with a state that represents the agent’s self'', moving through the environment. This could also lead the way to continual learning, with the network modularized and ‘protected’’ from overwrites that come with new data. In addition to aiding in solving these performance problems that plague current non-temporal deep networks, this also could finally lead towards endowing artificial networks with a sense of ``self’'.

[AI-15] LLMs generate structurally realistic social networks but overestimate political homophily

链接: https://arxiv.org/abs/2408.16629
作者: Serina Chang,Alicja Chaszczewicz,Emma Wang,Maya Josifovska,Emma Pierson,Jure Leskovec
关键词-EN: Generating social networks, Generating social, epidemic modeling, networks, Generating
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Generating social networks is essential for many applications, such as epidemic modeling and social simulations. Prior approaches either involve deep learning models, which require many observed networks for training, or stylized models, which are limited in their realism and flexibility. In contrast, LLMs offer the potential for zero-shot and flexible network generation. However, two key questions are: (1) are LLM’s generated networks realistic, and (2) what are risks of bias, given the importance of demographics in forming social ties? To answer these questions, we develop three prompting methods for network generation and compare the generated networks to real social networks. We find that more realistic networks are generated with “local” methods, where the LLM constructs relations for one persona at a time, compared to “global” methods that construct the entire network at once. We also find that the generated networks match real networks on many characteristics, including density, clustering, community structure, and degree. However, we find that LLMs emphasize political homophily over all other types of homophily and overestimate political homophily relative to real-world measures.

[AI-16] owards Infusing Auxiliary Knowledge for Distracted Driver Detection KDD

链接: https://arxiv.org/abs/2408.16621
作者: Ishwar B Balappanawar,Ashmit Chamoli,Ruwan Wickramarachchi,Aditya Mishra,Ponnurangam Kumaraguru,Amit P. Sheth
关键词-EN: road accidents globally, accidents globally, Distracted driving, distracted driving involves, road accidents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at KiL 2024: Workshop on Knowledge-infused Learning co-located with 30th ACM KDD Conference

点击查看摘要

Abstract:Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver’s pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver’s actions.Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.

[AI-17] Hyperdimensional Vector Tsetlin Machines with Applications to Sequence Learning and Generation

链接: https://arxiv.org/abs/2408.16620
作者: Christian D. Blakely
关键词-EN: adding numerous advantages, vanilla Tsetlin machines, Tsetlin machine clause, Tsetlin machines, machine learning model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We construct a two-layered model for learning and generating sequential data that is both computationally fast and competitive with vanilla Tsetlin machines, adding numerous advantages. Through the use of hyperdimensional vector computing (HVC) algebras and Tsetlin machine clause structures, we demonstrate that the combination of both inherits the generality of data encoding and decoding of HVC with the fast interpretable nature of Tsetlin machines to yield a powerful machine learning model. We apply the approach in two areas, namely in forecasting, generating new sequences, and classification. For the latter, we derive results for the entire UCR Time Series Archive and compare with the standard benchmarks to see how well the method competes in time series classification.

[AI-18] Examination of Code generated by Large Language Models

链接: https://arxiv.org/abs/2408.16601
作者: Robin Beer,Alexander Feix,Tim Guttzeit,Tamara Muras,Vincent Müller,Maurice Rauscher,Florian Schäffler,Welf Löwe
关键词-EN: enable rapid prototyping, Large language models, transforming software development, automating code generation, ChatGPT and Copilot
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT and Copilot, are transforming software development by automating code generation and, arguably, enable rapid prototyping, support education, and boost productivity. Therefore, correctness and quality of the generated code should be on par with manually written code. To assess the current state of LLMs in generating correct code of high quality, we conducted controlled experiments with ChatGPT and Copilot: we let the LLMs generate simple algorithms in Java and Python along with the corresponding unit tests and assessed the correctness and the quality (coverage) of the generated (test) codes. We observed significant differences between the LLMs, between the languages, between algorithm and test codes, and over time. The present paper reports these results together with the experimental methods allowing repeated and comparable assessments for more algorithms, languages, and LLMs over time.

[AI-19] Enhancing Dialogue Generation in Werewolf Game Through Situation Analysis and Persuasion Strategies

链接: https://arxiv.org/abs/2408.16586
作者: Zhiyang Qi,Michimasa Inaba
关键词-EN: natural language processing, large language models, enhanced dialogue systems, Recent advancements, significantly enhanced dialogue
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to the AIWolfDial2024 workshop at INLG 2024

点击查看摘要

Abstract:Recent advancements in natural language processing, particularly with large language models (LLMs) like GPT-4, have significantly enhanced dialogue systems, enabling them to generate more natural and fluent conversations. Despite these improvements, challenges persist, such as managing continuous dialogues, memory retention, and minimizing hallucinations. The AIWolfDial2024 addresses these challenges by employing the Werewolf Game, an incomplete information game, to test the capabilities of LLMs in complex interactive environments. This paper introduces a LLM-based Werewolf Game AI, where each role is supported by situation analysis to aid response generation. Additionally, for the werewolf role, various persuasion strategies, including logical appeal, credibility appeal, and emotional appeal, are employed to effectively persuade other players to align with its actions.

[AI-20] Seeking the Sufficiency and Necessity Causal Features in Multimodal Representation Learning

链接: https://arxiv.org/abs/2408.16577
作者: Boyu Chen,Junjie Liu,Zhu Li,Mengyue yang
关键词-EN: learning models’ ability, enhance deep learning, deep learning models’, high Probability, models’ ability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning representations with a high Probability of Necessary and Sufficient Causes (PNS) has been shown to enhance deep learning models’ ability. This task involves identifying causal features that are both sufficient (guaranteeing the outcome) and necessary (without which the outcome cannot occur). However, current research predominantly focuses on unimodal data, and extending PNS learning to multimodal settings presents significant challenges. The challenges arise as the conditions for PNS identifiability, Exogeneity and Monotonicity, need to be reconsidered in a multimodal context, where sufficient and necessary causal features are distributed across different modalities. To address this, we first propose conceptualizing multimodal representations as comprising modality-invariant and modality-specific components. We then analyze PNS identifiability for each component, while ensuring non-trivial PNS estimation. Finally, we formulate tractable optimization objectives that enable multimodal models to learn high-PNS representations, thereby enhancing their predictive performance. Experiments demonstrate the effectiveness of our method on both synthetic and real-world data.

[AI-21] SFR-GNN: Simple and Fast Robust GNNs against Structural Attacks

链接: https://arxiv.org/abs/2408.16537
作者: Xing Ai,Guanyu Zhu,Yulin Zhu,Yu Zheng,Gaolei Li,Jianhua Li,Kai Zhou
关键词-EN: demonstrated commendable performance, Graph Neural Networks, graph-structured data, demonstrated commendable, commendable performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated commendable performance for graph-structured data. Yet, GNNs are often vulnerable to adversarial structural attacks as embedding generation relies on graph topology. Existing efforts are dedicated to purifying the maliciously modified structure or applying adaptive aggregation, thereby enhancing the robustness against adversarial structural attacks. It is inevitable for a defender to consume heavy computational costs due to lacking prior knowledge about modified structures. To this end, we propose an efficient defense method, called Simple and Fast Robust Graph Neural Network (SFR-GNN), supported by mutual information theory. The SFR-GNN first pre-trains a GNN model using node attributes and then fine-tunes it over the modified graph in the manner of contrastive learning, which is free of purifying modified structures and adaptive aggregation, thus achieving great efficiency gains. Consequently, SFR-GNN exhibits a 24%–162% speedup compared to advanced robust models, demonstrating superior robustness for node classification tasks.

[AI-22] Adaptive Variational Continual Learning via Task-Heuristic Modelling

链接: https://arxiv.org/abs/2408.16517
作者: Fan Yang
关键词-EN: Variational continual learning, turn-key learning algorithm, generalized variational continual, Variational continual, continual learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Variational continual learning (VCL) is a turn-key learning algorithm that has state-of-the-art performance among the best continual learning models. In our work, we explore an extension of the generalized variational continual learning (GVCL) model, named AutoVCL, which combines task heuristics for informed learning and model optimization. We demonstrate that our model outperforms the standard GVCL with fixed hyperparameters, benefiting from the automatic adjustment of the hyperparameter based on the difficulty and similarity of the incoming task compared to the previous tasks.

[AI-23] On-device AI: Quantization-aware Training of Transformers in Time-Series

链接: https://arxiv.org/abs/2408.16495
作者: Tianheng Ling,Gregor Schiele
关键词-EN: Artificial Intelligence, Transformer model, pervasive computing, Programmable Gate Arrays, Field Programmable Gate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper is accepted by 2023 IEEE International Conference on Pervasive Computing and Communications(PhD Forum)

点击查看摘要

Abstract:Artificial Intelligence (AI) models for time-series in pervasive computing keep getting larger and more complicated. The Transformer model is by far the most compelling of these AI models. However, it is difficult to obtain the desired performance when deploying such a massive model on a sensor device with limited resources. My research focuses on optimizing the Transformer model for time-series forecasting tasks. The optimized model will be deployed as hardware accelerators on embedded Field Programmable Gate Arrays (FPGAs). I will investigate the impact of applying Quantization-aware Training to the Transformer model to reduce its size and runtime memory footprint while maximizing the advantages of FPGAs.

[AI-24] Integrating Features for Recognizing Human Activities through Optimized Parameters in Graph Convolutional Networks and Transformer Architectures

链接: https://arxiv.org/abs/2408.16442
作者: Mohammad Belal(1),Taimur Hassan(2),Abdelfatah Hassan(1),Nael Alsheikh(1),Noureldin Elhendawi(1),Irfan Hussain(1) ((1) Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates, (2) Abu Dhabi University, Abu Dhabi, United Arab Emirates)
关键词-EN: categorize human actions, employs computer vision, machine vision, computer vision, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 6 pages, 1 figure, conference

点击查看摘要

Abstract:Human activity recognition is a major field of study that employs computer vision, machine vision, and deep learning techniques to categorize human actions. The field of deep learning has made significant progress, with architectures that are extremely effective at capturing human dynamics. This study emphasizes the influence of feature fusion on the accuracy of activity recognition. This technique addresses the limitation of conventional models, which face difficulties in identifying activities because of their limited capacity to understand spatial and temporal features. The technique employs sensory data obtained from four publicly available datasets: HuGaDB, PKU-MMD, LARa, and TUG. The accuracy and F1-score of two deep learning models, specifically a Transformer model and a Parameter-Optimized Graph Convolutional Network (PO-GCN), were evaluated using these datasets. The feature fusion technique integrated the final layer features from both models and inputted them into a classifier. Empirical evidence demonstrates that PO-GCN outperforms standard models in activity recognition. HuGaDB demonstrated a 2.3% improvement in accuracy and a 2.2% increase in F1-score. TUG showed a 5% increase in accuracy and a 0.5% rise in F1-score. On the other hand, LARa and PKU-MMD achieved lower accuracies of 64% and 69% respectively. This indicates that the integration of features enhanced the performance of both the Transformer model and PO-GCN.

[AI-25] Gradient-free variational learning with conditional mixture networks

链接: https://arxiv.org/abs/2408.16429
作者: Conor Heins,Hao Wu,Dimitrije Markovic,Alexander Tschantz,Jeff Beck,Christopher Buckley
关键词-EN: Balancing computational efficiency, Balancing computational, robust predictive performance, critical applications, computational efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 16 pages main text (3 figures), including references. 9 pages supplementary material (5 figures)

点击查看摘要

Abstract:Balancing computational efficiency with robust predictive performance is crucial in supervised learning, especially for critical applications. Standard deep learning models, while accurate and scalable, often lack probabilistic features like calibrated predictions and uncertainty quantification. Bayesian methods address these issues but can be computationally expensive as model and data complexity increase. Previous work shows that fast variational methods can reduce the compute requirements of Bayesian methods by eliminating the need for gradient computation or sampling, but are often limited to simple models. We demonstrate that conditional mixture networks (CMNs), a probabilistic variant of the mixture-of-experts (MoE) model, are suitable for fast, gradient-free inference and can solve complex classification tasks. CMNs employ linear experts and a softmax gating network. By exploiting conditional conjugacy and Pólya-Gamma augmentation, we furnish Gaussian likelihoods for the weights of both the linear experts and the gating network. This enables efficient variational updates using coordinate ascent variational inference (CAVI), avoiding traditional gradient-based optimization. We validate this approach by training two-layer CMNs on standard benchmarks from the UCI repository. Our method, CAVI-CMN, achieves competitive and often superior predictive accuracy compared to maximum likelihood estimation (MLE) with backpropagation, while maintaining competitive runtime and full posterior distributions over all model parameters. Moreover, as input size or the number of experts increases, computation time scales competitively with MLE and other gradient-based solutions like black-box variational inference (BBVI), making CAVI-CMN a promising tool for deep, fast, and gradient-free Bayesian networks.

[AI-26] COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation ECCV2024

链接: https://arxiv.org/abs/2408.16426
作者: Jiefeng Li,Ye Yuan,Davis Rempe,Haotian Zhang,Pavlo Molchanov,Cewu Lu,Jan Kautz,Umar Iqbal
关键词-EN: Estimating global human, Estimating global, motion, global human motion, human motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV 2024

点击查看摘要

Abstract:Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. Although pre-trained motion diffusion models encode rich motion priors, we find it non-trivial to leverage such knowledge to guide global motion estimation from RGB videos. COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework. Furthermore, we introduce a new human-scene relation loss to alleviate the scale ambiguity by enforcing consistency among the humans, camera, and scene. Experiments on three challenging benchmarks demonstrate the effectiveness of COIN, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation. As an illustrative example, COIN outperforms the state-of-the-art method by 33% in world joint position error (W-MPJPE) on the RICH dataset.

[AI-27] Fourier Spectral Physics Informed Neural Network: An Efficient and Low-Memory PINN

链接: https://arxiv.org/abs/2408.16414
作者: Tianchi Yu,Yiming Qi,Ivan Oseledets,Shiyi Chen
关键词-EN: solving partial differential, partial differential equations, physics-informed neural networks, growing investigations, investigations into solving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:With growing investigations into solving partial differential equations by physics-informed neural networks (PINNs), more accurate and efficient PINNs are required to meet the practical demands of scientific computing. One bottleneck of current PINNs is computing the high-order derivatives via automatic differentiation which often necessitates substantial computing resources. In this paper, we focus on removing the automatic differentiation of the spatial derivatives and propose a spectral-based neural network that substitutes the differential operator with a multiplication. Compared to the PINNs, our approach requires lower memory and shorter training time. Thanks to the exponential convergence of the spectral basis, our approach is more accurate. Moreover, to handle the different situations between physics domain and spectral domain, we provide two strategies to train networks by their spectral information. Through a series of comprehensive experiments, We validate the aforementioned merits of our proposed network.

[AI-28] DetectBERT: Towards Full App-Level Representation Learning to Detect Android Malware

链接: https://arxiv.org/abs/2408.16353
作者: Tiezhu Sun,Nadia Daoudi,Kisub Kim,Kevin Allix,Tegawendé F. Bissyandé,Jacques Klein
关键词-EN: complex malicious behaviors, function call graphs, capture complex malicious, significantly improved Android, Recent advancements
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted at ESEM 2024

点击查看摘要

Abstract:Recent advancements in ML and DL have significantly improved Android malware detection, yet many methodologies still rely on basic static analysis, bytecode, or function call graphs that often fail to capture complex malicious behaviors. DexBERT, a pre-trained BERT-like model tailored for Android representation learning, enriches class-level representations by analyzing Smali code extracted from APKs. However, its functionality is constrained by its inability to process multiple Smali classes simultaneously. This paper introduces DetectBERT, which integrates correlated Multiple Instance Learning (c-MIL) with DexBERT to handle the high dimensionality and variability of Android malware, enabling effective app-level detection. By treating class-level features as instances within MIL bags, DetectBERT aggregates these into a comprehensive app-level representation. Our evaluation demonstrates that DetectBERT not only surpasses existing state-of-the-art detection methods but also adapts to evolving malware threats. Moreover, the versatility of the DetectBERT framework holds promising potential for broader applications in app-level analysis and other software engineering tasks, offering new avenues for research and development.

[AI-29] oward Robust Early Detection of Alzheimers Disease via an Integrated Multimodal Learning Approach

链接: https://arxiv.org/abs/2408.16343
作者: Yifei Chen,Shenghao Zhu,Zhaojie Fang,Chang Liu,Binfeng Zou,Yuhe Wang,Shuo Chang,Fan Jia,Feiwei Qin,Jin Fan,Yong Peng,Changmiao Wang
关键词-EN: Alzheimer Disease, complex neurodegenerative disorder, neurodegenerative disorder marked, executive dysfunction, memory loss
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a complex neurodegenerative disorder marked by memory loss, executive dysfunction, and personality changes. Early diagnosis is challenging due to subtle symptoms and varied presentations, often leading to misdiagnosis with traditional unimodal diagnostic methods due to their limited scope. This study introduces an advanced multimodal classification model that integrates clinical, cognitive, neuroimaging, and EEG data to enhance diagnostic accuracy. The model incorporates a feature tagger with a tabular data coding architecture and utilizes the TimesBlock module to capture intricate temporal patterns in Electroencephalograms (EEG) data. By employing Cross-modal Attention Aggregation module, the model effectively fuses Magnetic Resonance Imaging (MRI) spatial information with EEG temporal data, significantly improving the distinction between AD, Mild Cognitive Impairment, and Normal Cognition. Simultaneously, we have constructed the first AD classification dataset that includes three modalities: EEG, MRI, and tabular data. Our innovative approach aims to facilitate early diagnosis and intervention, potentially slowing the progression of AD. The source code and our private ADMC dataset are available at this https URL.

[AI-30] Self-Improving Diffusion Models with Synthetic Data

链接: https://arxiv.org/abs/2408.16333
作者: Sina Alemohammad,Ahmed Imtiaz Humayun,Shruti Agarwal,John Collomosse,Richard Baraniuk
关键词-EN: synthetic data, training increasingly large, data, increasingly large generative, synthetic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The artificial intelligence (AI) world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model’s generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fréchet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model’s synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.

[AI-31] Guided Reasoning: A Non-Technical Introduction

链接: https://arxiv.org/abs/2408.16331
作者: Gregor Betz
关键词-EN: Guided Reasoning, Guided Reasoning system, implementation of Guided, introduce the concept, Guided
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We introduce the concept and a default implementation of Guided Reasoning. A multi-agent system is a Guided Reasoning system iff one agent (the guide) primarily interacts with other agents in order to improve reasoning quality. We describe Logikon’s default implementation of Guided Reasoning in non-technical terms. This is a living document we’ll gradually enrich with more detailed information and examples. Code: this https URL Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2408.16331 [cs.AI] (or arXiv:2408.16331v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2408.16331 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-32] FA-YOLO: Research On Efficient Feature Selection YOLO Improved Algorithm Based On FMDS and AGMF Modules

链接: https://arxiv.org/abs/2408.16313
作者: Yukang Huo,Mingyuan Yao,Qingbin Tian,Tonghao Wang,Ruifeng Wang,Haihua Wang
关键词-EN: YOLO series, FMDS Module, Module, AGMF Module, FMDS Module branch
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages and 4 figures

点击查看摘要

Abstract:Over the past few years, the YOLO series of models has emerged as one of the dominant methodologies in the realm of object detection. Many studies have advanced these baseline models by modifying their architectures, enhancing data quality, and developing new loss functions. However, current models still exhibit deficiencies in processing feature maps, such as overlooking the fusion of cross-scale features and a static fusion approach that lacks the capability for dynamic feature adjustment. To address these issues, this paper introduces an efficient Fine-grained Multi-scale Dynamic Selection Module (FMDS Module), which applies a more effective dynamic feature selection and fusion method on fine-grained multi-scale feature maps, significantly enhancing the detection accuracy of small, medium, and large-sized targets in complex environments. Furthermore, this paper proposes an Adaptive Gated Multi-branch Focus Fusion Module (AGMF Module), which utilizes multiple parallel branches to perform complementary fusion of various features captured by the gated unit branch, FMDS Module branch, and TripletAttention branch. This approach further enhances the comprehensiveness, diversity, and integrity of feature fusion. This paper has integrated the FMDS Module, AGMF Module, into Yolov9 to develop a novel object detection model named FA-YOLO. Extensive experimental results show that under identical experimental conditions, FA-YOLO achieves an outstanding 66.1% mean Average Precision (mAP) on the PASCAL VOC 2007 dataset, representing 1.0% improvement over YOLOv9’s 65.1%. Additionally, the detection accuracies of FA-YOLO for small, medium, and large targets are 44.1%, 54.6%, and 70.8%, respectively, showing improvements of 2.0%, 3.1%, and 0.9% compared to YOLOv9’s 42.1%, 51.5%, and 69.9%.

[AI-33] Safe Bayesian Optimization for High-Dimensional Control Systems via Additive Gaussian Processes

链接: https://arxiv.org/abs/2408.16307
作者: Hongxuan Wang,Xiaocong Li,Adrish Bhaumik,Prahlad Vadakkepat
关键词-EN: fundamental problems, problems in robotics, robotics and mechatronic, mechatronic systems, safe Bayesian optimization
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Controller tuning and optimization have been among the most fundamental problems in robotics and mechatronic systems. The traditional methodology is usually model-based, but its performance heavily relies on an accurate mathematical model of the system. In control applications with complex dynamics, obtaining a precise model is often challenging, leading us towards a data-driven approach. While optimizing a single controller has been explored by various researchers, it remains a challenge to obtain the optimal controller parameters safely and efficiently when multiple controllers are involved. In this paper, we propose a high-dimensional safe Bayesian optimization method based on additive Gaussian processes to optimize multiple controllers simultaneously and safely. Additive Gaussian kernels replace the traditional squared-exponential kernels or Matérn kernels, enhancing the efficiency with which Gaussian processes update information on unknown functions. Experimental results on a permanent magnet synchronous motor (PMSM) demonstrate that compared to existing safe Bayesian optimization algorithms, our method can obtain optimal parameters more efficiently while ensuring safety.

[AI-34] Physics of Language Models: Part 2.2 How to Learn From Mistakes on Grade-School Math Problems

链接: https://arxiv.org/abs/2408.16293
作者: Tian Ye,Zicheng Xu,Yuanzhi Li,Zeyuan Allen-Zhu
关键词-EN: demonstrated remarkable performance, solving reasoning tasks, occasionally make reasoning, make reasoning mistakes, Language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2407.20311

点击查看摘要

Abstract:Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to “self-correct” their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating “error-correction” data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others.

[AI-35] OpenFGL: A Comprehensive Benchmarks for Federated Graph Learning

链接: https://arxiv.org/abs/2408.16288
作者: Xunkai Li,Yinlin Zhu,Boyang Pang,Guochen Yan,Yeyu Yan,Zening Li,Zhengyu Wu,Wentao Zhang,Rong-Hua Li,Guoren Wang
关键词-EN: promising distributed training, distributed training paradigm, multiple local systems, graph neural networks, direct data sharing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
*备注: Under Review

点击查看摘要

Abstract:Federated graph learning (FGL) has emerged as a promising distributed training paradigm for graph neural networks across multiple local systems without direct data sharing. This approach is particularly beneficial in privacy-sensitive scenarios and offers a new perspective on addressing scalability challenges in large-scale graph learning. Despite the proliferation of FGL, the diverse motivations from practical applications, spanning various research backgrounds and experimental settings, pose a significant challenge to fair evaluation. To fill this gap, we propose OpenFGL, a unified benchmark designed for the primary FGL scenarios: Graph-FL and Subgraph-FL. Specifically, OpenFGL includes 38 graph datasets from 16 application domains, 8 federated data simulation strategies that emphasize graph properties, and 5 graph-based downstream tasks. Additionally, it offers 18 recently proposed SOTA FGL algorithms through a user-friendly API, enabling a thorough comparison and comprehensive evaluation of their effectiveness, robustness, and efficiency. Empirical results demonstrate the ability of FGL while also revealing its potential limitations, offering valuable insights for future exploration in this thriving field.

[AI-36] Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding

链接: https://arxiv.org/abs/2408.16272
作者: Kaijing Ma,Haojian Huang,Jin Chen,Haodong Chen,Pengliang Ji,Xianghao Zang,Han Fang,Chao Ban,Hao Sun,Mulin Chen,Xuelong Li
关键词-EN: Existing Video Temporal, Video Temporal Grounding, Temporal Grounding, Existing Video, overlook open-world challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Ongoing work: 28pages, 19 figures, 7 tables. Code is available at: https://kaijing.space/SRAM/

点击查看摘要

Abstract:Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos. This leads to unreliable predictions for noisy, corrupted, and out-of-distribution data. Adapting VTG models to dynamically estimate uncertainties based on user input can address this issue. To this end, we introduce SRAM, a robust network module that benefits from a two-stage cross-modal alignment task. More importantly, it integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training, thus allowing the model to say “I do not know” in scenarios beyond its handling capacity. However, the direct application of traditional DER theory and its regularizer reveals structural flaws, leading to unintended constraints in VTG tasks. In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up. To the best of our knowledge, this marks the first successful attempt of DER in VTG. Our extensive quantitative and qualitative results affirm the effectiveness, robustness, and interpretability of our modules and the uncertainty learning paradigm in VTG tasks. The code will be made available.

[AI-37] LoraMap: Harnessing the Power of LoRA Connections

链接: https://arxiv.org/abs/2408.16264
作者: Hyeryun Park,Jeongwon Kwak,Dongsuk Jang,Sumin Park,Jinwook Choi
关键词-EN: Large Language Models, Large Language, Language Models, overcoming substantial computational, substantial computational overhead
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 13 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) can benefit from mitigating hallucinations through fact-checking and overcoming substantial computational overhead with parameter-efficient techniques such as Low-Rank Adaptation (LoRA). While some studies have explored the parallel integration of multiple LoRAs, these approaches need attention to the connections between them. This paper investigates methods to establish connections among multiple LoRAs. We create three reasoning datasets tailored to fact-checking and fine-tune individual LoRAs, allowing them to view and reason from diverse perspectives. Then, we explore strategies for allocating these reasoning LoRAs and introduce LoraMap, an approach to map connections between them. The results on the fact-checking task demonstrate that the performance of LoraMap is superior to LoraHub, an existing LoRA composition method. LoraMap also outperforms with significantly fewer parameters than LoraConcat, which concatenates LoRAs and further fine-tunes them.

[AI-38] Evaluating Time-Series Training Dataset through Lens of Spectrum in Deep State Space Models

链接: https://arxiv.org/abs/2408.16261
作者: Sekitoshi Kanai,Yasutoshi Ida,Kazuki Adachi,Mihiro Uchida,Tsukasa Yoshida,Shin’ya Yamaguchi
关键词-EN: state space models, deep SSMs, deep neural networks, SSMs, deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:This study investigates a method to evaluate time-series datasets in terms of the performance of deep neural networks (DNNs) with state space models (deep SSMs) trained on the dataset. SSMs have attracted attention as components inside DNNs to address time-series data. Since deep SSMs have powerful representation capacities, training datasets play a crucial role in solving a new task. However, the effectiveness of training datasets cannot be known until deep SSMs are actually trained on them. This can increase the cost of data collection for new tasks, as a trial-and-error process of data collection and time-consuming training are needed to achieve the necessary performance. To advance the practical use of deep SSMs, the metric of datasets to estimate the performance early in the training can be one key element. To this end, we introduce the concept of data evaluation methods used in system identification. In system identification of linear dynamical systems, the effectiveness of datasets is evaluated by using the spectrum of input signals. We introduce this concept to deep SSMs, which are nonlinear dynamical systems. We propose the K-spectral metric, which is the sum of the top-K spectra of signals inside deep SSMs, by focusing on the fact that each layer of a deep SSM can be regarded as a linear dynamical system. Our experiments show that the K-spectral metric has a large absolute value of the correlation coefficient with the performance and can be used to evaluate the quality of training datasets.

[AI-39] Coalitions of AI-based Methods Predict 15-Year Risks of Breast Cancer Metastasis Using Real-World Clinical Data with AUC up to 0.9

链接: https://arxiv.org/abs/2408.16256
作者: Xia Jiang,Yijun Zhou,Alan Wells,Adam Brufsky
关键词-EN: breast cancers newly, Breast cancer, cancers responsible, Breast, deaths
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Breast cancer is one of the two cancers responsible for the most deaths in women, with about 42,000 deaths each year in the US. That there are over 300,000 breast cancers newly diagnosed each year suggests that only a fraction of the cancers result in mortality. Thus, most of the women undergo seemingly curative treatment for localized cancers, but a significant later succumb to metastatic disease for which current treatments are only temporizing for the vast majority. The current prognostic metrics are of little actionable value for 4 of the 5 women seemingly cured after local treatment, and many women are exposed to morbid and even mortal adjuvant therapies unnecessarily, with these adjuvant therapies reducing metastatic recurrence by only a third. Thus, there is a need for better prognostics to target aggressive treatment at those who are likely to relapse and spare those who were actually cured. While there is a plethora of molecular and tumor-marker assays in use and under-development to detect recurrence early, these are time consuming, expensive and still often un-validated as to actionable prognostic utility. A different approach would use large data techniques to determine clinical and histopathological parameters that would provide accurate prognostics using existing data. Herein, we report on machine learning, together with grid search and Bayesian Networks to develop algorithms that present a AUC of up to 0.9 in ROC analyses, using only extant data. Such algorithms could be rapidly translated to clinical management as they do not require testing beyond routine tumor evaluations.

[AI-40] Enhancing Conditional Image Generation with Explainable Latent Space Manipulation

链接: https://arxiv.org/abs/2408.16232
作者: Kshitij Pathania
关键词-EN: gradient-based selective attention, Selective Attention Manipulation, gradient-based selective, selective attention, selective attention mechanisms
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages , 5 figures

点击查看摘要

Abstract:In the realm of image synthesis, achieving fidelity to a reference image while adhering to conditional prompts remains a significant challenge. This paper proposes a novel approach that integrates a diffusion model with latent space manipulation and gradient-based selective attention mechanisms to address this issue. Leveraging Grad-SAM (Gradient-based Selective Attention Manipulation), we analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector, deriving importance scores of elements of denoised latent vector related to the subject of interest. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features. This approach ensures the faithful formation of subjects based on conditional prompts, while concurrently refining the background for a more coherent composition. Our experiments on places365 dataset demonstrate promising results, with our proposed model achieving the lowest mean and median Frechet Inception Distance (FID) scores compared to baseline models, indicating superior fidelity preservation. Furthermore, our model exhibits competitive performance in aligning the generated images with provided textual descriptions, as evidenced by high CLIP scores. These results highlight the effectiveness of our approach in both fidelity preservation and textual context preservation, offering a significant advancement in text-to-image synthesis tasks.

[AI-41] LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

链接: https://arxiv.org/abs/2408.16224
作者: Jingyi Wang,Jianzhong Ju,Jian Luan,Zhidong Deng
关键词-EN: typically employ vision, employ vision encoders, vision encoders based, Vision Transformer, Recent advances
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM’s performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding. Code and data would be available.

[AI-42] M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

链接: https://arxiv.org/abs/2408.16213
作者: Jonggwon Park,Soobum Kim,Byungmu Yoon,Jihun Hyun,Kyoyun Choi
关键词-EN: large language models, including healthcare, artificial intelligence, impacted various domains, rapid evolution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the multi-tasking capabilities of LLMs or lacking clinical accuracy. This paper presents M4CXR, a multi-modal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multi-image, and multi-study contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and also demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR’s versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.

[AI-43] Short-Term Electricity-Load Forecasting by Deep Learning: A Comprehensive Survey

链接: https://arxiv.org/abs/2408.16202
作者: Qi Dong,Rubing Huang,Chenhui Cui,Dave Towey,Ling Zhou,Jinyu Tian,Jianzhou Wang
关键词-EN: Short-Term Electricity-Load Forecasting, Short-Term Electricity-Load, power system, STELF, impact electricity demand
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Short-Term Electricity-Load Forecasting (STELF) refers to the prediction of the immediate demand (in the next few hours to several days) for the power system. Various external factors, such as weather changes and the emergence of new electricity consumption scenarios, can impact electricity demand, causing load data to fluctuate and become non-linear, which increases the complexity and difficulty of STELF. In the past decade, deep learning has been applied to STELF, modeling and predicting electricity demand with high accuracy, and contributing significantly to the development of STELF. This paper provides a comprehensive survey on deep-learning-based STELF over the past ten years. It examines the entire forecasting process, including data pre-processing, feature extraction, deep-learning modeling and optimization, and results evaluation. This paper also identifies some research challenges and potential research directions to be further investigated in future work.

[AI-44] PolarBEVDet: Exploring Polar Representation for Multi-View 3D Object Detection in Birds-Eye-View

链接: https://arxiv.org/abs/2408.16200
作者: Zichen Yu,Quanli Liu,Wei Wang,Liyong Zhang,Xiaoguang Zhao
关键词-EN: polar BEV representation, Cartesian BEV representation, polar BEV, BEV representation, BEV
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Recently, LSS-based multi-view 3D object detection provides an economical and deployment-friendly solution for autonomous driving. However, all the existing LSS-based methods transform multi-view image features into a Cartesian Bird’s-Eye-View(BEV) representation, which does not take into account the non-uniform image information distribution and hardly exploits the view symmetry. In this paper, in order to adapt the image information distribution and preserve the view symmetry by regular convolution, we propose to employ the polar BEV representation to substitute the Cartesian BEV representation. To achieve this, we elaborately tailor three modules: a polar view transformer to generate the polar BEV representation, a polar temporal fusion module for fusing historical polar BEV features and a polar detection head to predict the polar-parameterized representation of the object. In addition, we design a 2D auxiliary detection head and a spatial attention enhancement module to improve the quality of feature extraction in perspective view and BEV, respectively. Finally, we integrate the above improvements into a novel multi-view 3D object detector, PolarBEVDet. Experiments on nuScenes show that PolarBEVDet achieves the superior performance. The code is available at this https URL.

[AI-45] Real-Time Energy Pricing in New Zealand: An Evolving Stream Analysis PRICAI

链接: https://arxiv.org/abs/2408.16187
作者: Yibin Sun,Heitor Murilo Gomes,Bernhard Pfahringer,Albert Bifet
关键词-EN: Electricity Market Information, Market Information, Electricity Market, representing real-time time-series, Zealand government
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 Pages, 8 figures, short version accepted by PRICAI

点击查看摘要

Abstract:This paper introduces a group of novel datasets representing real-time time-series and streaming data of energy prices in New Zealand, sourced from the Electricity Market Information (EMI) website maintained by the New Zealand government. The datasets are intended to address the scarcity of proper datasets for streaming regression learning tasks. We conduct extensive analyses and experiments on these datasets, covering preprocessing techniques, regression tasks, prediction intervals, concept drift detection, and anomaly detection. Our experiments demonstrate the datasets’ utility and highlight the challenges and opportunities for future research in energy price forecasting.

[AI-46] LLM-assisted Labeling Function Generation for Semantic Type Detection VLDB’24

链接: https://arxiv.org/abs/2408.16173
作者: Chenjie Li,Dan Zhang,Jin Wang
关键词-EN: Detecting semantic types, semantic type detection, Detecting semantic, important application, semantic type
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: VLDB’24-DATAI

点击查看摘要

Abstract:Detecting semantic types of columns in data lake tables is an important application. A key bottleneck in semantic type detection is the availability of human annotation due to the inherent complexity of data lakes. In this paper, we propose using programmatic weak supervision to assist in annotating the training data for semantic type detection by leveraging labeling functions. One challenge in this process is the difficulty of manually writing labeling functions due to the large volume and low quality of the data lake table datasets. To address this issue, we explore employing Large Language Models (LLMs) for labeling function generation and introduce several prompt engineering strategies for this purpose. We conduct experiments on real-world web table datasets. Based on the initial results, we perform extensive analysis and provide empirical insights and future directions for researchers in this field.

[AI-47] Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network

链接: https://arxiv.org/abs/2408.16169
作者: Duncan Taylor,Melissa Humphries
关键词-EN: DNA profiles, DNA profile electrophoretic, DNA, electrophoretic signal measuring, signal measuring fluorescence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 9 Figures

点击查看摘要

Abstract:DNA profiles are made up from multiple series of electrophoretic signal measuring fluorescence over time. Typically, human DNA analysts ‘read’ DNA profiles using their experience to distinguish instrument noise, artefactual signal, and signal corresponding to DNA fragments of interest. Recent work has developed an artificial neural network, ANN, to carry out the task of classifying fluorescence types into categories in DNA profile electrophoretic signal. But the creation of the necessarily large amount of labelled training data for the ANN is time consuming and expensive, and a limiting factor in the ability to robustly train the ANN. If realistic, prelabelled, training data could be simulated then this would remove the barrier to training an ANN with high efficacy. Here we develop a generative adversarial network, GAN, modified from the pix2pix GAN to achieve this task. With 1078 DNA profiles we train the GAN and achieve the ability to simulate DNA profile information, and then use the generator from the GAN as a ‘realism filter’ that applies the noise and artefact elements exhibited in typical electrophoretic signal.

[AI-48] FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench

链接: https://arxiv.org/abs/2408.16163
作者: Aman Priyanshu,Supriti Vijay
关键词-EN: Large Language Models, Large Language, Language Models, paper introduces, framework for evaluating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 tables

点击查看摘要

Abstract:This paper introduces FRACTURED-SORRY-Bench, a framework for evaluating the safety of Large Language Models (LLMs) against multi-turn conversational attacks. Building upon the SORRY-Bench dataset, we propose a simple yet effective method for generating adversarial prompts by breaking down harmful queries into seemingly innocuous sub-questions. Our approach achieves a maximum increase of +46.22% in Attack Success Rates (ASRs) across GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods. We demonstrate that this technique poses a challenge to current LLM safety measures and highlights the need for more robust defenses against subtle, multi-turn attacks.

[AI-49] Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation Optimization and Evaluation INTERSPEECH2024

链接: https://arxiv.org/abs/2408.16126
作者: Ke Chen,Jiaqi Su,Taylor Berg-Kirkpatrick,Shlomo Dubnov,Zeyu Jin
关键词-EN: Achieving robust speech, Achieving robust, open challenge, robust speech separation, overlapping speakers
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: In Proceedings of the 25th Annual Conference of the International Speech Communication Association, Interspeech 2024

点击查看摘要

Abstract:Achieving robust speech separation for overlapping speakers in various acoustic environments with noise and reverberation remains an open challenge. Although existing datasets are available to train separators for specific scenarios, they do not effectively generalize across diverse real-world scenarios. In this paper, we present a novel data simulation pipeline that produces diverse training data from a range of acoustic environments and content, and propose new training paradigms to improve quality of a general speech separation model. Specifically, we first introduce AC-SIM, a data simulation pipeline that incorporates broad variations in both content and acoustics. Then we integrate multiple training objectives into the permutation invariant training (PIT) to enhance separation quality and generalization of the trained model. Finally, we conduct comprehensive objective and human listening experiments across separation architectures and benchmarks to validate our methods, demonstrating substantial improvement of generalization on both non-homologous and real-world test sets.

[AI-50] ChartEye: A Deep Learning Framework for Chart Information Extraction

链接: https://arxiv.org/abs/2408.16123
作者: Osama Mustafa,Muhammad Khizer Ali,Momina Moetesum,Imran Siddiqi
关键词-EN: inspired recent research, automated chart understanding, data visualization, domains has inspired, inspired recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 Pages, and 11 Figures

点击查看摘要

Abstract:The widespread use of charts and infographics as a means of data visualization in various domains has inspired recent research in automated chart understanding. However, information extraction from chart images is a complex multitasked process due to style variations and, as a consequence, it is challenging to design an end-to-end system. In this study, we propose a deep learning-based framework that provides a solution for key steps in the chart information extraction pipeline. The proposed framework utilizes hierarchal vision transformers for the tasks of chart-type and text-role classification, while YOLOv7 for text detection. The detected text is then enhanced using Super Resolution Generative Adversarial Networks to improve the recognition output of the OCR. Experimental results on a benchmark dataset show that our proposed framework achieves excellent performance at every stage with F1-scores of 0.97 for chart-type classification, 0.91 for text-role classification, and a mean Average Precision of 0.95 for text detection.

[AI-51] Data Formulator 2: Iteratively Creating Rich Visualizations with AI

链接: https://arxiv.org/abs/2408.16119
作者: Chenglong Wang,Bongshin Lee,Steven Drucker,Dan Marshall,Jianfeng Gao
关键词-EN: Data Formulator, create rich visualizations, data, data transformation, create rich
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs’ code generation ability. However, these systems do not work well for iterative visualization authoring, because they often require analysts to provide, in a single turn, a text-only prompt that fully describes the complex visualization task to be performed, which is unrealistic to both users and models in many cases. In this paper, we present Data Formulator 2, an LLM-powered visualization system to address these challenges. With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI. To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don’t need to start from scratch every time. In a user study with eight participants, we observed that Data Formulator 2 allows participants to develop their own iteration strategies to complete challenging data exploration sessions.

[AI-52] Ensuring Equitable Financial Decisions: Leveraging Counterfactual Fairness and Deep Learning for Bias

链接: https://arxiv.org/abs/2408.16088
作者: Saish Shinde
关键词-EN: machine learning models, recent years due, learning models, machine learning, raised in recent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Concerns regarding fairness and bias have been raised in recent years due to the growing use of machine learning models in crucial decision-making processes, especially when it comes to delicate characteristics like gender. In order to address biases in machine learning models, this research paper investigates advanced bias mitigation techniques, with a particular focus on counterfactual fairness in conjunction with data augmentation. The study looks into how these integrated approaches can lessen gender bias in the financial industry, specifically in loan approval procedures. We show that these approaches are effective in achieving more equitable results through thorough testing and assessment on a skewed financial dataset. The findings emphasize how crucial it is to use fairness-aware techniques when creating machine learning models in order to guarantee morally righteous and impartial decision-making.

[AI-53] Logic-Enhanced Language Model Agents for Trustworthy Social Simulations

链接: https://arxiv.org/abs/2408.16081
作者: Agnieszka Mensfelt,Kostas Stathis,Vince Trencsenyi
关键词-EN: utilize large language, Language Model Agents, Logic-Enhanced Language Model, large language models, introduce the Logic-Enhanced
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Logic in Computer Science (cs.LO)
*备注: Source code: this https URL

点击查看摘要

Abstract:We introduce the Logic-Enhanced Language Model Agents (LELMA) framework, a novel approach to enhance the trustworthiness of social simulations that utilize large language models (LLMs). While LLMs have gained attention as agents for simulating human behaviour, their applicability in this role is limited by issues such as inherent hallucinations and logical inconsistencies. LELMA addresses these challenges by integrating LLMs with symbolic AI, enabling logical verification of the reasoning generated by LLMs. This verification process provides corrective feedback, refining the reasoning output. The framework consists of three main components: an LLM-Reasoner for producing strategic reasoning, an LLM-Translator for mapping natural language reasoning to logic queries, and a Solver for evaluating these queries. This study focuses on decision-making in game-theoretic scenarios as a model of human interaction. Experiments involving the Hawk-Dove game, Prisoner’s Dilemma, and Stag Hunt highlight the limitations of state-of-the-art LLMs, GPT-4 Omni and Gemini 1.0 Pro, in producing correct reasoning in these contexts. LELMA demonstrates high accuracy in error detection and improves the reasoning correctness of LLMs via self-refinement, particularly in GPT-4 Omni.

[AI-54] Verification methods for international AI agreements

链接: https://arxiv.org/abs/2408.16074
作者: Akash R. Wasil,Tom Reed,Jack William Miller,Peter Barnett
关键词-EN: verify compliance, methods, verification methods, verification, FLOP threshold
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:What techniques can be used to verify compliance with international agreements about advanced AI development? In this paper, we examine 10 verification methods that could detect two types of potential violations: unauthorized AI training (e.g., training runs above a certain FLOP threshold) and unauthorized data centers. We divide the verification methods into three categories: (a) national technical means (methods requiring minimal or no access from suspected non-compliant nations), (b) access-dependent methods (methods that require approval from the nation suspected of unauthorized activities), and © hardware-dependent methods (methods that require rules around advanced hardware). For each verification method, we provide a description, historical precedents, and possible evasion techniques. We conclude by offering recommendations for future work related to the verification and enforcement of international AI governance agreements.

[AI-55] Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings

链接: https://arxiv.org/abs/2408.16073
作者: Leo Yeykelis,Kaavya Pichai,James J. Cummings,Byron Reeves
关键词-EN: large language models, expedite accurate replication, published message effects, language models, Anthropic Claude Sonnet
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 24 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This report analyzes the potential for large language models (LLMs) to expedite accurate replication of published message effects studies. We tested LLM-powered participants (personas) by replicating 133 experimental findings from 14 papers containing 45 recent studies in the Journal of Marketing (January 2023-May 2024). We used a new software tool, Viewpoints AI (this https URL), that takes study designs, stimuli, and measures as input, automatically generates prompts for LLMs to act as a specified sample of unique personas, and collects their responses to produce a final output in the form of a complete dataset and statistical analysis. The underlying LLM used was Anthropic’s Claude Sonnet 3.5. We generated 19,447 AI personas to replicate these studies with the exact same sample attributes, study designs, stimuli, and measures reported in the original human research. Our LLM replications successfully reproduced 76% of the original main effects (84 out of 111), demonstrating strong potential for AI-assisted replication of studies in which people respond to media stimuli. When including interaction effects, the overall replication rate was 68% (90 out of 133). The use of LLMs to replicate and accelerate marketing research on media effects is discussed with respect to the replication crisis in social science, potential solutions to generalizability problems in sampling subjects and experimental conditions, and the ability to rapidly test consumer responses to various media stimuli. We also address the limitations of this approach, particularly in replicating complex interaction effects in media response studies, and suggest areas for future research and improvement in AI-assisted experimental replication of media effects.

[AI-56] Efficient k-NN Search in IoT Data: Overlap Optimization in Tree-Based Indexing Structures

链接: https://arxiv.org/abs/2408.16036
作者: Ala-Eddine Benrazek,Zineddine Kouahla,Brahim Farou,Hamid Seridi,Ibtissem Kemouguette
关键词-EN: Big IoT Data, Internet of Things, Big IoT, data space, proliferation of interconnected
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Performance (cs.PF)
*备注: 28 pages, 21 figures, 1 table

点击查看摘要

Abstract:The proliferation of interconnected devices in the Internet of Things (IoT) has led to an exponential increase in data, commonly known as Big IoT Data. Efficient retrieval of this heterogeneous data demands a robust indexing mechanism for effective organization. However, a significant challenge remains: the overlap in data space partitions during index construction. This overlap increases node access during search and retrieval, resulting in higher resource consumption, performance bottlenecks, and impedes system scalability. To address this issue, we propose three innovative heuristics designed to quantify and strategically reduce data space partition overlap. The volume-based method (VBM) offers a detailed assessment by calculating the intersection volume between partitions, providing deeper insights into spatial relationships. The distance-based method (DBM) enhances efficiency by using the distance between partition centers and radii to evaluate overlap, offering a streamlined yet accurate approach. Finally, the object-based method (OBM) provides a practical solution by counting objects across multiple partitions, delivering an intuitive understanding of data space dynamics. Experimental results demonstrate the effectiveness of these methods in reducing search time, underscoring their potential to improve data space partitioning and enhance overall system performance.

[AI-57] An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders

链接: https://arxiv.org/abs/2408.16032
作者: Shuang Feng,Grace Feng
关键词-EN: understanding webpage contexts, enabled understanding webpage, Recent advancements, large language models, webpage contexts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity – a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in InstructGPT, and Direct Preference Optimization (DPO). This report also evaluates the RL agents trained using generative trajectories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Also, with limited training time (2hours), without utilizing any images, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2408.16032 [cs.LG] (or arXiv:2408.16032v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16032 Focus to learn more arXiv-issued DOI via DataCite

[AI-58] EMP: Enhance Memory in Data Pruning

链接: https://arxiv.org/abs/2408.16031
作者: Jinying Xiao,Ping Li,Jie Nie,Zhe Tang
关键词-EN: shown strong performance, fine-tuning costs, research has shifted, memory, shown strong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, large language and vision models have shown strong performance, but due to high pre-training and fine-tuning costs, research has shifted towards faster training via dataset pruning. Previous methods used sample loss as an evaluation criterion, aiming to select the most “difficult” samples for training. However, when the pruning rate increases, the number of times each sample is trained becomes more evenly distributed, which causes many critical or general samples to not be effectively fitted. We refer to this as Low-Frequency Learning (LFL). In other words, LFL prevents the model from remembering most samples. In our work, we decompose the scoring function of LFL, provide a theoretical explanation for the inefficiency of LFL, and propose adding a memory term to the scoring function to enhance the model’s memory capability, along with an approximation of this memory term. Similarly, we explore memory in Self-Supervised Learning (SSL), marking the first discussion on SSL memory. Using contrastive learning, we derive the memory term both theoretically and experimentally. Finally, we propose Enhance Memory Pruning (EMP), which addresses the issue of insufficient memory under high pruning rates by enhancing the model’s memory of data, thereby improving its performance. We evaluated the performance of EMP in tasks such as image classification, natural language understanding, and model pre-training. The results show that EMP can improve model performance under extreme pruning rates. For example, in the CIFAR100-ResNet50 pre-training task, with 70% pruning, EMP outperforms current methods by 2.2%.

[AI-59] A Deep Learning Approach to Localizing Multi-level Airway Collapse Based on Snoring Sounds

链接: https://arxiv.org/abs/2408.16030
作者: Ying-Chieh Hsu,Stanley Yung-Chuan Liu,Chao-Jung Huang,Chi-Wei Wu,Ren-Kai Cheng,Jane Yung-Jen Hsu,Shang-Ran Huang,Yuan-Ren Cheng,Fu-Shun Hsu
关键词-EN: obstructive sleep apnea, drug-induced sleep endoscopy, classify snoring sounds, snoring sounds excited, Support Vector Machine
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This study investigates the application of machine/deep learning to classify snoring sounds excited at different levels of the upper airway in patients with obstructive sleep apnea (OSA) using data from drug-induced sleep endoscopy (DISE). The snoring sounds of 39 subjects were analyzed and labeled according to the Velum, Oropharynx, Tongue Base, and Epiglottis (VOTE) classification system. The dataset, comprising 5,173 one-second segments, was used to train and test models, including Support Vector Machine (SVM), Bidirectional Long Short-Term Memory (BiLSTM), and ResNet-50. The ResNet-50, a convolutional neural network (CNN), showed the best overall performance in classifying snoring acoustics, particularly in identifying multi-level obstructions. The study emphasizes the potential of integrating snoring acoustics with deep learning to improve the diagnosis and treatment of OSA. However, challenges such as limited sample size, data imbalance, and differences between pharmacologically induced and natural snoring sounds were noted, suggesting further research to enhance model accuracy and generalizability.

[AI-60] Meta-Learn Unimodal Signals with Weak Supervision for Multimodal Sentiment Analysis

链接: https://arxiv.org/abs/2408.16029
作者: Sijie Mai,Yu Zhao,Ying Zeng,Jianhua Yao,Haifeng Hu
关键词-EN: sentiment analysis aims, effectively integrate information, Multimodal sentiment analysis, Multimodal, unimodal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal sentiment analysis aims to effectively integrate information from various sources to infer sentiment, where in many cases there are no annotations for unimodal labels. Therefore, most works rely on multimodal labels for training. However, there exists the noisy label problem for the learning of unimodal signals as multimodal annotations are not always the ideal substitutes for the unimodal ones, failing to achieve finer optimization for individual modalities. In this paper, we explore the learning of unimodal labels under the weak supervision from the annotated multimodal labels. Specifically, we propose a novel meta uni-label generation (MUG) framework to address the above problem, which leverages the available multimodal labels to learn the corresponding unimodal labels by the meta uni-label correction network (MUCN). We first design a contrastive-based projection module to bridge the gap between unimodal and multimodal representations, so as to use multimodal annotations to guide the learning of MUCN. Afterwards, we propose unimodal and multimodal denoising tasks to train MUCN with explicit supervision via a bi-level optimization strategy. We then jointly train unimodal and multimodal learning tasks to extract discriminative unimodal features for multimodal inference. Experimental results suggest that MUG outperforms competitive baselines and can learn accurate unimodal labels.

[AI-61] oward Time-Continuous Data Inference in Sparse Urban CrowdSensing

链接: https://arxiv.org/abs/2408.16027
作者: Ziyu Sun,Haoyang Su,Hanqi Sun,En Wang,Wenbin Liu
关键词-EN: Mobile Crowd Sensing, leverages mobile users, Mobile Crowd, smart portable devices, leverages mobile
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 11 pages, 11 figures

点击查看摘要

Abstract:Mobile Crowd Sensing (MCS) is a promising paradigm that leverages mobile users and their smart portable devices to perform various real-world tasks. However, due to budget constraints and the inaccessibility of certain areas, Sparse MCS has emerged as a more practical alternative, collecting data from a limited number of target subareas and utilizing inference algorithms to complete the full sensing map. While existing approaches typically assume a time-discrete setting with data remaining constant within each sensing cycle, this simplification can introduce significant errors, especially when dealing with long cycles, as real-world sensing data often changes continuously. In this paper, we go from fine-grained completion, i.e., the subdivision of sensing cycles into minimal time units, towards a more accurate, time-continuous completion. We first introduce Deep Matrix Factorization (DMF) as a neural network-enabled framework and enhance it with a Recurrent Neural Network (RNN-DMF) to capture temporal correlations in these finer time slices. To further deal with the continuous data, we propose TIME-DMF, which captures temporal information across unequal intervals, enabling time-continuous completion. Additionally, we present the Query-Generate (Q-G) strategy within TIME-DMF to model the infinite states of continuous data. Extensive experiments across five types of sensing tasks demonstrate the effectiveness of our models and the advantages of time-continuous completion.

[AI-62] XG-NID: Dual-Modality Network Intrusion Detection using a Heterogeneous Graph Neural Network and Large Language Model

链接: https://arxiv.org/abs/2408.16021
作者: Yasir Ali Farrukh,Syed Wali,Irfan Khan,Nathaniel D. Bastian
关键词-EN: rapidly evolving field, largely untapped area, heterogeneous graph structure, flow-level and packet-level, intrusion detection remains
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:In the rapidly evolving field of cybersecurity, the integration of flow-level and packet-level information for real-time intrusion detection remains a largely untapped area of research. This paper introduces “XG-NID,” a novel framework that, to the best of our knowledge, is the first to fuse flow-level and packet-level data within a heterogeneous graph structure, offering a comprehensive analysis of network traffic. Leveraging a heterogeneous graph neural network (GNN) with graph-level classification, XG-NID uniquely enables real-time inference while effectively capturing the intricate relationships between flow and packet payload data. Unlike traditional GNN-based methodologies that predominantly analyze historical data, XG-NID is designed to accommodate the heterogeneous nature of network traffic, providing a robust and real-time defense mechanism. Our framework extends beyond mere classification; it integrates Large Language Models (LLMs) to generate detailed, human-readable explanations and suggest potential remedial actions, ensuring that the insights produced are both actionable and comprehensible. Additionally, we introduce a new set of flow features based on temporal information, further enhancing the contextual and explainable inferences provided by our model. To facilitate practical application and accessibility, we developed “GNN4ID,” an open-source tool that enables the extraction and transformation of raw network traffic into the proposed heterogeneous graph structure, seamlessly integrating flow and packet-level data. Our comprehensive quantitative comparative analysis demonstrates that XG-NID achieves an F1 score of 97% in multi-class classification, outperforming existing baseline and state-of-the-art methods. This sets a new standard in Network Intrusion Detection Systems by combining innovative data fusion with enhanced interpretability and real-time capabilities.

[AI-63] SPICED: Syntactical Bug and Trojan Pattern Identification in A/MS Circuits using LLM-Enhanced Detection

链接: https://arxiv.org/abs/2408.16018
作者: Jayeeta Chaudhuri,Dhruv Thapar,Arjun Chaudhuri,Farshad Firouzi,Krishnendu Chakrabarty
关键词-EN: playing key roles, modern electronics, playing key, signal processing, crucial in modern
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at PAINE’24

点击查看摘要

Abstract:Analog and mixed-signal (A/MS) integrated circuits (ICs) are crucial in modern electronics, playing key roles in signal processing, amplification, sensing, and power management. Many IC companies outsource manufacturing to third-party foundries, creating security risks such as stealthy analog Trojans. Traditional detection methods, including embedding circuit watermarks or conducting hardware-based monitoring, often impose significant area and power overheads, and may not effectively identify all types of Trojans. To address these shortcomings, we propose SPICED, a Large Language Model (LLM)-based framework that operates within the software domain, eliminating the need for hardware modifications for Trojan detection and localization. This is the first work using LLM-aided techniques for detecting and localizing syntactical bugs and analog Trojans in circuit netlists, requiring no explicit training and incurring zero area overhead. Our framework employs chain-of-thought reasoning and few-shot examples to teach anomaly detection rules to LLMs. With the proposed method, we achieve an average Trojan coverage of 93.32% and an average true positive rate of 93.4% in identifying Trojan-impacted nodes for the evaluated analog benchmark circuits. These experimental results validate the effectiveness of LLMs in detecting and locating both syntactical bugs and Trojans within analog netlists.

[AI-64] Differentially Private Publication of Electricity Time Series Data in Smart Grids

链接: https://arxiv.org/abs/2408.16017
作者: Sina Shaham,Gabriel Ghinita,Bhaskar Krishnamachari,Cyrus Shahabi
关键词-EN: energy policy decisions, study consumer behavior, guide energy policy, Smart grids, valuable data source
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Smart grids are a valuable data source to study consumer behavior and guide energy policy decisions. In particular, time-series of power consumption over geographical areas are essential in deciding the optimal placement of expensive resources (e.g., transformers, storage elements) and their activation schedules. However, publication of such data raises significant privacy issues, as it may reveal sensitive details about personal habits and lifestyles. Differential privacy (DP) is well-suited for sanitization of individual data, but current DP techniques for time series lead to significant loss in utility, due to the existence of temporal correlation between data readings. We introduce \em STPT (Spatio-Temporal Private Timeseries), a novel method for DP-compliant publication of electricity consumption data that analyzes spatio-temporal attributes and captures both micro and macro patterns by leveraging RNNs. Additionally, it employs a partitioning method for releasing electricity consumption time series based on identified patterns. We demonstrate through extensive experiments, on both real-world and synthetic datasets, that STPT significantly outperforms existing benchmarks, providing a well-balanced trade-off between data utility and user privacy.

[AI-65] Meta-Learning for Federated Face Recognition in Imbalanced Data Regimes

链接: https://arxiv.org/abs/2408.16003
作者: Arwin Gansekoele,Emiel Hess,Sandjai Bhulai
关键词-EN: concerns surrounding face, surrounding face image, growing privacy concerns, privacy concerns surrounding, Federated Face Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: To appear in the IEEE FLTA 2024 proceedings

点击查看摘要

Abstract:The growing privacy concerns surrounding face image data demand new techniques that can guarantee user privacy. One such face recognition technique that claims to achieve better user privacy is Federated Face Recognition (FRR), a subfield of Federated Learning (FL). However, FFR faces challenges due to the heterogeneity of the data, given the large number of classes that need to be handled. To overcome this problem, solutions are sought in the field of personalized FL. This work introduces three new data partitions based on the CelebA dataset, each with a different form of data heterogeneity. It also proposes Hessian-Free Model Agnostic Meta-Learning (HF-MAML) in an FFR setting. We show that HF-MAML scores higher in verification tests than current FFR models on three different CelebA data partitions. In particular, the verification scores improve the most in heterogeneous data partitions. To balance personalization with the development of an effective global model, an embedding regularization term is introduced for the loss function. This term can be combined with HF-MAML and is shown to increase global model verification performance. Lastly, this work performs a fairness analysis, showing that HF-MAML and its embedding regularization extension can improve fairness by reducing the standard deviation over the client evaluation scores.

[AI-66] Anchor-Controlled Generative Adversarial Network for High-Fidelity Electromagnetic and Structurally Diverse Metasurface Design

链接: https://arxiv.org/abs/2408.16231
作者: Yunhui Zeng,Hongkun Cao,Xin Jin
关键词-EN: presents significant challenges, designing free-form metasurfaces, free-form metasurfaces presents, metasurfaces presents significant, high electromagnetic response
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:In optoelectronics, designing free-form metasurfaces presents significant challenges, particularly in achieving high electromagnetic response fidelity due to the complex relationship between physical structures and electromagnetic behaviors. A key difficulty arises from the one-to-many mapping dilemma, where multiple distinct physical structures can yield similar electromagnetic responses, complicating the design process. This paper introduces a novel generative framework, the Anchor-controlled Generative Adversarial Network (AcGAN), which prioritizes electromagnetic fidelity while effectively navigating the one-to-many challenge to create structurally diverse metasurfaces. Unlike existing methods that mainly replicate physical appearances, AcGAN excels in generating a variety of structures that, despite their differences in physical attributes, exhibit similar electromagnetic responses, thereby accommodating fabrication constraints and tolerances. We introduce the Spectral Overlap Coefficient (SOC) as a precise metric to measure the spectral fidelity between generated designs and their targets. Additionally, a cluster-guided controller refines input processing, ensuring multi-level spectral integration and enhancing electromagnetic fidelity. The integration of AnchorNet into our loss function facilitates a nuanced assessment of electromagnetic qualities, supported by a dynamic loss weighting strategy that optimizes spectral alignment. Collectively, these innovations represent a transformative stride in metasurface inverse design, advancing electromagnetic response-oriented engineering and overcoming the complexities of the one-to-many mapping dilemma.Empirical evidence underscores AcGAN’s effectiveness in streamlining the design process, achieving superior electromagnetic precision, and fostering a broad spectrum of design possibilities.

[AI-67] SSDM: Scalable Speech Dysfluency Modeling

链接: https://arxiv.org/abs/2408.16221
作者: Jiachen Lian,Xuanru Zhou,Zoe Ezzes,Jet Vonk,Brittany Morin,David Baquirin,Zachary Mille,Maria Luisa Gorno Tempini,Gopala Anumanchipalli
关键词-EN: core module, module for spoken, Speech dysfluency modeling, speech therapy, dysfluency
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Speech dysfluency modeling is the core module for spoken language learning, and speech therapy. However, there are three challenges. First, current state-of-the-art solutions suffer from poor scalability. Second, there is a lack of a large-scale dysfluency corpus. Third, there is not an effective learning framework. In this paper, we propose \textitSSDM: Scalable Speech Dysfluency Modeling, which (1) adopts articulatory gestures as scalable forced alignment; (2) introduces connectionist subsequence aligner (CSA) to achieve dysfluency alignment; (3) introduces a large-scale simulated dysfluency corpus called Libri-Dys; and (4) develops an end-to-end system by leveraging the power of large language models (LLMs). We expect SSDM to serve as a standard in the area of dysfluency modeling. Demo is available at \urlthis https URL.

[AI-68] A More Unified Theory of Transfer Learning

链接: https://arxiv.org/abs/2408.16189
作者: Steve Hanneke,Samory Kpotufe
关键词-EN: target risk decreases, source risk decreases, risk decreases, fast target risk, delta
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We show that some basic moduli of continuity \delta – which measure how fast target risk decreases as source risk decreases – appear to be at the root of many of the classical relatedness measures in transfer learning and related literature. Namely, bounds in terms of \delta recover many of the existing bounds in terms of other measures of relatedness – both in regression and classification – and can at times be tighter. We are particularly interested in general situations where the learner has access to both source data and some or no target data. The unified perspective allowed by the moduli \delta allow us to extend many existing notions of relatedness at once to these scenarios involving target data: interestingly, while \delta itself might not be efficiently estimated, adaptive procedures exist – based on reductions to confidence sets – which can get nearly tight rates in terms of \delta with no prior distributional knowledge. Such adaptivity to unknown \delta immediately implies adaptivity to many classical relatedness notions, in terms of combined source and target samples’ sizes. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2408.16189 [stat.ML] (or arXiv:2408.16189v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2408.16189 Focus to learn more arXiv-issued DOI via DataCite

[AI-69] Identification of Prognostic Biomarkers for Stage III Non-Small Cell Lung Carcinoma in Female Nonsmokers Using Machine Learning

链接: https://arxiv.org/abs/2408.16068
作者: Huili Zheng,Qimin Zhang,Yiru Gong,Zheyan Liu,Shaohan Chen
关键词-EN: cancer-related deaths globally, non-small cell lung, stage III NSCLC, cell lung cancer, Lung cancer remains
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: This paper has been accepted for publication in the IEEE ICBASE 2024 conference

点击查看摘要

Abstract:Lung cancer remains a leading cause of cancer-related deaths globally, with non-small cell lung cancer (NSCLC) being the most common subtype. This study aimed to identify key biomarkers associated with stage III NSCLC in non-smoking females using gene expression profiling from the GDS3837 dataset. Utilizing XGBoost, a machine learning algorithm, the analysis achieved a strong predictive performance with an AUC score of 0.835. The top biomarkers identified - CCAAT enhancer binding protein alpha (C/EBP-alpha), lactate dehydrogenase A4 (LDHA), UNC-45 myosin chaperone B (UNC-45B), checkpoint kinase 1 (CHK1), and hypoxia-inducible factor 1 subunit alpha (HIF-1-alpha) - have been validated in the literature as being significantly linked to lung cancer. These findings highlight the potential of these biomarkers for early diagnosis and personalized therapy, emphasizing the value of integrating machine learning with molecular profiling in cancer research.

[AI-70] A Tutorial on Brownian Motion for Biostatisticians

链接: https://arxiv.org/abs/2408.16011
作者: Elvis Han Cui
关键词-EN: theory for Biostatisticians, fundamental stochastic process, Brownian Motion, exploration of Brownian, in-depth exploration
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This manuscript provides an in-depth exploration of Brownian Motion, a fundamental stochastic process in probability theory for Biostatisticians. It begins with foundational definitions and properties, including the construction of Brownian motion and its Markovian characteristics. The document delves into advanced topics such as the Karhunen-Loeve expansion, reflection principles, and Levy’s modulus of continuity. Through rigorous proofs and theorems, the manuscript examines the non-differentiability of Brownian paths, the behavior of zero sets, and the significance of local time. The notes also cover important results like Donsker’s theorem and Blumenthal’s 0-1 law, emphasizing their implications in the study of stochastic processes.

[AI-71] Novel Methods for Analyzing Cellular Interactions in Deep Learning-Based Image Cytometry: Spatial Interaction Potential and Co-Localization Index

链接: https://arxiv.org/abs/2408.16008
作者: Toru Nagasaka,Kimihiro Yamashita,Mitsugu Fujita
关键词-EN: learning-based image cytometry, deep learning-based image, image cytometry, study presents, learning-based image
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The study presents a novel approach for quantifying cellular interactions in digital pathology using deep learning-based image cytometry. Traditional methods struggle with the diversity and heterogeneity of cells within tissues. To address this, we introduce the Spatial Interaction Potential (SIP) and the Co-Localization Index (CLI), leveraging deep learning classification probabilities. SIP assesses the potential for cell-to-cell interactions, similar to an electric field, while CLI incorporates distances between cells, accounting for dynamic cell movements. Our approach enhances traditional methods, providing a more sophisticated analysis of cellular interactions. We validate SIP and CLI through simulations and apply them to colorectal cancer specimens, demonstrating strong correlations with actual biological data. This innovative method offers significant improvements in understanding cellular interactions and has potential applications in various fields of digital pathology.

计算机视觉

[CV-0] 3D Whole-body Grasp Synthesis with Directional Controllability

链接: https://arxiv.org/abs/2408.16770
作者: Georgios Paschalidis,Romana Wilschut,Dimitrije Antić,Omid Taheri,Dimitrios Tzionas
关键词-EN: realistically grasp objects, mixed reality, whole-bodies that realistically, Synthesizing, realistically grasp
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Synthesizing 3D whole-bodies that realistically grasp objects is useful for animation, mixed reality, and robotics. This is challenging, because the hands and body need to look natural w.r.t. each other, the grasped object, as well as the local scene (i.e., a receptacle supporting the object). Only recent work tackles this, with a divide-and-conquer approach; it first generates a “guiding” right-hand grasp, and then searches for bodies that match this. However, the guiding-hand synthesis lacks controllability and receptacle awareness, so it likely has an implausible direction (i.e., a body can’t match this without penetrating the receptacle) and needs corrections through major post-processing. Moreover, the body search needs exhaustive sampling and is expensive. These are strong limitations. We tackle these with a novel method called CWGrasp. Our key idea is that performing geometry-based reasoning “early on,” instead of “too late,” provides rich “control” signals for inference. To this end, CWGrasp first samples a plausible reaching-direction vector (used later for both the arm and hand) from a probabilistic model built via raycasting from the object and collision checking. Then, it generates a reaching body with a desired arm direction, as well as a “guiding” grasping hand with a desired palm direction that complies with the arm’s one. Eventually, CWGrasp refines the body to match the “guiding” hand, while plausibly contacting the scene. Notably, generating already-compatible “parts” greatly simplifies the “whole.” Moreover, CWGrasp uniquely tackles both right- and left-hand grasps. We evaluate on the GRAB and ReplicaGrasp datasets. CWGrasp outperforms baselines, at lower runtime and budget, while all components help performance. Code and models will be released.

[CV-1] PromptSmooth: Certifying Robustness of Medical Vision-Language Models via Prompt Learning MICCAI2024

链接: https://arxiv.org/abs/2408.16769
作者: Noor Hussein,Fahad Shamshad,Muzammal Naseer,Karthik Nandakumar
关键词-EN: medical image analysis, medical image-text pairs, Medical vision-language models, Medical vision-language, medical image-text
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: Accepted to MICCAI 2024

点击查看摘要

Abstract:Medical vision-language models (Med-VLMs) trained on large datasets of medical image-text pairs and later fine-tuned for specific tasks have emerged as a mainstream paradigm in medical image analysis. However, recent studies have highlighted the susceptibility of these Med-VLMs to adversarial attacks, raising concerns about their safety and robustness. Randomized smoothing is a well-known technique for turning any classifier into a model that is certifiably robust to adversarial perturbations. However, this approach requires retraining the Med-VLM-based classifier so that it classifies well under Gaussian noise, which is often infeasible in practice. In this paper, we propose a novel framework called PromptSmooth to achieve efficient certified robustness of Med-VLMs by leveraging the concept of prompt learning. Given any pre-trained Med-VLM, PromptSmooth adapts it to handle Gaussian noise by learning textual prompts in a zero-shot or few-shot manner, achieving a delicate balance between accuracy and robustness, while minimizing the computational overhead. Moreover, PromptSmooth requires only a single model to handle multiple noise levels, which substantially reduces the computational cost compared to traditional methods that rely on training a separate model for each noise level. Comprehensive experiments based on three Med-VLMs and across six downstream datasets of various imaging modalities demonstrate the efficacy of PromptSmooth. Our code and models are available at this https URL.

[CV-2] SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

链接: https://arxiv.org/abs/2408.16768
作者: Ziyu Guo,Renrui Zhang,Xiangyang Zhu,Chengzhuo Tong,Peng Gao,Chunyuan Li,Pheng-Ann Heng
关键词-EN: exploration adapting Segment, Segment Anything Model, preliminary exploration adapting, adapting Segment, preliminary exploration
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Work in progress. Online Demo: this https URL . Code: this https URL

点击查看摘要

Abstract:We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection. Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor environments, and raw sparse LiDAR. Demonstrations on multiple 3D datasets, e.g., Objaverse, S3DIS, ScanNet, Semantic3D, and KITTI, highlight the robust generalization capabilities of SAM2Point. To our best knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation. Online Demo: this https URL . Code: this https URL .

[CV-3] ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

链接: https://arxiv.org/abs/2408.16767
作者: Fangfu Liu,Wenqiang Sun,Hanyang Wang,Yikai Wang,Haowen Sun,Junliang Ye,Jun Zhang,Yueqi Duan
关键词-EN: producing realistic, results from hundreds, real world, scene reconstruction, scene
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.

[CV-4] CSGO: Content-Style Composition in Text-to-Image Generation

链接: https://arxiv.org/abs/2408.16766
作者: Peng Xing,Haofan Wang,Yanpeng Sun,Qixun Wang,Xu Bai,Hao Ai,Renyuan Huang,Zechao Li
关键词-EN: shown exceptional capabilities, controlled image generation, shown exceptional, fueled interest, image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The diffusion model has shown exceptional capabilities in controlled image generation, which has further fueled interest in image style transfer. Existing works mainly focus on training free-based methods (e.g., image inversion) due to the scarcity of specific data. In this study, we present a data construction pipeline for content-style-stylized image triplets that generates and automatically cleanses stylized data triplets. Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research. Equipped with IMAGStyle, we propose CSGO, a style transfer model based on end-to-end training, which explicitly decouples content and style features employing independent feature injection. The unified CSGO implements image-driven style transfer, text-driven stylized synthesis, and text editing-driven stylized synthesis. Extensive experiments demonstrate the effectiveness of our approach in enhancing style control capabilities in image generation. Additional visualization and access to the source code can be located on the project page: \urlthis https URL.

[CV-5] UV-free Texture Generation with Denoising and Geodesic Heat Diffusions

链接: https://arxiv.org/abs/2408.16762
作者: Simone Foti,Stefanos Zafeiriou,Tolga Birdal
关键词-EN: standard UV-based texturing, wasted UV space, standard UV-based, UV-based texturing, prominent issues
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Seams, distortions, wasted UV space, vertex-duplication, and varying resolution over the surface are the most prominent issues of the standard UV-based texturing of meshes. These issues are particularly acute when automatic UV-unwrapping techniques are used. For this reason, instead of generating textures in automatically generated UV-planes like most state-of-the-art methods, we propose to represent textures as coloured point-clouds whose colours are generated by a denoising diffusion probabilistic model constrained to operate on the surface of 3D objects. Our sampling and resolution agnostic generative model heavily relies on heat diffusion over the surface of the meshes for spatial communication between points. To enable processing of arbitrarily sampled point-cloud textures and ensure long-distance texture consistency we introduce a fast re-sampling of the mesh spectral properties used during the heat diffusion and introduce a novel heat-diffusion-based self-attention mechanism. Our code and pre-trained models are available at this http URL.

[CV-6] OmniRe: Omni Urban Scene Reconstruction

链接: https://arxiv.org/abs/2408.16760
作者: Ziyu Chen,Jiawei Yang,Jiahui Huang,Riccardo de Lutio,Janick Martinez Esturo,Boris Ivanovic,Or Litany,Zan Gojcic,Sanja Fidler,Marco Pavone,Li Song,Yue Wang
关键词-EN: efficiently reconstructing high-fidelity, high-fidelity dynamic urban, reconstructing high-fidelity dynamic, dynamic, dynamic urban
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: See the project page for code, video results and demos: this https URL

点击查看摘要

Abstract:We introduce OmniRe, a holistic approach for efficiently reconstructing high-fidelity dynamic urban scenes from on-device logs. Recent methods for modeling driving sequences using neural radiance fields or Gaussian Splatting have demonstrated the potential of reconstructing challenging dynamic scenes, but often overlook pedestrians and other non-vehicle dynamic actors, hindering a complete pipeline for dynamic urban scene reconstruction. To that end, we propose a comprehensive 3DGS framework for driving scenes, named OmniRe, that allows for accurate, full-length reconstruction of diverse dynamic objects in a driving log. OmniRe builds dynamic neural scene graphs based on Gaussian representations and constructs multiple local canonical spaces that model various dynamic actors, including vehicles, pedestrians, and cyclists, among many others. This capability is unmatched by existing methods. OmniRe allows us to holistically reconstruct different objects present in the scene, subsequently enabling the simulation of reconstructed scenarios with all actors participating in real-time (~60Hz). Extensive evaluations on the Waymo dataset show that our approach outperforms prior state-of-the-art methods quantitatively and qualitatively by a large margin. We believe our work fills a critical gap in driving reconstruction.

[CV-7] Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

链接: https://arxiv.org/abs/2408.16757
作者: Hongjun Wang,Sagar Vaze,Kai Han
关键词-EN: Detecting test-time distribution, machine learning models, test-time distribution shift, safely deployed machine, deployed machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted to IJCV, preprint version

点击查看摘要

Abstract:Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: \urlthis https URL

[CV-8] VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

链接: https://arxiv.org/abs/2408.16730
作者: Shiwei Wu,Joya Chen,Kevin Qinghong Lin,Qimeng Wang,Yan Gao,Qianli Xu,Tong Xu,Yao Hu,Enhong Chen,Mike Zheng Shou
关键词-EN: vision tokens, vision tokens generally, frame streaming scenarios, dense video frame, large vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens “skipping layers” rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42% time and \textasciitilde30% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.

[CV-9] Prediction-Feedback DETR for Temporal Action Detection

链接: https://arxiv.org/abs/2408.16729
作者: Jihwan Kim,Miso Lee,Cheol-Ho Cho,Jihyun Lee,Jae-Pil Heo
关键词-EN: Temporal Action Detection, Temporal Action, Action Detection, real-world video applications, video applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Temporal Action Detection (TAD) is fundamental yet challenging for real-world video applications. Leveraging the unique benefits of transformers, various DETR-based approaches have been adopted in TAD. However, it has recently been identified that the attention collapse in self-attention causes the performance degradation of DETR for TAD. Building upon previous research, this paper newly addresses the attention collapse problem in cross-attention within DETR-based TAD methods. Moreover, our findings reveal that cross-attention exhibits patterns distinct from predictions, indicating a short-cut phenomenon. To resolve this, we propose a new framework, Prediction-Feedback DETR (Pred-DETR), which utilizes predictions to restore the collapse and align the cross- and self-attention with predictions. Specifically, we devise novel prediction-feedback objectives using guidance from the relations of the predictions. As a result, Pred-DETR significantly alleviates the collapse and achieves state-of-the-art performance among DETR-based methods on various challenging benchmarks including THUMOS14, ActivityNet-v1.3, HACS, and FineAction.

[CV-10] H-SGANet: Hybrid Sparse Graph Attention Network for Deformable Medical Image Registration

链接: https://arxiv.org/abs/2408.16719
作者: Yufeng Zhou,Wenming Cao
关键词-EN: Convolutional Neural Network, Transformer has emerged, Convolutional Neural, large parameter space, sparse graph attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The integration of Convolutional Neural Network (ConvNet) and Transformer has emerged as a strong candidate for image registration, leveraging the strengths of both models and a large parameter space. However, this hybrid model, treating brain MRI volumes as grid or sequence structures, faces challenges in accurately representing anatomical connectivity, diverse brain regions, and vital connections contributing to the brain’s internal architecture. Concerns also arise regarding the computational expense and GPU memory usage associated with this model. To tackle these issues, a lightweight hybrid sparse graph attention network (H-SGANet) has been developed. This network incorporates a central mechanism, Sparse Graph Attention (SGA), based on a Vision Graph Neural Network (ViG) with predetermined anatomical connections. The SGA module expands the model’s receptive field and seamlessly integrates into the network. To further amplify the advantages of the hybrid network, the Separable Self-Attention (SSA) is employed as an enhanced token mixer, integrated with depth-wise convolution to constitute SSAFormer. This strategic integration is designed to more effectively extract long-range dependencies. As a hybrid ConvNet-ViG-Transformer model, H-SGANet offers threefold benefits for volumetric medical image registration. It optimizes fixed and moving images concurrently through a hybrid feature fusion layer and an end-to-end learning framework. Compared to VoxelMorph, a model with a similar parameter count, H-SGANet demonstrates significant performance enhancements of 3.5% and 1.5% in Dice score on the OASIS dataset and LPBA40 dataset, respectively.

[CV-11] One-Shot Learning Meets Depth Diffusion in Multi-Object Videos

链接: https://arxiv.org/abs/2408.16704
作者: Anisha Jain
关键词-EN: Creating editable videos, Creating editable, depict complex interactions, task in filmmaking, depict complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Creating editable videos that depict complex interactions between multiple objects in various artistic styles has long been a challenging task in filmmaking. Progress is often hampered by the scarcity of data sets that contain paired text descriptions and corresponding videos that showcase these interactions. This paper introduces a novel depth-conditioning approach that significantly advances this field by enabling the generation of coherent and diverse videos from just a single text-video pair using a pre-trained depth-aware Text-to-Image (T2I) model. Our method fine-tunes the pre-trained model to capture continuous motion by employing custom-designed spatial and temporal attention mechanisms. During inference, we use the DDIM inversion to provide structural guidance for video generation. This innovative technique allows for continuously controllable depth in videos, facilitating the generation of multiobject interactions while maintaining the concept generation and compositional strengths of the original T2I model across various artistic styles, such as photorealism, animation, and impressionism.

[CV-12] GradBias: Unveiling Word Influence on Bias in Text-to-Image Generative Models

链接: https://arxiv.org/abs/2408.16700
作者: Moreno D’Incà,Elia Peruzzo,Massimiliano Mancini,Xingqian Xu,Humphrey Shi,Nicu Sebe
关键词-EN: Recent progress, high-quality image generation, enabled high-quality image, enabled high-quality, biases
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review. Code: this https URL

点击查看摘要

Abstract:Recent progress in Text-to-Image (T2I) generative models has enabled high-quality image generation. As performance and accessibility increase, these models are gaining significant attraction and popularity: ensuring their fairness and safety is a priority to prevent the dissemination and perpetuation of biases. However, existing studies in bias detection focus on closed sets of predefined biases (e.g., gender, ethnicity). In this paper, we propose a general framework to identify, quantify, and explain biases in an open set setting, i.e. without requiring a predefined set. This pipeline leverages a Large Language Model (LLM) to propose biases starting from a set of captions. Next, these captions are used by the target generative model for generating a set of images. Finally, Vision Question Answering (VQA) is leveraged for bias evaluation. We show two variations of this framework: OpenBias and GradBias. OpenBias detects and quantifies biases, while GradBias determines the contribution of individual prompt words on biases. OpenBias effectively detects both well-known and novel biases related to people, objects, and animals and highly aligns with existing closed-set bias detection methods and human judgment. GradBias shows that neutral words can significantly influence biases and it outperforms several baselines, including state-of-the-art foundation models. Code available here: this https URL.

[CV-13] Generic Objects as Pose Probes for Few-Shot View Synthesis

链接: https://arxiv.org/abs/2408.16690
作者: Zhirui Gao,Renjiao Yi,Chenyang Zhu,Ke Zhuang,Wei Chen,Kai Xu
关键词-EN: Radiance fields including, Gaussians demonstrate great, Radiance fields, fields including NeRFs, Gaussians demonstrate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as “pose probes”. The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance.

[CV-14] PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification

链接: https://arxiv.org/abs/2408.16684
作者: Lei Tan,Pingyang Dai,Jie Chen,Liujuan Cao,Yongjian Wu,Rongrong Ji
关键词-EN: accurately identify objects, Extracting robust feature, Extracting robust, non-overlapping cameras, re-identification to accurately
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Extracting robust feature representation is critical for object re-identification to accurately identify objects across non-overlapping cameras. Although having a strong representation ability, the Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features. Meanwhile, due to the structural difference between CNN and ViT, fine-grained strategies that effectively address this issue in CNN do not continue to be successful in ViT. To address this issue, by observing the latent diverse representation hidden behind the multi-head attention, we present PartFormer, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks. The PartFormer integrates a Head Disentangling Block (HDB) that awakens the diverse representation of multi-head self-attention without the typical loss of feature richness induced by concatenation and FFN layers post-attention. To avoid the homogenization of attention heads and promote robust part-based feature learning, two head diversity constraints are imposed: attention diversity constraint and correlation diversity constraint. These constraints enable the model to exploit diverse and discriminative feature representations from different attention heads. Comprehensive experiments on various object Re-ID benchmarks demonstrate the superiority of the PartFormer. Specifically, our framework significantly outperforms state-of-the-art by 2.4% mAP scores on the most challenging MSMT17 dataset.

[CV-15] Space3D-Bench: Spatial 3D Question Answering Benchmark

链接: https://arxiv.org/abs/2408.16662
作者: Emilia Szymanska,Mihai Dusmanu,Jan-Willem Buurlage,Mahdi Rad,Marc Pollefeys
关键词-EN: environment poses challenges, Answering questions, foundation models due, environment poses, poses challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Answering questions about the spatial properties of the environment poses challenges for existing language and vision foundation models due to a lack of understanding of the 3D world notably in terms of relationships between objects. To push the field forward, multiple 3D QA datasets were proposed which, overall, provide a variety of questions, but they individually focus on particular aspects of 3D reasoning or are limited in terms of data modalities. To address this, we present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset which offers a variety of data modalities: point clouds, posed RGB-D images, navigation meshes and 3D object detections. To ensure that the questions cover a wide range of 3D objectives, we propose an indoor spatial questions taxonomy inspired by geographic information systems and use it to balance the dataset accordingly. Moreover, we provide an assessment system that grades natural language responses based on predefined ground-truth answers by leveraging a Vision Language Model’s comprehension of both text and images to compare the responses with ground-truth textual information or relevant visual data. Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval, achieving an accuracy of 67% on the proposed dataset.

[CV-16] Eigen-Cluster VIS: Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal Consistency

链接: https://arxiv.org/abs/2408.16661
作者: Farnoosh Arefi,Amir M. Mansourian,Shohreh Kasaei
关键词-EN: Video Instance Segmentation, Video Instance, Instance Segmentation, Eigen-Cluster VIS method, improved significantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 Figures, 5 tabels

点击查看摘要

Abstract:The performance of Video Instance Segmentation (VIS) methods has improved significantly with the advent of transformer networks. However, these networks often face challenges in training due to the high annotation cost. To address this, unsupervised and weakly-supervised methods have been developed to reduce the dependency on annotations. This work introduces a novel weakly-supervised method called Eigen-cluster VIS that, without requiring any mask annotations, achieves competitive accuracy compared to other VIS approaches. This method is based on two key innovations: a Temporal Eigenvalue Loss (TEL) and a clip-level Quality Cluster Coefficient (QCC). The TEL ensures temporal coherence by leveraging the eigenvalues of the Laplacian matrix derived from graph adjacency matrices. By minimizing the mean absolute error (MAE) between the eigenvalues of adjacent frames, this loss function promotes smooth transitions and stable segmentation boundaries over time, reducing temporal discontinuities and improving overall segmentation quality. The QCC employs the K-means method to ensure the quality of spatio-temporal clusters without relying on ground truth masks. Using the Davies-Bouldin score, the QCC provides an unsupervised measure of feature discrimination, allowing the model to self-evaluate and adapt to varying object distributions, enhancing robustness during the testing phase. These enhancements are computationally efficient and straightforward, offering significant performance gains without additional annotated data. The proposed Eigen-Cluster VIS method is evaluated on the YouTube-VIS 2019/2021 and OVIS datasets, demonstrating that it effectively narrows the performance gap between the fully-supervised and weakly-supervised VIS approaches. The code is available on: this https URL

[CV-17] DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving

链接: https://arxiv.org/abs/2408.16647
作者: Yongjie Fu,Anmol Jain,Xuan Di,Xu Chen,Zhaobin Mo
关键词-EN: technologies necessitates increasingly, necessitates increasingly sophisticated, increasingly sophisticated methods, driving technologies necessitates, autonomous driving technologies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of autonomous driving technologies necessitates increasingly sophisticated methods for understanding and predicting real-world scenarios. Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. In this paper, we propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them. To achieve this, we employ a video generation framework grounded in denoising diffusion probabilistic models (DDPM) aimed at predicting real-world video sequences. We then explore the adequacy of our generated videos for use in VLMs by employing a pre-trained model known as Efficient In-context Learning on Egocentric Videos (EILEV). The diffusion model is trained with the Waymo open dataset and evaluated using the Fréchet Video Distance (FVD) score to ensure the quality and realism of the generated videos. Corresponding narrations are provided by EILEV for these generated videos, which may be beneficial in the autonomous driving domain. These narrations can enhance traffic scene understanding, aid in navigation, and improve planning capabilities. The integration of video generation with VLMs in the DriveGenVLM framework represents a significant step forward in leveraging advanced AI models to address complex challenges in autonomous driving.

[CV-18] SODAWideNet: Combining Attention and Convolutions for Salient Object Detection ICPR2024

链接: https://arxiv.org/abs/2408.16645
作者: Rohit Venkata Sai Dulam,Chandra Kambhamettu
关键词-EN: Salient Object Detection, Salient Object, Object Detection, feature refinement modules, ImageNet pre-trained backbone
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:Salient Object Detection (SOD) has traditionally relied on feature refinement modules that utilize the features of an ImageNet pre-trained backbone. However, this approach limits the possibility of pre-training the entire network because of the distinct nature of SOD and image classification. Additionally, the architecture of these backbones originally built for Image classification is sub-optimal for a dense prediction task like SOD. To address these issues, we propose a novel encoder-decoder-style neural network called SODAWideNet++ that is designed explicitly for SOD. Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module, which combines large dilated convolutions and self-attention. Specifically, we use attention features to guide long-range information extracted by multiple dilated convolutions, thus taking advantage of the inductive biases of a convolution operation and the input dependency brought by self-attention. In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end. Further, we supervise the background predictions along with the foreground to push our model to generate accurate saliency predictions. SODAWideNet++ performs competitively on five different datasets while only containing 35% of the trainable parameters compared to the state-of-the-art models. The code and pre-computed saliency maps are provided at this https URL.

[CV-19] 3D Pose-Based Temporal Action Segmentation for Figure Skating: A Fine-Grained and Jump Procedure-Aware Annotation Approach

链接: https://arxiv.org/abs/2408.16638
作者: Ryota Tanaka,Tomohiro Suzuki,Keisuke Fujii
关键词-EN: Understanding human actions, Understanding human, Temporal Action Segmentation, including sports, figure skating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 7th ACM International Workshop on Multimedia Content Analysis in Sports

点击查看摘要

Abstract:Understanding human actions from videos is essential in many domains, including sports. In figure skating, technical judgments are performed by watching skaters’ 3D movements, and its part of the judging procedure can be regarded as a Temporal Action Segmentation (TAS) task. TAS tasks in figure skating that automatically assign temporal semantics to video are actively researched. However, there is a lack of datasets and effective methods for TAS tasks requiring 3D pose data. In this study, we first created the FS-Jump3D dataset of complex and dynamic figure skating jumps using optical markerless motion capture. We also propose a new fine-grained figure skating jump TAS dataset annotation method with which TAS models can learn jump procedures. In the experimental results, we validated the usefulness of 3D pose features as input and the fine-grained dataset for the TAS model in figure skating. FS-Jump3D Dataset is available at this https URL.

[CV-20] urbulence Strength C_n2 Estimation from Video using Physics-based Deep Learning

链接: https://arxiv.org/abs/2408.16623
作者: Ripon Kumar Saha,Esen Salcin,Jihoo Kim,Joseph Smith,Suren Jayasuriya
关键词-EN: dynamic image distortion, image distortion due, long distance suffer, refractive indices, long distance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Code Available: this https URL

点击查看摘要

Abstract:Images captured from a long distance suffer from dynamic image distortion due to turbulent flow of air cells with random temperatures, and thus refractive indices. This phenomenon, known as image dancing, is commonly characterized by its refractive-index structure constant C_n^2 as a measure of the turbulence strength. For many applications such as atmospheric forecast model, long-range/astronomy imaging, and aviation safety, optical communication technology, C_n^2 estimation is critical for accurately sensing the turbulent environment. Previous methods for C_n^2 estimation include estimation from meteorological data (temperature, relative humidity, wind shear, etc.) for single-point measurements, two-ended pathlength measurements from optical scintillometer for path-averaged C_n^2 , and more recently estimating C_n^2 from passive video cameras for low cost and hardware complexity. In this paper, we present a comparative analysis of classical image gradient methods for C_n^2 estimation and modern deep learning-based methods leveraging convolutional neural networks. To enable this, we collect a dataset of video capture along with reference scintillometer measurements for ground truth, and we release this unique dataset to the scientific community. We observe that deep learning methods can achieve higher accuracy when trained on similar data, but suffer from generalization errors to other, unseen imagery as compared to classical methods. To overcome this trade-off, we present a novel physics-based network architecture that combines learned convolutional layers with a differentiable image gradient method that maintains high accuracy while being generalizable across image datasets.

[CV-21] owards Infusing Auxiliary Knowledge for Distracted Driver Detection KDD

链接: https://arxiv.org/abs/2408.16621
作者: Ishwar B Balappanawar,Ashmit Chamoli,Ruwan Wickramarachchi,Aditya Mishra,Ponnurangam Kumaraguru,Amit P. Sheth
关键词-EN: road accidents globally, accidents globally, Distracted driving, distracted driving involves, road accidents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at KiL 2024: Workshop on Knowledge-infused Learning co-located with 30th ACM KDD Conference

点击查看摘要

Abstract:Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver’s pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver’s actions.Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.

[CV-22] FastForensics: Efficient Two-Stream Design for Real-Time Image Manipulation Detection BMVC2024

链接: https://arxiv.org/abs/2408.16582
作者: Yangxiang Zhang,Yuezun Li,Ao Luo,Jiaran Zhou,Junyu Dong
关键词-EN: rise in popularity, spread of falsified, falsified media, media on social, social platforms
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: BMVC 2024

点击查看摘要

Abstract:With the rise in popularity of portable devices, the spread of falsified media on social platforms has become rampant. This necessitates the timely identification of authentic content. However, most advanced detection methods are computationally heavy, hindering their real-time application. In this paper, we describe an efficient two-stream architecture for real-time image manipulation detection. Our method consists of two-stream branches targeting the cognitive and inspective perspectives. In the cognitive branch, we propose efficient wavelet-guided Transformer blocks to capture the global manipulation traces related to frequency. This block contains an interactive wavelet-guided self-attention module that integrates wavelet transformation with efficient attention design, interacting with the knowledge from the inspective branch. The inspective branch consists of simple convolutions that capture fine-grained traces and interact bidirectionally with Transformer blocks to provide mutual support. Our method is lightweight ( \sim 8M) but achieves competitive performance compared to many other counterparts, demonstrating its efficacy in image manipulation detection and its potential for portable integration.

[CV-23] MST-KD: Multiple Specialized Teachers Knowledge Distillation for Fair Face Recognition ECCV2024

链接: https://arxiv.org/abs/2408.16563
作者: Eduarda Caldeira,Jaime S. Cardoso,Ana F. Sequeira,Pedro C. Neto
关键词-EN: distill equally robust, equally robust information, equally robust, subjects is insufficient, student network
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024 ABAW

点击查看摘要

Abstract:As in school, one teacher to cover all subjects is insufficient to distill equally robust information to a student. Hence, each subject is taught by a highly specialised teacher. Following a similar philosophy, we propose a multiple specialized teacher framework to distill knowledge to a student network. In our approach, directed at face recognition use cases, we train four teachers on one specific ethnicity, leading to four highly specialized and biased teachers. Our strategy learns a project of these four teachers into a common space and distill that information to a student network. Our results highlighted increased performance and reduced bias for all our experiments. In addition, we further show that having biased/specialized teachers is crucial by showing that our approach achieves better results than when knowledge is distilled from four teachers trained on balanced datasets. Our approach represents a step forward to the understanding of the importance of ethnicity-specific features.

[CV-24] OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation ECCV2024

链接: https://arxiv.org/abs/2408.16547
作者: Yuchen Che,Ryo Furukawa,Asako Kanezaki
关键词-EN: pose estimation focuses, Category-level articulated object, Category-level articulated, estimation focuses, pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: to be published in ECCV2024

点击查看摘要

Abstract:Category-level articulated object pose estimation focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to solve this task. Our model consistently generates reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level poses that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate that our approach significantly outperforms previous self-supervised methods and is comparable to the state-of-the-art supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a new real-world articulated object benchmark dataset.

[CV-25] Spurfies: Sparse Surface Reconstruction using Local Geometry Priors

链接: https://arxiv.org/abs/2408.16544
作者: Kevin Raj,Christopher Wewer,Raza Yunus,Eddy Ilg,Jan Eric Lenssen
关键词-EN: introduce Spurfies, Spurfies, geometry, geometry priors trained, appearance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:We introduce Spurfies, a novel method for sparse-view surface reconstruction that disentangles appearance and geometry information to utilize local geometry priors trained on synthetic data. Recent research heavily focuses on 3D reconstruction using dense multi-view setups, typically requiring hundreds of images. However, these methods often struggle with few-view scenarios. Existing sparse-view reconstruction techniques often rely on multi-view stereo networks that need to learn joint priors for geometry and appearance from a large amount of data. In contrast, we introduce a neural point representation that disentangles geometry and appearance to train a local geometry prior using a subset of the synthetic ShapeNet dataset only. During inference, we utilize this surface prior as additional constraint for surface and appearance reconstruction from sparse input views via differentiable volume rendering, restricting the space of possible solutions. We validate the effectiveness of our method on the DTU dataset and demonstrate that it outperforms previous state of the art by 35% in surface quality while achieving competitive novel view synthesis quality. Moreover, in contrast to previous works, our method can be applied to larger, unbounded scenes, such as Mip-NeRF 360.

[CV-26] GRPose: Learning Graph Relations for Human Image Generation with Pose Priors

链接: https://arxiv.org/abs/2408.16540
作者: Xiangchen Yin,Donglin Di,Lei Fan,Hao Li,Chen Wei,Xiaofei Gou,Yang Song,Xiao Sun,Xun Yang
关键词-EN: made significant progress, Recent methods, human image generation, pose, pose priors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code will be released at this https URL

点击查看摘要

Abstract:Recent methods using diffusion models have made significant progress in human image generation with various additional controls such as pose priors. However, existing approaches still struggle to generate high-quality images with consistent pose alignment, resulting in unsatisfactory outputs. In this paper, we propose a framework delving into the graph relations of pose priors to provide control information for human image generation. The main idea is to establish a graph topological structure between the pose priors and latent representation of diffusion models to capture the intrinsic associations between different pose parts. A Progressive Graph Integrator (PGI) is designed to learn the spatial relationships of the pose priors with the graph structure, adopting a hierarchical strategy within an Adapter to gradually propagate information across different pose parts. A pose perception loss is further introduced based on a pretrained pose estimation network to minimize the pose differences. Extensive qualitative and quantitative experiments conducted on the Human-Art and LAION-Human datasets demonstrate that our model achieves superior performance, with a 9.98% increase in pose average precision compared to the latest benchmark model. The code is released on *******.

[CV-27] Are Pose Estimators Ready for the Open World? STAGE: Synthetic Data Generation Toolkit for Auditing 3D Human Pose Estimators

链接: https://arxiv.org/abs/2408.16536
作者: Nikita Kister,István Sárándi,Anna Khoreva,Gerard Pons-Moll
关键词-EN: progressed tremendously, years as measured, measured on standard, pose estimators, pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The estimation of 3D human poses from images has progressed tremendously over the last few years as measured on standard benchmarks. However, performance in the open world remains underexplored, as current benchmarks cannot capture its full extent. Especially in safety-critical systems, it is crucial that 3D pose estimators are audited before deployment, and their sensitivity towards single factors or attributes occurring in the operational domain is thoroughly examined. Nevertheless, we currently lack a benchmark that would enable such fine-grained analysis. We thus present STAGE, a GenAI data toolkit for auditing 3D human pose estimators. We enable a text-to-image model to control the 3D human body pose in the generated image. This allows us to create customized annotated data covering a wide range of open-world attributes. We leverage STAGE and generate a series of benchmarks to audit the sensitivity of popular pose estimators towards attributes such as gender, ethnicity, age, clothing, location, and weather. Our results show that the presence of such naturally occurring attributes can cause severe degradation in the performance of pose estimators and leads us to question if they are ready for open-world deployment.

[CV-28] A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions

链接: https://arxiv.org/abs/2408.16530
作者: Yu Wang,Shaohua Wang,Yicheng Li,Mingchun Liu
关键词-EN: essential environmental awareness, autonomous driving systems, providing essential environmental, autonomous driving, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, 3D object perception has become a crucial component in the development of autonomous driving systems, providing essential environmental awareness. However, as perception tasks in autonomous driving evolve, their variants have increased, leading to diverse insights from industry and academia. Currently, there is a lack of comprehensive surveys that collect and summarize these perception tasks and their developments from a broader perspective. This review extensively summarizes traditional 3D object detection methods, focusing on camera-based, LiDAR-based, and fusion detection techniques. We provide a comprehensive analysis of the strengths and limitations of each approach, highlighting advancements in accuracy and robustness. Furthermore, we discuss future directions, including methods to improve accuracy such as temporal perception, occupancy grids, and end-to-end learning frameworks. We also explore cooperative perception methods that extend the perception range through collaborative communication. By providing a holistic view of the current state and future developments in 3D object perception, we aim to offer a more comprehensive understanding of perception tasks for autonomous driving. Additionally, we have established an active repository to provide continuous updates on the latest advancements in this field, accessible at: this https URL.

[CV-29] owards Modality-agnostic Label-efficient Segmentation with Entropy-Regularized Distribution Alignment

链接: https://arxiv.org/abs/2408.16520
作者: Liyao Tang,Zhe Chen,Shanshan Zhao,Chaoyue Wang,Dacheng Tao
关键词-EN: limited ground-truth labels, Label-efficient segmentation aims, aims to perform, segmentation, limited ground-truth
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Extended version of arXiv:2305.15832 ; Code at this https URL

点击查看摘要

Abstract:Label-efficient segmentation aims to perform effective segmentation on input data using only sparse and limited ground-truth labels for training. This topic is widely studied in 3D point cloud segmentation due to the difficulty of annotating point clouds densely, while it is also essential for cost-effective segmentation on 2D images. Until recently, pseudo-labels have been widely employed to facilitate training with limited ground-truth labels, and promising progress has been witnessed in both the 2D and 3D segmentation. However, existing pseudo-labeling approaches could suffer heavily from the noises and variations in unlabelled data, which would result in significant discrepancies between generated pseudo-labels and current model predictions during training. We analyze that this can further confuse and affect the model learning process, which shows to be a shared problem in label-efficient learning across both 2D and 3D modalities. To address this issue, we propose a novel learning strategy to regularize the pseudo-labels generated for training, thus effectively narrowing the gaps between pseudo-labels and model predictions. More specifically, our method introduces an Entropy Regularization loss and a Distribution Alignment loss for label-efficient learning, resulting in an ERDA learning strategy. Interestingly, by using KL distance to formulate the distribution alignment loss, ERDA reduces to a deceptively simple cross-entropy-based loss which optimizes both the pseudo-label generation module and the segmentation model simultaneously. In addition, we innovate in the pseudo-label generation to make our ERDA consistently effective across both 2D and 3D data modalities for segmentation. Enjoying simplicity and more modality-agnostic pseudo-label generation, our method has shown outstanding performance in fully utilizing all unlabeled data points for training across …

[CV-30] Alignment is All You Need: A Training-free Augmentation Strategy for Pose-guided Video Generation ICML2024

链接: https://arxiv.org/abs/2408.16506
作者: Xiaoyu Jin,Zunnan Xu,Mingwen Ou,Wenming Yang
关键词-EN: graphics and vision, transformative field, field in computer, computer graphics, dynamic and realistic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVG@ICML 2024

点击查看摘要

Abstract:Character animation is a transformative field in computer graphics and vision, enabling dynamic and realistic video animations from static images. Despite advancements, maintaining appearance consistency in animations remains a challenge. Our approach addresses this by introducing a training-free framework that ensures the generated video sequence preserves the reference image’s subtleties, such as physique and proportions, through a dual alignment strategy. We decouple skeletal and motion priors from pose information, enabling precise control over animation generation. Our method also improves pixel-level alignment for conditional control from the reference character, enhancing the temporal consistency and visual cohesion of animations. Our method significantly enhances the quality of video generation without the need for large datasets or expensive computational resources.

[CV-31] A Simple and Generalist Approach for Panoptic Segmentation

链接: https://arxiv.org/abs/2408.16504
作者: Nedyalko Prisadnikov,Wouter Van Gansbeke,Danda Pani Paudel,Luc Van Gool
关键词-EN: vision models aim, Generalist vision models, vision tasks, Edge Distance Sampling, Generalist vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generalist vision models aim for one and the same architecture for a variety of vision tasks. While such shared architecture may seem attractive, generalist models tend to be outperformed by their bespoken counterparts, especially in the case of panoptic segmentation. We address this problem by introducing two key contributions, without compromising the desirable properties of generalist models. These contributions are: (i) a positional-embedding (PE) based loss for improved centroid regressions; (ii) Edge Distance Sampling (EDS) for the better separation of instance boundaries. The PE-based loss facilitates a better per-pixel regression of the associated instance’s centroid, whereas EDS contributes by carefully handling the void regions (caused by missing labels) and smaller instances. These two simple yet effective modifications significantly improve established baselines, while achieving state-of-the-art results among all generalist solutions. More specifically, our method achieves a panoptic quality(PQ) of 52.5 on the COCO dataset, which is an improvement of 10 points over the best model with similar approach (Painter), and is superior by 2 to the best performing diffusion-based method Pix2Seq- \mathcalD . Furthermore, we provide insights into and an in-depth analysis of our contributions through exhaustive experiments. Our source code and model weights will be made publicly available.

[CV-32] Locally Grouped and Scale-Guided Attention for Dense Pest Counting

链接: https://arxiv.org/abs/2408.16503
作者: Chang-Hwan Son
关键词-EN: predict densely distributed, densely distributed pests, distributed pests captured, digital traps, dense pest counting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study introduces a new dense pest counting problem to predict densely distributed pests captured by digital traps. Unlike traditional detection-based counting models for sparsely distributed objects, trap-based pest counting must deal with dense pest distributions that pose challenges such as severe occlusion, wide pose variation, and similar appearances in colors and textures. To address these problems, it is essential to incorporate the local attention mechanism, which identifies locally important and unimportant areas to learn locally grouped features, thereby enhancing discriminative performance. Accordingly, this study presents a novel design that integrates locally grouped and scale-guided attention into a multiscale CenterNet framework. To group local features with similar attributes, a straightforward method is introduced using the heatmap predicted by the first hourglass containing pest centroid information, which eliminates the need for complex clustering models. To enhance attentiveness, the pixel attention module transforms the heatmap into a learnable map. Subsequently, scale-guided attention is deployed to make the object and background features more discriminative, achieving multiscale feature fusion. Through experiments, the proposed model is verified to enhance object features based on local grouping and discriminative feature attention learning. Additionally, the proposed model is highly effective in overcoming occlusion and pose variation problems, making it more suitable for dense pest counting. In particular, the proposed model outperforms state-of-the-art models by a large margin, with a remarkable contribution to dense pest counting.

[CV-33] UAV-Based Human Body Detector Selection and Fusion for Geolocated Saliency Map Generation

链接: https://arxiv.org/abs/2408.16501
作者: Piotr Rudol,Patrick Doherty,Mariusz Wzorek,Chattrakul Sombattheera
关键词-EN: Unmanned Aerial Vehicles, Aerial Vehicles, Search and Rescue, Unmanned Aerial, reliably geolocating objects
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 42 pages, 19 figures

点击查看摘要

Abstract:The problem of reliably detecting and geolocating objects of different classes in soft real-time is essential in many application areas, such as Search and Rescue performed using Unmanned Aerial Vehicles (UAVs). This research addresses the complementary problems of system contextual vision-based detector selection, allocation, and execution, in addition to the fusion of detection results from teams of UAVs for the purpose of accurately and reliably geolocating objects of interest in a timely manner. In an offline step, an application-independent evaluation of vision-based detectors from a system perspective is first performed. Based on this evaluation, the most appropriate algorithms for online object detection for each platform are selected automatically before a mission, taking into account a number of practical system considerations, such as the available communication links, video compression used, and the available computational resources. The detection results are fused using a method for building maps of salient locations which takes advantage of a novel sensor model for vision-based detections for both positive and negative observations. A number of simulated and real flight experiments are also presented, validating the proposed method.

[CV-34] CogVLM2: Visual Language Models for Image and Video Understanding

链接: https://arxiv.org/abs/2408.16500
作者: Wenyi Hong,Weihan Wang,Ming Ding,Wenmeng Yu,Qingsong Lv,Yan Wang,Yean Cheng,Shiyu Huang,Junhui Ji,Zhao Xue,Lei Zhao,Zhuoyi Yang,Xiaotao Gu,Xiaohan Zhang,Guanyu Feng,Da Yin,Zihan Wang,Ji Qi,Xixuan Song,Peng Zhang,Debing Liu,Bin Xu,Juanzi Li,Yuxiao Dong,Jie Tang
关键词-EN: enhanced vision-language fusion, continuously exploring VLMs, efficient higher-resolution architecture, Beginning with VisualGLM, VisualGLM and CogVLM
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 \times 1344 pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in this https URL and this https URL, contributing to the advancement of the field.

[CV-35] Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

链接: https://arxiv.org/abs/2408.16486
作者: Zhengqing Gao,Xiang Ao,Xu-Yao Zhang,Cheng-Lin Liu
关键词-EN: Adapting pre-trained models, Adapting pre-trained, Adapting, challenging problem, pre-trained models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: PRCV 2024

点击查看摘要

Abstract:Adapting pre-trained models to open classes is a challenging problem in machine learning. Vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited for various open-set problems. More recently, some research focuses on fine-tuning such models to downstream tasks. Prompt tuning methods achieved huge improvements by learning context vectors on few-shot data. However, through the evaluation under open-set adaptation setting with the test data including new classes, we find that there exists a dilemma that learned prompts have worse generalization abilities than hand-crafted prompts. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach, which leverages the maximum concept matching (MCM) scores as dynamic weights to generate an input-conditioned prompt for each image during test. Through extensive experiments on 11 different datasets, we show that our proposed method outperforms all comparison methods on average considering both base and new classes. The code is available at this https URL

[CV-36] MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

链接: https://arxiv.org/abs/2408.16478
作者: Linyan Yang,Lukas Hoyer,Mark Weber,Tobias Fischer,Dengxin Dai,Laura Leal-Taixé,Marc Pollefeys,Daniel Cremers,Luc Van Gool
关键词-EN: Unsupervised Domain Adaptation, labeled source domain, unlabeled target domain, Unsupervised Domain, Domain Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data, and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop, which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy, we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances.

[CV-37] Creating a Segmented Pointcloud of Grapevines by Combining Multiple Viewpoints Through Visual Odometry

链接: https://arxiv.org/abs/2408.16472
作者: Michael Adlerstein,Angelo Bratta,João Carlos Virgolino Soares,Giovanni Dessy,Miguel Fernandes,Matteo Gatti,Claudio Semini
关键词-EN: Grapevine winter pruning, Grapevine winter, process that significantly, significantly influences, influences the quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Grapevine winter pruning is a labor-intensive and repetitive process that significantly influences the quality and quantity of the grape harvest and produced wine of the following season. It requires a careful and expert detection of the point to be cut. Because of its complexity, repetitive nature and time constraint, the task requires skilled labor that needs to be trained. This extended abstract presents the computer vision pipeline employed in project Vinum, using detectron2 as a segmentation network and keypoint visual odometry to merge different observation into a single pointcloud used to make informed pruning decisions.

[CV-38] Multi-source Domain Adaptation for Panoramic Semantic Segmentation

链接: https://arxiv.org/abs/2408.16469
作者: Jing Jiang,Sicheng Zhao,Jiankun Zhu,Wenbo Tang,Zhaopan Xu,Jidong Yang,Pengfei Xu,Hongxun Yao
关键词-EN: Panoramic semantic segmentation, received widespread attention, widespread attention recently, attention recently due, Panoramic semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Panoramic semantic segmentation has received widespread attention recently due to its comprehensive 360\degree field of view. However, labeling such images demands greater resources compared to pinhole images. As a result, many unsupervised domain adaptation methods for panoramic semantic segmentation have emerged, utilizing real pinhole images or low-cost synthetic panoramic images. But, the segmentation model lacks understanding of the panoramic structure when only utilizing real pinhole images, and it lacks perception of real-world scenes when only adopting synthetic panoramic images. Therefore, in this paper, we propose a new task of multi-source domain adaptation for panoramic semantic segmentation, aiming to utilize both real pinhole and synthetic panoramic images in the source domains, enabling the segmentation model to perform well on unlabeled real panoramic images in the target domain. Further, we propose Deformation Transform Aligner for Panoramic Semantic Segmentation (DTA4PASS), which converts all pinhole images in the source domains into panoramic-like images, and then aligns the converted source domains with the target domain. Specifically, DTA4PASS consists of two main components: Unpaired Semantic Morphing (USM) and Distortion Gating Alignment (DGA). Firstly, in USM, the Semantic Dual-view Discriminator (SDD) assists in training the diffeomorphic deformation network, enabling the effective transformation of pinhole images without paired panoramic views. Secondly, DGA assigns pinhole-like and panoramic-like features to each image by gating, and aligns these two features through uncertainty estimation. DTA4PASS outperforms the previous state-of-the-art methods by 1.92% and 2.19% on the outdoor and indoor multi-source domain adaptation scenarios, respectively. The source code will be released.

[CV-39] Spiking Diffusion Models

链接: https://arxiv.org/abs/2408.16467
作者: Jiahang Cao,Hanzhong Guo,Ziqing Wang,Deming Zhou,Hao Cheng,Qiang Zhang,Renjing Xu
关键词-EN: Artificial Neural Networks, Spiking Neural Networks, traditional Artificial Neural, Neural Networks, witnessed Spiking Neural
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE Transactions on Artificial Intelligence

点击查看摘要

Abstract:Recent years have witnessed Spiking Neural Networks (SNNs) gaining attention for their ultra-low energy consumption and high biological plausibility compared with traditional Artificial Neural Networks (ANNs). Despite their distinguished properties, the application of SNNs in the computationally intensive field of image generation is still under exploration. In this paper, we propose the Spiking Diffusion Models (SDMs), an innovative family of SNN-based generative models that excel in producing high-quality samples with significantly reduced energy consumption. In particular, we propose a Temporal-wise Spiking Mechanism (TSM) that allows SNNs to capture more temporal features from a bio-plasticity perspective. In addition, we propose a threshold-guided strategy that can further improve the performances by up to 16.7% without any additional training. We also make the first attempt to use the ANN-SNN approach for SNN-based generation tasks. Extensive experimental results reveal that our approach not only exhibits comparable performance to its ANN counterpart with few spiking time steps, but also outperforms previous SNN-based generative models by a large margin. Moreover, we also demonstrate the high-quality generation ability of SDM on large-scale datasets, e.g., LSUN bedroom. This development marks a pivotal advancement in the capabilities of SNN-based generation, paving the way for future research avenues to realize low-energy and low-latency generative applications. Our code is available at this https URL.

[CV-40] Weakly Supervised Object Detection for Automatic Tooth-marked Tongue Recognition

链接: https://arxiv.org/abs/2408.16451
作者: Yongcun Zhang,Jiajun Xu,Yina He,Shaozi Li,Zhiming Luo,Huangwei Lei
关键词-EN: Traditional Chinese Medicine, Chinese Medicine, individual health status, Traditional Chinese, crucial diagnostic method
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Tongue diagnosis in Traditional Chinese Medicine (TCM) is a crucial diagnostic method that can reflect an individual’s health status. Traditional methods for identifying tooth-marked tongues are subjective and inconsistent because they rely on practitioner experience. We propose a novel fully automated Weakly Supervised method using Vision transformer and Multiple instance learning WSVM for tongue extraction and tooth-marked tongue recognition. Our approach first accurately detects and extracts the tongue region from clinical images, removing any irrelevant background information. Then, we implement an end-to-end weakly supervised object detection method. We utilize Vision Transformer (ViT) to process tongue images in patches and employ multiple instance loss to identify tooth-marked regions with only image-level annotations. WSVM achieves high accuracy in tooth-marked tongue classification, and visualization experiments demonstrate its effectiveness in pinpointing these regions. This automated approach enhances the objectivity and accuracy of tooth-marked tongue diagnosis. It provides significant clinical value by assisting TCM practitioners in making precise diagnoses and treatment recommendations. Code is available at this https URL.

[CV-41] What to Preserve and What to Transfer: Faithful Identity-Preserving Diffusion-based Hairstyle Transfer

链接: https://arxiv.org/abs/2408.16450
作者: Chaeyeon Chung,Sunghyun Park,Jeongho Kim,Jaegul Choo
关键词-EN: image editing field, face image, Hairstyle transfer, editing field, field that modifies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hairstyle transfer is a challenging task in the image editing field that modifies the hairstyle of a given face image while preserving its other appearance and background features. The existing hairstyle transfer approaches heavily rely on StyleGAN, which is pre-trained on cropped and aligned face images. Hence, they struggle to generalize under challenging conditions such as extreme variations of head poses or focal lengths. To address this issue, we propose a one-stage hairstyle transfer diffusion model, HairFusion, that applies to real-world scenarios. Specifically, we carefully design a hair-agnostic representation as the input of the model, where the original hair information is thoroughly eliminated. Next, we introduce a hair align cross-attention (Align-CA) to accurately align the reference hairstyle with the face image while considering the difference in their face shape. To enhance the preservation of the face image’s original features, we leverage adaptive hair blending during the inference, where the output’s hair regions are estimated by the cross-attention map in Align-CA and blended with non-hair areas of the face image. Our experimental results show that our method achieves state-of-the-art performance compared to the existing methods in preserving the integrity of both the transferred hairstyle and the surrounding features. The codes are available at this https URL.

[CV-42] Enhancing Sound Source Localization via False Negative Elimination

链接: https://arxiv.org/abs/2408.16448
作者: Zengjie Song,Jiangshe Zhang,Yuxi Wang,Junsong Fan,Zhaoxiang Zhang
关键词-EN: source localization aims, Sound source localization, localize objects emitting, source localization, localization aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2203.13412

点击查看摘要

Abstract:Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks. Code and models are available at: this https URL.

[CV-43] Mismatched: Evaluating the Limits of Image Matching Approaches and Benchmarks

链接: https://arxiv.org/abs/2408.16445
作者: Sierra Bonilla,Chiara Di Vece,Rema Daher,Xinwei Ju,Danail Stoyanov,Francisco Vasconcelos,Sophia Bano
关键词-EN: three-dimensional modeling, active research field, computer vision, field in computer, applications ranging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Three-dimensional (3D) reconstruction from two-dimensional images is an active research field in computer vision, with applications ranging from navigation and object tracking to segmentation and three-dimensional modeling. Traditionally, parametric techniques have been employed for this task. However, recent advancements have seen a shift towards learning-based methods. Given the rapid pace of research and the frequent introduction of new image matching methods, it is essential to evaluate them. In this paper, we present a comprehensive evaluation of various image matching methods using a structure-from-motion pipeline. We assess the performance of these methods on both in-domain and out-of-domain datasets, identifying key limitations in both the methods and benchmarks. We also investigate the impact of edge detection as a pre-processing step. Our analysis reveals that image matching for 3D reconstruction remains an open challenge, necessitating careful selection and tuning of models for specific scenarios, while also highlighting mismatches in how metrics currently represent method performance.

[CV-44] Integrating Features for Recognizing Human Activities through Optimized Parameters in Graph Convolutional Networks and Transformer Architectures

链接: https://arxiv.org/abs/2408.16442
作者: Mohammad Belal(1),Taimur Hassan(2),Abdelfatah Hassan(1),Nael Alsheikh(1),Noureldin Elhendawi(1),Irfan Hussain(1) ((1) Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates, (2) Abu Dhabi University, Abu Dhabi, United Arab Emirates)
关键词-EN: categorize human actions, employs computer vision, machine vision, computer vision, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 6 pages, 1 figure, conference

点击查看摘要

Abstract:Human activity recognition is a major field of study that employs computer vision, machine vision, and deep learning techniques to categorize human actions. The field of deep learning has made significant progress, with architectures that are extremely effective at capturing human dynamics. This study emphasizes the influence of feature fusion on the accuracy of activity recognition. This technique addresses the limitation of conventional models, which face difficulties in identifying activities because of their limited capacity to understand spatial and temporal features. The technique employs sensory data obtained from four publicly available datasets: HuGaDB, PKU-MMD, LARa, and TUG. The accuracy and F1-score of two deep learning models, specifically a Transformer model and a Parameter-Optimized Graph Convolutional Network (PO-GCN), were evaluated using these datasets. The feature fusion technique integrated the final layer features from both models and inputted them into a classifier. Empirical evidence demonstrates that PO-GCN outperforms standard models in activity recognition. HuGaDB demonstrated a 2.3% improvement in accuracy and a 2.2% increase in F1-score. TUG showed a 5% increase in accuracy and a 0.5% rise in F1-score. On the other hand, LARa and PKU-MMD achieved lower accuracies of 64% and 69% respectively. This indicates that the integration of features enhanced the performance of both the Transformer model and PO-GCN.

[CV-45] Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS

链接: https://arxiv.org/abs/2408.16431
作者: Deshui Miao,Yameng Gu,Xin Li,Zhenyu He,Yaowei Wang,Ming-Hsuan Yang
关键词-EN: Video object segmentation, current VOS methods, VOS methods struggle, prolonged object motions, Video object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 1st Place Solution for 6th LSVOS VOS Track. arXiv admin note: substantial text overlap with arXiv:2406.04600

点击查看摘要

Abstract:Video object segmentation (VOS) is a crucial task in computer vision, but current VOS methods struggle with complex scenes and prolonged object motions. To address these challenges, the MOSE dataset aims to enhance object recognition and differentiation in complex environments, while the LVOS dataset focuses on segmenting objects exhibiting long-term, intricate movements. This report introduces a discriminative spatial-temporal VOS model that utilizes discriminative object features as query representations. The semantic understanding of spatial-semantic modules enables it to recognize object parts, while salient features highlight more distinctive object characteristics. Our model, trained on extensive VOS datasets, achieved first place (\textbf80.90% \mathcalJ \ F ) on the test set of the 6th LSVOS challenge in the VOS Track, demonstrating its effectiveness in tackling the aforementioned challenges. The code will be available at \hrefthis https URLcode.

[CV-46] COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation ECCV2024

链接: https://arxiv.org/abs/2408.16426
作者: Jiefeng Li,Ye Yuan,Davis Rempe,Haotian Zhang,Pavlo Molchanov,Cewu Lu,Jan Kautz,Umar Iqbal
关键词-EN: Estimating global human, Estimating global, motion, global human motion, human motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ECCV 2024

点击查看摘要

Abstract:Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. Although pre-trained motion diffusion models encode rich motion priors, we find it non-trivial to leverage such knowledge to guide global motion estimation from RGB videos. COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework. Furthermore, we introduce a new human-scene relation loss to alleviate the scale ambiguity by enforcing consistency among the humans, camera, and scene. Experiments on three challenging benchmarks demonstrate the effectiveness of COIN, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation. As an illustrative example, COIN outperforms the state-of-the-art method by 33% in world joint position error (W-MPJPE) on the RICH dataset.

[CV-47] xt-Enhanced Zero-Shot Action Recognition: A training-free approach ICPR2024

链接: https://arxiv.org/abs/2408.16412
作者: Massimo Bosetti,Shibingfeng Zhang,Bendetta Liberatori,Giacomo Zara,Elisa Ricci,Paolo Rota
关键词-EN: leveraging joint learning, demonstrated remarkable performance, Vision-language models, leveraging joint, textual representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted to ICPR 2024

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.

[CV-48] IBO: Inpainting-Based Occlusion to Enhance Explainable Artificial Intelligence Evaluation in Histopathology

链接: https://arxiv.org/abs/2408.16395
作者: Pardis Afshar,Sajjad Hashembeiki,Pouya Khani,Emad Fatemizadeh,Mohammad Hossein Rohban
关键词-EN: accurate cancer diagnosis, treatment planning, crucial for accurate, accurate cancer, cancer diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:Histopathological image analysis is crucial for accurate cancer diagnosis and treatment planning. While deep learning models, especially convolutional neural networks, have advanced this field, their “black-box” nature raises concerns about interpretability and trustworthiness. Explainable Artificial Intelligence (XAI) techniques aim to address these concerns, but evaluating their effectiveness remains challenging. A significant issue with current occlusion-based XAI methods is that they often generate Out-of-Distribution (OoD) samples, leading to inaccurate evaluations. In this paper, we introduce Inpainting-Based Occlusion (IBO), a novel occlusion strategy that utilizes a Denoising Diffusion Probabilistic Model to inpaint occluded regions in histopathological images. By replacing cancerous areas with realistic, non-cancerous tissue, IBO minimizes OoD artifacts and preserves data integrity. We evaluate our method on the CAMELYON16 dataset through two phases: first, by assessing perceptual similarity using the Learned Perceptual Image Patch Similarity (LPIPS) metric, and second, by quantifying the impact on model predictions through Area Under the Curve (AUC) analysis. Our results demonstrate that IBO significantly improves perceptual fidelity, achieving nearly twice the improvement in LPIPS scores compared to the best existing occlusion strategy. Additionally, IBO increased the precision of XAI performance prediction from 42% to 71% compared to traditional methods. These results demonstrate IBO’s potential to provide more reliable evaluations of XAI techniques, benefiting histopathology and other applications. The source code for this study is available at this https URL.

[CV-49] Exploiting temporal information to detect conversational groups in videos and predict the next speaker

链接: https://arxiv.org/abs/2408.16380
作者: Lucrezia Tosato,Victor Fortier,Isabelle Bloch,Catherine Pelachaud
关键词-EN: social interactions, human human interaction, introduced the concept, describe the spatial, spatial arrangement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to Pattern Recognition Letter, 8 pages, 10 figures

点击查看摘要

Abstract:Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker’s turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.

[CV-50] Law of Vision Representation in MLLMs

链接: https://arxiv.org/abs/2408.16357
作者: Shijia Yang,Bohan Zhai,Quanzeng You,Jianbo Yuan,Hongxia Yang,Chenfeng Xu
关键词-EN: multimodal large language, Vision Representation, multimodal large, large language models, cross-modal alignment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code is available at this https URL

点击查看摘要

Abstract:We present the “Law of Vision Representation” in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

[CV-51] oward Robust Early Detection of Alzheimers Disease via an Integrated Multimodal Learning Approach

链接: https://arxiv.org/abs/2408.16343
作者: Yifei Chen,Shenghao Zhu,Zhaojie Fang,Chang Liu,Binfeng Zou,Yuhe Wang,Shuo Chang,Fan Jia,Feiwei Qin,Jin Fan,Yong Peng,Changmiao Wang
关键词-EN: Alzheimer Disease, complex neurodegenerative disorder, neurodegenerative disorder marked, executive dysfunction, memory loss
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Alzheimer’s Disease (AD) is a complex neurodegenerative disorder marked by memory loss, executive dysfunction, and personality changes. Early diagnosis is challenging due to subtle symptoms and varied presentations, often leading to misdiagnosis with traditional unimodal diagnostic methods due to their limited scope. This study introduces an advanced multimodal classification model that integrates clinical, cognitive, neuroimaging, and EEG data to enhance diagnostic accuracy. The model incorporates a feature tagger with a tabular data coding architecture and utilizes the TimesBlock module to capture intricate temporal patterns in Electroencephalograms (EEG) data. By employing Cross-modal Attention Aggregation module, the model effectively fuses Magnetic Resonance Imaging (MRI) spatial information with EEG temporal data, significantly improving the distinction between AD, Mild Cognitive Impairment, and Normal Cognition. Simultaneously, we have constructed the first AD classification dataset that includes three modalities: EEG, MRI, and tabular data. Our innovative approach aims to facilitate early diagnosis and intervention, potentially slowing the progression of AD. The source code and our private ADMC dataset are available at this https URL.

[CV-52] P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising ECCV2024

链接: https://arxiv.org/abs/2408.16325
作者: Mathias Vogel,Keisuke Tateno,Marc Pollefeys,Federico Tombari,Marie-Julie Rakotosaona,Francis Engelmann
关键词-EN: adapts Diffusion Schrödinger, Diffusion Schrödinger bridges, Diffusion Schrödinger, adapts Diffusion, Schrödinger bridges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 Project page: this https URL

点击查看摘要

Abstract:In this work, we tackle the task of point cloud denoising through a novel framework that adapts Diffusion Schrödinger bridges to points clouds. Unlike previous approaches that predict point-wise displacements from point features or learned noise distributions, our method learns an optimal transport plan between paired point clouds. Experiments on object datasets like PU-Net and real-world datasets such as ScanNet++ and ARKitScenes show that P2P-Bridge achieves significant improvements over existing methods. While our approach demonstrates strong results using only point coordinates, we also show that incorporating additional features, such as color information or point-wise DINOv2 features, further enhances the performance. Code and pretrained models are available at this https URL.

[CV-53] BEVal: A Cross-dataset Evaluation Study of BEV Segmentation Models for Autononomous Driving

链接: https://arxiv.org/abs/2408.16322
作者: Manuel Alejandro Diaz-Zapata(CHROMA),Wenqian Liu(CHROMA, UGA),Robin Baruffa(CHROMA),Christian Laugier(CHROMA, E-MOTION, Inria)
关键词-EN: optimizing neural network, Current research, neural network models, driving focuses solely, typically nuScenes
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Current research in semantic bird’s-eye view segmentation for autonomous driving focuses solely on optimizing neural network models using a single dataset, typically nuScenes. This practice leads to the development of highly specialized models that may fail when faced with different environments or sensor setups, a problem known as domain shift. In this paper, we conduct a comprehensive cross-dataset evaluation of state-of-the-art BEV segmentation models to assess their performance across different training and testing datasets and setups, as well as different semantic categories. We investigate the influence of different sensors, such as cameras and LiDAR, on the models’ ability to generalize to diverse conditions and scenarios. Additionally, we conduct multi-dataset training experiments that improve models’ BEV segmentation performance compared to single-dataset training. Our work addresses the gap in evaluating BEV segmentation models under cross-dataset validation. And our findings underscore the importance of enhancing model generalizability and adaptability to ensure more robust and reliable BEV segmentation approaches for autonomous driving applications.

[CV-54] ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding ACM-MM2024

链接: https://arxiv.org/abs/2408.16314
作者: Minghang Zheng,Jiahua Zhang,Qingchao Chen,Yuxin Peng,Yang Liu
关键词-EN: natural language query, Visual grounding aims, Visual grounding, Semantic-sensitive Visual Grounding, language query
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue. Firstly, we enhance the model’s understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce images representing the semantic attributes of target objects described in queries. Secondly, we tackle the lack of training samples with multiple distractions by introducing a relation-sensitive data augmentation method. This method generates additional training data by synthesizing images containing multiple objects of the same category and pseudo queries based on their spatial relationships. The proposed ReSVG model significantly improves the model’s ability to comprehend both object semantics and spatial relations, leading to enhanced performance in visual grounding tasks, particularly in scenarios with multiple-instance distractions. We conduct extensive experiments to validate the effectiveness of our methods on five datasets. Code is available at this https URL.

[CV-55] FA-YOLO: Research On Efficient Feature Selection YOLO Improved Algorithm Based On FMDS and AGMF Modules

链接: https://arxiv.org/abs/2408.16313
作者: Yukang Huo,Mingyuan Yao,Qingbin Tian,Tonghao Wang,Ruifeng Wang,Haihua Wang
关键词-EN: YOLO series, FMDS Module, Module, AGMF Module, FMDS Module branch
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages and 4 figures

点击查看摘要

Abstract:Over the past few years, the YOLO series of models has emerged as one of the dominant methodologies in the realm of object detection. Many studies have advanced these baseline models by modifying their architectures, enhancing data quality, and developing new loss functions. However, current models still exhibit deficiencies in processing feature maps, such as overlooking the fusion of cross-scale features and a static fusion approach that lacks the capability for dynamic feature adjustment. To address these issues, this paper introduces an efficient Fine-grained Multi-scale Dynamic Selection Module (FMDS Module), which applies a more effective dynamic feature selection and fusion method on fine-grained multi-scale feature maps, significantly enhancing the detection accuracy of small, medium, and large-sized targets in complex environments. Furthermore, this paper proposes an Adaptive Gated Multi-branch Focus Fusion Module (AGMF Module), which utilizes multiple parallel branches to perform complementary fusion of various features captured by the gated unit branch, FMDS Module branch, and TripletAttention branch. This approach further enhances the comprehensiveness, diversity, and integrity of feature fusion. This paper has integrated the FMDS Module, AGMF Module, into Yolov9 to develop a novel object detection model named FA-YOLO. Extensive experimental results show that under identical experimental conditions, FA-YOLO achieves an outstanding 66.1% mean Average Precision (mAP) on the PASCAL VOC 2007 dataset, representing 1.0% improvement over YOLOv9’s 65.1%. Additionally, the detection accuracies of FA-YOLO for small, medium, and large targets are 44.1%, 54.6%, and 70.8%, respectively, showing improvements of 2.0%, 3.1%, and 0.9% compared to YOLOv9’s 42.1%, 51.5%, and 69.9%.

[CV-56] Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning ECCV2024

链接: https://arxiv.org/abs/2408.16310
作者: Luyao Tang,Yuxuan Yuan,Chaoqi Chen,Kunze Huang,Xinghao Ding,Yue Huang
关键词-EN: leveraging prompt engineering, made incredible strides, leveraging prompt, made incredible, incredible strides
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work is accepted by ECCV 2024 EVAL-FoMo Workshop

点击查看摘要

Abstract:Foundation models have made incredible strides in achieving zero-shot or few-shot generalization, leveraging prompt engineering to mimic the problem-solving approach of human intelligence. However, when it comes to some foundation models like Segment Anything, there is still a challenge in performing well on out-of-distribution data, including camouflaged and medical images. Inconsistent prompting strategies during fine-tuning and testing further compound the issue, leading to decreased performance. Drawing inspiration from how human cognition processes new environments, we introduce SlotSAM, a method that reconstructs features from the encoder in a self-supervised manner to create object-centric representations. These representations are then integrated into the foundation model, bolstering its object-level perceptual capabilities while reducing the impact of distribution-related variables. The beauty of SlotSAM lies in its simplicity and adaptability to various tasks, making it a versatile solution that significantly enhances the generalization abilities of foundation models. Through limited parameter fine-tuning in a bootstrap manner, our approach paves the way for improved generalization in novel environments. The code is available at this http URL.

[CV-57] Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach

链接: https://arxiv.org/abs/2408.16305
作者: Mian Zou,Baosheng Yu,Yibing Zhan,Siwei Lyu,Kede Ma
关键词-EN: recent years, multimedia forensics, forensics and security, security community, remarkable progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, the multimedia forensics and security community has seen remarkable progress in multitask learning for DeepFake (i.e., face forgery) detection. The prevailing strategy has been to frame DeepFake detection as a binary classification problem augmented by manipulation-oriented auxiliary tasks. This strategy focuses on learning features specific to face manipulations, which exhibit limited generalizability. In this paper, we delve deeper into semantics-oriented multitask learning for DeepFake detection, leveraging the relationships among face semantics via joint embedding. We first propose an automatic dataset expansion technique that broadens current face forgery datasets to support semantics-oriented DeepFake detection tasks at both the global face attribute and local face region levels. Furthermore, we resort to joint embedding of face images and their corresponding labels (depicted by textual descriptions) for prediction. This approach eliminates the need for manually setting task-agnostic and task-specific parameters typically required when predicting labels directly from images. In addition, we employ a bi-level optimization strategy to dynamically balance the fidelity loss weightings of various tasks, making the training process fully automated. Extensive experiments on six DeepFake datasets show that our method improves the generalizability of DeepFake detection and, meanwhile, renders some degree of model interpretation by providing human-understandable explanations.

[CV-58] Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models ECCV2024

链接: https://arxiv.org/abs/2408.16296
作者: Kengo Nakata,Daisuke Miyashita,Youyang Ng,Yasuto Hoshi,Jun Deguchi
关键词-EN: rethink sparse lexical, sparse lexical representations, lexical representations, image retrieval, image
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted to ECCV 2024 Workshops: 2nd Workshop on Traditional Computer Vision in the Age of Deep Learning (TradiCV)

点击查看摘要

Abstract:In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling us to utilize efficient sparse retrieval algorithms employed in natural language processing for image retrieval tasks. To assist the LLM in extracting image features, we apply data augmentation techniques for key expansion and analyze the impact with a metric for relevance between images and textual data. We empirically show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods on the MS-COCO, PASCAL VOC, and NUS-WIDE datasets in a keyword-based image retrieval scenario, where keywords serve as search queries. We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.

[CV-59] Convolutional Neural Network Compression Based on Low-Rank Decomposition

链接: https://arxiv.org/abs/2408.16289
作者: Yaping He,Linhao Jiang,Di Wu
关键词-EN: impose significant computational, significant computational loads, Deep neural networks, Deep neural, memory consumption
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 1 figures

点击查看摘要

Abstract:Deep neural networks typically impose significant computational loads and memory consumption. Moreover, the large parameters pose constraints on deploying the model on edge devices such as embedded systems. Tensor decomposition offers a clear advantage in compressing large-scale weight tensors. Nevertheless, direct utilization of low-rank decomposition typically leads to significant accuracy loss. This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization (VBMF) with orthogonal regularization. Initially, the model undergoes over-parameterization and training, with orthogonal regularization applied to enhance its likelihood of achieving the accuracy of the original model. Secondly, VBMF is employed to estimate the rank of the weight tensor at each layer. Our framework is sufficiently general to apply to other convolutional neural networks and easily adaptable to incorporate other tensor decomposition methods. Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.

[CV-60] SAU: A Dual-Branch Network to Enhance Long-Tailed Recognition via Generative Models

链接: https://arxiv.org/abs/2408.16273
作者: Guangxi Li,Yinsheng Song,Mingkai Zheng
关键词-EN: considerable challenge due, image recognition pose, dominant classes, classes with numerous, minority classes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages

点击查看摘要

Abstract:Long-tailed distributions in image recognition pose a considerable challenge due to the severe imbalance between a few dominant classes with numerous examples and many minority classes with few samples. Recently, the use of large generative models to create synthetic data for image classification has been realized, but utilizing synthetic data to address the challenge of long-tailed recognition remains relatively unexplored. In this work, we proposed the use of synthetic data as a complement to long-tailed datasets to eliminate the impact of data imbalance. To tackle this real-synthetic mixed dataset, we designed a two-branch model that contains Synthetic-Aware and Unaware branches (SAU). The core ideas are (1) a synthetic-unaware branch for classification that mixes real and synthetic data and treats all data equally without distinguishing between them. (2) A synthetic-aware branch for improving the robustness of the feature extractor by distinguishing between real and synthetic data and learning their discrepancies. Extensive experimental results demonstrate that our method can improve the accuracy of long-tailed image recognition. Notably, our approach achieves state-of-the-art Top-1 accuracy and significantly surpasses other methods on CIFAR-10-LT and CIFAR-100-LT datasets across various imbalance factors. Our code is available at this https URL.

[CV-61] Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding

链接: https://arxiv.org/abs/2408.16272
作者: Kaijing Ma,Haojian Huang,Jin Chen,Haodong Chen,Pengliang Ji,Xianghao Zang,Han Fang,Chao Ban,Hao Sun,Mulin Chen,Xuelong Li
关键词-EN: Existing Video Temporal, Video Temporal Grounding, Temporal Grounding, Existing Video, overlook open-world challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Ongoing work: 28pages, 19 figures, 7 tables. Code is available at: https://kaijing.space/SRAM/

点击查看摘要

Abstract:Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos. This leads to unreliable predictions for noisy, corrupted, and out-of-distribution data. Adapting VTG models to dynamically estimate uncertainties based on user input can address this issue. To this end, we introduce SRAM, a robust network module that benefits from a two-stage cross-modal alignment task. More importantly, it integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training, thus allowing the model to say “I do not know” in scenarios beyond its handling capacity. However, the direct application of traditional DER theory and its regularizer reveals structural flaws, leading to unintended constraints in VTG tasks. In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up. To the best of our knowledge, this marks the first successful attempt of DER in VTG. Our extensive quantitative and qualitative results affirm the effectiveness, robustness, and interpretability of our modules and the uncertainty learning paradigm in VTG tasks. The code will be made available.

[CV-62] UDD: Dataset Distillation via Mining Underutilized Regions

链接: https://arxiv.org/abs/2408.16268
作者: Shiguang Wang,Zhongyu Zhang,Jian Cheng
关键词-EN: Dataset distillation synthesizes, underutilized regions, Dataset distillation, dataset distillation focused, synthesizes a small
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: PRCV2024

点击查看摘要

Abstract:Dataset distillation synthesizes a small dataset such that a model trained on this set approximates the performance of the original dataset. Recent studies on dataset distillation focused primarily on the design of the optimization process, with methods such as gradient matching, feature alignment, and training trajectory matching. However, little attention has been given to the issue of underutilized regions in synthetic images. In this paper, we propose UDD, a novel approach to identify and exploit the underutilized regions to make them informative and discriminate, and thus improve the utilization of the synthetic dataset. Technically, UDD involves two underutilized regions searching policies for different conditions, i.e., response-based policy and data jittering-based policy. Compared with previous works, such two policies are utilization-sensitive, equipping with the ability to dynamically adjust the underutilized regions during the training process. Additionally, we analyze the current model optimization problem and design a category-wise feature contrastive loss, which can enhance the distinguishability of different categories and alleviate the shortcomings of the existing multi-formation methods. Experimentally, our method improves the utilization of the synthetic dataset and outperforms the state-of-the-art methods on various datasets, such as MNIST, FashionMNIST, SVHN, CIFAR-10, and CIFAR-100. For example, the improvements on CIFAR-10 and CIFAR-100 are 4.0% and 3.7% over the next best method with IPC=1, by mining the underutilized regions.

[CV-63] Improving Diffusion-based Data Augmentation with Inversion Spherical Interpolation

链接: https://arxiv.org/abs/2408.16266
作者: Yanghao Wang,Long Chen
关键词-EN: Data Augmentation, original training set, visual recognition tasks, synthesizing faithful, training set
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data Augmentation (DA), \ie, synthesizing faithful and diverse samples to expand the original training set, is a prevalent and effective strategy to improve various visual recognition tasks. With the powerful image generation ability, diffusion-based DA has shown strong performance gains on different benchmarks. In this paper, we analyze today’s diffusion-based DA methods, and argue that they cannot take account of both faithfulness and diversity, which are two critical keys for generating high-quality samples and boosting final classification performance. To this end, we propose a novel Diffusion-based Inversion Interpolation DA method: Diff-II. Specifically, Diff-II consists of three main steps: 1) Category concepts learning: Learning concept embeddings for each category. 2) Inversion interpolation: Calculating the inversion for each image, and conducting spherical interpolation for two randomly sampled inversions from the same category. 3) Two-stage denoising: Using different prompts to generate synthesized images in a coarse-to-fine manner. Extensive experiments on multiple image classification tasks (\eg, few-shot, long-tailed, and out-of-distribution classification) have demonstrated its effectiveness over state-of-the-art diffusion-based DA methods.

[CV-64] Low Saturation Confidence Distribution-based Test-Time Adaptation for Cross-Domain Remote Sensing Image Classification

链接: https://arxiv.org/abs/2408.16265
作者: Yu Liang,Xiucheng Zhang,Juepeng Zheng,Jianxi Huang,Haohuan Fu
关键词-EN: Unsupervised Domain Adaptation, Source-free Domain Adaptation, test time adaptation, Domain Adaptation, Unsupervised Domain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Although the Unsupervised Domain Adaptation (UDA) method has improved the effect of remote sensing image classification tasks, most of them are still limited by access to the source domain (SD) data. Designs such as Source-free Domain Adaptation (SFDA) solve the challenge of a lack of SD data, however, they still rely on a large amount of target domain data and thus cannot achieve fast adaptations, which seriously hinders their further application in broader scenarios. The real-world applications of cross-domain remote sensing image classification require a balance of speed and accuracy at the same time. Therefore, we propose a novel and comprehensive test time adaptation (TTA) method – Low Saturation Confidence Distribution Test Time Adaptation (LSCD-TTA), which is the first attempt to solve such scenarios through the idea of TTA. LSCD-TTA specifically considers the distribution characteristics of remote sensing images, including three main parts that concentrate on different optimization directions: First, low saturation distribution (LSD) considers the dominance of low-confidence samples during the later TTA stage. Second, weak-category cross-entropy (WCCE) increases the weight of categories that are more difficult to classify with less prior knowledge. Finally, diverse categories confidence (DIV) comprehensively considers the category diversity to alleviate the deviation of the sample distribution. By weighting the abovementioned three modules, the model can widely, quickly and accurately adapt to the target domain without much prior target distributions, repeated data access, and manual annotation. We evaluate LSCD-TTA on three remote-sensing image datasets. The experimental results show that LSCD-TTA achieves a significant gain of 4.96%-10.51% with Resnet-50 and 5.33%-12.49% with Resnet-101 in average accuracy compared to other state-of-the-art DA and TTA methods.

[CV-65] Advancing Architectural Floorplan Design with Geometry-enhanced Graph Diffusion

链接: https://arxiv.org/abs/2408.16258
作者: Sizhe Hu,Wenming Wu,Yuntao Wang,Benzhu Xu,Liping Zheng
关键词-EN: Automating architectural floorplan, Automating architectural, offering a faster, cost-effective alternative, sketches by architects
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automating architectural floorplan design is vital for housing and interior design, offering a faster, cost-effective alternative to manual sketches by architects. However, existing methods, including rule-based and learning-based approaches, face challenges in design complexity and constrained generation with extensive post-processing, and tend to obvious geometric inconsistencies such as misalignment, overlap, and gaps. In this work, we propose a novel generative framework for vector floorplan design via structural graph generation, called GSDiff, focusing on wall junction generation and wall segment prediction to capture both geometric and semantic aspects of structural graphs. To improve the geometric rationality of generated structural graphs, we propose two innovative geometry enhancement methods. In wall junction generation, we propose a novel alignment loss function to improve geometric consistency. In wall segment prediction, we propose a random self-supervision method to enhance the model’s perception of the overall geometric structure, thereby promoting the generation of reasonable geometric structures. Employing the diffusion model and the Transformer model, as well as the geometry enhancement strategies, our framework can generate wall junctions, wall segments and room polygons with structural and semantic information, resulting in structural graphs that accurately represent floorplans. Extensive experiments show that the proposed method surpasses existing techniques, enabling free generation and constrained generation, marking a shift towards structure generation in architectural design.

[CV-66] EvLight: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset Novel Method and More

链接: https://arxiv.org/abs/2408.16254
作者: Kanghao Chen,Guoqiang Liang,Hangyu Li,Yunfan Lu,Lin Wang
关键词-EN: cameras offer significant, offer significant advantages, Event cameras offer, high dynamic range, primarily due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Journal extension based on EvLight ( arXiv:2404.00834 )

点击查看摘要

Abstract:Event cameras offer significant advantages for low-light video enhancement, primarily due to their high dynamic range. Current research, however, is severely limited by the absence of large-scale, real-world, and spatio-temporally aligned event-video datasets. To address this, we introduce a large-scale dataset with over 30,000 pairs of frames and events captured under varying illumination. This dataset was curated using a robotic arm that traces a consistent non-linear trajectory, achieving spatial alignment precision under 0.03mm and temporal alignment with errors under 0.01s for 90% of the dataset. Based on the dataset, we propose \textbfEvLight++, a novel event-guided low-light video enhancement approach designed for robust performance in real-world scenarios. Firstly, we design a multi-scale holistic fusion branch to integrate structural and textural information from both images and events. To counteract variations in regional illumination and noise, we introduce Signal-to-Noise Ratio (SNR)-guided regional feature selection, enhancing features from high SNR regions and augmenting those from low SNR regions by extracting structural information from events. To incorporate temporal information and ensure temporal coherence, we further introduce a recurrent module and temporal loss in the whole pipeline. Extensive experiments on our and the synthetic SDSD dataset demonstrate that EvLight++ significantly outperforms both single image- and video-based methods by 1.37 dB and 3.71 dB, respectively. To further explore its potential in downstream tasks like semantic segmentation and monocular depth estimation, we extend our datasets by adding pseudo segmentation and depth labels via meticulous annotation efforts with foundation models. Experiments under diverse low-light scenes show that the enhanced results achieve a 15.97% improvement in mIoU for semantic segmentation.

[CV-67] Anno-incomplete Multi-dataset Detection

链接: https://arxiv.org/abs/2408.16247
作者: Yiran Xu,Haoxiang Zhong,Kai Wu,Jialin Li,Yong Liu,Chengjie Wang,Shu-Tao Xia,Hongen Liao
关键词-EN: shown outstanding performance, detectors have shown, shown outstanding, outstanding performance, Annotation-incomplete Multi-dataset Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Object detectors have shown outstanding performance on various public datasets. However, annotating a new dataset for a new task is usually unavoidable in real, since 1) a single existing dataset usually does not contain all object categories needed; 2) using multiple datasets usually suffers from annotation incompletion and heterogeneous features. We propose a novel problem as “Annotation-incomplete Multi-dataset Detection”, and develop an end-to-end multi-task learning architecture which can accurately detect all the object categories with multiple partially annotated datasets. Specifically, we propose an attention feature extractor which helps to mine the relations among different datasets. Besides, a knowledge amalgamation training strategy is incorporated to accommodate heterogeneous features from different sources. Extensive experiments on different object detection datasets demonstrate the effectiveness of our methods and an improvement of 2.17%, 2.10% in mAP can be achieved on COCO and VOC respectively.

[CV-68] Neural Spectral Decomposition for Dataset Distillation ECCV2024

链接: https://arxiv.org/abs/2408.16236
作者: Shaolei Yang,Shen Cheng,Mingbo Hong,Haoqiang Fan,Xing Wei,Shuaicheng Liu
关键词-EN: Neural Spectrum Decomposition, generic decomposition framework, propose Neural Spectrum, propose Neural, generic decomposition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:In this paper, we propose Neural Spectrum Decomposition, a generic decomposition framework for dataset distillation. Unlike previous methods, we consider the entire dataset as a high-dimensional observation that is low-rank across all dimensions. We aim to discover the low-rank representation of the entire dataset and perform distillation efficiently. Toward this end, we learn a set of spectrum tensors and transformation matrices, which, through simple matrix multiplication, reconstruct the data distribution. Specifically, a spectrum tensor can be mapped back to the image space by a transformation matrix, and efficient information sharing during the distillation learning process is achieved through pairwise combinations of different spectrum vectors and transformation matrices. Furthermore, we integrate a trajectory matching optimization method guided by a real distribution. Our experimental results demonstrate that our approach achieves state-of-the-art performance on benchmarks, including CIFAR10, CIFAR100, Tiny Imagenet, and ImageNet Subset. Our code are available at \urlthis https URL.

[CV-69] LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement

链接: https://arxiv.org/abs/2408.16235
作者: Ye Yu,Fengxin Chen,Jun Yu,Zhen Kan
关键词-EN: made significant advancements, low-light image enhancement, recent low-light image, low visual quality, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While recent low-light image enhancement (LLIE) methods have made significant advancements, they still face challenges in terms of low visual quality and weak generalization ability when applied to complex scenarios. To address these issues, we propose a semi-supervised method based on latent mean-teacher and Gaussian process, named LMT-GP. We first design a latent mean-teacher framework that integrates both labeled and unlabeled data, as well as their latent vectors, into model training. Meanwhile, we use a mean-teacher-assisted Gaussian process learning strategy to establish a connection between the latent and pseudo-latent vectors obtained from the labeled and unlabeled data. To guide the learning process, we utilize an assisted Gaussian process regression (GPR) loss function. Furthermore, we design a pseudo-label adaptation module (PAM) to ensure the reliability of the network learning. To demonstrate our method’s generalization ability and effectiveness, we apply it to multiple LLIE datasets and high-level vision tasks. Experiment results demonstrate that our method achieves high generalization performance and image quality. The code is available at this https URL.

[CV-70] PSE-Net: Channel Pruning for Convolutional Neural Networks with Parallel-subnets Estimator

链接: https://arxiv.org/abs/2408.16233
作者: Shiguang Wang,Tao Xie,Haijun Liu,Xingcheng Zhang,Jian Cheng
关键词-EN: compress deep neural, deep neural networks, Channel Pruning, widespread techniques, compress deep
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10pages, Neural Networks

点击查看摘要

Abstract:Channel Pruning is one of the most widespread techniques used to compress deep neural networks while maintaining their performances. Currently, a typical pruning algorithm leverages neural architecture search to directly find networks with a configurable width, the key step of which is to identify representative subnet for various pruning ratios by training a supernet. However, current methods mainly follow a serial training strategy to optimize supernet, which is very time-consuming. In this work, we introduce PSE-Net, a novel parallel-subnets estimator for efficient channel pruning. Specifically, we propose a parallel-subnets training algorithm that simulate the forward-backward pass of multiple subnets by droping extraneous features on batch dimension, thus various subnets could be trained in one round. Our proposed algorithm facilitates the efficiency of supernet training and equips the network with the ability to interpolate the accuracy of unsampled subnets, enabling PSE-Net to effectively evaluate and rank the subnets. Over the trained supernet, we develop a prior-distributed-based sampling algorithm to boost the performance of classical evolutionary search. Such algorithm utilizes the prior information of supernet training phase to assist in the search of optimal subnets while tackling the challenge of discovering samples that satisfy resource constraints due to the long-tail distribution of network configuration. Extensive experiments demonstrate PSE-Net outperforms previous state-of-the-art channel pruning methods on the ImageNet dataset while retaining superior supernet training efficiency. For example, under 300M FLOPs constraint, our pruned MobileNetV2 achieves 75.2% Top-1 accuracy on ImageNet dataset, exceeding the original MobileNetV2 by 2.6 units while only cost 30%/16% times than BCNet/AutoAlim.

[CV-71] Enhancing Conditional Image Generation with Explainable Latent Space Manipulation

链接: https://arxiv.org/abs/2408.16232
作者: Kshitij Pathania
关键词-EN: gradient-based selective attention, Selective Attention Manipulation, gradient-based selective, selective attention, selective attention mechanisms
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages , 5 figures

点击查看摘要

Abstract:In the realm of image synthesis, achieving fidelity to a reference image while adhering to conditional prompts remains a significant challenge. This paper proposes a novel approach that integrates a diffusion model with latent space manipulation and gradient-based selective attention mechanisms to address this issue. Leveraging Grad-SAM (Gradient-based Selective Attention Manipulation), we analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector, deriving importance scores of elements of denoised latent vector related to the subject of interest. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features. This approach ensures the faithful formation of subjects based on conditional prompts, while concurrently refining the background for a more coherent composition. Our experiments on places365 dataset demonstrate promising results, with our proposed model achieving the lowest mean and median Frechet Inception Distance (FID) scores compared to baseline models, indicating superior fidelity preservation. Furthermore, our model exhibits competitive performance in aligning the generated images with provided textual descriptions, as evidenced by high CLIP scores. These results highlight the effectiveness of our approach in both fidelity preservation and textual context preservation, offering a significant advancement in text-to-image synthesis tasks.

[CV-72] Revisiting 360 Depth Estimation with PanoGabor: A New Fusion Perspective

链接: https://arxiv.org/abs/2408.16227
作者: Zhijie Shen,Chunyu Lin,Lang Nie,Kang Liao
关键词-EN: images pose great, Gabor filters, Depth estimation, introduce Gabor filters, Gabor
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Depth estimation from a monocular 360 image is important to the perception of the entire 3D environment. However, the inherent distortion and large field of view (FoV) in 360 images pose great challenges for this task. To this end, existing mainstream solutions typically introduce additional perspective-based 360 representations (\textite.g., Cubemap) to achieve effective feature extraction. Nevertheless, regardless of the introduced representations, they eventually need to be unified into the equirectangular projection (ERP) format for the subsequent depth estimation, which inevitably reintroduces the troublesome distortions. In this work, we propose an oriented distortion-aware Gabor Fusion framework (PGFuse) to address the above challenges. First, we introduce Gabor filters that analyze texture in the frequency domain, thereby extending the receptive fields and enhancing depth cues. To address the reintroduced distortions, we design a linear latitude-aware distortion representation method to generate customized, distortion-aware Gabor filters (PanoGabor filters). Furthermore, we design a channel-wise and spatial-wise unidirectional fusion module (CS-UFM) that integrates the proposed PanoGabor filters to unify other representations into the ERP format, delivering effective and distortion-free features. Considering the orientation sensitivity of the Gabor transform, we introduce a spherical gradient constraint to stabilize this sensitivity. Experimental results on three popular indoor 360 benchmarks demonstrate the superiority of the proposed PGFuse to existing state-of-the-art solutions. Code can be available upon acceptance.

[CV-73] LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

链接: https://arxiv.org/abs/2408.16224
作者: Jingyi Wang,Jianzhong Ju,Jian Luan,Zhidong Deng
关键词-EN: typically employ vision, employ vision encoders, vision encoders based, Vision Transformer, Recent advances
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in large vision-language models (VLMs) typically employ vision encoders based on the Vision Transformer (ViT) architecture. The division of the images into patches by ViT results in a fragmented perception, thereby hindering the visual understanding capabilities of VLMs. In this paper, we propose an innovative enhancement to address this limitation by introducing a Scene Graph Expression (SGE) module in VLMs. This module extracts and structurally expresses the complex semantic information within images, thereby improving the foundational perception and understanding abilities of VLMs. Extensive experiments demonstrate that integrating our SGE module significantly enhances the VLM’s performance in vision-language tasks, indicating its effectiveness in preserving intricate semantic details and facilitating better visual understanding. Code and data would be available.

[CV-74] raining-free Video Temporal Grounding using Large-scale Pre-trained Models ECCV2024

链接: https://arxiv.org/abs/2408.16219
作者: Minghang Zheng,Xinhao Cai,Qingchao Chen,Yuxin Peng,Yang Liu
关键词-EN: Video temporal grounding, identify video segments, temporal grounding aims, Video temporal, temporal grounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description, we use VLMs to locate the top-k proposals and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.

[CV-75] M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

链接: https://arxiv.org/abs/2408.16213
作者: Jonggwon Park,Soobum Kim,Byungmu Yoon,Jihun Hyun,Kyoyun Choi
关键词-EN: large language models, including healthcare, artificial intelligence, impacted various domains, rapid evolution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the multi-tasking capabilities of LLMs or lacking clinical accuracy. This paper presents M4CXR, a multi-modal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multi-image, and multi-study contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and also demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR’s versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.

[CV-76] Uni-3DAD: GAN-Inversion Aided Universal 3D Anomaly Detection on Model-free Products

链接: https://arxiv.org/abs/2408.16201
作者: Jiayu Liu,Shancong Mou,Nathan Gaw,Yinan Wang
关键词-EN: Anomaly detection, Anomaly, manufacturing systems, detection, point clouds
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is a long-standing challenge in manufacturing systems. Traditionally, anomaly detection has relied on human inspectors. However, 3D point clouds have gained attention due to their robustness to environmental factors and their ability to represent geometric data. Existing 3D anomaly detection methods generally fall into two categories. One compares scanned 3D point clouds with design files, assuming these files are always available. However, such assumptions are often violated in many real-world applications where model-free products exist, such as fresh produce (i.e., Cookie", Potato", etc.), dentures, bone, etc. The other category compares patches of scanned 3D point clouds with a library of normal patches named memory bank. However, those methods usually fail to detect incomplete shapes, which is a fairly common defect type (i.e., missing pieces of different products). The main challenge is that missing areas in 3D point clouds represent the absence of scanned points. This makes it infeasible to compare the missing region with existing point cloud patches in the memory bank. To address these two challenges, we proposed a unified, unsupervised 3D anomaly detection framework capable of identifying all types of defects on model-free products. Our method integrates two detection modules: a feature-based detection module and a reconstruction-based detection module. Feature-based detection covers geometric defects, such as dents, holes, and cracks, while the reconstruction-based method detects missing regions. Additionally, we employ a One-class Support Vector Machine (OCSVM) to fuse the detection results from both modules. The results demonstrate that (1) our proposed method outperforms the state-of-the-art methods in identifying incomplete shapes and (2) it still maintains comparable performance with the SOTA methods in detecting all other types of anomalies.

[CV-77] PolarBEVDet: Exploring Polar Representation for Multi-View 3D Object Detection in Birds-Eye-View

链接: https://arxiv.org/abs/2408.16200
作者: Zichen Yu,Quanli Liu,Wei Wang,Liyong Zhang,Xiaoguang Zhao
关键词-EN: polar BEV representation, Cartesian BEV representation, polar BEV, BEV representation, BEV
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Recently, LSS-based multi-view 3D object detection provides an economical and deployment-friendly solution for autonomous driving. However, all the existing LSS-based methods transform multi-view image features into a Cartesian Bird’s-Eye-View(BEV) representation, which does not take into account the non-uniform image information distribution and hardly exploits the view symmetry. In this paper, in order to adapt the image information distribution and preserve the view symmetry by regular convolution, we propose to employ the polar BEV representation to substitute the Cartesian BEV representation. To achieve this, we elaborately tailor three modules: a polar view transformer to generate the polar BEV representation, a polar temporal fusion module for fusing historical polar BEV features and a polar detection head to predict the polar-parameterized representation of the object. In addition, we design a 2D auxiliary detection head and a spatial attention enhancement module to improve the quality of feature extraction in perspective view and BEV, respectively. Finally, we integrate the above improvements into a novel multi-view 3D object detector, PolarBEVDet. Experiments on nuScenes show that PolarBEVDet achieves the superior performance. The code is available at this https URL.

[CV-78] DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning

链接: https://arxiv.org/abs/2408.16195
作者: Zeyi Bo(1),Wuxi Sun(1),Ye Jin(1) ((1) Harbin Institute of Technology)
关键词-EN: Video Foundation Model, recent years, reach billion-level, parameters of backbones, continue to increase
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, the parameters of backbones of Video Understanding tasks continue to increase and even reach billion-level. Whether fine-tuning a specific task on the Video Foundation Model or pre-training the model designed for the specific task, incurs a lot of overhead. How to make these models play other values than their own tasks becomes a worthy question. Multi-Task Learning(MTL) makes the visual task acquire the rich shareable knowledge from other tasks while joint training. It is fully explored in Image Recognition tasks especially dense predict tasks. Nevertheless, it is rarely used in video domain due to the lack of multi-labels video data. In this paper, a heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem. It’s different from it in image domain, a Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task. Extensive experiments prove that our DLM-VMTL performs better than baselines on 6 different video understanding tasks and 11 datasets.

[CV-79] Estimating Dynamic Flow Features in Groups of Tracked Objects

链接: https://arxiv.org/abs/2408.16190
作者: Tanner D. Harms,Steven L. Brunton,Beverley J. McKeon
关键词-EN: Interpreting motion captured, Interpreting motion, wide range, range of computer, motion
类目: Computer Vision and Pattern Recognition (cs.CV); Fluid Dynamics (physics.flu-dyn)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:Interpreting motion captured in image sequences is crucial for a wide range of computer vision applications. Typical estimation approaches include optical flow (OF), which approximates the apparent motion instantaneously in a scene, and multiple object tracking (MOT), which tracks the motion of subjects over time. Often, the motion of objects in a scene is governed by some underlying dynamical system which could be inferred by analyzing the motion of groups of objects. Standard motion analyses, however, are not designed to intuit flow dynamics from trajectory data, making such measurements difficult in practice. The goal of this work is to extend gradient-based dynamical systems analyses to real-world applications characterized by complex, feature-rich image sequences with imperfect tracers. The tracer trajectories are tracked using deep vision networks and gradients are approximated using Lagrangian gradient regression (LGR), a tool designed to estimate spatial gradients from sparse data. From gradients, dynamical features such as regions of coherent rotation and transport barriers are identified. The proposed approach is affordably implemented and enables advanced studies including the motion analysis of two distinct object classes in a single image sequence. Two examples of the method are presented on data sets for which standard gradient-based analyses do not apply.

[CV-80] VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

链接: https://arxiv.org/abs/2408.16176
作者: M. Maruf,Arka Daw,Kazi Sajeed Mehrab,Harish Babu Manogaran,Abhilash Neog,Medha Sawhney,Mridul Khurana,James P. Balhoff,Yasin Bakis,Bahadir Altintas,Matthew J. Thompson,Elizabeth G. Campolongo,Josef C. Uyeda,Hilmar Lapp,Henry L. Bart,Paula M. Mabee,Yu Su,Wei-Lun Chao,Charles Stewart,Tanya Berger-Wolf,Wasila Dahdul,Anuj Karpatne
关键词-EN: large vision-language models, accelerating scientific discoveries, biologically relevant questions, providing novel opportunities, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 36 pages, 37 figures, 7 tables

点击查看摘要

Abstract:Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K question-answer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images. The code and datasets for running all the analyses reported in this paper can be found at this https URL.

[CV-81] Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? ECCV2024

链接: https://arxiv.org/abs/2408.16154
作者: Dilermando Queiroz,Anderson Carlos,Maíra Fatoretto,André Anjos,Lilian Berton,Luis Filipe Nakayama
关键词-EN: Foundation model, diverse domains, emerged as robust, label efficiency, efficiency in diverse
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint of paper to be presented at Fairness and Ethics Towards Transparent AI: Facing the Challenge through Model Debiasing (FAILED) during ECCV 2024

点击查看摘要

Abstract:Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.

[CV-82] Using Backbone Foundation Model for Evaluating Fairness in Chest Radiography Without Demographic Data MICCAI2024

链接: https://arxiv.org/abs/2408.16130
作者: Dilermando Queiroz,André Anjos,Lilian Berton
关键词-EN: Ensuring consistent performance, Ensuring consistent, machine learning models, advancing medical image, diverse populations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint of paper to be presented at Fairness of AI in Medical Imaging (FAIMI) during MICCAI 2024

点击查看摘要

Abstract:Ensuring consistent performance across diverse populations and incorporating fairness into machine learning models are crucial for advancing medical image diagnostics and promoting equitable healthcare. However, many databases do not provide protected attributes or contain unbalanced representations of demographic groups, complicating the evaluation of model performance across different demographics and the application of bias mitigation techniques that rely on these attributes. This study aims to investigate the effectiveness of using the backbone of Foundation Models as an embedding extractor for creating groups that represent protected attributes, such as gender and age. We propose utilizing these groups in different stages of bias mitigation, including pre-processing, in-processing, and evaluation. Using databases in and out-of-distribution scenarios, it is possible to identify that the method can create groups that represent gender in both databases and reduce in 4.44% the difference between the gender attribute in-distribution and 6.16% in out-of-distribution. However, the model lacks robustness in handling age attributes, underscoring the need for more fundamentally fair and robust Foundation models. These findings suggest a role in promoting fairness assessment in scenarios where we lack knowledge of attributes, contributing to the development of more equitable medical diagnostics.

[CV-83] ChartEye: A Deep Learning Framework for Chart Information Extraction

链接: https://arxiv.org/abs/2408.16123
作者: Osama Mustafa,Muhammad Khizer Ali,Momina Moetesum,Imran Siddiqi
关键词-EN: inspired recent research, automated chart understanding, data visualization, domains has inspired, inspired recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 Pages, and 11 Figures

点击查看摘要

Abstract:The widespread use of charts and infographics as a means of data visualization in various domains has inspired recent research in automated chart understanding. However, information extraction from chart images is a complex multitasked process due to style variations and, as a consequence, it is challenging to design an end-to-end system. In this study, we propose a deep learning-based framework that provides a solution for key steps in the chart information extraction pipeline. The proposed framework utilizes hierarchal vision transformers for the tasks of chart-type and text-role classification, while YOLOv7 for text detection. The detected text is then enhanced using Super Resolution Generative Adversarial Networks to improve the recognition output of the OCR. Experimental results on a benchmark dataset show that our proposed framework achieves excellent performance at every stage with F1-scores of 0.97 for chart-type classification, 0.91 for text-role classification, and a mean Average Precision of 0.95 for text detection.

[CV-84] Negative Binomial Matrix Completion

链接: https://arxiv.org/abs/2408.16113
作者: Yu Lu,Kevin Bui,Roummel F. Marcia
关键词-EN: Poisson matrix completion, information in matrices, Matrix completion, Matrix completion focuses, focuses on recovering
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注: 6 pages, Accepted by the IEEE International Workshop on Machine Learning for Signal Processing (MLSP)

点击查看摘要

Abstract:Matrix completion focuses on recovering missing or incomplete information in matrices. This problem arises in various applications, including image processing and network analysis. Previous research proposed Poisson matrix completion for count data with noise that follows a Poisson distribution, which assumes that the mean and variance are equal. Since overdispersed count data, whose variance is greater than the mean, is more likely to occur in realistic settings, we assume that the noise follows the negative binomial (NB) distribution, which can be more general than the Poisson distribution. In this paper, we introduce NB matrix completion by proposing a nuclear-norm regularized model that can be solved by proximal gradient descent. In our experiments, we demonstrate that the NB model outperforms Poisson matrix completion in various noise and missing data settings on real data.

[CV-85] 3D Reconstruction with Spatial Memory

链接: https://arxiv.org/abs/2408.16061
作者: Hengyi Wang,Lourdes Agapito
关键词-EN: approach for dense, unordered image collections, global coordinate system, unordered image, image collections
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: \url{ this https URL }

点击查看摘要

Abstract:We present Spann3R, a novel approach for dense 3D reconstruction from ordered or unordered image collections. Built on the DUSt3R paradigm, Spann3R uses a transformer-based architecture to directly regress pointmaps from images without any prior knowledge of the scene or camera parameters. Unlike DUSt3R, which predicts per image-pair pointmaps each expressed in its local coordinate frame, Spann3R can predict per-image pointmaps expressed in a global coordinate system, thus eliminating the need for optimization-based global alignment. The key idea of Spann3R is to manage an external spatial memory that learns to keep track of all previous relevant 3D information. Spann3R then queries this spatial memory to predict the 3D structure of the next frame in a global coordinate system. Taking advantage of DUSt3R’s pre-trained weights, and further fine-tuning on a subset of datasets, Spann3R shows competitive performance and generalization ability on various unseen datasets and can process ordered image collections in real time. Project page: \urlthis https URL

[CV-86] Many-Worlds Inverse Rendering

链接: https://arxiv.org/abs/2408.16005
作者: Ziyi Zhang,Nicolas Roussel,Wenzel Jakob
关键词-EN: physically-based inverse renderer, Discontinuous visibility, inverse renderer, remain a major, major bottleneck
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Discontinuous visibility changes remain a major bottleneck when optimizing surfaces within a physically-based inverse renderer. Many previous works have proposed sophisticated algorithms and data structures to sample visibility silhouettes more efficiently. Our work presents another solution: instead of differentiating a tentative surface locally, we differentiate a volumetric perturbation of a surface. We refer this as a many-worlds representation because it models a non-interacting superposition of conflicting explanations (worlds) of the input dataset. Each world is optically isolated from others, leading to a new transport law that distinguishes our method from prior work based on exponential random media. The resulting Monte Carlo algorithm is simpler and more efficient than prior methods. We demonstrate that our method promotes rapid convergence, both in terms of the total iteration count and the cost per iteration. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2408.16005 [cs.CV] (or arXiv:2408.16005v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.16005 Focus to learn more arXiv-issued DOI via DataCite

[CV-87] Meta-Learning for Federated Face Recognition in Imbalanced Data Regimes

链接: https://arxiv.org/abs/2408.16003
作者: Arwin Gansekoele,Emiel Hess,Sandjai Bhulai
关键词-EN: concerns surrounding face, surrounding face image, growing privacy concerns, privacy concerns surrounding, Federated Face Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: To appear in the IEEE FLTA 2024 proceedings

点击查看摘要

Abstract:The growing privacy concerns surrounding face image data demand new techniques that can guarantee user privacy. One such face recognition technique that claims to achieve better user privacy is Federated Face Recognition (FRR), a subfield of Federated Learning (FL). However, FFR faces challenges due to the heterogeneity of the data, given the large number of classes that need to be handled. To overcome this problem, solutions are sought in the field of personalized FL. This work introduces three new data partitions based on the CelebA dataset, each with a different form of data heterogeneity. It also proposes Hessian-Free Model Agnostic Meta-Learning (HF-MAML) in an FFR setting. We show that HF-MAML scores higher in verification tests than current FFR models on three different CelebA data partitions. In particular, the verification scores improve the most in heterogeneous data partitions. To balance personalization with the development of an effective global model, an embedding regularization term is introduced for the loss function. This term can be combined with HF-MAML and is shown to increase global model verification performance. Lastly, this work performs a fairness analysis, showing that HF-MAML and its embedding regularization extension can improve fairness by reducing the standard deviation over the client evaluation scores.

[CV-88] Sparse Signal Reconstruction for Overdispersed Low-photon Count Biomedical Imaging Using ell_p Total Variation

链接: https://arxiv.org/abs/2408.16622
作者: Yu Lu,Roummel F. Marcia
关键词-EN: including medical imaging, applications involving low-photon, involving low-photon signal, low-photon signal recovery, negative binomial model
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注: 5 pages, Accepted by the IEEE International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:The negative binomial model, which generalizes the Poisson distribution model, can be found in applications involving low-photon signal recovery, including medical imaging. Recent studies have explored several regularization terms for the negative binomial model, such as the \ell_p quasi-norm with 0 p 1 , \ell_1 norm, and the total variation (TV) quasi-seminorm for promoting sparsity in signal recovery. These penalty terms have been shown to improve image reconstruction outcomes. In this paper, we investigate the \ell_p quasi-seminorm, both isotropic and anisotropic \ell_p TV quasi-seminorms, within the framework of the negative binomial statistical model. This problem can be formulated as an optimization problem, which we solve using a gradient-based approach. We present comparisons between the negative binomial and Poisson statistical models using the \ell_p TV quasi-seminorm as well as common penalty terms. Our experimental results highlight the efficacy of the proposed method.

[CV-89] A Deep-Learning-Based Lable-free No-Reference Image Quality Assessment Metric: Application in Sodium MRI Denoising

链接: https://arxiv.org/abs/2408.16481
作者: Shuaiyu Yuan,Tristan Whitmarsh,Dimitri A Kessler,Otso Arponen,Mary A McLean,Gabrielle Baxter,Frank Riemer,Aneurin J Kennerley,William J Brackenbury,Fiona J Gilbert,Joshua D Kaggie
关键词-EN: multinuclear MRI techniques, inherently low signal, MRI techniques, low image quality, sodium MRI
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:New multinuclear MRI techniques, such as sodium MRI, generally suffer from low image quality due to an inherently low signal. Postprocessing methods, such as image denoising, have been developed for image enhancement. However, the assessment of these enhanced images is challenging especially considering when there is a lack of high resolution and high signal images as reference, such as in sodium MRI. No-reference Image Quality Assessment (NR-IQA) metrics are approaches to solve this problem. Existing learning-based NR-IQA metrics rely on labels derived from subjective human opinions or metrics like Signal-to-Noise Ratio (SNR), which are either time-consuming or lack accurate ground truths, resulting in unreliable assessment. We note that deep learning (DL) models have a unique characteristic in that they are specialized to a characteristic training set, meaning that deviations between the input testing data from the training data will reduce prediction accuracy. Therefore, we propose a novel DL-based NR-IQA metric, the Model Specialization Metric (MSM), which does not depend on ground-truth images or labels. MSM measures the difference between the input image and the model’s prediction for evaluating the quality of the input image. Experiments conducted on both simulated distorted proton T1-weighted MR images and denoised sodium MR images demonstrate that MSM exhibits a superior evaluation performance on various simulated noises and distortions. MSM also has a substantial agreement with the expert evaluations, achieving an averaged Cohen’s Kappa coefficient of 0.6528, outperforming the existing NR-IQA metrics.

[CV-90] Improving 3D deep learning segmentation with biophysically motivated cell synthesis

链接: https://arxiv.org/abs/2408.16471
作者: Roman Bruch,Mario Vitacolonna,Elina Nürnberg,Simeon Sauer,Rüdiger Rudolf,Markus Reischl
关键词-EN: Biomedical research increasingly, research increasingly relies, accurate feature extraction, Biomedical research, single-cell level
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Biomedical research increasingly relies on 3D cell culture models and AI-based analysis can potentially facilitate a detailed and accurate feature extraction on a single-cell level. However, this requires for a precise segmentation of 3D cell datasets, which in turn demands high-quality ground truth for training. Manual annotation, the gold standard for ground truth data, is too time-consuming and thus not feasible for the generation of large 3D training datasets. To address this, we present a novel framework for generating 3D training data, which integrates biophysical modeling for realistic cell shape and alignment. Our approach allows the in silico generation of coherent membrane and nuclei signals, that enable the training of segmentation models utilizing both channels for improved performance. Furthermore, we present a new GAN training scheme that generates not only image data but also matching labels. Quantitative evaluation shows superior performance of biophysical motivated synthetic training data, even outperforming manual annotation and pretrained models. This underscores the potential of incorporating biophysical modeling for enhancing synthetic training data quality.

[CV-91] NeRF-CA: Dynamic Reconstruction of X-ray Coronary Angiography with Extremely Sparse-views

链接: https://arxiv.org/abs/2408.16355
作者: Kirsten W.H. Maas,Danny Ruijters,Anna Vilanova,Nicola Pezzotti
关键词-EN: Neural Radiance Field, two-dimensional X-ray coronary, two-dimensional X-ray, background, reconstruction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dynamic three-dimensional (4D) reconstruction from two-dimensional X-ray coronary angiography (CA) remains a significant clinical problem. Challenges include sparse-view settings, intra-scan motion, and complex vessel morphology such as structure sparsity and background occlusion. Existing CA reconstruction methods often require extensive user interaction or large training datasets. On the other hand, Neural Radiance Field (NeRF), a promising deep learning technique, has successfully reconstructed high-fidelity static scenes for natural and medical scenes. Recent work, however, identified that sparse-views, background occlusion, and dynamics still pose a challenge when applying NeRF in the X-ray angiography context. Meanwhile, many successful works for natural scenes propose regularization for sparse-view reconstruction or scene decomposition to handle dynamics. However, these techniques do not directly translate to the CA context, where both challenges and background occlusion are significant. This paper introduces NeRF-CA, the first step toward a 4D CA reconstruction method that achieves reconstructions from sparse coronary angiograms with cardiac motion. We leverage the motion of the coronary artery to decouple the scene into a dynamic coronary artery component and static background. We combine this scene decomposition with tailored regularization techniques. These techniques enforce the separation of the coronary artery from the background by enforcing dynamic structure sparsity and scene smoothness. By uniquely combining these approaches, we achieve 4D reconstructions from as few as four angiogram sequences. This setting aligns with clinical workflows while outperforming state-of-the-art X-ray sparse-view NeRF reconstruction techniques. We validate our approach quantitatively and qualitatively using 4D phantom datasets and ablation studies.

[CV-92] Learned Image Transmission with Hierarchical Variational Autoencoder

链接: https://arxiv.org/abs/2408.16340
作者: Guangyi Zhang,Hanlei Li,Yunlong Cai,Qiyu Hu,Guanding Yu,Runmin Zhang
关键词-EN: joint source-channel coding, hierarchical variational autoencoder, innovative hierarchical joint, hierarchical joint source-channel, source-channel coding
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce an innovative hierarchical joint source-channel coding (HJSCC) framework for image transmission, utilizing a hierarchical variational autoencoder (VAE). Our approach leverages a combination of bottom-up and top-down paths at the transmitter to autoregressively generate multiple hierarchical representations of the original image. These representations are then directly mapped to channel symbols for transmission by the JSCC encoder. We extend this framework to scenarios with a feedback link, modeling transmission over a noisy channel as a probabilistic sampling process and deriving a novel generative formulation for JSCC with feedback. Compared with existing approaches, our proposed HJSCC provides enhanced adaptability by dynamically adjusting transmission bandwidth, encoding these representations into varying amounts of channel symbols. Additionally, we introduce a rate attention module to guide the JSCC encoder in optimizing its encoding strategy based on prior information. Extensive experiments on images of varying resolutions demonstrate that our proposed model outperforms existing baselines in rate-distortion performance and maintains robustness against channel noise.

[CV-93] Enhanced Control for Diffusion Bridge in Image Restoration

链接: https://arxiv.org/abs/2408.16303
作者: Conghan Yue,Zhengwei Peng,Junlong Ma,Dongyu Zhang
关键词-EN: Image restoration, damaged low-quality image, low-quality image back, Image restoration refers, low-quality images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image restoration refers to the process of restoring a damaged low-quality image back to its corresponding high-quality image. Typically, we use convolutional neural networks to directly learn the mapping from low-quality images to high-quality images achieving image restoration. Recently, a special type of diffusion bridge model has achieved more advanced results in image restoration. It can transform the direct mapping from low-quality to high-quality images into a diffusion process, restoring low-quality images through a reverse process. However, the current diffusion bridge restoration models do not emphasize the idea of conditional control, which may affect performance. This paper introduces the ECDB model enhancing the control of the diffusion bridge with low-quality images as conditions. Moreover, in response to the characteristic of diffusion models having low denoising level at larger values of (\bm t ), we also propose a Conditional Fusion Schedule, which more effectively handles the conditional feature information of various modules. Experimental results prove that the ECDB model has achieved state-of-the-art results in many image restoration tasks, including deraining, inpainting and super-resolution. Code is avaliable at this https URL.

[CV-94] Fine-grained Classification of Port Wine Stains Using Optical Coherence Tomography Angiography

链接: https://arxiv.org/abs/2408.16277
作者: Xiaofeng Deng,Defu Chen,Bowen Liu,Xiwan Zhang,Haixia Qiu,Wu Yuan,Hongliang Ren
关键词-EN: port wine stains, PWS, subsequent treatment planning, PWS lesions, Accurate classification
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Accurate classification of port wine stains (PWS, vascular malformations present at birth), is critical for subsequent treatment planning. However, the current method of classifying PWS based on the external skin appearance rarely reflects the underlying angiopathological heterogeneity of PWS lesions, resulting in inconsistent outcomes with the common vascular-targeted photodynamic therapy (V-PDT) treatments. Conversely, optical coherence tomography angiography (OCTA) is an ideal tool for visualizing the vascular malformations of PWS. Previous studies have shown no significant correlation between OCTA quantitative metrics and the PWS subtypes determined by the current classification approach. This study proposes a new classification approach for PWS using both OCT and OCTA. By examining the hypodermic histopathology and vascular structure of PWS, we have devised a fine-grained classification method that subdivides PWS into five distinct types. To assess the angiopathological differences of various PWS subtypes, we have analyzed six metrics related to vascular morphology and depth information of PWS lesions. The five PWS types present significant differences across all metrics compared to the conventional subtypes. Our findings suggest that an angiopathology-based classification accurately reflects the heterogeneity in PWS lesions. This research marks the first attempt to classify PWS based on angiopathology, potentially guiding more effective subtyping and treatment strategies for PWS.

[CV-95] Single-Photon 3D Imaging with Equi-Depth Photon Histograms

链接: https://arxiv.org/abs/2408.16150
作者: Kaustubh Sadekar,David Maier,Atul Ingle
关键词-EN: Single-photon cameras present, avenue for high-resolution, present a promising, promising avenue, histograms
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Single-photon cameras present a promising avenue for high-resolution 3D imaging. They have ultra-high sensitivity – down to individual photons – and can record photon arrival times with extremely high (sub-nanosecond) resolution. Single-photon 3D cameras estimate the round-trip time of a laser pulse by forming equi-width (EW) histograms of detected photon timestamps. Acquiring and transferring such EW histograms requires high bandwidth and in-pixel memory, making SPCs less attractive in resource-constrained settings such as mobile devices and AR/VR headsets. In this work we propose a 3D sensing technique based on equi-depth (ED) histograms. ED histograms compress timestamp data more efficiently than EW histograms, reducing the bandwidth requirement. Moreover, to reduce the in-pixel memory requirement, we propose a lightweight algorithm to estimate ED histograms in an online fashion without explicitly storing the photon timestamps. This algorithm is amenable to future in-pixel implementations. We propose algorithms that process ED histograms to perform 3D computer-vision tasks of estimating scene distance maps and performing visual odometry under challenging conditions such as high ambient light. Our work paves the way towards lower bandwidth and reduced in-pixel memory requirements for SPCs, making them attractive for resource-constrained 3D vision applications. Project page: \hrefhttps://www.computational.camera/pedhhttps://www.computational.camera/pedh

[CV-96] Alternating Direction Method of Multipliers for Negative Binomial Model with The Weighted Difference of Anisotropic and Isotropic Total Variation ICME

链接: https://arxiv.org/abs/2408.16117
作者: Yu Lu,Kevin Bui,Roummel F. Marcia
关键词-EN: measurement data represent, data represent counts, medical imaging, hitting a detector, photons hitting
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
*备注: 6 pages, Accepted by the IEEE International Conference on Multimedia and Expo (ICME)

点击查看摘要

Abstract:In many applications such as medical imaging, the measurement data represent counts of photons hitting a detector. Such counts in low-photon settings are often modeled using a Poisson distribution. However, this model assumes that the mean and variance of the signal’s noise distribution are equal. For overdispersed data where the variance is greater than the mean, the negative binomial distribution is a more appropriate statistical model. In this paper, we propose an optimization approach for recovering images corrupted by overdispersed Poisson noise. In particular, we incorporate a weighted anisotropic-isotropic total variation regularizer, which avoids staircasing artifacts that are introduced by a regular total variation penalty. We use an alternating direction method of multipliers, where each subproblem has a closed-form solution. Numerical experiments demonstrate the effectiveness of our proposed approach, especially in very photon-limited settings.

机器学习

[LG-0] A Score-Based Density Formula with Applications in Diffusion Generative Models

链接: https://arxiv.org/abs/2408.16765
作者: Gen Li,Yuling Yan
关键词-EN: Score-based generative models, achieving unprecedented success, diffusion generative models, Score-based generative, generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based generative models (SGMs) have revolutionized the field of generative modeling, achieving unprecedented success in generating realistic and diverse content. Despite empirical advances, the theoretical basis for why optimizing the evidence lower bound (ELBO) on the log-likelihood is effective for training diffusion generative models, such as DDPMs, remains largely unexplored. In this paper, we address this question by establishing a density formula for a continuous-time diffusion process, which can be viewed as the continuous-time limit of the forward process in an SGM. This formula reveals the connection between the target density and the score function associated with each step of the forward process. Building on this, we demonstrate that the minimizer of the optimization objective for training DDPMs nearly coincides with that of the true objective, providing a theoretical foundation for optimizing DDPMs using the ELBO. Furthermore, we offer new insights into the role of score-matching regularization in training GANs, the use of ELBO in diffusion classifiers, and the recently proposed diffusion loss.

[LG-1] UV-free Texture Generation with Denoising and Geodesic Heat Diffusions

链接: https://arxiv.org/abs/2408.16762
作者: Simone Foti,Stefanos Zafeiriou,Tolga Birdal
关键词-EN: standard UV-based texturing, wasted UV space, standard UV-based, UV-based texturing, prominent issues
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Seams, distortions, wasted UV space, vertex-duplication, and varying resolution over the surface are the most prominent issues of the standard UV-based texturing of meshes. These issues are particularly acute when automatic UV-unwrapping techniques are used. For this reason, instead of generating textures in automatically generated UV-planes like most state-of-the-art methods, we propose to represent textures as coloured point-clouds whose colours are generated by a denoising diffusion probabilistic model constrained to operate on the surface of 3D objects. Our sampling and resolution agnostic generative model heavily relies on heat diffusion over the surface of the meshes for spatial communication between points. To enable processing of arbitrarily sampled point-cloud textures and ensure long-distance texture consistency we introduce a fast re-sampling of the mesh spectral properties used during the heat diffusion and introduce a novel heat-diffusion-based self-attention mechanism. Our code and pre-trained models are available at this http URL.

[LG-2] Reinforcement Learning without Human Feedback for Last Mile Fine-Tuning of Large Language Models

链接: https://arxiv.org/abs/2408.16753
作者: Alec Solway
关键词-EN: align language models, human preference signals, Reinforcement learning, likelihood maximization, align language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning is used to align language models with human preference signals after first pre-training the model to predict the next token of text within a large corpus using likelihood maximization. Before being deployed in a specific domain, models are often further fine-tuned on task specific data. Since human preferences are often unavailable for the last step, it is performed using likelihood maximization as that is the typical default method. However, reinforcement learning has other advantages besides facilitating alignment to a human derived reward function. For one, whereas likelihood maximization is a form of imitation learning in which the model is trained on what to do under ideal conditions, reinforcement learning is not limited to demonstrating actions just for optimally reached states and trains a model what to do under a range of scenarios as it explores the policy space. In addition, it also trains a model what not to do, suppressing competitive but poor actions. This work develops a framework for last-mile fine-tuning using reinforcement learning and tests whether it garners performance gains. The experiments center on abstractive summarization, but the framework is general and broadly applicable. Use of the procedure produced significantly better results than likelihood maximization when comparing raw predictions. For the specific data tested, the gap could be bridged by employing post-processing of the maximum likelihood outputs. Nonetheless, the framework offers a new avenue for model optimization in situations where post-processing may be less straightforward or effective, and it can be extended to include more complex classes of undesirable outputs to penalize and train against, such as hallucinations.

[LG-3] A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models

链接: https://arxiv.org/abs/2408.16751
作者: Yi-Lin Tuan,William Yang Wang
关键词-EN: including unlikelihood training, maximum likelihood estimation, average treatment effect, exponential maximizing average, maximizing average treatment
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Beyond maximum likelihood estimation (MLE), the standard objective of a language model (LM) that optimizes good examples probabilities, many studies have explored ways that also penalize bad examples for enhancing the quality of output distribution, including unlikelihood training, exponential maximizing average treatment effect (ExMATE), and direct preference optimization (DPO). To systematically compare these methods and further provide a unified recipe for LM optimization, in this paper, we present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs. Through both mathematical results and experiments on CausalDialogue and Anthropic HH-RLHF datasets, we identify distinct functional characteristics among these methods. We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.

[LG-4] Mini-Omni: Language Models Can Hear Talk While Thinking in Streaming

链接: https://arxiv.org/abs/2408.16725
作者: Zhifei Xie,Changqiao Wu
关键词-EN: achieved significant progress, Recent advances, significant progress, achieved significant, Recent
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 10 pages

点击查看摘要

Abstract:Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model’s language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method “Any Model Can Talk”. We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.

[LG-5] A GREAT Architecture for Edge-Based Graph Problems Like TSP

链接: https://arxiv.org/abs/2408.16717
作者: Attila Lischka,Jiaming Wu,Morteza Haghir Chehreghani,Balázs Kulcsár
关键词-EN: tackle combinatorial optimization, combinatorial optimization problems, routing problems, neural network-based approaches, proposed to tackle
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:In the last years, many neural network-based approaches have been proposed to tackle combinatorial optimization problems such as routing problems. Many of these approaches are based on graph neural networks (GNNs) or related transformers, operating on the Euclidean coordinates representing the routing problems. However, GNNs are inherently not well suited to operate on dense graphs, such as in routing problems. Furthermore, models operating on Euclidean coordinates cannot be applied to non-Euclidean versions of routing problems that are often found in real-world settings. To overcome these limitations, we propose a novel GNN-related edge-based neural model called Graph Edge Attention Network (GREAT). We evaluate the performance of GREAT in the edge-classification task to predict optimal edges in the Traveling Salesman Problem (TSP). We can use such a trained GREAT model to produce sparse TSP graph instances, keeping only the edges GREAT finds promising. Compared to other, non-learning-based methods to sparsify TSP graphs, GREAT can produce very sparse graphs while keeping most of the optimal edges. Furthermore, we build a reinforcement learning-based GREAT framework which we apply to Euclidean and non-Euclidean asymmetric TSP. This framework achieves state-of-the-art results.

[LG-6] Enhanced forecasting of stock prices based on variational mode decomposition PatchTST and adaptive scale-weighted layer

链接: https://arxiv.org/abs/2408.16707
作者: Xiaorui Xue,Shaofang Li,Xiaonan Wang
关键词-EN: recent years highlight, recent years, years highlight, highlight the critical, financial strategies
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The significant fluctuations in stock index prices in recent years highlight the critical need for accurate forecasting to guide investment and financial strategies. This study introduces a novel composite forecasting framework that integrates variational mode decomposition (VMD), PatchTST, and adaptive scale-weighted layer (ASWL) to address these challenges. Utilizing datasets of four major stock indices–SP500, DJI, SSEC, and FTSE–from 2000 to 2024, the proposed method first decomposes the raw price series into intrinsic mode functions (IMFs) using VMD. Each IMF is then modeled with PatchTST to capture temporal patterns effectively. The ASWL module is applied to incorporate scale information, enhancing prediction accuracy. The final forecast is derived by aggregating predictions from all IMFs. The VMD-PatchTST-ASWL framework demonstrates significant improvements in forecasting accuracy compared to traditional models, showing robust performance across different indices. This innovative approach provides a powerful tool for stock index price forecasting, with potential applications in various financial analysis and investment decision-making contexts.

[LG-7] SympGNNs: Symplectic Graph Neural Networks for identifiying high-dimensional Hamiltonian systems and node classification

链接: https://arxiv.org/abs/2408.16698
作者: Alan John Varghese,Zhen Zhang,George Em Karniadakis
关键词-EN: Existing neural network, Graph Neural Networks, learn Hamiltonian systems, neural network models, high-dimensional Hamiltonian systems
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:Existing neural network models to learn Hamiltonian systems, such as SympNets, although accurate in low-dimensions, struggle to learn the correct dynamics for high-dimensional many-body systems. Herein, we introduce Symplectic Graph Neural Networks (SympGNNs) that can effectively handle system identification in high-dimensional Hamiltonian systems, as well as node classification. SympGNNs combines symplectic maps with permutation equivariance, a property of graph neural networks. Specifically, we propose two variants of SympGNNs: i) G-SympGNN and ii) LA-SympGNN, arising from different parameterizations of the kinetic and potential energy. We demonstrate the capabilities of SympGNN on two physical examples: a 40-particle coupled Harmonic oscillator, and a 2000-particle molecular dynamics simulation in a two-dimensional Lennard-Jones potential. Furthermore, we demonstrate the performance of SympGNN in the node classification task, achieving accuracy comparable to the state-of-the-art. We also empirically show that SympGNN can overcome the oversmoothing and heterophily problems, two key challenges in the field of graph neural networks.

[LG-8] CW-CNN CW-AN: Convolutional Networks and Attention Networks for CW-Complexes

链接: https://arxiv.org/abs/2408.16686
作者: Rahul Khorana
关键词-EN: structured data points, CW-complex structured data, data points, structured data, learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel framework for learning on CW-complex structured data points. Recent advances have discussed CW-complexes as ideal learning representations for problems in cheminformatics. However, there is a lack of available machine learning methods suitable for learning on CW-complexes. In this paper we develop notions of convolution and attention that are well defined for CW-complexes. These notions enable us to create the first neural network that can receive a CW-complex as input. We illustrate and interpret this framework in the context of supervised prediction.

[LG-9] A Catalog of Fairness-Aware Practices in Machine Learning Engineering

链接: https://arxiv.org/abs/2408.16683
作者: Gianmario Voria,Giulia Sellitto,Carmine Ferrara,Francesco Abate,Andrea De Lucia,Filomena Ferrucci,Gemma Catolino,Fabio Palomba
关键词-EN: decision-making processes raises, processes raises concerns, learning widespread adoption, Machine learning, Machine learning widespread
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning’s widespread adoption in decision-making processes raises concerns about fairness, particularly regarding the treatment of sensitive features and potential discrimination against minorities. The software engineering community has responded by developing fairness-oriented metrics, empirical studies, and approaches. However, there remains a gap in understanding and categorizing practices for engineering fairness throughout the machine learning lifecycle. This paper presents a novel catalog of practices for addressing fairness in machine learning derived from a systematic mapping study. The study identifies and categorizes 28 practices from existing literature, mapping them onto different stages of the machine learning lifecycle. From this catalog, the authors extract actionable items and implications for both researchers and practitioners in software engineering. This work aims to provide a comprehensive resource for integrating fairness considerations into the development and deployment of machine learning systems, enhancing their reliability, accountability, and credibility.

[LG-10] Entropic Distribution Matching in Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity

链接: https://arxiv.org/abs/2408.16673
作者: Ziniu Li,Congliang Chen,Tian Xu,Zeyu Qin,Jiancong Xiao,Ruoyu Sun,Zhi-Quan Luo
关键词-EN: Large language models, Large language, rely on Supervised, language models rely, specialize in downstream
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks. Cross Entropy (CE) loss is the de facto choice in SFT, but it often leads to overfitting and limited output diversity due to its aggressive updates to the data distribution. This paper aim to address these issues by introducing the maximum entropy principle, which favors models with flatter distributions that still effectively capture the data. Specifically, we develop a new distribution matching method called GEM, which solves reverse Kullback-Leibler divergence minimization with an entropy regularizer. For the SFT of Llama-3-8B models, GEM outperforms CE in several aspects. First, when applied to the UltraFeedback dataset to develop general instruction-following abilities, GEM exhibits reduced overfitting, evidenced by lower perplexity and better performance on the IFEval benchmark. Furthermore, GEM enhances output diversity, leading to performance gains of up to 7 points on math reasoning and code generation tasks using best-of-n sampling, even without domain-specific data. Second, when fine-tuning with domain-specific datasets for math reasoning and code generation, GEM also shows less overfitting and improvements of up to 10 points compared with CE. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.16673 [cs.LG] (or arXiv:2408.16673v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16673 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Iterative Graph Alignment

链接: https://arxiv.org/abs/2408.16667
作者: Fangyuan Yu,Hardeep Singh Arora,Matt Johnson
关键词-EN: generalizable causal relationships, capturing generalizable causal, compressing diverse narratives, causal relationships, intelligence by capturing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:By compressing diverse narratives, LLMs go beyond memorization, achieving intelligence by capturing generalizable causal relationships. However, they suffer from local ‘representation gaps’ due to insufficient training data diversity, limiting their real-world utility, especially in tasks requiring strict alignment to rules. Traditional alignment methods relying on heavy human annotations are inefficient and unscalable. Recent self-alignment techniques also fall short, as they often depend on self-selection based prompting and memorization-based learning. To address these issues, we introduce Iterative Graph Alignment (IGA), an annotation-free rule-based alignment algorithm. A teacher model (VLM) employs Iterative Graph Prompting (IGP) to create logical graphs and reference answers. The student model (LLM) identifies local knowledge gaps by attempting to align its responses with these references, collaborating with helper models to generate diverse answers. These aligned responses are then used for iterative supervised fine-tuning (SFT). Our evaluations across five rule-based scenarios demonstrate IGP’s effectiveness, with a 73.12% alignment improvement in Claude Sonnet 3.5, and Llama3-8B-Instruct achieving an 86.20% improvement, outperforming Claude Sonnet 3.5 in rule-based alignment.

[LG-12] Optimal Parallelization of Boosting

链接: https://arxiv.org/abs/2408.16653
作者: Arthur da Cunha,Mikael Møller Høgsgaard,Kasper Green Larsen
关键词-EN: established strong lower, training rounds, Recent works, strong lower bounds, total parallel work
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works on the parallel complexity of Boosting have established strong lower bounds on the tradeoff between the number of training rounds p and the total parallel work per round t . These works have also presented highly non-trivial parallel algorithms that shed light on different regions of this tradeoff. Despite these advancements, a significant gap persists between the theoretical lower bounds and the performance of these algorithms across much of the tradeoff space. In this work, we essentially close this gap by providing both improved lower bounds on the parallel complexity of weak-to-strong learners, and a parallel Boosting algorithm whose performance matches these bounds across the entire p vs.~ t compromise spectrum, up to logarithmic factors. Ultimately, this work settles the true parallel complexity of Boosting algorithms that are nearly sample-optimal.

[LG-13] owards Efficient Modelling of String Dynamics: A Comparison of State Space and Koopman based Deep Learning Methods

链接: https://arxiv.org/abs/2408.16650
作者: Rodrigo Diaz,Carlos De La Vega Martin,Mark Sandler
关键词-EN: State Space Models, State Space, examination of State, Koopman-based deep learning, non-linear stiff strings
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Computational Physics (physics.comp-ph)
*备注: Accepted to DAFx2024

点击查看摘要

Abstract:This paper presents an examination of State Space Models (SSM) and Koopman-based deep learning methods for modelling the dynamics of both linear and non-linear stiff strings. Through experiments with datasets generated under different initial conditions and sample rates, we assess the capacity of these models to accurately model the complex behaviours observed in string dynamics. Our findings indicate that our proposed Koopman-based model performs as well as or better than other existing approaches in non-linear cases for long-sequence modelling. We inform the design of these architectures with the structure of the problems at hand. Although challenges remain in extending model predictions beyond the training horizon (i.e., extrapolation), the focus of our investigation lies in the models’ ability to generalise across different initial conditions within the training time interval. This research contributes insights into the physical modelling of dynamical systems (in particular those addressing musical acoustics) by offering a comparative overview of these and previous methods and introducing innovative strategies for model improvement. Our results highlight the efficacy of these models in simulating non-linear dynamics and emphasise their wide-ranging applicability in accurately modelling dynamical systems over extended sequences. Comments: Accepted to DAFx2024 Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Computational Physics (physics.comp-ph) Cite as: arXiv:2408.16650 [cs.SD] (or arXiv:2408.16650v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2408.16650 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-14] 3D Pose-Based Temporal Action Segmentation for Figure Skating: A Fine-Grained and Jump Procedure-Aware Annotation Approach

链接: https://arxiv.org/abs/2408.16638
作者: Ryota Tanaka,Tomohiro Suzuki,Keisuke Fujii
关键词-EN: Understanding human actions, Understanding human, Temporal Action Segmentation, including sports, figure skating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 7th ACM International Workshop on Multimedia Content Analysis in Sports

点击查看摘要

Abstract:Understanding human actions from videos is essential in many domains, including sports. In figure skating, technical judgments are performed by watching skaters’ 3D movements, and its part of the judging procedure can be regarded as a Temporal Action Segmentation (TAS) task. TAS tasks in figure skating that automatically assign temporal semantics to video are actively researched. However, there is a lack of datasets and effective methods for TAS tasks requiring 3D pose data. In this study, we first created the FS-Jump3D dataset of complex and dynamic figure skating jumps using optical markerless motion capture. We also propose a new fine-grained figure skating jump TAS dataset annotation method with which TAS models can learn jump procedures. In the experimental results, we validated the usefulness of 3D pose features as input and the fine-grained dataset for the TAS model in figure skating. FS-Jump3D Dataset is available at this https URL.

[LG-15] urbulence Strength C_n2 Estimation from Video using Physics-based Deep Learning

链接: https://arxiv.org/abs/2408.16623
作者: Ripon Kumar Saha,Esen Salcin,Jihoo Kim,Joseph Smith,Suren Jayasuriya
关键词-EN: dynamic image distortion, image distortion due, long distance suffer, refractive indices, long distance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Code Available: this https URL

点击查看摘要

Abstract:Images captured from a long distance suffer from dynamic image distortion due to turbulent flow of air cells with random temperatures, and thus refractive indices. This phenomenon, known as image dancing, is commonly characterized by its refractive-index structure constant C_n^2 as a measure of the turbulence strength. For many applications such as atmospheric forecast model, long-range/astronomy imaging, and aviation safety, optical communication technology, C_n^2 estimation is critical for accurately sensing the turbulent environment. Previous methods for C_n^2 estimation include estimation from meteorological data (temperature, relative humidity, wind shear, etc.) for single-point measurements, two-ended pathlength measurements from optical scintillometer for path-averaged C_n^2 , and more recently estimating C_n^2 from passive video cameras for low cost and hardware complexity. In this paper, we present a comparative analysis of classical image gradient methods for C_n^2 estimation and modern deep learning-based methods leveraging convolutional neural networks. To enable this, we collect a dataset of video capture along with reference scintillometer measurements for ground truth, and we release this unique dataset to the scientific community. We observe that deep learning methods can achieve higher accuracy when trained on similar data, but suffer from generalization errors to other, unseen imagery as compared to classical methods. To overcome this trade-off, we present a novel physics-based network architecture that combines learned convolutional layers with a differentiable image gradient method that maintains high accuracy while being generalizable across image datasets.

[LG-16] owards Infusing Auxiliary Knowledge for Distracted Driver Detection KDD

链接: https://arxiv.org/abs/2408.16621
作者: Ishwar B Balappanawar,Ashmit Chamoli,Ruwan Wickramarachchi,Aditya Mishra,Ponnurangam Kumaraguru,Amit P. Sheth
关键词-EN: road accidents globally, accidents globally, Distracted driving, distracted driving involves, road accidents
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at KiL 2024: Workshop on Knowledge-infused Learning co-located with 30th ACM KDD Conference

点击查看摘要

Abstract:Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver’s pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver’s actions.Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.

[LG-17] Hyperdimensional Vector Tsetlin Machines with Applications to Sequence Learning and Generation

链接: https://arxiv.org/abs/2408.16620
作者: Christian D. Blakely
关键词-EN: adding numerous advantages, vanilla Tsetlin machines, Tsetlin machine clause, Tsetlin machines, machine learning model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We construct a two-layered model for learning and generating sequential data that is both computationally fast and competitive with vanilla Tsetlin machines, adding numerous advantages. Through the use of hyperdimensional vector computing (HVC) algebras and Tsetlin machine clause structures, we demonstrate that the combination of both inherits the generality of data encoding and decoding of HVC with the fast interpretable nature of Tsetlin machines to yield a powerful machine learning model. We apply the approach in two areas, namely in forecasting, generating new sequences, and classification. For the latter, we derive results for the entire UCR Time Series Archive and compare with the standard benchmarks to see how well the method competes in time series classification.

[LG-18] Blending Low and High-Level Semantics of Time Series for Better Masked Time Series Generation

链接: https://arxiv.org/abs/2408.16613
作者: Johan Vik Mathisen,Erlend Lokna,Daesoo Lee,Erlend Aune
关键词-EN: utilize vector quantization-based, vector quantization-based tokenization, time series generation, effectively model complex, model complex distributions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-of-the-art approaches in time series generation (TSG), such as TimeVQVAE, utilize vector quantization-based tokenization to effectively model complex distributions of time series. These approaches first learn to transform time series into a sequence of discrete latent vectors, and then a prior model is learned to model the sequence. The discrete latent vectors, however, only capture low-level semantics (\textite.g., shapes). We hypothesize that higher-fidelity time series can be generated by training a prior model on more informative discrete latent vectors that contain both low and high-level semantics (\textite.g., characteristic dynamics). In this paper, we introduce a novel framework, termed NC-VQVAE, to integrate self-supervised learning into those TSG methods to derive a discrete latent space where low and high-level semantics are captured. Our experimental results demonstrate that NC-VQVAE results in a considerable improvement in the quality of synthetic samples.

[LG-19] Data Quality Monitoring through Transfer Learning on Anomaly Detection for the Hadron Calorimeters

链接: https://arxiv.org/abs/2408.16612
作者: Mulugeta Weldezgina Asres,Christian Walter Omlin,Long Wang,Pavel Parygin,David Yu,Jay Dittmann, TheCMS-HCAL Collaboration
关键词-EN: including monitoring, proliferation of sensors, sensors brings, brings an immense, volume of spatio-temporal
类目: Machine Learning (cs.LG)
*备注: 28 pages, 15 figures, and 9 tables

点击查看摘要

Abstract:The proliferation of sensors brings an immense volume of spatio-temporal (ST) data in many domains for various purposes, including monitoring, diagnostics, and prognostics applications. Data curation is a time-consuming process for a large volume of data, making it challenging and expensive to deploy data analytics platforms in new environments. Transfer learning (TL) mechanisms promise to mitigate data sparsity and model complexity by utilizing pre-trained models for a new task. Despite the triumph of TL in fields like computer vision and natural language processing, efforts on complex ST models for anomaly detection (AD) applications are limited. In this study, we present the potential of TL within the context of AD for the Hadron Calorimeter of the Compact Muon Solenoid experiment at CERN. We have transferred the ST AD models trained on data collected from one part of a calorimeter to another. We have investigated different configurations of TL on semi-supervised autoencoders of the ST AD models – transferring convolutional, graph, and recurrent neural networks of both the encoder and decoder networks. The experiment results demonstrate that TL effectively enhances the model learning accuracy on a target subdetector. The TL achieves promising data reconstruction and AD performance while substantially reducing the trainable parameters of the AD models. It also improves robustness against anomaly contamination in the training data sets of the semi-supervised AD models.

[LG-20] sEMG-Driven Physics-Informed Gated Recurrent Networks for Modeling Upper Limb Multi-Joint Movement Dynamics

链接: https://arxiv.org/abs/2408.16599
作者: Rajnish Kumar,Anand Gupta,Suriya Prakash Muthukrishnan,Lalan Kumar,Sitikantha Roy
关键词-EN: advanced human-machine interfaces, systems offer great, enhancing human strength, offer great potential, rehabilitation systems offer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Exoskeletons and rehabilitation systems offer great potential for enhancing human strength and recovery through advanced human-machine interfaces (HMIs) that adapt to movement dynamics. However, the real-time application of physics-informed neural networks (PINNs) is limited by their reliance on fixed input lengths and surrogate models. This study introduces a novel physics-informed Gated Recurrent Network (PiGRN) designed to predict multi-joint torques using surface electromyography (sEMG) data. The PiGRN model employs a Gated Recurrent Unit (GRU) to convert time-series sEMG inputs into multi-joint kinematics and external loads, which are then integrated into an equation of motion to ensure consistency with physical laws. Experimental validation with sEMG data from five participants performing elbow flexion-extension tasks showed that the PiGRN model accurately predicted joint torques for 10 unfamiliar movements, with RMSE values between 4.02% and 11.40% and correlation coefficients ranging from 0.87 to 0.98. These findings highlight the PiGRN’s potential for real-time exoskeleton and rehabilitation applications. Future research will explore more diverse datasets, improve musculoskeletal models, and investigate unsupervised learning methods.

[LG-21] High-Dimensional Sparse Data Low-rank Representation via Accelerated Asynchronous Parallel Stochastic Gradient Descent

链接: https://arxiv.org/abs/2408.16592
作者: Qicong Hu,Hao Wu
关键词-EN: describe real-world node, real-world node interactions, node interactions, characterized by high, high dimensionality
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Data characterized by high dimensionality and sparsity are commonly used to describe real-world node interactions. Low-rank representation (LR) can map high-dimensional sparse (HDS) data to low-dimensional feature spaces and infer node interactions via modeling data latent associations. Unfortunately, existing optimization algorithms for LR models are computationally inefficient and slowly convergent on large-scale datasets. To address this issue, this paper proposes an Accelerated Asynchronous Parallel Stochastic Gradient Descent A2PSGD for High-Dimensional Sparse Data Low-rank Representation with three fold-ideas: a) establishing a lock-free scheduler to simultaneously respond to scheduling requests from multiple threads; b) introducing a greedy algorithm-based load balancing strategy for balancing the computational load among threads; c) incorporating Nesterov’s accelerated gradient into the learning scheme to accelerate model convergence. Empirical studies show that A2PSGD outperforms existing optimization algorithms for HDS data LR in both accuracy and training time.

[LG-22] CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions INTERSPEECH2024

链接: https://arxiv.org/abs/2408.16589
作者: Laurin Wagner,Bernhard Thallinger,Mario Zusag
关键词-EN: Whisper speech recognition, decoder cross-attention scores, applying dynamic time, dynamic time warping, recognition model significantly
类目: Machine Learning (cs.LG)
*备注: Published at INTERSPEECH2024

点击查看摘要

Abstract:We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder’s cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open this https URL.

[LG-23] ransformers Meet ACT-R: Repeat-Aware and Sequential Listening Session Recommendation RECSYS’2024

链接: https://arxiv.org/abs/2408.16578
作者: Viet-Anh Tran,Guillaume Salha-Galvan,Bruno Sguerra,Romain Hennequin
关键词-EN: leverage sequential recommender, based on past, past sequences, PISA, listening
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 11 pages. Accepted by RecSys’2024, full paper

点击查看摘要

Abstract:Music streaming services often leverage sequential recommender systems to predict the best music to showcase to users based on past sequences of listening sessions. Nonetheless, most sequential recommendation methods ignore or insufficiently account for repetitive behaviors. This is a crucial limitation for music recommendation, as repeatedly listening to the same song over time is a common phenomenon that can even change the way users perceive this song. In this paper, we introduce PISA (Psychology-Informed Session embedding using ACT-R), a session-level sequential recommender system that overcomes this limitation. PISA employs a Transformer architecture learning embedding representations of listening sessions and users using attention mechanisms inspired by Anderson’s ACT-R (Adaptive Control of Thought-Rational), a cognitive architecture modeling human information access and memory dynamics. This approach enables us to capture dynamic and repetitive patterns from user behaviors, allowing us to effectively predict the songs they will listen to in subsequent sessions, whether they are repeated or new ones. We demonstrate the empirical relevance of PISA using both publicly available listening data from this http URL and proprietary data from Deezer, a global music streaming service, confirming the critical importance of repetition modeling for sequential listening session recommendation. Along with this paper, we publicly release our proprietary dataset to foster future research in this field, as well as the source code of PISA to facilitate its future use.

[LG-24] Seeking the Sufficiency and Necessity Causal Features in Multimodal Representation Learning

链接: https://arxiv.org/abs/2408.16577
作者: Boyu Chen,Junjie Liu,Zhu Li,Mengyue yang
关键词-EN: learning models’ ability, enhance deep learning, deep learning models’, high Probability, models’ ability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning representations with a high Probability of Necessary and Sufficient Causes (PNS) has been shown to enhance deep learning models’ ability. This task involves identifying causal features that are both sufficient (guaranteeing the outcome) and necessary (without which the outcome cannot occur). However, current research predominantly focuses on unimodal data, and extending PNS learning to multimodal settings presents significant challenges. The challenges arise as the conditions for PNS identifiability, Exogeneity and Monotonicity, need to be reconsidered in a multimodal context, where sufficient and necessary causal features are distributed across different modalities. To address this, we first propose conceptualizing multimodal representations as comprising modality-invariant and modality-specific components. We then analyze PNS identifiability for each component, while ensuring non-trivial PNS estimation. Finally, we formulate tractable optimization objectives that enable multimodal models to learn high-PNS representations, thereby enhancing their predictive performance. Experiments demonstrate the effectiveness of our method on both synthetic and real-world data.

[LG-25] An Adaptive Latent Factorization of Tensors Model for Embedding Dynamic Communication Network

链接: https://arxiv.org/abs/2408.16573
作者: Xin Liao,Qicong Hu,Peng Tang
关键词-EN: Dynamic Communication Network, Communication Network, Dynamic Communication, Big-data applications, communication nodes increases
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:The Dynamic Communication Network (DCN) describes the interactions over time among various communication nodes, and it is widely used in Big-data applications as a data source. As the number of communication nodes increases and temporal slots accumulate, each node interacts in with only a few nodes in a given temporal slot, the DCN can be represented by an High-Dimensional Sparse (HDS) tensor. In order to extract rich behavioral patterns from an HDS tensor in DCN, this paper proposes an Adaptive Temporal-dependent Tensor low-rank representation (ATT) model. It adopts a three-fold approach: a) designing a temporal-dependent method to reconstruct temporal feature matrix, thereby precisely represent the data by capturing the temporal patterns; b) achieving hyper-parameters adaptation of the model via the Differential Evolutionary Algorithms (DEA) to avoid tedious hyper-parameters tuning; c) employing nonnegative learning schemes for the model parameters to effectively handle an the nonnegativity inherent in HDS data. The experimental results on four real-world DCNs demonstrate that the proposed ATT model significantly outperforms several state-of-the-art models in both prediction errors and convergence rounds.

[LG-26] Identifying Terrain Physical Parameters from Vision – Towards Physical-Parameter-Aware Locomotion and Navigation

链接: https://arxiv.org/abs/2408.16567
作者: Jiaqi Chen,Jonas Frey,Ruyi Zhou,Takahiro Miki,Georg Martius,Marco Hutter
关键词-EN: non-geometric hazards, physical properties, physical, essential for robotic, deal with non-geometric
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying the physical properties of the surrounding environment is essential for robotic locomotion and navigation to deal with non-geometric hazards, such as slippery and deformable terrains. It would be of great benefit for robots to anticipate these extreme physical properties before contact; however, estimating environmental physical parameters from vision is still an open challenge. Animals can achieve this by using their prior experience and knowledge of what they have seen and how it felt. In this work, we propose a cross-modal self-supervised learning framework for vision-based environmental physical parameter estimation, which paves the way for future physical-property-aware locomotion and navigation. We bridge the gap between existing policies trained in simulation and identification of physical terrain parameters from vision. We propose to train a physical decoder in simulation to predict friction and stiffness from multi-modal input. The trained network allows the labeling of real-world images with physical parameters in a self-supervised manner to further train a visual network during deployment, which can densely predict the friction and stiffness from image data. We validate our physical decoder in simulation and the real world using a quadruped ANYmal robot, outperforming an existing baseline method. We show that our visual network can predict the physical properties in indoor and outdoor experiments while allowing fast adaptation to new environments.

[LG-27] Android Malware Detection Based on RGB Images and Multi-feature Fusion

链接: https://arxiv.org/abs/2408.16555
作者: Zhiqiang Wang,Qiulong Yu,Sicheng Yuan
关键词-EN: mobile device security, Android malware detection, Android malware, Android, malware
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 9 pages,10 figures

点击查看摘要

Abstract:With the widespread adoption of smartphones, Android malware has become a significant challenge in the field of mobile device security. Current Android malware detection methods often rely on feature engineering to construct dynamic or static features, which are then used for learning. However, static feature-based methods struggle to counter code obfuscation, packing, and signing techniques, while dynamic feature-based methods involve time-consuming feature extraction. Image-based methods for Android malware detection offer better resilience against malware variants and polymorphic malware. This paper proposes an end-to-end Android malware detection technique based on RGB images and multi-feature fusion. The approach involves extracting Dalvik Executable (DEX) files, AndroidManifest.xml files, and API calls from APK files, converting them into grayscale images, and enhancing their texture features using Canny edge detection, histogram equalization, and adaptive thresholding techniques. These grayscale images are then combined into an RGB image containing multi-feature fusion information, which is analyzed using mainstream image classification models for Android malware detection. Extensive experiments demonstrate that the proposed method effectively captures Android malware characteristics, achieving an accuracy of up to 97.25%, outperforming existing detection methods that rely solely on DEX files as classification features. Additionally, ablation experiments confirm the effectiveness of using the three key files for feature representation in the proposed approach.

[LG-28] SALSA: Speedy ASR-LLM Synchronous Aggregation INTERSPEECH2024

链接: https://arxiv.org/abs/2408.16542
作者: Ashish Mittal,Darshan Prabhu,Sunita Sarawagi,Preethi Jyothi
关键词-EN: Harnessing pre-trained LLMs, Harnessing pre-trained, improve ASR systems, ASR systems, ASR
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:Harnessing pre-trained LLMs to improve ASR systems, particularly for low-resource languages, is now an emerging area of research. Existing methods range from using LLMs for ASR error correction to tightly coupled systems that replace the ASR decoder with the LLM. These approaches either increase decoding time or require expensive training of the cross-attention layers. We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. Such coupling is performed with a simple projection of the last decoder state, and is thus significantly more training efficient than earlier approaches. A challenge of our proposed coupling is handling the mismatch between the tokenizers of the LLM and ASR systems. We handle this mismatch using cascading tokenization with respect to the LLM and ASR vocabularies. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.

[LG-29] SFR-GNN: Simple and Fast Robust GNNs against Structural Attacks

链接: https://arxiv.org/abs/2408.16537
作者: Xing Ai,Guanyu Zhu,Yulin Zhu,Yu Zheng,Gaolei Li,Jianhua Li,Kai Zhou
关键词-EN: demonstrated commendable performance, Graph Neural Networks, graph-structured data, demonstrated commendable, commendable performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated commendable performance for graph-structured data. Yet, GNNs are often vulnerable to adversarial structural attacks as embedding generation relies on graph topology. Existing efforts are dedicated to purifying the maliciously modified structure or applying adaptive aggregation, thereby enhancing the robustness against adversarial structural attacks. It is inevitable for a defender to consume heavy computational costs due to lacking prior knowledge about modified structures. To this end, we propose an efficient defense method, called Simple and Fast Robust Graph Neural Network (SFR-GNN), supported by mutual information theory. The SFR-GNN first pre-trains a GNN model using node attributes and then fine-tunes it over the modified graph in the manner of contrastive learning, which is free of purifying modified structures and adaptive aggregation, thus achieving great efficiency gains. Consequently, SFR-GNN exhibits a 24%–162% speedup compared to advanced robust models, demonstrating superior robustness for node classification tasks.

[LG-30] nyTNAS: GPU-Free Time-Bound Hardware-Aware Neural Architecture Search for TinyML Time Series Classification

链接: https://arxiv.org/abs/2408.16535
作者: Bidyut Saha,Riya Samanta,Soumya K. Ghosh,Ram Babu Roy
关键词-EN: time series classification, hardware-aware multi-objective Neural, tool specifically designed, multi-objective Neural Architecture, TinyML time series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we present TinyTNAS, a novel hardware-aware multi-objective Neural Architecture Search (NAS) tool specifically designed for TinyML time series classification. Unlike traditional NAS methods that rely on GPU capabilities, TinyTNAS operates efficiently on CPUs, making it accessible for a broader range of applications. Users can define constraints on RAM, FLASH, and MAC operations to discover optimal neural network architectures within these parameters. Additionally, the tool allows for time-bound searches, ensuring the best possible model is found within a user-specified duration. By experimenting with benchmark dataset UCI HAR, PAMAP2, WISDM, MIT BIH, and PTB Diagnostic ECG Databas TinyTNAS demonstrates state-of-the-art accuracy with significant reductions in RAM, FLASH, MAC usage, and latency. For example, on the UCI HAR dataset, TinyTNAS achieves a 12x reduction in RAM usage, a 144x reduction in MAC operations, and a 78x reduction in FLASH memory while maintaining superior accuracy and reducing latency by 149x. Similarly, on the PAMAP2 and WISDM datasets, it achieves a 6x reduction in RAM usage, a 40x reduction in MAC operations, an 83x reduction in FLASH, and a 67x reduction in latency, all while maintaining superior accuracy. Notably, the search process completes within 10 minutes in a CPU environment. These results highlight TinyTNAS’s capability to optimize neural network architectures effectively for resource-constrained TinyML applications, ensuring both efficiency and high performance. The code for TinyTNAS is available at the GitHub repository and can be accessed at this https URL.

[LG-31] Multitask learning for improved scour detection: A dynamic wave tank study

链接: https://arxiv.org/abs/2408.16527
作者: Simon M. Brealy,Aidan J. Hughes,Tina A. Dardeno,Lawrence A. Bull,Robin S. Mills,Nikolaos Dervilis,Keith Worden
关键词-EN: Population-based structural health, structural health monitoring, Population-based structural, health monitoring, aims to share
类目: Machine Learning (cs.LG)
*备注: 25 pages, 12 figures, early work features in ISWHM 2023 conference proceedings and available here: arXiv:2402.19295 . Submitted to the Renewable Energy journal

点击查看摘要

Abstract:Population-based structural health monitoring (PBSHM), aims to share information between members of a population. An offshore wind (OW) farm could be considered as a population of nominally-identical wind-turbine structures. However, benign variations exist among members, such as geometry, sea-bed conditions and temperature differences. These factors could influence structural properties and therefore the dynamic response, making it more difficult to detect structural problems via traditional SHM techniques. This paper explores the use of a Bayesian hierarchical model as a means of multitask learning, to infer foundation stiffness distribution parameters at both population and local levels. To do this, observations of natural frequency from populations of structures were first generated from both numerical and experimental models. These observations were then used in a partially-pooled Bayesian hierarchical model in tandem with surrogate FE models of the structures to infer foundation stiffness parameters. Finally, it is demonstrated how the learned parameters may be used as a basis to perform more robust anomaly detection (as compared to a no-pooling approach) e.g. as a result of scour. Comments: 25 pages, 12 figures, early work features in ISWHM 2023 conference proceedings and available here: arXiv:2402.19295. Submitted to the Renewable Energy journal Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.16527 [cs.LG] (or arXiv:2408.16527v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16527 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] Adaptive Variational Continual Learning via Task-Heuristic Modelling

链接: https://arxiv.org/abs/2408.16517
作者: Fan Yang
关键词-EN: Variational continual learning, turn-key learning algorithm, generalized variational continual, Variational continual, continual learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Variational continual learning (VCL) is a turn-key learning algorithm that has state-of-the-art performance among the best continual learning models. In our work, we explore an extension of the generalized variational continual learning (GVCL) model, named AutoVCL, which combines task heuristics for informed learning and model optimization. We demonstrate that our model outperforms the standard GVCL with fixed hyperparameters, benefiting from the automatic adjustment of the hyperparameter based on the difficulty and similarity of the incoming task compared to the previous tasks.

[LG-33] On-device AI: Quantization-aware Training of Transformers in Time-Series

链接: https://arxiv.org/abs/2408.16495
作者: Tianheng Ling,Gregor Schiele
关键词-EN: Artificial Intelligence, Transformer model, pervasive computing, Programmable Gate Arrays, Field Programmable Gate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper is accepted by 2023 IEEE International Conference on Pervasive Computing and Communications(PhD Forum)

点击查看摘要

Abstract:Artificial Intelligence (AI) models for time-series in pervasive computing keep getting larger and more complicated. The Transformer model is by far the most compelling of these AI models. However, it is difficult to obtain the desired performance when deploying such a massive model on a sensor device with limited resources. My research focuses on optimizing the Transformer model for time-series forecasting tasks. The optimized model will be deployed as hardware accelerators on embedded Field Programmable Gate Arrays (FPGAs). I will investigate the impact of applying Quantization-aware Training to the Transformer model to reduce its size and runtime memory footprint while maximizing the advantages of FPGAs.

[LG-34] An Exploratory Deep Learning Approach for Predicting Subsequent Suicidal Acts in Chinese Psychological Support Hotlines

链接: https://arxiv.org/abs/2408.16463
作者: Changwei Song,Qing Zhao,Jianqiang Li,Yining Chen,Yongsheng Tong,Guanghui Fu
关键词-EN: suicide risk assessment, individual risk scores, effective suicide prevention, suicide prevention measure, Psychological support hotlines
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Psychological support hotlines are an effective suicide prevention measure that typically relies on professionals using suicide risk assessment scales to predict individual risk scores. However, the accuracy of scale-based predictive methods for suicide risk assessment can vary widely depending on the expertise of the operator. This limitation underscores the need for more reliable methods, prompting this research’s innovative exploration of the use of artificial intelligence to improve the accuracy and efficiency of suicide risk prediction within the context of psychological support hotlines. The study included data from 1,549 subjects from 2015-2017 in China who contacted a psychological support hotline. Each participant was followed for 12 months to identify instances of suicidal behavior. We proposed a novel multi-task learning method that uses the large-scale pre-trained model Whisper for feature extraction and fits psychological scales while predicting the risk of suicide. The proposed method yields a 2.4% points improvement in F1-score compared to the traditional manual approach based on the psychological scales. Our model demonstrated superior performance compared to the other eight popular models. To our knowledge, this study is the first to apply deep learning to long-term speech data to predict suicide risk in China, indicating grate potential for clinical applications. The source code is publicly available at: \urlthis https URL.

[LG-35] HYGENE: A Diffusion-based Hypergraph Generation Method

链接: https://arxiv.org/abs/2408.16457
作者: Dorian Gailhard,Enzo Tartaglione,Lirida Naviner De Barros,Jhony H. Giraldo
关键词-EN: including social networks, powerful mathematical structures, high-order relationships, including social, social networks
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: arXiv admin note: text overlap with arXiv:2312.11529 by other authors

点击查看摘要

Abstract:Hypergraphs are powerful mathematical structures that can model complex, high-order relationships in various domains, including social networks, bioinformatics, and recommender systems. However, generating realistic and diverse hypergraphs remains challenging due to their inherent complexity and lack of effective generative models. In this paper, we introduce a diffusion-based Hypergraph Generation (HYGENE) method that addresses these challenges through a progressive local expansion approach. HYGENE works on the bipartite representation of hypergraphs, starting with a single pair of connected nodes and iteratively expanding it to form the target hypergraph. At each step, nodes and hyperedges are added in a localized manner using a denoising diffusion process, which allows for the construction of the global structure before refining local details. Our experiments demonstrated the effectiveness of HYGENE, proving its ability to closely mimic a variety of properties in hypergraphs. To the best of our knowledge, this is the first attempt to employ deep learning models for hypergraph generation, and our work aims to lay the groundwork for future research in this area.

[LG-36] Do Recommender Systems Promote Local Music? A Reproducibility Study Using Music Streaming Data

链接: https://arxiv.org/abs/2408.16430
作者: Kristina Matrosova,Lilian Marey,Guillaume Salha-Galvan,Thomas Louail,Olivier Bodini,Manuel Moussallam
关键词-EN: discussing prior findings, local music representation, local music, recommender systems, recommender systems exhibit
类目: Information Retrieval (cs.IR); Databases (cs.DB); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines the influence of recommender systems on local music representation, discussing prior findings from an empirical study on the LFM-2b public dataset. This prior study argued that different recommender systems exhibit algorithmic biases shifting music consumption either towards or against local content. However, LFM-2b users do not reflect the diverse audience of music streaming services. To assess the robustness of this study’s conclusions, we conduct a comparative analysis using proprietary listening data from a global music streaming service, which we publicly release alongside this paper. We observe significant differences in local music consumption patterns between our dataset and LFM-2b, suggesting that caution should be exercised when drawing conclusions on local music based solely on LFM-2b. Moreover, we show that the algorithmic biases exhibited in the original work vary in our dataset, and that several unexplored model parameters can significantly influence these biases and affect the study’s conclusion on both datasets. Finally, we discuss the complexity of accurately labeling local music, emphasizing the risk of misleading conclusions due to unreliable, biased, or incomplete labels. To encourage further research and ensure reproducibility, we have publicly shared our dataset and code.

[LG-37] Gradient-free variational learning with conditional mixture networks

链接: https://arxiv.org/abs/2408.16429
作者: Conor Heins,Hao Wu,Dimitrije Markovic,Alexander Tschantz,Jeff Beck,Christopher Buckley
关键词-EN: Balancing computational efficiency, Balancing computational, robust predictive performance, critical applications, computational efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 16 pages main text (3 figures), including references. 9 pages supplementary material (5 figures)

点击查看摘要

Abstract:Balancing computational efficiency with robust predictive performance is crucial in supervised learning, especially for critical applications. Standard deep learning models, while accurate and scalable, often lack probabilistic features like calibrated predictions and uncertainty quantification. Bayesian methods address these issues but can be computationally expensive as model and data complexity increase. Previous work shows that fast variational methods can reduce the compute requirements of Bayesian methods by eliminating the need for gradient computation or sampling, but are often limited to simple models. We demonstrate that conditional mixture networks (CMNs), a probabilistic variant of the mixture-of-experts (MoE) model, are suitable for fast, gradient-free inference and can solve complex classification tasks. CMNs employ linear experts and a softmax gating network. By exploiting conditional conjugacy and Pólya-Gamma augmentation, we furnish Gaussian likelihoods for the weights of both the linear experts and the gating network. This enables efficient variational updates using coordinate ascent variational inference (CAVI), avoiding traditional gradient-based optimization. We validate this approach by training two-layer CMNs on standard benchmarks from the UCI repository. Our method, CAVI-CMN, achieves competitive and often superior predictive accuracy compared to maximum likelihood estimation (MLE) with backpropagation, while maintaining competitive runtime and full posterior distributions over all model parameters. Moreover, as input size or the number of experts increases, computation time scales competitively with MLE and other gradient-based solutions like black-box variational inference (BBVI), making CAVI-CMN a promising tool for deep, fast, and gradient-free Bayesian networks.

[LG-38] A Comparative Study of Hyperparameter Tuning Methods

链接: https://arxiv.org/abs/2408.16425
作者: Subhasis Dasgupta,Jaydip Sen
关键词-EN: Tree-structured Parzen Estimator, hyperparameter optimization increases, algorithms Tree-structured Parzen, bias and variance, increases in complexity
类目: Machine Learning (cs.LG)
*备注: This chapter has been accepted in the edited volume titles “Data Science in Theory and Practice”, editor J Sen S Roy Choudhury. The volume is expected to be published in October 2024 by Cambridge Scholars Publishing, New Castle upon Tyne, UK. This chapter is 34 pages long and it contains 11 tables and 8 images

点击查看摘要

Abstract:The study emphasizes the challenge of finding the optimal trade-off between bias and variance, especially as hyperparameter optimization increases in complexity. Through empirical analysis, three hyperparameter tuning algorithms Tree-structured Parzen Estimator (TPE), Genetic Search, and Random Search are evaluated across regression and classification tasks. The results show that nonlinear models, with properly tuned hyperparameters, significantly outperform linear models. Interestingly, Random Search excelled in regression tasks, while TPE was more effective for classification tasks. This suggests that there is no one-size-fits-all solution, as different algorithms perform better depending on the task and model type. The findings underscore the importance of selecting the appropriate tuning method and highlight the computational challenges involved in optimizing machine learning models, particularly as search spaces expand.

[LG-39] Fourier Spectral Physics Informed Neural Network: An Efficient and Low-Memory PINN

链接: https://arxiv.org/abs/2408.16414
作者: Tianchi Yu,Yiming Qi,Ivan Oseledets,Shiyi Chen
关键词-EN: solving partial differential, partial differential equations, physics-informed neural networks, growing investigations, investigations into solving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:With growing investigations into solving partial differential equations by physics-informed neural networks (PINNs), more accurate and efficient PINNs are required to meet the practical demands of scientific computing. One bottleneck of current PINNs is computing the high-order derivatives via automatic differentiation which often necessitates substantial computing resources. In this paper, we focus on removing the automatic differentiation of the spatial derivatives and propose a spectral-based neural network that substitutes the differential operator with a multiplication. Compared to the PINNs, our approach requires lower memory and shorter training time. Thanks to the exponential convergence of the spectral basis, our approach is more accurate. Moreover, to handle the different situations between physics domain and spectral domain, we provide two strategies to train networks by their spectral information. Through a series of comprehensive experiments, We validate the aforementioned merits of our proposed network.

[LG-40] DeepSPoC: A Deep Learning-Based PDE Solver Governed by Sequential Propagation of Chaos

链接: https://arxiv.org/abs/2408.16403
作者: Kai Du,Yongle Xie,Tao Zhou,Yuancheng Zhou
关键词-EN: recently developed tool, related nonlinear Fokker-Planck, Sequential propagation, stochastic differential equations, nonlinear Fokker-Planck equations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequential propagation of chaos (SPoC) is a recently developed tool to solve mean-field stochastic differential equations and their related nonlinear Fokker-Planck equations. Based on the theory of SPoC, we present a new method (deepSPoC) that combines the interacting particle system of SPoC and deep learning. Under the framework of deepSPoC, two classes of frequently used deep models include fully connected neural networks and normalizing flows are considered. For high-dimensional problems, spatial adaptive method are designed to further improve the accuracy and efficiency of deepSPoC. We analysis the convergence of the framework of deepSPoC under some simplified conditions and also provide a posterior error estimation for the algorithm. Finally, we test our methods on a wide range of different types of mean-field equations.

[LG-41] Illuminating the Diversity-Fitness Trade-Off in Black-Box Optimization

链接: https://arxiv.org/abs/2408.16393
作者: Maria Laura Santoni,Elena Raponi,Aneta Neumann,Frank Neumann,Mike Preuss,Carola Doerr
关键词-EN: favor structurally diverse, structurally diverse design, diverse design choices, real-world applications, users often favor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In real-world applications, users often favor structurally diverse design choices over one high-quality solution. It is hence important to consider more solutions that decision-makers can compare and further explore based on additional criteria. Alongside the existing approaches of evolutionary diversity optimization, quality diversity, and multimodal optimization, this paper presents a fresh perspective on this challenge by considering the problem of identifying a fixed number of solutions with a pairwise distance above a specified threshold while maximizing their average quality. We obtain first insight into these objectives by performing a subset selection on the search trajectories of different well-established search heuristics, whether specifically designed with diversity in mind or not. We emphasize that the main goal of our work is not to present a new algorithm but to look at the problem in a more fundamental and theoretically tractable way by asking the question: What trade-off exists between the minimum distance within batches of solutions and the average quality of their fitness? These insights also provide us with a way of making general claims concerning the properties of optimization problems that shall be useful in turn for benchmarking algorithms of the approaches enumerated above. A possibly surprising outcome of our empirical study is the observation that naive uniform random sampling establishes a very strong baseline for our problem, hardly ever outperformed by the search trajectories of the considered heuristics. We interpret these results as a motivation to develop algorithms tailored to produce diverse solutions of high average quality. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.16393 [cs.LG] (or arXiv:2408.16393v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] mpoKGAT: A Novel Graph Attention Network Approach for Temporal Graph Analysis

链接: https://arxiv.org/abs/2408.16391
作者: Lena Sasal,Daniel Busby,Abdenour Hadid
关键词-EN: shown significant capabilities, Graph neural networks, data remains limited, handling structured data, graph attention network
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNN) have shown significant capabilities in handling structured data, yet their application to dynamic, temporal data remains limited. This paper presents a new type of graph attention network, called TempoKGAT, which combines time-decaying weight and a selective neighbor aggregation mechanism on the spatial domain, which helps uncover latent patterns in the graph data. In this approach, a top-k neighbor selection based on the edge weights is introduced to represent the evolving features of the graph data. We evaluated the performance of our TempoKGAT on multiple datasets from the traffic, energy, and health sectors involving spatio-temporal data. We compared the performance of our approach to several state-of-the-art methods found in the literature on several open-source datasets. Our method shows superior accuracy on all datasets. These results indicate that TempoKGAT builds on existing methodologies to optimize prediction accuracy and provide new insights into model interpretation in temporal contexts.

[LG-43] Addressing Common Misinterpretations of KART and UAT in Neural Network Literature

链接: https://arxiv.org/abs/2408.16389
作者: Vugar Ismailov
关键词-EN: Universal Approximation Theorem, Kolmogorov-Arnold Representation Theorem, Representation Theorem, Approximation Theorem, Universal Approximation
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages

点击查看摘要

Abstract:This note addresses the Kolmogorov-Arnold Representation Theorem (KART) and the Universal Approximation Theorem (UAT), focusing on their common misinterpretations in some papers related to neural network approximation. Our remarks aim to support a more accurate understanding of KART and UAT among neural network specialists.

[LG-44] G-PhyNN: An Enhanced Physically-Aware Graph Neural Network framework for forecasting Spatio-Temporal Data

链接: https://arxiv.org/abs/2408.16379
作者: Zakaria Elabid,Lena Sasal,Daniel Busby,Abdenour Hadid
关键词-EN: Graph Neural Networks, remains a challenge, Accurately forecasting dynamic, Neural Network framework, Accurately forecasting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately forecasting dynamic processes on graphs, such as traffic flow or disease spread, remains a challenge. While Graph Neural Networks (GNNs) excel at modeling and forecasting spatio-temporal data, they often lack the ability to directly incorporate underlying physical laws. This work presents TG-PhyNN, a novel Temporal Graph Physics-Informed Neural Network framework. TG-PhyNN leverages the power of GNNs for graph-based modeling while simultaneously incorporating physical constraints as a guiding principle during training. This is achieved through a two-step prediction strategy that enables the calculation of physical equation derivatives within the GNN architecture. Our findings demonstrate that TG-PhyNN significantly outperforms traditional forecasting models (e.g., GRU, LSTM, GAT) on real-world spatio-temporal datasets like PedalMe (traffic flow), COVID-19 spread, and Chickenpox outbreaks. These datasets are all governed by well-defined physical principles, which TG-PhyNN effectively exploits to offer more reliable and accurate forecasts in various domains where physical processes govern the dynamics of data. This paves the way for improved forecasting in areas like traffic flow prediction, disease outbreak prediction, and potentially other fields where physics plays a crucial role.

[LG-45] Do Graph Neural Networks Work for High Entropy Alloys?

链接: https://arxiv.org/abs/2408.16337
作者: Hengrui Zhang,Ruishu Huang,Jie Chen,James M. Rondinelli,Wei Chen
关键词-EN: Graph neural networks, neural networks, crystals and molecules, excelled in predictive, Graph neural
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have excelled in predictive modeling for both crystals and molecules, owing to the expressiveness of graph representations. High-entropy alloys (HEAs), however, lack chemical long-range order, limiting the applicability of current graph representations. To overcome this challenge, we propose a representation of HEAs as a collection of local environment (LE) graphs. Based on this representation, we introduce the LESets machine learning model, an accurate, interpretable GNN for HEA property prediction. We demonstrate the accuracy of LESets in modeling the mechanical properties of quaternary HEAs. Through analyses and interpretation, we further extract insights into the modeling and design of HEAs. In a broader sense, LESets extends the potential applicability of GNNs to disordered materials with combinatorial complexity formed by diverse constituents and their flexible configurations.

[LG-46] GL-TSVM: A robust and smooth twin support vector machine with guardian loss function

链接: https://arxiv.org/abs/2408.16336
作者: Mushir Akhtar,M. Tanveer,Mohd. Arshad
关键词-EN: support vector machine, Twin support vector, times lower computational, vector machine, garnered significant attention
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2404.18101

点击查看摘要

Abstract:Twin support vector machine (TSVM), a variant of support vector machine (SVM), has garnered significant attention due to its 3/4 times lower computational complexity compared to SVM. However, due to the utilization of the hinge loss function, TSVM is sensitive to outliers or noise. To remedy it, we introduce the guardian loss (G-loss), a novel loss function distinguished by its asymmetric, bounded, and smooth characteristics. We then fuse the proposed G-loss function into the TSVM and yield a robust and smooth classifier termed GL-TSVM. Further, to adhere to the structural risk minimization (SRM) principle and reduce overfitting, we incorporate a regularization term into the objective function of GL-TSVM. To address the optimization challenges of GL-TSVM, we devise an efficient iterative algorithm. The experimental analysis on UCI and KEEL datasets substantiates the effectiveness of the proposed GL-TSVM in comparison to the baseline models. Moreover, to showcase the efficacy of the proposed GL-TSVM in the biomedical domain, we evaluated it on the breast cancer (BreaKHis) and schizophrenia datasets. The outcomes strongly demonstrate the competitiveness of the proposed GL-TSVM against the baseline models.

[LG-47] Self-Improving Diffusion Models with Synthetic Data

链接: https://arxiv.org/abs/2408.16333
作者: Sina Alemohammad,Ahmed Imtiaz Humayun,Shruti Agarwal,John Collomosse,Richard Baraniuk
关键词-EN: synthetic data, training increasingly large, data, increasingly large generative, synthetic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The artificial intelligence (AI) world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model’s generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fréchet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model’s synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.

[LG-48] Minimising changes to audit when updating decision trees

链接: https://arxiv.org/abs/2408.16321
作者: Anj Simmons,Scott Barnett,Anupam Chaudhuri,Sankhya Singh,Shangeetha Sivasothy
关键词-EN: Interpretable models, training data, models are important, Interpretable, model is updated
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Interpretable models are important, but what happens when the model is updated on new training data? We propose an algorithm for updating a decision tree while minimising the number of changes to the tree that a human would need to audit. We achieve this via a greedy approach that incorporates the number of changes to the tree as part of the objective function. We compare our algorithm to existing methods and show that it sits in a sweet spot between final accuracy and number of changes to audit.

[LG-49] Passenger hazard perception based on EEG signals for highly automated driving vehicles

链接: https://arxiv.org/abs/2408.16315
作者: Ashton Yu Xuan Tan,Yingkai Yang,Xiaofei Zhang,Bowen Li,Xiaorong Gao,Sifa Zheng,Jianqiang Wang,Xinyu Gu,Jun Li,Yang Zhao,Yuxin Zhang,Tania Stathaki
关键词-EN: recent accidents involving, accidents involving automated, involving automated systems, EEG Decoding Strategy, Passenger Cognitive Model
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Enhancing the safety of autonomous vehicles is crucial, especially given recent accidents involving automated systems. As passengers in these vehicles, humans’ sensory perception and decision-making can be integrated with autonomous systems to improve safety. This study explores neural mechanisms in passenger-vehicle interactions, leading to the development of a Passenger Cognitive Model (PCM) and the Passenger EEG Decoding Strategy (PEDS). Central to PEDS is a novel Convolutional Recurrent Neural Network (CRNN) that captures spatial and temporal EEG data patterns. The CRNN, combined with stacking algorithms, achieves an accuracy of 85.0% \pm 3.18% . Our findings highlight the predictive power of pre-event EEG data, enhancing the detection of hazardous scenarios and offering a network-driven framework for safer autonomous vehicles.

[LG-50] Physics of Language Models: Part 2.2 How to Learn From Mistakes on Grade-School Math Problems

链接: https://arxiv.org/abs/2408.16293
作者: Tian Ye,Zicheng Xu,Yuanzhi Li,Zeyuan Allen-Zhu
关键词-EN: demonstrated remarkable performance, solving reasoning tasks, occasionally make reasoning, make reasoning mistakes, Language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2407.20311

点击查看摘要

Abstract:Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to “self-correct” their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating “error-correction” data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others.

[LG-51] Flexible framework for generating synthetic electrocardiograms and photoplethysmograms

链接: https://arxiv.org/abs/2408.16291
作者: Katri Karhinoja,Antti Vasankari,Jukka-Pekka Sirkiä,Antti Airola,David Wong,Matti Kaisti
关键词-EN: generating synthetic biosignals, quantity and variety, variety of health, model, generating synthetic
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:By generating synthetic biosignals, the quantity and variety of health data can be increased. This is especially useful when training machine learning models by enabling data augmentation and introduction of more physiologically plausible variation to the data. For these purposes, we have developed a synthetic biosignal model for two signal modalities, electrocardiography (ECG) and photoplethysmography (PPG). The model produces realistic signals that account for physiological effects such as breathing modulation and changes in heart rate due to physical stress. Arrhythmic signals can be generated with beat intervals extracted from real measurements. The model also includes a flexible approach to adding different kinds of noise and signal artifacts. The noise is generated from power spectral densities extracted from both measured noisy signals and modeled power spectra. Importantly, the model also automatically produces labels for noise, segmentation (e.g. P and T waves, QRS complex, for electrocardiograms), and artifacts. We assessed how this comprehensive model can be used in practice to improve the performance of models trained on ECG or PPG data. For example, we trained an LSTM to detect ECG R-peaks using both real ECG signals from the MIT-BIH arrythmia set and our new generator. The F1 score of the model was 0.83 using real data, in comparison to 0.98 using our generator. In addition, the model can be used for example in signal segmentation, quality detection and bench-marking detection algorithms. The model code has been released in \urlthis https URL

[LG-52] OpenFGL: A Comprehensive Benchmarks for Federated Graph Learning

链接: https://arxiv.org/abs/2408.16288
作者: Xunkai Li,Yinlin Zhu,Boyang Pang,Guochen Yan,Yeyu Yan,Zening Li,Zhengyu Wu,Wentao Zhang,Rong-Hua Li,Guoren Wang
关键词-EN: promising distributed training, distributed training paradigm, multiple local systems, graph neural networks, direct data sharing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
*备注: Under Review

点击查看摘要

Abstract:Federated graph learning (FGL) has emerged as a promising distributed training paradigm for graph neural networks across multiple local systems without direct data sharing. This approach is particularly beneficial in privacy-sensitive scenarios and offers a new perspective on addressing scalability challenges in large-scale graph learning. Despite the proliferation of FGL, the diverse motivations from practical applications, spanning various research backgrounds and experimental settings, pose a significant challenge to fair evaluation. To fill this gap, we propose OpenFGL, a unified benchmark designed for the primary FGL scenarios: Graph-FL and Subgraph-FL. Specifically, OpenFGL includes 38 graph datasets from 16 application domains, 8 federated data simulation strategies that emphasize graph properties, and 5 graph-based downstream tasks. Additionally, it offers 18 recently proposed SOTA FGL algorithms through a user-friendly API, enabling a thorough comparison and comprehensive evaluation of their effectiveness, robustness, and efficiency. Empirical results demonstrate the ability of FGL while also revealing its potential limitations, offering valuable insights for future exploration in this thriving field.

[LG-53] Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

链接: https://arxiv.org/abs/2408.16286
作者: Toshinori Kitamura,Tadashi Kozuno,Wataru Kumagai,Kenta Hoshino,Yohei Hosoe,Kazumi Kasaura,Masashi Hamaya,Paavo Parmas,Yutaka Matsuo
关键词-EN: real-world control applications, Designing a safe, control applications, crucial in real-world, real-world control
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Designing a safe policy for uncertain environments is crucial in real-world control applications. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional Lagrangian max-min formulation with policy gradient methods can become trapped in suboptimal solutions by encountering a sum of conflicting gradients from the objective and constraint functions during its inner minimization problem. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a binary search algorithm with a policy gradient subroutine and prove that it identifies an \varepsilon -optimal policy in an RCMDP with \tilde\mathcalO(\varepsilon^-4) policy evaluations.

[LG-54] ART: Actually Robust Training

链接: https://arxiv.org/abs/2408.16285
作者: Sebastian Chwilczyński,Kacper Trębacz,Karol Cyganik,Mateusz Małecki,Dariusz Brzezinski
关键词-EN: deep learning captures, developing deep learning, deep learning, programmers and researchers, captures the attention
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Current interest in deep learning captures the attention of many programmers and researchers. Unfortunately, the lack of a unified schema for developing deep learning models results in methodological inconsistencies, unclear documentation, and problems with reproducibility. Some guidelines have been proposed, yet currently, they lack practical implementations. Furthermore, neural network training often takes on the form of trial and error, lacking a structured and thoughtful process. To alleviate these issues, in this paper, we introduce Art, a Python library designed to help automatically impose rules and standards while developing deep learning pipelines. Art divides model development into a series of smaller steps of increasing complexity, each concluded with a validation check improving the interpretability and robustness of the process. The current version of Art comes equipped with nine predefined steps inspired by Andrej Karpathy’s Recipe for Training Neural Networks, a visualization dashboard, and integration with loggers such as Neptune. The code related to this paper is available at: this https URL.

[LG-55] Enhancing Customer Churn Prediction in Telecommunications: An Adaptive Ensemble Learning Approach

链接: https://arxiv.org/abs/2408.16284
作者: Mohammed Affan Shaikhsurab,Pramod Magadum
关键词-EN: Support Vector Machine, poses a significant, discontinuation of services, services by existing, significant challenge
类目: Machine Learning (cs.LG)
*备注: 12 pages,2 figures

点击查看摘要

Abstract:Customer churn, the discontinuation of services by existing customers, poses a significant challenge to the telecommunications industry. This paper proposes a novel adaptive ensemble learning framework for highly accurate customer churn prediction. The framework integrates multiple base models, including XGBoost, LightGBM, LSTM, a Multi-Layer Perceptron (MLP) neural network, and Support Vector Machine (SVM). These models are strategically combined using a stacking ensemble method, further enhanced by meta-feature generation from base model predictions. A rigorous data preprocessing pipeline, coupled with a multi-faceted feature engineering approach, optimizes model performance. The framework is evaluated on three publicly available telecom churn datasets, demonstrating substantial accuracy improvements over state-of-the-art techniques. The research achieves a remarkable 99.28% accuracy, signifying a major advancement in churn prediction.The implications of this research for developing proactive customer retention strategies withinthe telecommunications industry are discussed.

[LG-56] Web Service QoS Prediction via Extended Canonical Polyadic-based Tensor Network

链接: https://arxiv.org/abs/2408.16278
作者: Qu Wang,Hao Wu
关键词-EN: numerous web services, web services, tensor network, similar functionalities, Today
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:Today, numerous web services with similar functionalities are available on the Internet. Users often evaluate the Quality of Service (QoS) to choose the best option among them. Predicting the QoS values of these web services is a significant challenge in the field of web services. A Canonical Polyadic (CP)-based tensor network model has proven to be efficient for predicting dynamic QoS data. However, current CP-based tensor network models do not consider the correlation of users and services in the low-dimensional latent feature space, thereby limiting model’s prediction capability. To tackle this issue, this paper proposes an Extended Canonical polyadic-based Tensor Network (ECTN) model. It models the correlation of users and services via building a relation dimension between user feature and service feature in low-dimensional space, and then designs an extended CP decomposition structure to improve prediction accuracy. Experiments are conducted on two public dynamic QoS data, and the results show that compared with state-of-the-art QoS prediction models, the ECTN obtains higher prediction accuracy.

[LG-57] On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

链接: https://arxiv.org/abs/2408.16262
作者: Yi Wan,Huizhen Yu,Richard S. Sutton
关键词-EN: Markov decision processes, analyzes reinforcement learning, paper analyzes reinforcement, RVI Q-learning algorithms, RVI Q-learning
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic convergence. We also characterize the sets to which RVI Q-learning algorithms converge, showing that they are compact, connected, potentially nonconvex, and comprised of solutions to the average-reward optimality equation, with exactly one less degree of freedom than the general solution set of this equation. Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.

[LG-58] Evaluating Time-Series Training Dataset through Lens of Spectrum in Deep State Space Models

链接: https://arxiv.org/abs/2408.16261
作者: Sekitoshi Kanai,Yasutoshi Ida,Kazuki Adachi,Mihiro Uchida,Tsukasa Yoshida,Shin’ya Yamaguchi
关键词-EN: state space models, deep SSMs, deep neural networks, SSMs, deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:This study investigates a method to evaluate time-series datasets in terms of the performance of deep neural networks (DNNs) with state space models (deep SSMs) trained on the dataset. SSMs have attracted attention as components inside DNNs to address time-series data. Since deep SSMs have powerful representation capacities, training datasets play a crucial role in solving a new task. However, the effectiveness of training datasets cannot be known until deep SSMs are actually trained on them. This can increase the cost of data collection for new tasks, as a trial-and-error process of data collection and time-consuming training are needed to achieve the necessary performance. To advance the practical use of deep SSMs, the metric of datasets to estimate the performance early in the training can be one key element. To this end, we introduce the concept of data evaluation methods used in system identification. In system identification of linear dynamical systems, the effectiveness of datasets is evaluated by using the spectrum of input signals. We introduce this concept to deep SSMs, which are nonlinear dynamical systems. We propose the K-spectral metric, which is the sum of the top-K spectra of signals inside deep SSMs, by focusing on the fact that each layer of a deep SSM can be regarded as a linear dynamical system. Our experiments show that the K-spectral metric has a large absolute value of the correlation coefficient with the performance and can be used to evaluate the quality of training datasets.

[LG-59] Coalitions of AI-based Methods Predict 15-Year Risks of Breast Cancer Metastasis Using Real-World Clinical Data with AUC up to 0.9

链接: https://arxiv.org/abs/2408.16256
作者: Xia Jiang,Yijun Zhou,Alan Wells,Adam Brufsky
关键词-EN: breast cancers newly, Breast cancer, cancers responsible, Breast, deaths
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Breast cancer is one of the two cancers responsible for the most deaths in women, with about 42,000 deaths each year in the US. That there are over 300,000 breast cancers newly diagnosed each year suggests that only a fraction of the cancers result in mortality. Thus, most of the women undergo seemingly curative treatment for localized cancers, but a significant later succumb to metastatic disease for which current treatments are only temporizing for the vast majority. The current prognostic metrics are of little actionable value for 4 of the 5 women seemingly cured after local treatment, and many women are exposed to morbid and even mortal adjuvant therapies unnecessarily, with these adjuvant therapies reducing metastatic recurrence by only a third. Thus, there is a need for better prognostics to target aggressive treatment at those who are likely to relapse and spare those who were actually cured. While there is a plethora of molecular and tumor-marker assays in use and under-development to detect recurrence early, these are time consuming, expensive and still often un-validated as to actionable prognostic utility. A different approach would use large data techniques to determine clinical and histopathological parameters that would provide accurate prognostics using existing data. Herein, we report on machine learning, together with grid search and Bayesian Networks to develop algorithms that present a AUC of up to 0.9 in ROC analyses, using only extant data. Such algorithms could be rapidly translated to clinical management as they do not require testing beyond routine tumor evaluations.

[LG-60] Iterated Energy-based Flow Matching for Sampling from Boltzmann Densities

链接: https://arxiv.org/abs/2408.16249
作者: Dongyeop Woo,Sungsoo Ahn
关键词-EN: training a generator, generator from evaluations, unnormalized densities, energy-based flow matching, Monte Carlo estimation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this work, we consider the problem of training a generator from evaluations of energy functions or unnormalized densities. This is a fundamental problem in probabilistic inference, which is crucial for scientific applications such as learning the 3D coordinate distribution of a molecule. To solve this problem, we propose iterated energy-based flow matching (iEFM), the first off-policy approach to train continuous normalizing flow (CNF) models from unnormalized densities. We introduce the simulation-free energy-based flow matching objective, which trains the model to predict the Monte Carlo estimation of the marginal vector field constructed from known energy functions. Our framework is general and can be extended to variance-exploding (VE) and optimal transport (OT) conditional probability paths. We evaluate iEFM on a two-dimensional Gaussian mixture model (GMM) and an eight-dimensional four-particle double-well potential (DW-4) energy function. Our results demonstrate that iEFM outperforms existing methods, showcasing its potential for efficient and scalable probabilistic modeling in complex high-dimensional systems.

[LG-61] PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation

链接: https://arxiv.org/abs/2408.16246
作者: Wenlun Zhang,Shimpei Ando,Yung-Chin Chen,Satomi Miyagi,Shinya Takamaeda-Yamazaki,Kentaro Yoshioka
关键词-EN: neural network processing, deep neural network, network processing, promising approach, approach to enhance
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Approximate computing emerges as a promising approach to enhance the efficiency of compute-in-memory (CiM) systems in deep neural network processing. However, traditional approximate techniques often significantly trade off accuracy for power efficiency, and fail to reduce data transfer between main memory and CiM banks, which dominates power consumption. This paper introduces a novel probabilistic approximate computation (PAC) method that leverages statistical techniques to approximate multiply-and-accumulation (MAC) operations, reducing approximation error by 4X compared to existing approaches. PAC enables efficient sparsity-based computation in CiM systems by simplifying complex MAC vector computations into scalar calculations. Moreover, PAC enables sparsity encoding and eliminates the LSB activations transmission, significantly reducing data reads and writes. This sets PAC apart from traditional approximate computing techniques, minimizing not only computation power but also memory accesses by 50%, thereby boosting system-level efficiency. We developed PACiM, a sparsity-centric architecture that fully exploits sparsity to reduce bit-serial cycles by 81% and achieves a peak 8b/8b efficiency of 14.63 TOPS/W in 65 nm CMOS while maintaining high accuracy of 93.85/72.36/66.02% on CIFAR-10/CIFAR-100/ImageNet benchmarks using a ResNet-18 model, demonstrating the effectiveness of our PAC methodology.

[LG-62] Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

链接: https://arxiv.org/abs/2408.16245
作者: Sully F. Chen,Robert J. Steele,Beakal Lemeneh,Shivanand P. Lad,Eric Oermann
关键词-EN: architecture has revolutionized, revolutionized bioinformatics, bioinformatics and driven, driven progress, understanding and prediction
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 27 pages, 5 figures

点击查看摘要

Abstract:The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually nucleotides or peptides. These models have seen incredible success in downstream tasks in each domain and have achieved particularly noteworthy breakthroughs in sequences of peptides and structural modeling. However, these single-omic models are naturally incapable of modeling multi-omic tasks, one of the most biologically critical being nucleotide-peptide interactions. We present our work training the first multi-omic nucleotide-peptide foundation models. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology, despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on peptide-nucleotide interaction tasks, namely predicting the change in Gibbs free energy (\DeltaG) of the binding interaction between a given oligonucleotide and peptide, as well as the effect on this binding interaction due to mutations in the oligonucleotide sequence (\Delta\DeltaG). Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any prior structural training, allowing us to predict which peptide residues are most involved in the peptide-nucleotide binding interaction. Lastly, we provide evidence that multi-omic biosequence models are non-inferior to foundation models trained on single-omics distributions, suggesting a more generalized or foundational approach to building these models. Comments: 27 pages, 5 figures Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM) Cite as: arXiv:2408.16245 [cs.LG] (or arXiv:2408.16245v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16245 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] Enhancing Conditional Image Generation with Explainable Latent Space Manipulation

链接: https://arxiv.org/abs/2408.16232
作者: Kshitij Pathania
关键词-EN: gradient-based selective attention, Selective Attention Manipulation, gradient-based selective, selective attention, selective attention mechanisms
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages , 5 figures

点击查看摘要

Abstract:In the realm of image synthesis, achieving fidelity to a reference image while adhering to conditional prompts remains a significant challenge. This paper proposes a novel approach that integrates a diffusion model with latent space manipulation and gradient-based selective attention mechanisms to address this issue. Leveraging Grad-SAM (Gradient-based Selective Attention Manipulation), we analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector, deriving importance scores of elements of denoised latent vector related to the subject of interest. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features. This approach ensures the faithful formation of subjects based on conditional prompts, while concurrently refining the background for a more coherent composition. Our experiments on places365 dataset demonstrate promising results, with our proposed model achieving the lowest mean and median Frechet Inception Distance (FID) scores compared to baseline models, indicating superior fidelity preservation. Furthermore, our model exhibits competitive performance in aligning the generated images with provided textual descriptions, as evidenced by high CLIP scores. These results highlight the effectiveness of our approach in both fidelity preservation and textual context preservation, offering a significant advancement in text-to-image synthesis tasks.

[LG-64] Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

链接: https://arxiv.org/abs/2408.16228
作者: Vivek Myers,Bill Chunyuan Zheng,Oier Mees,Sergey Levine,Kuan Fang
关键词-EN: Learned language-conditioned robot, Learned language-conditioned, set of instructions, struggle to effectively, effectively adapt
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 27 pages, 14 figures

点击查看摘要

Abstract:Learned language-conditioned robot policies often struggle to effectively adapt to new real-world tasks even when pre-trained across a diverse set of instructions. We propose a novel approach for few-shot adaptation to unseen tasks that exploits the semantic understanding of task decomposition provided by vision-language models (VLMs). Our method, Policy Adaptation via Language Optimization (PALO), combines a handful of demonstrations of a task with proposed language decompositions sampled from a VLM to quickly enable rapid nonparametric adaptation, avoiding the need for a larger fine-tuning dataset. We evaluate PALO on extensive real-world experiments consisting of challenging unseen, long-horizon robot manipulation tasks. We find that PALO is able of consistently complete long-horizon, multi-tier tasks in the real world, outperforming state of the art pre-trained generalist policies, and methods that have access to the same demonstrations.

[LG-65] argeted Cause Discovery with Data-Driven Learning

链接: https://arxiv.org/abs/2408.16218
作者: Jang-Hyun Kim,Claudia Skok Gibbs,Sangdoo Yun,Hyun Oh Song,Kyunghyun Cho
关键词-EN: machine learning approach, inferring causal variables, approach for inferring, target variable, inferring causal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: preprint

点击查看摘要

Abstract:We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our goal is to identify both direct and indirect causes within a system, thereby efficiently regulating the target variable when the difficulty and cost of intervening on each causal variable vary. Our method employs a neural network trained to identify causality through supervised learning on simulated data. By implementing a local-inference strategy, we achieve linear complexity with respect to the number of variables, efficiently scaling up to thousands of variables. Empirical results demonstrate the effectiveness of our method in identifying causal relationships within large-scale gene regulatory networks, outperforming existing causal discovery methods that primarily focus on direct causality. We validate our model’s generalization capability across novel graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line. Implementation codes are available at this https URL.

[LG-66] ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics

链接: https://arxiv.org/abs/2408.16208
作者: Oishi Banerjee,Agustina Saenz,Kay Wu,Warren Clements,Adil Zia,Dominic Buensalido,Helen Kavnoudias,Alain S. Abi-Ghanem,Nour El Ghawi,Cibele Luna,Patricia Castillo,Khaled Al-Surimi,Rayyan A. Daghistani,Yuh-Min Chen,Heng-sheng Chao,Lars Heiliger,Moon Kim,Johannes Haubold,Frederic Jonske,Pranav Rajpurkar
关键词-EN: rapidly expanding capabilities, rapidly expanding, expanding capabilities, capabilities of generative, generative AI models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Given the rapidly expanding capabilities of generative AI models for radiology, there is a need for robust metrics that can accurately measure the quality of AI-generated radiology reports across diverse hospitals. We develop ReXamine-Global, a LLM-powered, multi-site framework that tests metrics across different writing styles and patient populations, exposing gaps in their generalization. First, our method tests whether a metric is undesirably sensitive to reporting style, providing different scores depending on whether AI-generated reports are stylistically similar to ground-truth reports or not. Second, our method measures whether a metric reliably agrees with experts, or whether metric and expert scores of AI-generated report quality diverge for some sites. Using 240 reports from 6 hospitals around the world, we apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability. Developers can apply ReXamine-Global when designing new report evaluation metrics, ensuring their robustness across sites. Additionally, our analysis of existing metrics can guide users of those metrics towards evaluation procedures that work reliably at their sites of interest.

[LG-67] Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation

链接: https://arxiv.org/abs/2408.16204
作者: Lun Wang
关键词-EN: enhancing auto-speech recognition, gradient clipping method, recently shown potential, auto-speech recognition, recently shown
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Micro-batch clipping, a gradient clipping method, has recently shown potential in enhancing auto-speech recognition (ASR) model performance. However, the underlying mechanism behind this improvement remains mysterious, particularly the observation that only certain micro-batch sizes are beneficial. In this paper, we make the first attempt to explain this phenomenon. Inspired by recent data pruning research, we assume that specific training samples may impede model convergence during certain training phases. Under this assumption, the convergence analysis shows that micro-batch clipping can improve the convergence rate asymptotically at the cost of an additional constant bias that does not diminish with more training iterations. The bias is dependent on a few factors and can be minimized at specific micro-batch size, thereby elucidating the existence of the sweet-spot micro-batch size observed previously. We also verify the effectiveness of micro-batch clipping beyond speech models on vision and language models, and show promising performance gains in these domains. An exploration of potential limitations shows that micro-batch clipping is less effective when training data originates from multiple distinct domains.

[LG-68] Short-Term Electricity-Load Forecasting by Deep Learning: A Comprehensive Survey

链接: https://arxiv.org/abs/2408.16202
作者: Qi Dong,Rubing Huang,Chenhui Cui,Dave Towey,Ling Zhou,Jinyu Tian,Jianzhou Wang
关键词-EN: Short-Term Electricity-Load Forecasting, Short-Term Electricity-Load, power system, STELF, impact electricity demand
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Short-Term Electricity-Load Forecasting (STELF) refers to the prediction of the immediate demand (in the next few hours to several days) for the power system. Various external factors, such as weather changes and the emergence of new electricity consumption scenarios, can impact electricity demand, causing load data to fluctuate and become non-linear, which increases the complexity and difficulty of STELF. In the past decade, deep learning has been applied to STELF, modeling and predicting electricity demand with high accuracy, and contributing significantly to the development of STELF. This paper provides a comprehensive survey on deep-learning-based STELF over the past ten years. It examines the entire forecasting process, including data pre-processing, feature extraction, deep-learning modeling and optimization, and results evaluation. This paper also identifies some research challenges and potential research directions to be further investigated in future work.

[LG-69] Uni-3DAD: GAN-Inversion Aided Universal 3D Anomaly Detection on Model-free Products

链接: https://arxiv.org/abs/2408.16201
作者: Jiayu Liu,Shancong Mou,Nathan Gaw,Yinan Wang
关键词-EN: Anomaly detection, Anomaly, manufacturing systems, detection, point clouds
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is a long-standing challenge in manufacturing systems. Traditionally, anomaly detection has relied on human inspectors. However, 3D point clouds have gained attention due to their robustness to environmental factors and their ability to represent geometric data. Existing 3D anomaly detection methods generally fall into two categories. One compares scanned 3D point clouds with design files, assuming these files are always available. However, such assumptions are often violated in many real-world applications where model-free products exist, such as fresh produce (i.e., Cookie", Potato", etc.), dentures, bone, etc. The other category compares patches of scanned 3D point clouds with a library of normal patches named memory bank. However, those methods usually fail to detect incomplete shapes, which is a fairly common defect type (i.e., missing pieces of different products). The main challenge is that missing areas in 3D point clouds represent the absence of scanned points. This makes it infeasible to compare the missing region with existing point cloud patches in the memory bank. To address these two challenges, we proposed a unified, unsupervised 3D anomaly detection framework capable of identifying all types of defects on model-free products. Our method integrates two detection modules: a feature-based detection module and a reconstruction-based detection module. Feature-based detection covers geometric defects, such as dents, holes, and cracks, while the reconstruction-based method detects missing regions. Additionally, we employ a One-class Support Vector Machine (OCSVM) to fuse the detection results from both modules. The results demonstrate that (1) our proposed method outperforms the state-of-the-art methods in identifying incomplete shapes and (2) it still maintains comparable performance with the SOTA methods in detecting all other types of anomalies.

[LG-70] Variational Mode-Driven Graph Convolutional Network for Spatiotemporal Traffic Forecasting

链接: https://arxiv.org/abs/2408.16191
作者: Osama Ahmad,Zubair Khalid
关键词-EN: focuses on spatio-temporal, paper focuses, data, graph neural networks, Abstract
类目: Machine Learning (cs.LG)
*备注: IEEE Transactions on Intelligent Transportation Systems Submission, 2024

点击查看摘要

Abstract:This paper focuses on spatio-temporal (ST) traffic prediction traffic using graph neural networks. Given that ST data consists of non-stationary and complex time events, interpreting and predicting such trends is comparatively complicated. Representation of ST data in modes helps us infer behavior and assess the impact of noise on prediction applications. We propose a framework that decomposes ST data into modes using the variational mode decomposition (VMD) method, which is then fed into the neural network for forecasting future states. This hybrid approach is known as a variational mode graph convolutional network (VMGCN). Instead of exhaustively searching for the number of modes, they are determined using the reconstruction loss from the real-time application data. We also study the significance of each mode and the impact of bandwidth constraints on different horizon predictions in traffic flow data. We evaluate the performance of our proposed network on the LargeST dataset for both short and long-term predictions. Our framework yields better results compared to state-of-the-art methods.

[LG-71] Real-Time Energy Pricing in New Zealand: An Evolving Stream Analysis PRICAI

链接: https://arxiv.org/abs/2408.16187
作者: Yibin Sun,Heitor Murilo Gomes,Bernhard Pfahringer,Albert Bifet
关键词-EN: Electricity Market Information, Market Information, Electricity Market, representing real-time time-series, Zealand government
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 Pages, 8 figures, short version accepted by PRICAI

点击查看摘要

Abstract:This paper introduces a group of novel datasets representing real-time time-series and streaming data of energy prices in New Zealand, sourced from the Electricity Market Information (EMI) website maintained by the New Zealand government. The datasets are intended to address the scarcity of proper datasets for streaming regression learning tasks. We conduct extensive analyses and experiments on these datasets, covering preprocessing techniques, regression tasks, prediction intervals, concept drift detection, and anomaly detection. Our experiments demonstrate the datasets’ utility and highlight the challenges and opportunities for future research in energy price forecasting.

[LG-72] CardBench: A Benchmark for Learned Cardinality Estimation in Relational Databases

链接: https://arxiv.org/abs/2408.16170
作者: Yannis Chronis,Yawen Wang,Yu Gan,Sami Abu-El-Haija,Chelsea Lin,Carsten Binnig,Fatma Özcan
关键词-EN: enabling high query, high query performance, Cardinality estimation, learned cardinality estimation, Cardinality
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cardinality estimation is crucial for enabling high query performance in relational databases. Recently learned cardinality estimation models have been proposed to improve accuracy but there is no systematic benchmark or datasets which allows researchers to evaluate the progress made by new learned approaches and even systematically develop new learned approaches. In this paper, we are releasing a benchmark, containing thousands of queries over 20 distinct real-world databases for learned cardinality estimation. In contrast to other initial benchmarks, our benchmark is much more diverse and can be used for training and testing learned models systematically. Using this benchmark, we explored whether learned cardinality estimation can be transferred to an unseen dataset in a zero-shot manner. We trained GNN-based and transformer-based models to study the problem in three setups: 1-) instance-based, 2-) zero-shot, and 3-) fine-tuned. Our results show that while we get promising results for zero-shot cardinality estimation on simple single table queries; as soon as we add joins, the accuracy drops. However, we show that with fine-tuning, we can still utilize pre-trained models for cardinality estimation, significantly reducing training overheads compared to instance specific models. We are open sourcing our scripts to collect statistics, generate queries and training datasets to foster more extensive research, also from the ML community on the important problem of cardinality estimation and in particular improve on recent directions such as pre-trained cardinality estimation.

[LG-73] Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network

链接: https://arxiv.org/abs/2408.16169
作者: Duncan Taylor,Melissa Humphries
关键词-EN: DNA profiles, DNA profile electrophoretic, DNA, electrophoretic signal measuring, signal measuring fluorescence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 9 Figures

点击查看摘要

Abstract:DNA profiles are made up from multiple series of electrophoretic signal measuring fluorescence over time. Typically, human DNA analysts ‘read’ DNA profiles using their experience to distinguish instrument noise, artefactual signal, and signal corresponding to DNA fragments of interest. Recent work has developed an artificial neural network, ANN, to carry out the task of classifying fluorescence types into categories in DNA profile electrophoretic signal. But the creation of the necessarily large amount of labelled training data for the ANN is time consuming and expensive, and a limiting factor in the ability to robustly train the ANN. If realistic, prelabelled, training data could be simulated then this would remove the barrier to training an ANN with high efficacy. Here we develop a generative adversarial network, GAN, modified from the pix2pix GAN to achieve this task. With 1078 DNA profiles we train the GAN and achieve the ability to simulate DNA profile information, and then use the generator from the GAN as a ‘realism filter’ that applies the noise and artefact elements exhibited in typical electrophoretic signal.

[LG-74] LeMON: Learning to Learn Multi-Operator Networks

链接: https://arxiv.org/abs/2408.16168
作者: Jingmin Sun,Zecheng Zhang,Hayden Schaeffer
关键词-EN: Single-operator learning involves, Single-operator learning, multi-operator learning, operator embedding structure, deep neural network
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-operator learning involves training a deep neural network to learn a specific operator, whereas recent work in multi-operator learning uses an operator embedding structure to train a single neural network on data from multiple operators. Thus, multi-operator learning is capable of predicting a range of operators within one model. In this work, we propose pretraining and fine-tuning strategies for solving PDEs using multi-operator learning. One key aspect is that by increasing the number of families of operators used in pretraining, a PDE foundation model can be fine-tuned to downstream tasks involving new PDEs with a limited number of samples, thus outperforming single operator neural networks. Specifically, a multi-operator learning model pre-trained with data from diverse PDE families can predict unseen operators after fine-tuning with only a limited number of operators from the new family, enabling them to serve as a data-free PDE solver. We also show that the proposed training and fine-tuning method is able to predict new operators in zero-shot prediction without samples. Additionally, we introduce a PDE-agnostic meta-learning algorithm to improve the adaptability of the model to various PDEs by providing a better parameter initialization process. To address the needs of applications with limited computing resources, we explore low-rank adaptation methods that reduce computational costs while enhancing solver accuracy. Lastly, by examining the scaling law with respect to the number of operator families, we establish and highlight its potential for broad adaptation in PDE-solving tasks.

[LG-75] Free Lunch in the Forest: Functionally-Identical Pruning of Boosted Tree Ensembles

链接: https://arxiv.org/abs/2408.16167
作者: Youssouf Emine,Alexandre Forel,Idriss Malek,Thibaut Vidal
关键词-EN: including boosting methods, including boosting, tabular data, boosting methods, ensembles
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Tree ensembles, including boosting methods, are highly effective and widely used for tabular data. However, large ensembles lack interpretability and require longer inference times. We introduce a method to prune a tree ensemble into a reduced version that is “functionally identical” to the original model. In other words, our method guarantees that the prediction function stays unchanged for any possible input. As a consequence, this pruning algorithm is lossless for any aggregated metric. We formalize the problem of functionally identical pruning on ensembles, introduce an exact optimization model, and provide a fast yet highly effective method to prune large ensembles. Our algorithm iteratively prunes considering a finite set of points, which is incrementally augmented using an adversarial model. In multiple computational experiments, we show that our approach is a “free lunch”, significantly reducing the ensemble size without altering the model’s behavior. Thus, we can preserve state-of-the-art performance at a fraction of the original model’s size.

[LG-76] CLPNets: Coupled Lie-Poisson Neural Networks for Multi-Part Hamiltonian Systems with Symmetries

链接: https://arxiv.org/abs/2408.16160
作者: Christopher Eldred,François Gay-Balmaz,Vakhtang Putkaradze
关键词-EN: accurately compute data-based, compute data-based prediction, prediction of Hamiltonian, Hamiltonian systems, equations over time
类目: Machine Learning (cs.LG)
*备注: 52 pages, 9 figures

点击查看摘要

Abstract:To accurately compute data-based prediction of Hamiltonian systems, especially the long-term evolution of such systems, it is essential to utilize methods that preserve the structure of the equations over time. We consider a case that is particularly challenging for data-based methods: systems with interacting parts that do not reduce to pure momentum evolution. Such systems are essential in scientific computations. For example, any discretization of a continuum elastic rod can be viewed as interacting elements that can move and rotate in space, with each discrete element moving on the group of rotations and translations SE(3) . We develop a novel method of data-based computation and complete phase space learning of such systems. We follow the original framework of \emphSympNets (Jin et al, 2020) building the neural network from canonical phase space mappings, and transformations that preserve the Lie-Poisson structure (\emphLPNets) as in (Eldred et al, 2024). We derive a novel system of mappings that are built into neural networks for coupled systems. We call such networks Coupled Lie-Poisson Neural Networks, or \emphCLPNets. We consider increasingly complex examples for the applications of CLPNets: rotation of two rigid bodies about a common axis, the free rotation of two rigid bodies, and finally the evolution of two connected and interacting SE(3) components. Our method preserves all Casimir invariants of each system to machine precision, irrespective of the quality of the training data, and preserves energy to high accuracy. Our method also shows good resistance to the curse of dimensionality, requiring only a few thousand data points for all cases studied, with the effective dimension varying from three to eighteen. Additionally, the method is highly economical in memory requirements, requiring only about 200 parameters for the most complex case considered. Comments: 52 pages, 9 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.16160 [cs.LG] (or arXiv:2408.16160v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16160 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-77] Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? ECCV2024

链接: https://arxiv.org/abs/2408.16154
作者: Dilermando Queiroz,Anderson Carlos,Maíra Fatoretto,André Anjos,Lilian Berton,Luis Filipe Nakayama
关键词-EN: Foundation model, diverse domains, emerged as robust, label efficiency, efficiency in diverse
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint of paper to be presented at Fairness and Ethics Towards Transparent AI: Facing the Challenge through Model Debiasing (FAILED) during ECCV 2024

点击查看摘要

Abstract:Foundation models have emerged as robust models with label efficiency in diverse domains. In medical imaging, these models contribute to the advancement of medical diagnoses due to the difficulty in obtaining labeled data. However, it is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model. This research examines the bias in the Foundation model (RetFound) when it is applied to fine-tune the Brazilian Multilabel Ophthalmological Dataset (BRSET), which has a different population than the pre-training dataset. The model evaluation, in comparison with supervised learning, shows that the Foundation Model has the potential to reduce the gap between the maximum AUC and minimum AUC evaluations across gender and age groups. However, in a data-efficient generalization, the model increases the bias when the data amount decreases. These findings suggest that when deploying a Foundation Model in real-life scenarios with limited data, the possibility of fairness issues should be considered.

[LG-78] Improving the Prediction of Individual Engagement in Recommendations Using Cognitive Models

链接: https://arxiv.org/abs/2408.16147
作者: Roderick Seow,Yunfan Zhao,Duncan Wood,Milind Tambe,Cleotilde Gonzalez
关键词-EN: public health programs, limited resources, behaviors change, crucial for deciding, maternal health program
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:For public health programs with limited resources, the ability to predict how behaviors change over time and in response to interventions is crucial for deciding when and to whom interventions should be allocated. Using data from a real-world maternal health program, we demonstrate how a cognitive model based on Instance-Based Learning (IBL) Theory can augment existing purely computational approaches. Our findings show that, compared to general time-series forecasters (e.g., LSTMs), IBL models, which reflect human decision-making processes, better predict the dynamics of individuals’ states. Additionally, IBL provides estimates of the volatility in individuals’ states and their sensitivity to interventions, which can improve the efficiency of training of other time series models.

[LG-79] hinner Latent Spaces: Detecting dimension and imposing invariance through autoencoder gradient constraints

链接: https://arxiv.org/abs/2408.16138
作者: George A. Kevrekidis,Mauro Maggioni,Soledad Villar,Yannis G. Kevrekidis
关键词-EN: achieving disentangled representations, Conformal Autoencoders, neural network architecture, imposes orthogonality conditions, architecture that imposes
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal Autoencoders are a neural network architecture that imposes orthogonality conditions between the gradients of latent variables towards achieving disentangled representations of data. In this letter we show that orthogonality relations within the latent layer of the network can be leveraged to infer the intrinsic dimensionality of nonlinear manifold data sets (locally characterized by the dimension of their tangent space), while simultaneously computing encoding and decoding (embedding) maps. We outline the relevant theory relying on differential geometry, and describe the corresponding gradient-descent optimization algorithm. The method is applied to standard data sets and we highlight its applicability, advantages, and shortcomings. In addition, we demonstrate that the same computational technology can be used to build coordinate invariance to local group actions when defined only on a (reduced) submanifold of the embedding space.

[LG-80] Using Backbone Foundation Model for Evaluating Fairness in Chest Radiography Without Demographic Data MICCAI2024

链接: https://arxiv.org/abs/2408.16130
作者: Dilermando Queiroz,André Anjos,Lilian Berton
关键词-EN: Ensuring consistent performance, Ensuring consistent, machine learning models, advancing medical image, diverse populations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint of paper to be presented at Fairness of AI in Medical Imaging (FAIMI) during MICCAI 2024

点击查看摘要

Abstract:Ensuring consistent performance across diverse populations and incorporating fairness into machine learning models are crucial for advancing medical image diagnostics and promoting equitable healthcare. However, many databases do not provide protected attributes or contain unbalanced representations of demographic groups, complicating the evaluation of model performance across different demographics and the application of bias mitigation techniques that rely on these attributes. This study aims to investigate the effectiveness of using the backbone of Foundation Models as an embedding extractor for creating groups that represent protected attributes, such as gender and age. We propose utilizing these groups in different stages of bias mitigation, including pre-processing, in-processing, and evaluation. Using databases in and out-of-distribution scenarios, it is possible to identify that the method can create groups that represent gender in both databases and reduce in 4.44% the difference between the gender attribute in-distribution and 6.16% in out-of-distribution. However, the model lacks robustness in handling age attributes, underscoring the need for more fundamentally fair and robust Foundation models. These findings suggest a role in promoting fairness assessment in scenarios where we lack knowledge of attributes, contributing to the development of more equitable medical diagnostics.

[LG-81] Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation Optimization and Evaluation INTERSPEECH2024

链接: https://arxiv.org/abs/2408.16126
作者: Ke Chen,Jiaqi Su,Taylor Berg-Kirkpatrick,Shlomo Dubnov,Zeyu Jin
关键词-EN: Achieving robust speech, Achieving robust, open challenge, robust speech separation, overlapping speakers
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: In Proceedings of the 25th Annual Conference of the International Speech Communication Association, Interspeech 2024

点击查看摘要

Abstract:Achieving robust speech separation for overlapping speakers in various acoustic environments with noise and reverberation remains an open challenge. Although existing datasets are available to train separators for specific scenarios, they do not effectively generalize across diverse real-world scenarios. In this paper, we present a novel data simulation pipeline that produces diverse training data from a range of acoustic environments and content, and propose new training paradigms to improve quality of a general speech separation model. Specifically, we first introduce AC-SIM, a data simulation pipeline that incorporates broad variations in both content and acoustics. Then we integrate multiple training objectives into the permutation invariant training (PIT) to enhance separation quality and generalization of the trained model. Finally, we conduct comprehensive objective and human listening experiments across separation architectures and benchmarks to validate our methods, demonstrating substantial improvement of generalization on both non-homologous and real-world test sets.

[LG-82] ChartEye: A Deep Learning Framework for Chart Information Extraction

链接: https://arxiv.org/abs/2408.16123
作者: Osama Mustafa,Muhammad Khizer Ali,Momina Moetesum,Imran Siddiqi
关键词-EN: inspired recent research, automated chart understanding, data visualization, domains has inspired, inspired recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 Pages, and 11 Figures

点击查看摘要

Abstract:The widespread use of charts and infographics as a means of data visualization in various domains has inspired recent research in automated chart understanding. However, information extraction from chart images is a complex multitasked process due to style variations and, as a consequence, it is challenging to design an end-to-end system. In this study, we propose a deep learning-based framework that provides a solution for key steps in the chart information extraction pipeline. The proposed framework utilizes hierarchal vision transformers for the tasks of chart-type and text-role classification, while YOLOv7 for text detection. The detected text is then enhanced using Super Resolution Generative Adversarial Networks to improve the recognition output of the OCR. Experimental results on a benchmark dataset show that our proposed framework achieves excellent performance at every stage with F1-scores of 0.97 for chart-type classification, 0.91 for text-role classification, and a mean Average Precision of 0.95 for text detection.

[LG-83] Variational Mode Decomposition and Linear Embeddings are What You Need For Time-Series Forecasting

链接: https://arxiv.org/abs/2408.16122
作者: Hafizh Raihan Kurnia Putra,Novanto Yudistira,Tirana Noor Fatyanosa
关键词-EN: faces challenges due, Variational Mode Decomposition, VMD, inaccurate predictions, faces challenges
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: For accossiated repository, see this https URL

点击查看摘要

Abstract:Time-series forecasting often faces challenges due to data volatility, which can lead to inaccurate predictions. Variational Mode Decomposition (VMD) has emerged as a promising technique to mitigate volatility by decomposing data into distinct modes, thereby enhancing forecast accuracy. In this study, we integrate VMD with linear models to develop a robust forecasting framework. Our approach is evaluated on 13 diverse datasets, including ETTm2, WindTurbine, M4, and 10 air quality datasets from various Southeast Asian cities. The effectiveness of the VMD strategy is assessed by comparing Root Mean Squared Error (RMSE) values from models utilizing VMD against those without it. Additionally, we benchmark linear-based models against well-known neural network architectures such as LSTM, BLSTM, and RNN. The results demonstrate a significant reduction in RMSE across nearly all models following VMD application. Notably, the Linear + VMD model achieved the lowest average RMSE in univariate forecasting at 0.619. In multivariate forecasting, the DLinear + VMD model consistently outperformed others, attaining the lowest RMSE across all datasets with an average of 0.019. These findings underscore the effectiveness of combining VMD with linear models for superior time-series forecasting.

[LG-84] RAIN: Reinforcement Algorithms for Improving Numerical Weather and Climate Models

链接: https://arxiv.org/abs/2408.16118
作者: Pritthijit Nath,Henry Moss,Emily Shuckburgh,Mark Webb
关键词-EN: study explores integrating, explores integrating reinforcement, integrating reinforcement learning, key parameterisation challenges, address key parameterisation
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:This study explores integrating reinforcement learning (RL) with idealised climate models to address key parameterisation challenges in climate science. Current climate models rely on complex mathematical parameterisations to represent sub-grid scale processes, which can introduce substantial uncertainties. RL offers capabilities to enhance these parameterisation schemes, including direct interaction, handling sparse or delayed feedback, continuous online learning, and long-term optimisation. We evaluate the performance of eight RL algorithms on two idealised environments: one for temperature bias correction, another for radiative-convective equilibrium (RCE) imitating real-world computational constraints. Results show different RL approaches excel in different climate scenarios with exploration algorithms performing better in bias correction, while exploitation algorithms proving more effective for RCE. These findings support the potential of RL-based parameterisation schemes to be integrated into global climate models, improving accuracy and efficiency in capturing complex climate dynamics. Overall, this work represents an important first step towards leveraging RL to enhance climate model accuracy, critical for improving climate understanding and predictions. Code accessible at this https URL.

[LG-85] Uncertainty Modeling in Graph Neural Networks via Stochastic Differential Equations

链接: https://arxiv.org/abs/2408.16115
作者: Richard Bergna,Sergio Calvo-Ordoñez,Felix L. Opolka,Pietro Liò,Jose Miguel Hernandez-Lobato
关键词-EN: Ordinary Differential Equations, Stochastic Differential Equations, Graph Neural Ordinary, Neural Ordinary Differential, Differential Equations
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review. 9 pages including appendix

点击查看摘要

Abstract:We address the problem of learning uncertainty-aware representations for graph-structured data. While Graph Neural Ordinary Differential Equations (GNODE) are effective in learning node representations, they fail to quantify uncertainty. To address this, we introduce Latent Graph Neural Stochastic Differential Equations (LGNSDE), which enhance GNODE by embedding randomness through Brownian motion to quantify uncertainty. We provide theoretical guarantees for LGNSDE and empirically show better performance in uncertainty quantification.

[LG-86] Negative Binomial Matrix Completion

链接: https://arxiv.org/abs/2408.16113
作者: Yu Lu,Kevin Bui,Roummel F. Marcia
关键词-EN: Poisson matrix completion, information in matrices, Matrix completion, Matrix completion focuses, focuses on recovering
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注: 6 pages, Accepted by the IEEE International Workshop on Machine Learning for Signal Processing (MLSP)

点击查看摘要

Abstract:Matrix completion focuses on recovering missing or incomplete information in matrices. This problem arises in various applications, including image processing and network analysis. Previous research proposed Poisson matrix completion for count data with noise that follows a Poisson distribution, which assumes that the mean and variance are equal. Since overdispersed count data, whose variance is greater than the mean, is more likely to occur in realistic settings, we assume that the noise follows the negative binomial (NB) distribution, which can be more general than the Poisson distribution. In this paper, we introduce NB matrix completion by proposing a nuclear-norm regularized model that can be solved by proximal gradient descent. In our experiments, we demonstrate that the NB model outperforms Poisson matrix completion in various noise and missing data settings on real data.

[LG-87] EPO: Hierarchical LLM Agents with Environment Preference Optimization

链接: https://arxiv.org/abs/2408.16090
作者: Qi Zhao,Haotian Fu,Chen Sun,George Konidaris
关键词-EN: tasks present significant, present significant challenges, multiple steps, decision-making tasks present, present significant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps. In this paper, we propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals. We introduce Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment’s feedback and uses them to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments.

[LG-88] Ensuring Equitable Financial Decisions: Leveraging Counterfactual Fairness and Deep Learning for Bias

链接: https://arxiv.org/abs/2408.16088
作者: Saish Shinde
关键词-EN: machine learning models, recent years due, learning models, machine learning, raised in recent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Concerns regarding fairness and bias have been raised in recent years due to the growing use of machine learning models in crucial decision-making processes, especially when it comes to delicate characteristics like gender. In order to address biases in machine learning models, this research paper investigates advanced bias mitigation techniques, with a particular focus on counterfactual fairness in conjunction with data augmentation. The study looks into how these integrated approaches can lessen gender bias in the financial industry, specifically in loan approval procedures. We show that these approaches are effective in achieving more equitable results through thorough testing and assessment on a skewed financial dataset. The findings emphasize how crucial it is to use fairness-aware techniques when creating machine learning models in order to guarantee morally righteous and impartial decision-making.

[LG-89] Scaling Up Diffusion and Flow-based XGBoost Models ICML2024

链接: https://arxiv.org/abs/2408.16046
作者: Jesse C. Cresswell,Taewoo Kim
关键词-EN: tabular data generation, machine learning methods, tabular data, machine learning, developed on small
类目: Machine Learning (cs.LG)
*备注: Presented at ICML 2024 Workshop on AI for Science

点击查看摘要

Abstract:Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. Our efficient implementation also unlocks scaling models to much larger sizes which we show directly leads to improved performance on benchmark tasks. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge. Code is available at this https URL.

[LG-90] Fairness Accuracy and Unreliable Data

链接: https://arxiv.org/abs/2408.16040
作者: Kevin Stangl
关键词-EN: strategic classification, algorithmic robustness, machine learning, investigates three areas, areas targeted
类目: Machine Learning (cs.LG)
*备注: PhD thesis

点击查看摘要

Abstract:This thesis investigates three areas targeted at improving the reliability of machine learning; fairness in machine learning, strategic classification, and algorithmic robustness. Each of these domains has special properties or structure that can complicate learning. A theme throughout this thesis is thinking about ways in which a `plain’ empirical risk minimization algorithm will be misleading or ineffective because of a mis-match between classical learning theory assumptions and specific properties of some data distribution in the wild. Theoretical understanding in eachof these domains can help guide best practices and allow for the design of effective, reliable, and robust systems.

[LG-91] Systematic Evaluation of Synthetic Data Augmentation for Multi-class NetFlow Traffic KDD ECML

链接: https://arxiv.org/abs/2408.16034
作者: Maximilian Wolf,Dieter Landes,Andreas Hotho,Daniel Schlör
关键词-EN: ongoing research challenge, research challenge, cyber-attacks in computer, crucial and ongoing, ongoing research
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, submitted to Machine Learning for CyberSecurity @ ECML PKDD

点击查看摘要

Abstract:The detection of cyber-attacks in computer networks is a crucial and ongoing research challenge. Machine learning-based attack classification offers a promising solution, as these models can be continuously updated with new data, enhancing the effectiveness of network intrusion detection systems (NIDS). Unlike binary classification models that simply indicate the presence of an attack, multi-class models can identify specific types of attacks, allowing for more targeted and effective incident responses. However, a significant drawback of these classification models is their sensitivity to imbalanced training data. Recent advances suggest that generative models can assist in data augmentation, claiming to offer superior solutions for imbalanced datasets. Classical balancing methods, although less novel, also provide potential remedies for this issue. Despite these claims, a comprehensive comparison of these methods within the NIDS domain is lacking. Most existing studies focus narrowly on individual methods, making it difficult to compare results due to varying experimental setups. To close this gap, we designed a systematic framework to compare classical and generative resampling methods for class balancing across multiple popular classification models in the NIDS domain, evaluated on several NIDS benchmark datasets. Our experiments indicate that resampling methods for balancing training data do not reliably improve classification performance. Although some instances show performance improvements, the majority of results indicate decreased performance, with no consistent trend in favor of a specific resampling technique enhancing a particular classifier.

[LG-92] An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders

链接: https://arxiv.org/abs/2408.16032
作者: Shuang Feng,Grace Feng
关键词-EN: understanding webpage contexts, enabled understanding webpage, Recent advancements, large language models, webpage contexts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity – a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in InstructGPT, and Direct Preference Optimization (DPO). This report also evaluates the RL agents trained using generative trajectories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Also, with limited training time (2hours), without utilizing any images, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2408.16032 [cs.LG] (or arXiv:2408.16032v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16032 Focus to learn more arXiv-issued DOI via DataCite

[LG-93] EMP: Enhance Memory in Data Pruning

链接: https://arxiv.org/abs/2408.16031
作者: Jinying Xiao,Ping Li,Jie Nie,Zhe Tang
关键词-EN: shown strong performance, fine-tuning costs, research has shifted, memory, shown strong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, large language and vision models have shown strong performance, but due to high pre-training and fine-tuning costs, research has shifted towards faster training via dataset pruning. Previous methods used sample loss as an evaluation criterion, aiming to select the most “difficult” samples for training. However, when the pruning rate increases, the number of times each sample is trained becomes more evenly distributed, which causes many critical or general samples to not be effectively fitted. We refer to this as Low-Frequency Learning (LFL). In other words, LFL prevents the model from remembering most samples. In our work, we decompose the scoring function of LFL, provide a theoretical explanation for the inefficiency of LFL, and propose adding a memory term to the scoring function to enhance the model’s memory capability, along with an approximation of this memory term. Similarly, we explore memory in Self-Supervised Learning (SSL), marking the first discussion on SSL memory. Using contrastive learning, we derive the memory term both theoretically and experimentally. Finally, we propose Enhance Memory Pruning (EMP), which addresses the issue of insufficient memory under high pruning rates by enhancing the model’s memory of data, thereby improving its performance. We evaluated the performance of EMP in tasks such as image classification, natural language understanding, and model pre-training. The results show that EMP can improve model performance under extreme pruning rates. For example, in the CIFAR100-ResNet50 pre-training task, with 70% pruning, EMP outperforms current methods by 2.2%.

[LG-94] A Deep Learning Approach to Localizing Multi-level Airway Collapse Based on Snoring Sounds

链接: https://arxiv.org/abs/2408.16030
作者: Ying-Chieh Hsu,Stanley Yung-Chuan Liu,Chao-Jung Huang,Chi-Wei Wu,Ren-Kai Cheng,Jane Yung-Jen Hsu,Shang-Ran Huang,Yuan-Ren Cheng,Fu-Shun Hsu
关键词-EN: obstructive sleep apnea, drug-induced sleep endoscopy, classify snoring sounds, snoring sounds excited, Support Vector Machine
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This study investigates the application of machine/deep learning to classify snoring sounds excited at different levels of the upper airway in patients with obstructive sleep apnea (OSA) using data from drug-induced sleep endoscopy (DISE). The snoring sounds of 39 subjects were analyzed and labeled according to the Velum, Oropharynx, Tongue Base, and Epiglottis (VOTE) classification system. The dataset, comprising 5,173 one-second segments, was used to train and test models, including Support Vector Machine (SVM), Bidirectional Long Short-Term Memory (BiLSTM), and ResNet-50. The ResNet-50, a convolutional neural network (CNN), showed the best overall performance in classifying snoring acoustics, particularly in identifying multi-level obstructions. The study emphasizes the potential of integrating snoring acoustics with deep learning to improve the diagnosis and treatment of OSA. However, challenges such as limited sample size, data imbalance, and differences between pharmacologically induced and natural snoring sounds were noted, suggesting further research to enhance model accuracy and generalizability.

[LG-95] Meta-Learn Unimodal Signals with Weak Supervision for Multimodal Sentiment Analysis

链接: https://arxiv.org/abs/2408.16029
作者: Sijie Mai,Yu Zhao,Ying Zeng,Jianhua Yao,Haifeng Hu
关键词-EN: sentiment analysis aims, effectively integrate information, Multimodal sentiment analysis, Multimodal, unimodal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal sentiment analysis aims to effectively integrate information from various sources to infer sentiment, where in many cases there are no annotations for unimodal labels. Therefore, most works rely on multimodal labels for training. However, there exists the noisy label problem for the learning of unimodal signals as multimodal annotations are not always the ideal substitutes for the unimodal ones, failing to achieve finer optimization for individual modalities. In this paper, we explore the learning of unimodal labels under the weak supervision from the annotated multimodal labels. Specifically, we propose a novel meta uni-label generation (MUG) framework to address the above problem, which leverages the available multimodal labels to learn the corresponding unimodal labels by the meta uni-label correction network (MUCN). We first design a contrastive-based projection module to bridge the gap between unimodal and multimodal representations, so as to use multimodal annotations to guide the learning of MUCN. Afterwards, we propose unimodal and multimodal denoising tasks to train MUCN with explicit supervision via a bi-level optimization strategy. We then jointly train unimodal and multimodal learning tasks to extract discriminative unimodal features for multimodal inference. Experimental results suggest that MUG outperforms competitive baselines and can learn accurate unimodal labels.

[LG-96] ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data

链接: https://arxiv.org/abs/2408.16028
作者: Weizhou Wang,Eric Liu,Xiangyu Guo,David Lie
关键词-EN: Supervised learning-based software, fall short due, Supervised learning-based, Large Language Models, Large Language
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Supervised learning-based software vulnerability detectors often fall short due to the inadequate availability of labelled training data. In contrast, Large Language Models (LLMs) such as GPT-4, are not trained on labelled data, but when prompted to detect vulnerabilities, LLM prediction accuracy is only marginally better than random guessing. In this paper, we explore a different approach by reframing vulnerability detection as one of anomaly detection. Since the vast majority of code does not contain vulnerabilities and LLMs are trained on massive amounts of such code, vulnerable code can be viewed as an anomaly from the LLM’s predicted code distribution, freeing the model from the need for labelled data to provide a learnable representation of vulnerable code. Leveraging this perspective, we demonstrate that LLMs trained for code generation exhibit a significant gap in prediction accuracy when prompted to reconstruct vulnerable versus non-vulnerable code. Using this insight, we implement ANVIL, a detector that identifies software vulnerabilities at line-level granularity. Our experiments explore the discriminating power of different anomaly scoring methods, as well as the sensitivity of ANVIL to context size. We also study the effectiveness of ANVIL on various LLM families, and conduct leakage experiments on vulnerabilities that were discovered after the knowledge cutoff of our evaluated LLMs. On a collection of vulnerabilities from the Magma benchmark, ANVIL outperforms state-of-the-art line-level vulnerability detectors, LineVul and LineVD, which have been trained with labelled data, despite ANVIL having never been trained with labelled vulnerabilities. Specifically, our approach achieves 1.62\times to 2.18\times better Top-5 accuracies and 1.02\times to 1.29\times times better ROC scores on line-level vulnerability detection tasks. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2408.16028 [cs.CR] (or arXiv:2408.16028v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.16028 Focus to learn more arXiv-issued DOI via DataCite

[LG-97] oward Time-Continuous Data Inference in Sparse Urban CrowdSensing

链接: https://arxiv.org/abs/2408.16027
作者: Ziyu Sun,Haoyang Su,Hanqi Sun,En Wang,Wenbin Liu
关键词-EN: Mobile Crowd Sensing, leverages mobile users, Mobile Crowd, smart portable devices, leverages mobile
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 11 pages, 11 figures

点击查看摘要

Abstract:Mobile Crowd Sensing (MCS) is a promising paradigm that leverages mobile users and their smart portable devices to perform various real-world tasks. However, due to budget constraints and the inaccessibility of certain areas, Sparse MCS has emerged as a more practical alternative, collecting data from a limited number of target subareas and utilizing inference algorithms to complete the full sensing map. While existing approaches typically assume a time-discrete setting with data remaining constant within each sensing cycle, this simplification can introduce significant errors, especially when dealing with long cycles, as real-world sensing data often changes continuously. In this paper, we go from fine-grained completion, i.e., the subdivision of sensing cycles into minimal time units, towards a more accurate, time-continuous completion. We first introduce Deep Matrix Factorization (DMF) as a neural network-enabled framework and enhance it with a Recurrent Neural Network (RNN-DMF) to capture temporal correlations in these finer time slices. To further deal with the continuous data, we propose TIME-DMF, which captures temporal information across unequal intervals, enabling time-continuous completion. Additionally, we present the Query-Generate (Q-G) strategy within TIME-DMF to model the infinite states of continuous data. Extensive experiments across five types of sensing tasks demonstrate the effectiveness of our models and the advantages of time-continuous completion.

[LG-98] Improving Adversarial Robustness in Android Malware Detection by Reducing the Impact of Spurious Correlations ECAI2024 ESORICS2024

链接: https://arxiv.org/abs/2408.16025
作者: Hamid Bostani,Zhengyu Zhao,Veelasha Moonsamy
关键词-EN: demonstrated significant advancements, Machine learning, demonstrated significant, significant advancements, remains a major
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: The paper is accepted at the ESORICS 2024 Workshop on Security and Artificial Intelligence (SECAI 2024)

点击查看摘要

Abstract:Machine learning (ML) has demonstrated significant advancements in Android malware detection (AMD); however, the resilience of ML against realistic evasion attacks remains a major obstacle for AMD. One of the primary factors contributing to this challenge is the scarcity of reliable generalizations. Malware classifiers with limited generalizability tend to overfit spurious correlations derived from biased features. Consequently, adversarial examples (AEs), generated by evasion attacks, can modify these features to evade detection. In this study, we propose a domain adaptation technique to improve the generalizability of AMD by aligning the distribution of malware samples and AEs. Specifically, we utilize meaningful feature dependencies, reflecting domain constraints in the feature space, to establish a robust feature space. Training on the proposed robust feature space enables malware classifiers to learn from predefined patterns associated with app functionality rather than from individual features. This approach helps mitigate spurious correlations inherent in the initial feature space. Our experiments conducted on DREBIN, a renowned Android malware detector, demonstrate that our approach surpasses the state-of-the-art defense, Sec-SVM, when facing realistic evasion attacks. In particular, our defense can improve adversarial robustness by up to 55% against realistic evasion attacks compared to Sec-SVM.

[LG-99] Characterizing Physician Referral Networks with Ricci Curvature

链接: https://arxiv.org/abs/2408.16022
作者: Jeremy Wayland,Russel J. Funk,Bastian Rieck
关键词-EN: United States remains, United States, Physician Referral Networks, States remains, quality healthcare access
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying (a) systemic barriers to quality healthcare access and (b) key indicators of care efficacy in the United States remains a significant challenge. To improve our understanding of regional disparities in care delivery, we introduce a novel application of curvature, a geometrical-topological property of networks, to Physician Referral Networks. Our initial findings reveal that Forman-Ricci and Ollivier-Ricci curvature measures, which are known for their expressive power in characterizing network structure, offer promising indicators for detecting variations in healthcare efficacy while capturing a range of significant regional demographic features. We also present APPARENT, an open-source tool that leverages Ricci curvature and other network features to examine correlations between regional Physician Referral Networks structure, local census data, healthcare effectiveness, and patient outcomes.

[LG-100] XG-NID: Dual-Modality Network Intrusion Detection using a Heterogeneous Graph Neural Network and Large Language Model

链接: https://arxiv.org/abs/2408.16021
作者: Yasir Ali Farrukh,Syed Wali,Irfan Khan,Nathaniel D. Bastian
关键词-EN: rapidly evolving field, largely untapped area, heterogeneous graph structure, flow-level and packet-level, intrusion detection remains
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 6 figures

点击查看摘要

Abstract:In the rapidly evolving field of cybersecurity, the integration of flow-level and packet-level information for real-time intrusion detection remains a largely untapped area of research. This paper introduces “XG-NID,” a novel framework that, to the best of our knowledge, is the first to fuse flow-level and packet-level data within a heterogeneous graph structure, offering a comprehensive analysis of network traffic. Leveraging a heterogeneous graph neural network (GNN) with graph-level classification, XG-NID uniquely enables real-time inference while effectively capturing the intricate relationships between flow and packet payload data. Unlike traditional GNN-based methodologies that predominantly analyze historical data, XG-NID is designed to accommodate the heterogeneous nature of network traffic, providing a robust and real-time defense mechanism. Our framework extends beyond mere classification; it integrates Large Language Models (LLMs) to generate detailed, human-readable explanations and suggest potential remedial actions, ensuring that the insights produced are both actionable and comprehensible. Additionally, we introduce a new set of flow features based on temporal information, further enhancing the contextual and explainable inferences provided by our model. To facilitate practical application and accessibility, we developed “GNN4ID,” an open-source tool that enables the extraction and transformation of raw network traffic into the proposed heterogeneous graph structure, seamlessly integrating flow and packet-level data. Our comprehensive quantitative comparative analysis demonstrates that XG-NID achieves an F1 score of 97% in multi-class classification, outperforming existing baseline and state-of-the-art methods. This sets a new standard in Network Intrusion Detection Systems by combining innovative data fusion with enhanced interpretability and real-time capabilities.

[LG-101] SPICED: Syntactical Bug and Trojan Pattern Identification in A/MS Circuits using LLM-Enhanced Detection

链接: https://arxiv.org/abs/2408.16018
作者: Jayeeta Chaudhuri,Dhruv Thapar,Arjun Chaudhuri,Farshad Firouzi,Krishnendu Chakrabarty
关键词-EN: playing key roles, modern electronics, playing key, signal processing, crucial in modern
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at PAINE’24

点击查看摘要

Abstract:Analog and mixed-signal (A/MS) integrated circuits (ICs) are crucial in modern electronics, playing key roles in signal processing, amplification, sensing, and power management. Many IC companies outsource manufacturing to third-party foundries, creating security risks such as stealthy analog Trojans. Traditional detection methods, including embedding circuit watermarks or conducting hardware-based monitoring, often impose significant area and power overheads, and may not effectively identify all types of Trojans. To address these shortcomings, we propose SPICED, a Large Language Model (LLM)-based framework that operates within the software domain, eliminating the need for hardware modifications for Trojan detection and localization. This is the first work using LLM-aided techniques for detecting and localizing syntactical bugs and analog Trojans in circuit netlists, requiring no explicit training and incurring zero area overhead. Our framework employs chain-of-thought reasoning and few-shot examples to teach anomaly detection rules to LLMs. With the proposed method, we achieve an average Trojan coverage of 93.32% and an average true positive rate of 93.4% in identifying Trojan-impacted nodes for the evaluated analog benchmark circuits. These experimental results validate the effectiveness of LLMs in detecting and locating both syntactical bugs and Trojans within analog netlists.

[LG-102] Differentially Private Publication of Electricity Time Series Data in Smart Grids

链接: https://arxiv.org/abs/2408.16017
作者: Sina Shaham,Gabriel Ghinita,Bhaskar Krishnamachari,Cyrus Shahabi
关键词-EN: energy policy decisions, study consumer behavior, guide energy policy, Smart grids, valuable data source
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Smart grids are a valuable data source to study consumer behavior and guide energy policy decisions. In particular, time-series of power consumption over geographical areas are essential in deciding the optimal placement of expensive resources (e.g., transformers, storage elements) and their activation schedules. However, publication of such data raises significant privacy issues, as it may reveal sensitive details about personal habits and lifestyles. Differential privacy (DP) is well-suited for sanitization of individual data, but current DP techniques for time series lead to significant loss in utility, due to the existence of temporal correlation between data readings. We introduce \em STPT (Spatio-Temporal Private Timeseries), a novel method for DP-compliant publication of electricity consumption data that analyzes spatio-temporal attributes and captures both micro and macro patterns by leveraging RNNs. Additionally, it employs a partitioning method for releasing electricity consumption time series based on identified patterns. We demonstrate through extensive experiments, on both real-world and synthetic datasets, that STPT significantly outperforms existing benchmarks, providing a well-balanced trade-off between data utility and user privacy.

[LG-103] Artificial Neural Network and Deep Learning: Fundamentals and Theory

链接: https://arxiv.org/abs/2408.16002
作者: M. M. Hammad
关键词-EN: Neural Network, Fundamentals and Theory, Artificial Neural Network, neural network optimization, Neural
类目: Machine Learning (cs.LG)
*备注: 517 pages. arXiv admin note: text overlap with arXiv:2407.11090 , arXiv:2407.19258 , arXiv:2310.00004 ; text overlap with arXiv:2109.14545 , arXiv:1502.03167 , arXiv:1412.6980 , arXiv:2003.00547 , arXiv:2212.08989 by other authors

点击查看摘要

Abstract:“Artificial Neural Network and Deep Learning: Fundamentals and Theory” offers a comprehensive exploration of the foundational principles and advanced methodologies in neural networks and deep learning. This book begins with essential concepts in descriptive statistics and probability theory, laying a solid groundwork for understanding data and probability distributions. As the reader progresses, they are introduced to matrix calculus and gradient optimization, crucial for training and fine-tuning neural networks. The book delves into multilayer feed-forward neural networks, explaining their architecture, training processes, and the backpropagation algorithm. Key challenges in neural network optimization, such as activation function saturation, vanishing and exploding gradients, and weight initialization, are thoroughly discussed. The text covers various learning rate schedules and adaptive algorithms, providing strategies to optimize the training process. Techniques for generalization and hyperparameter tuning, including Bayesian optimization and Gaussian processes, are also presented to enhance model performance and prevent overfitting. Advanced activation functions are explored in detail, categorized into sigmoid-based, ReLU-based, ELU-based, miscellaneous, non-standard, and combined types. Each activation function is examined for its properties and applications, offering readers a deep understanding of their impact on neural network behavior. The final chapter introduces complex-valued neural networks, discussing complex numbers, functions, and visualizations, as well as complex calculus and backpropagation algorithms. This book equips readers with the knowledge and skills necessary to design, and optimize advanced neural network models, contributing to the ongoing advancements in artificial intelligence.

[LG-104] Subspace Representation Learning for Sparse Linear Arrays to Localize More Sources than Sensors: A Deep Learning Methodology

链接: https://arxiv.org/abs/2408.16605
作者: Kuan-Lin Chen,Bhaskar D. Rao
关键词-EN: utilize semidefinite programming, semidefinite programming, long relied, relied on minimizing, minimizing a distance
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 13 pages. Submitted to the IEEE Transactions on Signal Processing

点击查看摘要

Abstract:Localizing more sources than sensors with a sparse linear array (SLA) has long relied on minimizing a distance between two covariance matrices and recent algorithms often utilize semidefinite programming (SDP). Although deep neural network (DNN)-based methods offer new alternatives, they still depend on covariance matrix fitting. In this paper, we develop a novel methodology that estimates the co-array subspaces from a sample covariance for SLAs. Our methodology trains a DNN to learn signal and noise subspace representations that are invariant to the selection of bases. To learn such representations, we propose loss functions that gauge the separation between the desired and the estimated subspace. In particular, we propose losses that measure the length of the shortest path between subspaces viewed on a union of Grassmannians, and prove that it is possible for a DNN to approximate signal subspaces. The computation of learning subspaces of different dimensions is accelerated by a new batch sampling strategy called consistent rank sampling. The methodology is robust to array imperfections due to its geometry-agnostic and data-driven nature. In addition, we propose a fully end-to-end gridless approach that directly learns angles to study the possibility of bypassing subspace methods. Numerical results show that learning such subspace representations is more beneficial than learning covariances or angles. It outperforms conventional SDP-based methods such as the sparse and parametric approach (SPA) and existing DNN-based covariance reconstruction methods for a wide range of signal-to-noise ratios (SNRs), snapshots, and source numbers for both perfect and imperfect arrays.

[LG-105] Super-Resolution works for coastal simulations

链接: https://arxiv.org/abs/2408.16553
作者: Zhi-Song Liu,Markus Buttner,Vadym Aizinger,Andreas Rupp
关键词-EN: Learning fine-scale details, Learning fine-scale, coastal ocean simulation, challenging task, fine-scale details
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:Learning fine-scale details of a coastal ocean simulation from a coarse representation is a challenging task. For real-world applications, high-resolution simulations are necessary to advance understanding of many coastal processes, specifically, to predict flooding resulting from tsunamis and storm surges. We propose a Deep Network for Coastal Super-Resolution (DNCSR) for spatiotemporal enhancement to efficiently learn the high-resolution numerical solution. Given images of coastal simulations produced on low-resolution computational meshes using low polynomial order discontinuous Galerkin discretizations and a coarse temporal resolution, the proposed DNCSR learns to produce high-resolution free surface elevation and velocity visualizations in both time and space. To efficiently model the dynamic changes over time and space, we propose grid-aware spatiotemporal attention to project the temporal features to the spatial domain for non-local feature matching. The coordinate information is also utilized via positional encoding. For the final reconstruction, we use the spatiotemporal bilinear operation to interpolate the missing frames and then expand the feature maps to the frequency domain for residual mapping. Besides data-driven losses, the proposed physics-informed loss guarantees gradient consistency and momentum changes. Their combination contributes to the overall 24% improvements in RMSE. To train the proposed model, we propose a large-scale coastal simulation dataset and use it for model optimization and evaluation. Our method shows superior super-resolution quality and fast computation compared to the state-of-the-art methods.

[LG-106] Statistical and Geometrical properties of regularized Kernel Kullback-Leibler divergence

链接: https://arxiv.org/abs/2408.16543
作者: Clémentine Chazal,Anna Korba,Francis Bach
关键词-EN: introduced by Bach, kernel covariance operators, covariance operators, kernel Hilbert space, KKL
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this paper, we study the statistical and geometrical properties of the Kullback-Leibler divergence with kernel covariance operators (KKL) introduced by Bach [2022]. Unlike the classical Kullback-Leibler (KL) divergence that involves density ratios, the KKL compares probability distributions through covariance operators (embeddings) in a reproducible kernel Hilbert space (RKHS), and compute the Kullback-Leibler quantum divergence. This novel divergence hence shares parallel but different aspects with both the standard Kullback-Leibler between probability distributions and kernel embeddings metrics such as the maximum mean discrepancy. A limitation faced with the original KKL divergence is its inability to be defined for distributions with disjoint supports. To solve this problem, we propose in this paper a regularised variant that guarantees that the divergence is well defined for all distributions. We derive bounds that quantify the deviation of the regularised KKL to the original one, as well as finite-sample bounds. In addition, we provide a closed-form expression for the regularised KKL, specifically applicable when the distributions consist of finite sets of points, which makes it implementable. Furthermore, we derive a Wasserstein gradient descent scheme of the KKL divergence in the case of discrete distributions, and study empirically its properties to transport a set of points to a target distribution.

[LG-107] WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

链接: https://arxiv.org/abs/2408.16532
作者: Shengpeng Ji,Ziyue Jiang,Xize Cheng,Yifu Chen,Minghui Fang,Jialong Zuo,Qian Yang,Ruiqi Li,Ziang Zhang,Xiaoda Yang,Rongjie Huang,Yidi Jiang,Qian Chen,Siqi Zheng,Wen Wang,Zhou Zhao
关键词-EN: modeling natural signals, natural signals, Language models, effectively applied, applied to modeling
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Working in progress. arXiv admin note: text overlap with arXiv:2402.12208

点击查看摘要

Abstract:Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at this https URL.

[LG-108] Machine learning models for daily rainfall forecasting in Northern Tropical Africa using tropical wave predictors

链接: https://arxiv.org/abs/2408.16349
作者: Athul Rasheeda Satheesh,Peter Knippertz,Andreas H. Fink
关键词-EN: Numerical weather prediction, Numerical weather, simpler climatology-based precipitation, tropical Africa, northern tropical Africa
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerical weather prediction (NWP) models often underperform compared to simpler climatology-based precipitation forecasts in northern tropical Africa, even after statistical postprocessing. AI-based forecasting models show promise but have avoided precipitation due to its complexity. Synoptic-scale forcings like African easterly waves and other tropical waves (TWs) are important for predictability in tropical Africa, yet their value for predicting daily rainfall remains unexplored. This study uses two machine-learning models–gamma regression and a convolutional neural network (CNN)–trained on TW predictors from satellite-based GPM IMERG data to predict daily rainfall during the July-September monsoon season. Predictor variables are derived from the local amplitude and phase information of seven TW from the target and up-and-downstream neighboring grids at 1-degree spatial resolution. The ML models are combined with Easy Uncertainty Quantification (EasyUQ) to generate calibrated probabilistic forecasts and are compared with three benchmarks: Extended Probabilistic Climatology (EPC15), ECMWF operational ensemble forecast (ENS), and a probabilistic forecast from the ENS control member using EasyUQ (CTRL EasyUQ). The study finds that downstream predictor variables offer the highest predictability, with downstream tropical depression (TD)-type wave-based predictors being most important. Other waves like mixed-Rossby gravity (MRG), Kelvin, and inertio-gravity waves also contribute significantly but show regional preferences. ENS forecasts exhibit poor skill due to miscalibration. CTRL EasyUQ shows improvement over ENS and marginal enhancement over EPC15. Both gamma regression and CNN forecasts significantly outperform benchmarks in tropical Africa. This study highlights the potential of ML models trained on TW-based predictors to improve daily precipitation forecasts in tropical Africa.

[LG-109] Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks

链接: https://arxiv.org/abs/2408.16215
作者: Yan Dai,Longbo Huang
关键词-EN: stochastic queueing systems, Stochastic Network Optimization, stochastic queueing, Stochastic Network, Stochastic
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems. It has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary with time, which fails to capture the non-stationary components in many real-world scenarios. Many existing algorithms also assume knowledge of network conditions before decision, which rules out applications where unpredictability presents. Motivated by these issues, we consider Adversarial Network Optimization (ANO) under bandit feedback. Specifically, we consider the task of i) maximizing some unknown and time-varying utility function associated to scheduler’s actions, where ii) the underlying network is a non-stationary multi-hop one whose conditions change arbitrarily with time, and iii) only bandit feedback (effect of actually deployed actions) is revealed after decisions. Our proposed UMO2 algorithm ensures network stability and also matches the utility maximization performance of any “mildly varying” reference policy up to a polynomially decaying gap. To our knowledge, no previous ANO algorithm handled multi-hop networks or achieved utility guarantees under bandit feedback, whereas ours can do both. Technically, our method builds upon a novel integration of online learning into Lyapunov analyses: To handle complex inter-dependencies among queues in multi-hop networks, we propose meticulous techniques to balance online learning and Lyapunov arguments. To tackle the learning obstacles due to potentially unbounded queue sizes, we design a new online linear optimization algorithm that automatically adapts to loss magnitudes. To maximize utility, we propose a bandit convex optimization algorithm with novel queue-dependent learning rate scheduling that suites drastically varying queue lengths. Our new insights in online learning can be of independent interest. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY) Cite as: arXiv:2408.16215 [math.OC] (or arXiv:2408.16215v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2408.16215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-110] he Application of Machine Learning in Tidal Evolution Simulation of Star-Planet Systems

链接: https://arxiv.org/abs/2408.16212
作者: Shuaishuai Guo,Jianheng Guo,KaiFan Ji,Hui Liu,Lei Xing
关键词-EN: evolutionary curves, close-in hot Jupiters, hot Jupiter systems, predicted evolutionary curves, astronomical data
类目: Earth and Planetary Astrophysics (astro-ph.EP); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the release of a large amount of astronomical data, an increasing number of close-in hot Jupiters have been discovered. Calculating their evolutionary curves using star-planet interaction models presents a challenge. To expedite the generation of evolutionary curves for these close-in hot Jupiter systems, we utilized tidal interaction models established on MESA to create 15,745 samples of star-planet systems and 7,500 samples of stars. Additionally, we employed a neural network (Multi-Layer Perceptron - MLP) to predict the evolutionary curves of the systems, including stellar effective temperature, radius, stellar rotation period, and planetary orbital period. The median relative errors of the predicted evolutionary curves were found to be 0.15%, 0.43%, 2.61%, and 0.57%, respectively. Furthermore, the speed at which we generate evolutionary curves exceeds that of model-generated curves by more than four orders of magnitude. We also extracted features of planetary migration states and utilized lightGBM to classify the samples into 6 categories for prediction. We found that by combining three types that undergo long-term double synchronization into one label, the classifier effectively recognized these features. Apart from systems experiencing long-term double synchronization, the median relative errors of the predicted evolutionary curves were all below 4%. Our work provides an efficient method to save significant computational resources and time with minimal loss in accuracy. This research also lays the foundation for analyzing the evolutionary characteristics of systems under different migration states, aiding in the understanding of the underlying physical mechanisms of such systems. Finally, to a large extent, our approach could replace the calculations of theoretical models.

[LG-111] A More Unified Theory of Transfer Learning

链接: https://arxiv.org/abs/2408.16189
作者: Steve Hanneke,Samory Kpotufe
关键词-EN: target risk decreases, source risk decreases, risk decreases, fast target risk, delta
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We show that some basic moduli of continuity \delta – which measure how fast target risk decreases as source risk decreases – appear to be at the root of many of the classical relatedness measures in transfer learning and related literature. Namely, bounds in terms of \delta recover many of the existing bounds in terms of other measures of relatedness – both in regression and classification – and can at times be tighter. We are particularly interested in general situations where the learner has access to both source data and some or no target data. The unified perspective allowed by the moduli \delta allow us to extend many existing notions of relatedness at once to these scenarios involving target data: interestingly, while \delta itself might not be efficiently estimated, adaptive procedures exist – based on reductions to confidence sets – which can get nearly tight rates in terms of \delta with no prior distributional knowledge. Such adaptivity to unknown \delta immediately implies adaptivity to many classical relatedness notions, in terms of combined source and target samples’ sizes. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2408.16189 [stat.ML] (or arXiv:2408.16189v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2408.16189 Focus to learn more arXiv-issued DOI via DataCite

[LG-112] Single-Loop Deterministic and Stochastic Interior-Point Algorithms for Nonlinearly Constrained Optimization

链接: https://arxiv.org/abs/2408.16186
作者: Frank E. Curtis,Xin Jiang,Qi Wang
关键词-EN: solving nonlinearly constrained, nonlinearly constrained continuous, constrained continuous optimization, continuous optimization problems, tested for solving
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An interior-point algorithm framework is proposed, analyzed, and tested for solving nonlinearly constrained continuous optimization problems. The main setting of interest is when the objective and constraint functions may be nonlinear and/or nonconvex, and when constraint values and derivatives are tractable to compute, but objective function values and derivatives can only be estimated. The algorithm is intended primarily for a setting that is similar for stochastic-gradient methods for unconstrained optimization, namely, the setting when stochastic-gradient estimates are available and employed in place of gradients of the objective, and when no objective function values (nor estimates of them) are employed. This is achieved by the interior-point framework having a single-loop structure rather than the nested-loop structure that is typical of contemporary interior-point methods. For completeness, convergence guarantees for the framework are provided both for deterministic and stochastic settings. Numerical experiments show that the algorithm yields good performance on a large set of test problems.

[LG-113] A Minibatch-SGD-Based Learning Meta-Policy for Inventory Systems with Myopic Optimal Policy

链接: https://arxiv.org/abs/2408.16181
作者: Jiameng Lyu,Jinxing Xie,Shilin Yuan,Yuan Zhou
关键词-EN: Stochastic gradient descent, Stochastic gradient, gradient descent, proven effective, effective in solving
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Forthcoming in Management Science

点击查看摘要

Abstract:Stochastic gradient descent (SGD) has proven effective in solving many inventory control problems with demand learning. However, it often faces the pitfall of an infeasible target inventory level that is lower than the current inventory level. Several recent works (e.g., Huh and Rusmevichientong (2009), Shi et al.(2016)) are successful to resolve this issue in various inventory systems. However, their techniques are rather sophisticated and difficult to be applied to more complicated scenarios such as multi-product and multi-constraint inventory systems. In this paper, we address the infeasible-target-inventory-level issue from a new technical perspective – we propose a novel minibatch-SGD-based meta-policy. Our meta-policy is flexible enough to be applied to a general inventory systems framework covering a wide range of inventory management problems with myopic clairvoyant optimal policy. By devising the optimal minibatch scheme, our meta-policy achieves a regret bound of \mathcalO(\sqrtT) for the general convex case and \mathcalO(\log T) for the strongly convex case. To demonstrate the power and flexibility of our meta-policy, we apply it to three important inventory control problems: multi-product and multi-constraint systems, multi-echelon serial systems, and one-warehouse and multi-store systems by carefully designing application-specific subroutines.We also conduct extensive numerical experiments to demonstrate that our meta-policy enjoys competitive regret performance, high computational efficiency, and low variances among a wide range of applications. Comments: Forthcoming in Management Science Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2408.16181 [math.OC] (or arXiv:2408.16181v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2408.16181 Focus to learn more arXiv-issued DOI via DataCite

[LG-114] A nudge to the truth: atom conservation as a hard constraint in models of atmospheric composition using an uncertainty-weighted correction

链接: https://arxiv.org/abs/2408.16109
作者: Patrick Obin Sturm,Sam J. Silva
关键词-EN: Computational models, atmospheric composition, Computational, physically consistent, conservation laws
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures (main text); 11 pages, 4 figures (supporting information). This version of the manuscript is a preprint and not peer-reviewed

点击查看摘要

Abstract:Computational models of atmospheric composition are not always physically consistent. For example, not all models respect fundamental conservation laws such as conservation of atoms in an interconnected chemical system. In well performing models, these nonphysical deviations are often ignored because they are frequently minor, and thus only need a small nudge to perfectly conserve mass. Here we introduce a method that anchors a prediction from any numerical model to physically consistent hard constraints, nudging concentrations to the nearest solution that respects the conservation laws. This closed-form model-agnostic correction uses a single matrix operation to minimally perturb the predicted concentrations to ensure that atoms are conserved to machine precision. To demonstrate this approach, we train a gradient boosting decision tree ensemble to emulate a small reference model of ozone photochemistry and test the effect of the correction on accurate but non-conservative predictions. The nudging approach minimally perturbs the already well-predicted results for most species, but decreases the accuracy of important oxidants, including radicals. We develop a weighted extension of this nudging approach that considers the uncertainty and magnitude of each species in the correction. This species-level weighting approach is essential to accurately predict important low concentration species such as radicals. We find that applying the uncertainty-weighted correction to the nonphysical predictions slightly improves overall accuracy, by nudging the predictions to a more likely mass-conserving solution.

[LG-115] Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

链接: https://arxiv.org/abs/2408.16087
作者: Quan Xiao,Tianyi Chen
关键词-EN: resurgence of interest, witnessed a resurgence, critical role, role in trustworthy, efficient machine learning
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bilevel optimization has witnessed a resurgence of interest, driven by its critical role in trustworthy and efficient machine learning applications. Recent research has focused on proposing efficient methods with provable convergence guarantees. However, while many prior works have established convergence to stationary points or local minima, obtaining the global optimum of bilevel optimization remains an important yet open problem. The difficulty lies in the fact that unlike many prior non-convex single-level problems, this bilevel problem does not admit a ``benign" landscape, and may indeed have multiple spurious local solutions. Nevertheless, attaining the global optimality is indispensable for ensuring reliability, safety, and cost-effectiveness, particularly in high-stakes engineering applications that rely on bilevel optimization. In this paper, we first explore the challenges of establishing a global convergence theory for bilevel optimization, and present two sufficient conditions for global convergence. We provide algorithm-specific proofs to rigorously substantiate these sufficient conditions along the optimization trajectory, focusing on two specific bilevel learning scenarios: representation learning and data hypercleaning (a.k.a. reweighting). Experiments corroborate the theoretical findings, demonstrating convergence to global minimum in both cases.

[LG-116] Analysis of Diagnostics (Part II): Prevalence Linear Independence and Unsupervised Learning

链接: https://arxiv.org/abs/2408.16035
作者: Paul N. Patrone,Raquel A. Binder,Catherine S. Forconi,Ann M. Moormann,Anthony J. Kearsley
关键词-EN: number of elements, classification theory, two-part series, diagnostic testing, testing to understand
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:This is the second manuscript in a two-part series that uses diagnostic testing to understand the connection between prevalence (i.e. number of elements in a class), uncertainty quantification (UQ), and classification theory. Part I considered the context of supervised machine learning (ML) and established a duality between prevalence and the concept of relative conditional probability. The key idea of that analysis was to train a family of discriminative classifiers by minimizing a sum of prevalence-weighted empirical risk functions. The resulting outputs can be interpreted as relative probability level-sets, which thereby yield uncertainty estimates in the class labels. This procedure also demonstrated that certain discriminative and generative ML models are equivalent. Part II considers the extent to which these results can be extended to tasks in unsupervised learning through recourse to ideas in linear algebra. We first observe that the distribution of an impure population, for which the class of a corresponding sample is unknown, can be parameterized in terms of a prevalence. This motivates us to introduce the concept of linearly independent populations, which have different but unknown prevalence values. Using this, we identify an isomorphism between classifiers defined in terms of impure and pure populations. In certain cases, this also leads to a nonlinear system of equations whose solution yields the prevalence values of the linearly independent populations, fully realizing unsupervised learning as a generalization of supervised learning. We illustrate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent assay (ELISA).

信息检索

[IR-0] Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

链接: https://arxiv.org/abs/2408.16672
作者: Rohan Jha,Bo Wang,Michael Günther,Saba Sturua,Mohammad Kalim Akram,Han Xiao
关键词-EN: proven highly effective, Multi-vector dense models, Multi-vector dense, proven highly, highly effective
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce several improvements to the ColBERT model architecture and training pipeline, leveraging techniques successful in the more established single-vector embedding model paradigm, particularly those suited for heterogeneous multilingual data. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks, while also cutting storage requirements by up to 50% compared to previous models.

[IR-1] ransformers Meet ACT-R: Repeat-Aware and Sequential Listening Session Recommendation RECSYS’2024

链接: https://arxiv.org/abs/2408.16578
作者: Viet-Anh Tran,Guillaume Salha-Galvan,Bruno Sguerra,Romain Hennequin
关键词-EN: leverage sequential recommender, based on past, past sequences, PISA, listening
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 11 pages. Accepted by RecSys’2024, full paper

点击查看摘要

Abstract:Music streaming services often leverage sequential recommender systems to predict the best music to showcase to users based on past sequences of listening sessions. Nonetheless, most sequential recommendation methods ignore or insufficiently account for repetitive behaviors. This is a crucial limitation for music recommendation, as repeatedly listening to the same song over time is a common phenomenon that can even change the way users perceive this song. In this paper, we introduce PISA (Psychology-Informed Session embedding using ACT-R), a session-level sequential recommender system that overcomes this limitation. PISA employs a Transformer architecture learning embedding representations of listening sessions and users using attention mechanisms inspired by Anderson’s ACT-R (Adaptive Control of Thought-Rational), a cognitive architecture modeling human information access and memory dynamics. This approach enables us to capture dynamic and repetitive patterns from user behaviors, allowing us to effectively predict the songs they will listen to in subsequent sessions, whether they are repeated or new ones. We demonstrate the empirical relevance of PISA using both publicly available listening data from this http URL and proprietary data from Deezer, a global music streaming service, confirming the critical importance of repetition modeling for sequential listening session recommendation. Along with this paper, we publicly release our proprietary dataset to foster future research in this field, as well as the source code of PISA to facilitate its future use.

[IR-2] Is text normalization relevant for classifying medieval charters?

链接: https://arxiv.org/abs/2408.16446
作者: Florian Atzenhofer-Baumgartner,Tamás Kovács
关键词-EN: Middle High German, High German charters, specifically focusing, study examines, examines the impact
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: This preprint has not undergone peer review or any post-submission improvements or corrections

点击查看摘要

Abstract:This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.

[IR-3] Do Recommender Systems Promote Local Music? A Reproducibility Study Using Music Streaming Data

链接: https://arxiv.org/abs/2408.16430
作者: Kristina Matrosova,Lilian Marey,Guillaume Salha-Galvan,Thomas Louail,Olivier Bodini,Manuel Moussallam
关键词-EN: discussing prior findings, local music representation, local music, recommender systems, recommender systems exhibit
类目: Information Retrieval (cs.IR); Databases (cs.DB); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper examines the influence of recommender systems on local music representation, discussing prior findings from an empirical study on the LFM-2b public dataset. This prior study argued that different recommender systems exhibit algorithmic biases shifting music consumption either towards or against local content. However, LFM-2b users do not reflect the diverse audience of music streaming services. To assess the robustness of this study’s conclusions, we conduct a comparative analysis using proprietary listening data from a global music streaming service, which we publicly release alongside this paper. We observe significant differences in local music consumption patterns between our dataset and LFM-2b, suggesting that caution should be exercised when drawing conclusions on local music based solely on LFM-2b. Moreover, we show that the algorithmic biases exhibited in the original work vary in our dataset, and that several unexplored model parameters can significantly influence these biases and affect the study’s conclusion on both datasets. Finally, we discuss the complexity of accurately labeling local music, emphasizing the risk of misleading conclusions due to unreliable, biased, or incomplete labels. To encourage further research and ensure reproducibility, we have publicly shared our dataset and code.

[IR-4] SynDL: A Large-Scale Synthetic Test Collection

链接: https://arxiv.org/abs/2408.16312
作者: Hossein A. Rahmani,Xi Wang,Emine Yilmaz,Nick Craswell,Bhaskar Mitra,Paul Thomas
关键词-EN: information retrieval research, Information Retrieval, existing information retrieval, play a crucial, crucial role
类目: Information Retrieval (cs.IR)
*备注: 9 pages

点击查看摘要

Abstract:Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments - a time-intensive and expensive process. Recent studies have shown the strong capability of Large Language Models (LLMs) in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional language model synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.

[IR-5] Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models ECCV2024

链接: https://arxiv.org/abs/2408.16296
作者: Kengo Nakata,Daisuke Miyashita,Youyang Ng,Yasuto Hoshi,Jun Deguchi
关键词-EN: rethink sparse lexical, sparse lexical representations, lexical representations, image retrieval, image
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: Accepted to ECCV 2024 Workshops: 2nd Workshop on Traditional Computer Vision in the Age of Deep Learning (TradiCV)

点击查看摘要

Abstract:In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling us to utilize efficient sparse retrieval algorithms employed in natural language processing for image retrieval tasks. To assist the LLM in extracting image features, we apply data augmentation techniques for key expansion and analyze the impact with a metric for relevance between images and textual data. We empirically show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods on the MS-COCO, PASCAL VOC, and NUS-WIDE datasets in a keyword-based image retrieval scenario, where keywords serve as search queries. We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.

[IR-6] Efficient Transfer Learning Framework for Cross-Domain Click-Through Rate Prediction

链接: https://arxiv.org/abs/2408.16238
作者: Qi Liu,Xingyuan Tang,Jianqiang Huang,Xiangqian Yu,Haoran Jin,Jin Chen,Yuanhao Pu,Defu Lian,Tan Qu,Zhe Wang,Jia Cheng,Jun Lei
关键词-EN: industrial recommendation systems, CTR model, Natural content, advertisement CTR models, tiny CTR model
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Natural content and advertisement coexist in industrial recommendation systems but differ in data distribution. Concretely, traffic related to the advertisement is considerably sparser compared to that of natural content, which motivates the development of transferring knowledge from the richer source natural content domain to the sparser advertising domain. The challenges include the inefficiencies arising from the management of extensive source data and the problem of ‘catastrophic forgetting’ that results from the CTR model’s daily updating. To this end, we propose a novel tri-level asynchronous framework, i.e., Efficient Transfer Learning Framework for Cross-Domain Click-Through Rate Prediction (E-CDCTR), to transfer comprehensive knowledge of natural content to advertisement CTR models. This framework consists of three key components: Tiny Pre-training Model ((TPM), which trains a tiny CTR model with several basic features on long-term natural data; Complete Pre-training Model (CPM), which trains a CTR model holding network structure and input features the same as target advertisement on short-term natural data; Advertisement CTR model (A-CTR), which derives its parameter initialization from CPM together with multiple historical embeddings from TPM as extra feature and then fine-tunes on advertisement data. TPM provides richer representations of user and item for both the CPM and A-CTR, effectively alleviating the forgetting problem inherent in the daily updates. CPM further enhances the advertisement model by providing knowledgeable initialization, thereby alleviating the data sparsity challenges typically encountered by advertising CTR models. Such a tri-level cross-domain transfer learning framework offers an efficient solution to address both data sparsity and `catastrophic forgetting’, yielding remarkable improvements.

[IR-7] Efficient k-NN Search in IoT Data: Overlap Optimization in Tree-Based Indexing Structures

链接: https://arxiv.org/abs/2408.16036
作者: Ala-Eddine Benrazek,Zineddine Kouahla,Brahim Farou,Hamid Seridi,Ibtissem Kemouguette
关键词-EN: Big IoT Data, Internet of Things, Big IoT, data space, proliferation of interconnected
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Performance (cs.PF)
*备注: 28 pages, 21 figures, 1 table

点击查看摘要

Abstract:The proliferation of interconnected devices in the Internet of Things (IoT) has led to an exponential increase in data, commonly known as Big IoT Data. Efficient retrieval of this heterogeneous data demands a robust indexing mechanism for effective organization. However, a significant challenge remains: the overlap in data space partitions during index construction. This overlap increases node access during search and retrieval, resulting in higher resource consumption, performance bottlenecks, and impedes system scalability. To address this issue, we propose three innovative heuristics designed to quantify and strategically reduce data space partition overlap. The volume-based method (VBM) offers a detailed assessment by calculating the intersection volume between partitions, providing deeper insights into spatial relationships. The distance-based method (DBM) enhances efficiency by using the distance between partition centers and radii to evaluate overlap, offering a streamlined yet accurate approach. Finally, the object-based method (OBM) provides a practical solution by counting objects across multiple partitions, delivering an intuitive understanding of data space dynamics. Experimental results demonstrate the effectiveness of these methods in reducing search time, underscoring their potential to improve data space partitioning and enhance overall system performance.

[IR-8] An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders

链接: https://arxiv.org/abs/2408.16032
作者: Shuang Feng,Grace Feng
关键词-EN: understanding webpage contexts, enabled understanding webpage, Recent advancements, large language models, webpage contexts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have enabled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity – a notable achievement is the success of InstructGPT. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evaluated using the WebShop benchmark environment, data, simulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in InstructGPT, and Direct Preference Optimization (DPO). This report also evaluates the RL agents trained using generative trajectories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Also, with limited training time (2hours), without utilizing any images, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2408.16032 [cs.LG] (or arXiv:2408.16032v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.16032 Focus to learn more arXiv-issued DOI via DataCite

[IR-9] Ranking evaluation metrics from a group-theoretic perspective

链接: https://arxiv.org/abs/2408.16009
作者: Chiara Balestra,Andreas Mayr,Emmanuel Müller
关键词-EN: newly proposed models, ranking evaluation metrics, validate the merits, merits of newly, newly proposed
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Confronted with the challenge of identifying the most suitable metric to validate the merits of newly proposed models, the decision-making process is anything but straightforward. Given that comparing rankings introduces its own set of formidable challenges and the likely absence of a universal metric applicable to all scenarios, the scenario does not get any better. Furthermore, metrics designed for specific contexts, such as for Recommender Systems, sometimes extend to other domains without a comprehensive grasp of their underlying mechanisms, resulting in unforeseen outcomes and potential misuses. Complicating matters further, distinct metrics may emphasize different aspects of rankings, frequently leading to seemingly contradictory comparisons of model results and hindering the trustworthiness of evaluations. We unveil these aspects in the domain of ranking evaluation metrics. Firstly, we show instances resulting in inconsistent evaluations, sources of potential mistrust in commonly used metrics; by quantifying the frequency of such disagreements, we prove that these are common in rankings. Afterward, we conceptualize rankings using the mathematical formalism of symmetric groups detaching from possible domains where the metrics have been created; through this approach, we can rigorously and formally establish essential mathematical properties for ranking evaluation metrics, essential for a deeper comprehension of the source of inconsistent evaluations. We conclude with a discussion, connecting our theoretical analysis to the practical applications, highlighting which properties are important in each domain where rankings are commonly evaluated. In conclusion, our analysis sheds light on ranking evaluation metrics, highlighting that inconsistent evaluations should not be seen as a source of mistrust but as the need to carefully choose how to evaluate our models in the future. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2408.16009 [cs.IR] (or arXiv:2408.16009v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.16009 Focus to learn more arXiv-issued DOI via DataCite

附件下载

点击下载今日全部论文列表